As digitalization progresses, the mountains of data in companies are growing rapidly. However, much of the valuable information is still unused in the form of texts, websites, media files, documents, and e-mails. Numerous innovations in the field of Natural Language Processing (NLP) now make it possible to evaluate this information to a new extent. This leads to immediate information and competitive advantage in many industries. A central building block in NLP is the recognition of semantic concepts in texts – the so-called Named Entity Recognition (NER).

Companies continuously produce text data such as e-mails, work protocols, manuals, patents and much more. Clients produce text data via e-mail, social media channels, questionnaires, reviews, comments, and other sources. Text data comes from different sources, is written by different authors in different languages and often contains spelling mistakes. Companies are making significant efforts to secure this data in so-called “data lakes”. Organizing of this data is often difficult and time-consuming, but automatic text analysis makes this possible.

More efficient than a manual analysis done by humans

Finding relevant content in complex text collections requires new document analysis and search concepts. Common methods, such as searching for certain terms, i.e. the exact matching of letter sequences, prove to be inefficient in times of big data. The manual checking and classification of millions of texts by humans are, in turn, hardly economical and, of course, time-consuming as well. This means that far too much time is wasted on processes that a machine can do faster, better and more precisely.

Get valuable insights out of all your data

Nevertheless, it is extremely important for companies to be able to include all available data for their decisions. In the course of a due diligence, for example, a data room comprising several gigabytes would ideally be checked entirely instead of merely selecting just a sample of documents. The same applies, for example, to a very big archive with digitalized texts and documents or to research in an entire online content network in which all articles and URIs (even for x – million entries) can be analyzed in a database.

Thanks to modern techniques such as Named Entity Recognition, large amounts of data can easily be analyzed in a blink of an eye – in real-time or by batch-packets in defined time slots. These processes are working automated 24 hours a day and 365 days a year by using NLP solutions like hyScore|analyze.

In science, the automatic recognition of a real-world object is known as Named Entity Recognition (NER). General objects such as persons, places and organizations can be recognized, but also specific objects such as aircraft, company, phone, or e.g. cryptocurrency.

Image: Difference between rule-based character search (left) and intelligent detection of entities (right). In the example on the left, the system does not find the character string “UC Berkeley” because it does not occur in the text. In the example on the right, the system recognizes the text section “University of California, Berkeley” as an organization. Similarity measures can be used to link this organization to UC Berkeley University. Furthermore, a rule-based system cannot distinguish between the company or the fruit “Apple”. An intelligent system – like hyScore|analyze can!

The history of the development of NER systems goes back to the early 90’s, but has recently been boosted by the application of deep neural networks. The accuracy of the systems was achieved by two fundamental improvements: firstly, neural networks can include entire sentences or even entire documents in the analysis – older systems, however, were always limited to a few words. On the other hand, the mathematical representation of individual words is much more advanced than before.

End of “Named Entity Recognition (NER) – Part 1”.
Part 2 of the article will be published soon. Sign-up for our newsletter to get informed!