NER

What is Named Entity Recognition?

A detector of named entities (Named Entity Recognition or NER) is an analysis tool based on dedicated knowledge bases. Its aim is to identify the key semantic elements of a sentence. It allows us to comprehend the general knowledge present in a text.

What does it do? How does it work?

Named Entity Recognition is an investigative process that allows us to identify the most relevant semantic units because they are anchored in reality: person, place, date, purpose...all the tags identify not just actors but also landmarks. They are essential in the process of reading and understanding a text.

Named Entity Recognition is based on a set of rules or on Machine learning based on a manually-labelled corpus. In NLP, this detection is detailed work. Initially, it was limited to the most important models or patterns of a text. The detector had to identify the 5 proper nouns or their equivalents (people, place, purpose, company or organization, date) to understand the who and the what of the text. It’s a little gem of computational linguistics.

A named entity must:

designate a reality perceptible in the world
represent an element that is essential for the understanding of the text

A LITTLE BIT OF GEEK CULTURE

Often used in research, NER depends on semantics and understanding. To understand a text, we split it up in order to find relevant elements. We then assign a type to the detected elements: it’s an ontological effort. NER is closely related to the work of the tokenizer, and notably to the step of "re-tokenization" that assembles multiple tokens that carry the same meaning.

Example:

In the sentence, "I drive at 100 m/hour,” the following tokens are clearly distinct:

[ 100 ] [ m ] [ / ] [ hour ]

“Re-tokenization” allows us to group them creating the entity “speed,” which corresponds to the underlying concept of speed in this sentence.

How does it work? What comes before and after?

In reality, one named entity can take different forms. The elements "30 april 2019", "30/04/2019" and "2019/04/30" all refer to the same date. Only proper names have a single meaning, given their unique reference.

We sometimes must resort to embedding in order to treat constructions composed of multiple elements. For example, "the president of Renault-Nissan" contains several entities. For this, we have two available methods. The first is corpus linguistics, in relation to the targets that interest us. We can derive, for example, defined descriptions (locutions that describe an individual or a singular object, like “the first monkey to go to space,” or “the 9th American president”). The second method is pattern detection via sufficiently generic regular expressions (regex).

This double methodology makes NER a meticulous process that takes place at the end of the chain, once each word of the text has been isolated and labelled.

What’s the Lettria approach?

At Lettria, we’ve chosen to combine machine learning with regular expressions. This double approach allows us to gain precision and to facilitate the maintenance of our models, all while increasing the number of entities generated by our API. First, we apply an initial series of regular expressions to isolate iconic text patterns (dates, numbers, etc.); this is entity recognition. Then we do the more generic work (hyperonyms) which applies a semantic filter to the text to encompass all patterns.

So the types of entities are perfectly understood. For example, the entity 30.04.2019 is broken down into 30 (day from 01 to 31) + april (month of the year) + 2019 (year).

Today, we can detect 40 numerical and named entities, including telephone numbers, volumes, time intervals, sums of money, measurements and more. Find our exhaustive list of covered entities here.

Example:

In the sentence, “He comes to the house from Tuesday to Friday,” Lettria’s tools detect a series of tokens [from Tuesday to Friday] which form the “time interval” entity.

*Key takeaways

The work of Named Entity Recognition comes down to labelling. By labelling markers or expressions of text that have a real and unique referential in the world, we manage to isolate the most useful elements from a semantic point of view.

Each named entity refers to an independent label that constitutes a landmark or anchor in the text. Essential for the extraction of information, named entities are completely dependent on a language and its culture.

At Lettria, our NER stays agile because it’s not based on fixed learnings from computational linguistics, but also on the processing of regular expressions.