Those who like the notion of ontologies may be familiar with the term "entity" as something general. In the text analysis community the term "named entity" is very popular. In fact, anything recognisable by text analysis software is called a "named entity", or so it seems. I'm not entirely sure of the background of the term "named entity", but it probably has been thought up by the military. So what are named entities: persons, locations organizations, dates and times, and amounts. The military interest becomes real when considering a sentence like: "Karadzic was seen in Krajina last Sunday" (I'll check the visitors stats later to find out whether "they" are monitoring my blog).
For some reason or another I thought the "named entity" problem (according to the above description) had been solved. Reviewing recent literature this is not the case, recall and precision is generally higher than 80%, but some small experiments revealed these are not unattainable with relatively naive algorithms. I used three corpora to find names of people: a corporate forum (84%), a mailinglist with calls for papers (79%), and the BlogWalk weblogs (93%). Of these the mailinglist is tricky as it contains pseudo-sentences like: "Fred Blogs Joe GoodFellow ..." (as members of the programme committee for example). The weblog corpus scores high because bloggers are very precise in how/what they write. Most of the articles evaluating named entity algorithms are based on well-controlled sources (e.g. newspaper clippings), I wonder how well these systems work on uncontrolled sources like a corporate forum or mailinglists?
Comments