Most of the text analysis research I do uses statistics heavily and natural language processing (NLP) if necessary. There is an excellent textbook which saves on awful lot of googling: "Foundations of Statistical Natural Language Processing" by Christopher D. Manning and Hinrich Schütze (MIT Press 1999).
Chapter 10 of the book is about part-of-speech (POS) tagging. The objective of POS-tagging is to assign the word classes (noun, verb, etc). I have often wondered why the first step in just about all text analysis tools is POS-tagging. First of all, if the POS-tagger makes a mistake this can never be corrected in later stages (and even the best taggers make a mistake in every sentence on average). Secondly, how to use the POS-tagging information anyway? Manning and Schütze:
The widespread interest in tagging is founded on the believe that NLP applications will benefit from syntactically disambiguated text. Given this ultimate motivation for part-of-speech tagging, it is surprising that there seem to be more papers on stand-alone tagging than on applying tagging to a task of immediate interest.
I use a pattern-matching language called AIML for NLP - I'm a co-founder of aiml.info, a site that offers a large collection of tools for and information about AIML -, and I've often wondered about POS tagging the same way. I understand about the usefulness of the POS concept for theoretical linguistics, but writing code that specifically looks for nouns and verbs as a form of preprocessing is something I've always found pointless.
Posted by: Dirk Scheuring | April 20, 2005 at 11:09 PM