I'm currently dealing with chat data from an educational setting. Statistics reveal that there are 6150 different words in these chats, 3157 not known by the dictionary. More than 50% of the words are not known! To be able to do anything with these chats they need to be "cleaned". Wilson Wong and his co-workers have come up with a method for Cleaning dirty texts which is the first non-trivial attempt I have seen to attack the problem. Cudo's. Matthew Hurst, jokingly on April 1st, quotes the boss of Google: Our spelling correction...is an example of AI. Matthew sounded a little bit cynical, but is spelling correction not a fundamental step in connecting language and meaning. If all of the wonderful technology stumbles on something trivial as a spelling error, what is left?
I'm applying Wong's ideas to my data and the results are encouraging, but not as spectaculair as the 96%+ correct corrections he is quoting. Perhaps it is my data :-), perhaps it is an imperfection in his method.
One of the most challenging examples in my data is k hbe geni dee (in Dutch). All four words are incorrect. The correct spelling is: ik heb geen idee. Which translates as: I have no idea, something social researchers of educational data find significant.
Conclusion: spelling correction is taken seriously by researchers and tools like Google. I think this is good news. Google performs well at single word errors and poorly at multi-word errors, and it does only correct the query (not the results).