Topics: on and off
Back to the topic of topics. For the time being I'm convinced that the only reasonable way to think about the notion of a topic is to consider it as a set of terms. In other words, a document is about a topic if it contains (many of) the terms associated with the topic.
Under this interpretation, it actually becomes possible to derive a topic from an arbitrary corpus of documents. The topic search is started by a cue term (the label of the topic) and a list of N terms that best "fit" the topic are returned.
An example. Cue term is skype, the top 30 terms describingthis topic are (extracted from a certain corpus): skype, caller, cellphones, pstn, telephony, skype call, landline, instant messaging system, dial, pda, conference call, buddy list, voice mail, skypeout, presence, voip, sound quality, interconnect, messenger, skype api, long distance, conferencing, headset, sip, premium, telecom, pipe, msn, telecoms, telephone.
This looks pretty "accurate". The order is by decreasing association to the cue term.
A slightly vaguer cue term is knowledge worker: knowledge worker, personal productivity, knowledge work, personal content management, dave snowden, routine, knowledge transfer, cynefin, front line, information system, jack vinson, personal information management, good tool, personal knowledge management, sequence, dissemination, knowledge management system, collaborator, information overload, sheet, co-ordination, information management, management tool, association of knowledgework, surroundings, workforce, stephanie, effectiveness, databases, nowledgeboard.
Given that the corpus is a collection of knowledge management (KM) blogs, the results are "reasonable".
A more focussed term within KM would be semantic web: semantic web, module, ontology, rdf, owl, rogier, spider, service provider, take advantage, critical mass, metadata, sigmund, demonstration, friend of a friend, geeks, late post, tim berners lee, aggregators, disc, mechanic, web designer, silicon, multimedia, label, demo, richard, restriction, text analysis, tim, api.
Some noise, but still. Next try, a cue term that has nothing to do with KM. vegetarian: vegetarian, meat, diet, substitute, agriculture, grain, famine, pesticide, biodiversity, hunter gatherer, beef, human population, population growth, conservation, wilderness, chemical, poison, predator, ingredient, toxic, plastic, food production, cow, chicken, squirrel, denial, cream, riot, cage, kilo.
Pretty "amazing", given that the terms for vegetarian are extracted from a collection of KM blogs. Another example of off-topicness. bush: bush, patriot act, civil liberty, court, voter, neocons, litigation, opposition, john kerry, republican, welfare, abortion, gay, immigrant, vote, americans, swing, candidate, iran, gun, iraqis, mob, kerry, ballot, tax cut, county, legislation, province, charter, delay.
In a recent paper Bettina Berendt and Roberto Navigli (Finding your way through blogspace: Using semantics for cross-domain blog analysis. Proceedings of the AAAI 2006 Symposium on Computational Approaches to Analysing Weblogs) addressed the issue of topic detection in weblogs. I'll not go into details about the paper itself, but have the impression that the above confirms bloggers have an on-topic and an off-topic mode. And the off-topic mode makes automatic classification of an entire weblog (or set of weblogs) extremely hard, despite the fact that when bloggers are off-topic they still seem to use terms that are pretty appropriate and shared by bloggers from other
communities who are also off-topic.
A nice thing to think about: even when bloggers are off-topic they make some sense. Or, so it seems.
Comments