While searching for a reasonable approach to create Visual Settlements based on language rather than linkage, I considered the following idea. In Information Retrieval the idea of inverse document frequency (TF/IDF algorithm) is often used to find "unique documents". In a (virtual) community there is a shared interest, but there will also be "personal" differences, and to identify the uniqueness of a single blog within the community the idea of inverse document frequency might come in handy.
Partial motivation came from a paper on identifying virtual communities using linking structures. One of the results reported in the paper was that more than 5000 sites had a common link: namely a link to LaTeX to HTML (i.e. the software that had generated the site). Goes to show, that automated analysis has its flaws. But as long as these flaws generate funny results, who cares.
Below are the results of the exercise. The procedure was as follows:
- Sigmund was run on the blogs involved to generate the terms used.
- From the Sigmund terms generated, all terms that occurred infrequently or very frequently were removed. Infrequently was defined as in less than sqrt(NoPosts)/2 and very frequently as in more than sqrt(NoPosts)*2.
- Finally, all remaining terms that appeared in at least two blogs in the community were removed.
The result should be a list of terms, per blog, which defines that blog's uniqueness within the community. Several disclaimers can be made. Some blogs are mainly in languages other than English, I have deleted non-English terms that crept up manually (but of course the results will still be skewed because of the procedure above).
No conclusions, as usual.
Alex Halavais: agent, assignment, association, authority, campus, candidate, citizen, communication technology, grad student, grade, graduate student, guest, journalism, nation, peer, period, porn, slashdot, terrorist, undergrad, undergraduate, venue, web page, wikipedia.
Anjo Anjewierden: blog spider, chess, ontology, sentence, sigmund, text analysis, vocabulary.
Lilia Efimova: aggregation, artifact, awareness, blog community, blog conversation, blog network, blog reading, blog research, bookmark, contribution, discovery, energy, integration, mode, moscow, news aggregator, overlap, overview, personal information management, personal knowledge management, phd research, proceedings, progress, relation, rhythm, setting, skype, slide, theme, training, visualization.
Piers Young: ancient, british, cup, fame, flag, rhetoric, scientist, tea, treat.
Ton Zijlstra: addendum, barrier, blogwalk meeting, collective, corner, dialogue, enschede, neighbour, open space, parallel, poster, presenter, schedule, small scale, vienna, wiki page.
Carla Verwijs: marathon, submission, virtual community.
Andy Boyd: community of practice, furniture make, km europe, km work, shell, transfer, wale.
Elmine Wijnia: communicative action, habermas theory, ideal speech situation, masterthesis, passage, teacher.
Jill Walker: electronic literature, grant, hypertext, new media, wireless.
Jeremy Aarons: australia, brief, bureau, epistemology, flaw, forecast, forecasters, justification, keynote, knowledge management research, melbourne, meteorology, methodology, monash university, positivism, positivist, relevance, specification, task based knowledge management, weather forecast, weather.
Janine Swaak: knowledge animal, selfishness, territory.
Marc Canter: bet, buddy, cable, calendar, credit, dan, digital lifestyle aggregation, digital lifestyle aggregator, director, dude, editor, enterprise, fee, foafnet, founder, fuck, hire, hook, macromedia, manager, marqui, module, open standard, paris, plug, rap, really simple syndication feed, ship, shot, stream, suck, trieste, vancouver, video clip, wall.
Matt Mower: firefox, iraq, java, mozilla, salon, terrorism.
Paolo Valdemarin: folder.
Torill Mortensen: computer game, dark, gamers, husband, lunch, padding, solid.
Peter Caputa: advertiser, beat, blogdex, boston, buck, cheap, competitor, directory, ebay, eurekster, feedster, gmail account, hotmail, marketer, minute abs, overture, permission, promotion, purchase, search engine, search result, social software blog, statement, syndicate, toolbar, waypath, weblogsinc.
Riccardo Cambiassi: dive, italy, layer, natural interaction, plugin, procedure, proof, smartmobs, speed, toy.
Suw Charman: bolt, business blogging, geek, irc, laptop, marketing blog, subethaedit, vodka, water.
Nancy White: blend, coach, distribute community, grey, on line community, on line facilitation, on line group, on line interaction, peek, principles, sector, telephone, thread, web based.
Jim McGee: book challenge, chicago, congratulations, pointer, productivity, trick.
Stephanie Hendrick: humlab, jokkmokk blog, master thesis, mental space, mylookingglass, proposal, research blog, social network analysis, sweden, warm.
Danah Boyd: abuse, adult, battle, blog entry, blow, california, critique, dance, drink, earth, everyday life, fight, frame, frustration, gay, gender, hair, intention, joy, land, laugh, lesson, mail list, nytimes, pain, parent, phenomenon, population, pressure, privilege, production, refuse, responsibility, scream, smile, sms, status, stranger, super, tear, technologist, tendency, upset, violence, wake, yasns.
This and your earlier post on visual settlements is really intriguing. I'm not sure I've wrapped my head around it, but am looking forward to following your work now that I've found you!
Posted by: Nancy White | January 10, 2005 at 04:31 AM
Hi Anjo,
I arrived here via a link from Alex Halavais' site. I think this is an absolutely fascinating little project you've done here. With Six Apart's recent acquisition of LiveJournal http://www.sixapart.com/pronet/2005/01/professional_ne.html , I think of how cool it would be to apply this type of analysis to LJ communities (assuming it hasn't already been done). Six Apart might be game for something like this, who knows!
Posted by: tom sherman | January 10, 2005 at 07:07 AM
Very interesting! Are you making Sigmund available so we can run this on our own blogs as well?
Posted by: Elin | January 10, 2005 at 04:18 PM
This is fantastic! Please keep me posted on where you go with this!
Posted by: zephoria | January 14, 2005 at 07:16 AM
It is my intention to keep track of my working life on this blog. Does not always work out that way!
Contact Lilia and/or myself if there is anything specific. (Also applies to others, of course).
anjo science uva nl (insert @ and some dots).
Posted by: Anjo | January 14, 2005 at 11:44 PM
So I only use either very original words or very common words?
Posted by: Seb | January 15, 2005 at 06:02 AM
Seb, Had a quick look under the hood and it appears you are using the same words as others semi-frequently. So, blend well with this community ...
Posted by: Anjo | January 15, 2005 at 03:10 PM
I'm still working at getting chocolate to show up on my list! ;-)
Posted by: Nancy White | February 06, 2005 at 10:40 PM
