Supporting Blog Research
After a short detour in the domain of radar systems I'm back in blog research mode, or more specifically how to create an infrastructure (platform) to support blog research. Blog research, seems to center around the following themes:
- Communities. Or "virtual settlements" see the recent paper by Lilia Efimova and Stephanie Hendrick.
- Conversations. A set of posts, distributed over several weblogs, which relate a particular topic.
- Language analysis. Analysis of the vocabulary used in a weblog, for example to classify favourite topics of the blogger. Sigmund is an example.
Support for researching these themes requires different kinds of information from weblogs. Communities mainly requires link data, Conversations in addition requires shallow text analysis of particular posts and Language analysis obviously requires (all) full posts.
The question therefore is whether it is possible to create a Blog Research Repository that accomodates the above themes. The data acquisition methods described in the paper by Lilia and Stephanie illustrate that blog research is by-and-large only supported by hard work and regular expressions (or tools that know what regular expressions are :-)).
Motivated by the themes, and practical considerations, the proposal is to organise the repository around the following types of data-sets:
- Structure. This is essentially the same as an RSS feed without the content of the posts, but with all links that can be found in posts.
- Content. Identical to full post RSS feeds.
- Abstractions / Aggregations. Any number of data-sets that contain abstractions or aggregations on a weblog for a particular research purpose. For example, Sigmund requires a data-set that contains the relation between terms in posts.
The practical considerations regarding efficiency are that the structure can be kept in memory for a fairly large set of weblogs (say 10,000), the content can be retrieved from disk on demand, and the abstractions can be defined on-the-fly.
The repository should use public standards for representation. For the structure RDF(s) appears the obvious choice. A basic structure that includes classes like weblog, post, link (etc.) provides a starting point that can be refined. Content is represented as the de facto standard RSS 1.0. Abstractions and aggregations are represented in RDF where possible to preserve the relation to the structure and content.
I think the above is a reasonable approach to address the weblog research support problem in a principled manner and colleagues seem to agree. Further challenges are: (1) How to turn weblogs on the web into the above representation (structure and content); (2) Visualisation. More later ...
Anjo, you may want to look at the presentation from http://costarica.cs.northwestern.edu/bmd/blogs/nmh/archives/000797.html
L.
Posted by:Lilia | November 22, 2004 at 06:20 PM