Given that weblogs are now becoming an object of research for social and computer scientists alike, it is perhaps an idea to look for a model (I hate the word).
One model element some think is important is size. The Weblogging Ecosystem workshop (Edinburgh 2006) is running a data challenge:
Much of the interest in research relating to weblogs involves the analysis of large quantities of data. As part of this workshop, we are very excited to provide a data set to the research community. The aim is to encourage the use of this data to focus the various views and analyses of the blogosphere over a common space...
Apart from stating that the data set consists of 10M posts little more is said about what potential participants are supposed to do with it. Throwing away 99,99% of the posts would be a good starting point, I think.
Another conference (not on the web yet), originating from text analysis, is also offering a data set of weblog posts (4 to 40Gigabytes) and one of the organisers asked me to come up with interesting research questions related to this data set for their challenge.
What is really challenging regarding weblog research is to come up with ideas, analysis and tools that are relevant to bloggers and that take into account the social aspect of blogging. Suppose that someone, after analysing the data sets above, came up with the observation that 7.6% of the posts start with the word Today or that the average post is linked to by 0.03 other posts. Is that interesting?
A model for weblog research may have to take into account all of the following five dimensions:
- Person. Trivial. Blogs are written by real people.
- Document. Trivial. Posts are part of blogs.
- Link. The above two dimensions are very general, blogs (and other online material) add the notion of an explicit link. Links can be social, informative, hint at communities. They are very important for weblog research.
- Term (Concept). Documents (posts) contain sets of terms (concepts). In the future it will perhaps be possible to add even more semantics.
- Time. Not so trivial, and a dimension many researchers would like to ignore.
These dimensions are always in the back of my mind when thinking about weblog research and its potential applications. The dimensions can also serve as simple tool to classify existing blog services. For example, Technorati caters for Person x Link, Google provides some assistance for Document x Term and Bloglines is Person x Document.
It is actually possible to compress all five dimensions in a single visualisation when we forget about millions of posts and gigabytes of data. Below is a visualisation of a weblog conversation. The dimensions are visualised as follows:
- Person. Top to bottom.
- Document. Coloured rectangles.
- Link. Lines between the rectangles.
- Term. Wavering blue line. This one depicts usage of the term blog research over time.
- Time. Left to right.
(Details about how the image was constructed are omitted. My fellow researchers think it is so nice we should write a paper on it [and submit it to the Weblogging Ecosystem workshop]).