« Sigmund on the US Presidential Debate (2) | Main | Holiday! »

Blog Research Repository

Wondering what would happen if I made Sigmund public, I have come to the conclusion that it is not as straightforward as it seems. The spidering problem, mentioned in the same post, is only a technical glitch. Matt Mower in Making blogs spider friendly has addressed it, although his approach requires cooperation of the blogger. Although I have the skills to solve the spidering problem technically, and I already got very far, anyone who succeeds before I return from holiday (October 25th) will get a citation in every blog related paper I write until infinity.

Let us imagine the blog spider exists. And we want to do research on a set of blogs, what would the consequences be? First of all, on a 1000 post blog we have to retrieve 1000 x 50kb is 50Mb of data. The blog spider reduces this to about 5Mb by removing all the redundant formatting and so forth, leaving just the "raw" content of the posts. Under ideal circumstances, the redundancy part of this exercise should obviously be only performed once. That is, blog researchers and potential users of tools like Sigmund need a repository that turns blogs into the appropriate format.

There are companies that have created a repository like this. Creating such a repository for everyone to tap into seems extremely valuable. Can this be organised?

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83452af8f69e200d8346ef80169e2

Listed below are links to weblogs that reference Blog Research Repository:

Comments

Would love to talk to you about this at some point. There are a bunch of folks who may be interested in such a repository, and some early attempts (e.g., the blog census). The issue is that this would have to be updated with some frequency (4x a year?), but I think the community would still benefit from a sample that has been investigated by others.

Cameron Marlowe at MIT is working on something like this called "up flux"

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment