Blog Research Repository
Wondering what would happen if I made Sigmund public, I have come to the conclusion that it is not as straightforward as it seems. The spidering problem, mentioned in the same post, is only a technical glitch. Matt Mower in Making blogs spider friendly has addressed it, although his approach requires cooperation of the blogger. Although I have the skills to solve the spidering problem technically, and I already got very far, anyone who succeeds before I return from holiday (October 25th) will get a citation in every blog related paper I write until infinity.
Let us imagine the blog spider exists. And we want to do research on a set of blogs, what would the consequences be? First of all, on a 1000 post blog we have to retrieve 1000 x 50kb is 50Mb of data. The blog spider reduces this to about 5Mb by removing all the redundant formatting and so forth, leaving just the "raw" content of the posts. Under ideal circumstances, the redundancy part of this exercise should obviously be only performed once. That is, blog researchers and potential users of tools like Sigmund need a repository that turns blogs into the appropriate format.
There are companies that have created a repository like this. Creating such a repository for everyone to tap into seems extremely valuable. Can this be organised?
Would love to talk to you about this at some point. There are a bunch of folks who may be interested in such a repository, and some early attempts (e.g., the blog census). The issue is that this would have to be updated with some frequency (4x a year?), but I think the community would still benefit from a sample that has been investigated by others.
Posted by: Alex Halavais | October 19, 2004 at 03:56 PM
Cameron Marlowe at MIT is working on something like this called "up flux"
Posted by: Christina Pikas | November 22, 2004 at 09:31 PM