Erik Borra, a student at the University who started a project on Tracking Election Issues in weblogs, pointed me to a number of papers on spidering weblogs (or more generally content of news websites):
Davi de Castro Reis, Paulo B. Golgher, Altigran S. da Silva and Alberto H. F. Laender. Automatic web news extraction using tree edit distance. Proceedings of the 13th international conference on World Wide Web, New York, May 2004.
Tomoyuki Nanno, Yasuhiro Suzuki, Toshiaki Fujiki and Manabu Okumura. Automatic Collection and Monitoring of Japanese Weblogs. WWW 2004 Workshop Weblogging Ecosystem: Aggregation, Analysis and Dynamics, New York, May 2004.
The most interesting difference between the two is that the first tries to apply a top-down approach using established principles, whereas the second is more bottom-up and heuristically oriented. Some recall/precision type statistics are given in both papers (in the 85% range). The preliminary conclusion is therefore that the problem of spidering weblogs for the posts they contain has not been solved yet,.
Awesome!
Posted by: Jozef Imrich | December 05, 2004 at 02:23 PM