This post summarizes the architecture and status of BlogTrace. BlogTrace is an interactive environment that supports weblog research. It includes Sigmund. In the months to come I hope to address several of the issues raised below and this post also serves as a single place of reference for BlogTrace should I decide to blog about it in the future :-).
Supporting Blog Research is the post that first outlined the idea of BlogTrace (I'm surprised this is dated November 18th, 2004). Since then it has materialised and Visual Settlements, Making a Difference and Knowledge Flows have been developed on top of BlogTrace.
One of the outcomes of our research planning is that resources to develop BlogTrace further have been made available. Many thanks to Telematica Instituut (in particular Janine Swaak) for that. Sigmund is moving from "weblogs only" to other types of text documents (email boxes, communities of practice, knowledge mapping, etc.) and is now part of my regular research (text analysis to support the development of ontologies).
The main motivation for BlogTrace comes from Lilia's Ph.D. research and the knowledge flows work with Robert de Hoog and Rogier Brussee. Lilia, more or less, defines what is given the highest priority in terms of functionality. If you have ideas related to BlogTrace you should contact her or both of us.
Although BlogTrace already contains some interesting features, there are several reasons to not make it widely available right now (see the overview below). The medium term objective is to make it available as Open Source once the usability is at an acceptable level. At this moment an alpha version of BlogTrace is available for weblog researchers in my immediate research neighbourhood.
Before delving into detail I would like to thank everybody who blogged about, linked to, commented on and otherwise showed an interest in BlogTrace and its parts. Special thanks to Lilia (also for going on holiday for a month, that helped :-)).
The BlogTrace architecture uses RDF(S) for representation and this opens up an opportunity to define additional abstractions and usage outside of BlogTrace (details of why this design decision is really important are omitted). The architecture is shown in the image (taken from ).
Weblog spider. The weblog spider takes a URL of a weblog as input. It then locates the RSS feed and the archives. Locating the RSS feed is hardly ever a problem, but some weblogs do not have a (explicit) link to the archives on the home page. Such weblogs cannot be spidered. Next, the spider extracts the archives and, using a set of heuristics and induction algorithms, determines the posts. These posts are written out as an RSS 1.0 feed with the full text. Issues:
- Quality. How accurate is the spider? This needs to be evaluated.
- Date stamps. The biggest mistake made is not identifying the correct date stamp of a post.
- Non-clean HTML. Some HTML is not clean as it does not conform to the standard or by using Windows specific characters.
- Performance. The induction algorithms are unpredictable in how long they take to find the patterns that define the posts. For small sets of weblogs this is not a problem if the process is started shortly before going to bed.
Links and metadata. With Rogier Brussee an ontology that allows the representation of the metadata about a blog and the links that are in the posts has been defined. The spider instantiates this ontology for a particular weblog. Issues:
- Uniqueness. The anchor (href) of a link is often different from the real URL. For example, "anjo.blogs.com" and "anjo.blogs.com/metis/" resolve to the same URL. In many cases BlogTrace notices that the two are identical, but sometimes it does not.
- Errors. A significant number of links contain typing mistakes and other anomalies. Links, for example, sometimes contain spaces.
- Types of links. For answering some research questions it is necessary to know the "type" of the link. At the moment the following types of link targets are distinguished: self (own homepage), community homepage, own post, community post, and other. I would be interested in techniques to determine types of links.
- Quality. This appears to work rather well. Victor de Boer has compared person name extraction to some other named entity recognisers and my approach seems to perform well especially on noisy data.
- Languages. It has been tested for both English and Dutch. Support for German would be possible, although some of the algorithms probably don't work as in German nouns are capitalized.
- Quoting. The biggest problem is that the term extractor does not know about quoting. The implication is that by quoting in a post the terms of somebody else get imported into your vocabulary. This, of course, is a bug that is non-trivial to fix.
- Abbreviations. Long forms of abbreviations are determined by looking at the posts of a single blog. This works most of the time, but in a community abbreviations sometimes travel from one blogger to another and some mechanism that the abbreviations are shared has to be introduced.
Dictionaries. The text analysis part of BlogTrace only works when dictionaries are available. Unfortunately, most high quality dictionaries are licensed material. I'm now in the process of rewriting parts of BlogTrace in such a way that the dictionaries can be distributed. Issues:
- Dictionary ontology. Is there an accepted RDF ontology for representing the core of a dictionary (word classes, inflections and orthography)?
- Exploiting other language resources. WordNet for example.
Link analysis. This module includes functions related to links between posts (or more general documents) and links between weblogs (or more general people). Answers simple questions such as "how often does blogger A link to blogger B" or "which links have bloggers A and B in common". But also more complicated questions such as "find all conversations (linked posts) in which at least five posts appear and in which at least three bloggers are involved". Issues:
- Functions. To be enumerated.
- Functions. To be enumerated.
- Scalability. Text analysis does not scale well, even Google has problems with it (see for example: More arithmetic problems at Google).
- Performance. There is a serious performance bottleneck with the "shared conceptualisations" function in Sigmund. This is caused by the quadratic nature of the algorithm, a design flaw and by using Prolog. At the moment it takes Sigmund 10 minutes to compute the tables (for a community of 30 weblogs) and this is clearly unacceptable.
- Other. There is no end of other issues related to linguistic analysis. But that has nothing to do with BlogTrace.
- Finding suitable visualisations for the data is a major (research) challenge.
- Formats. Import formats of external analysis tools.
 Anjo Anjewierden, Rogier Brussee and Lilia Efimova. Shared conceptualisations in weblogs. To be published in "Proceedings of BlogTalk 2.0", Thomas N. Burg (ed.), Vienna, 2004 (July). PDF.
 Anjo Anjewierden, Robert de Hoog, Rogier Brussee and Lilia Efimova. Detecting knowledge flows in weblogs. Submitted. 2005. PDF.