This post summarizes the architecture and status of BlogTrace. BlogTrace is an interactive environment that supports weblog research. It includes Sigmund. In the months to come I hope to address several of the issues raised below and this post also serves as a single place of reference for BlogTrace should I decide to blog about it in the future :-).
Supporting Blog Research is the post that first outlined the idea of BlogTrace (I'm surprised this is dated November 18th, 2004). Since then it has materialised and Visual Settlements, Making a Difference and Knowledge Flows have been developed on top of BlogTrace.
One of the outcomes of our research planning is that resources to develop BlogTrace further have been made available. Many thanks to Telematica Instituut (in particular Janine Swaak) for that. Sigmund is moving from "weblogs only" to other types of text documents (email boxes, communities of practice, knowledge mapping, etc.) and is now part of my regular research (text analysis to support the development of ontologies).
The main motivation for BlogTrace comes from Lilia's Ph.D. research and the knowledge flows work with Robert de Hoog and Rogier Brussee. Lilia, more or less, defines what is given the highest priority in terms of functionality. If you have ideas related to BlogTrace you should contact her or both of us.
Although BlogTrace already contains some interesting features, there are several reasons to not make it widely available right now (see the overview below). The medium term objective is to make it available as Open Source once the usability is at an acceptable level. At this moment an alpha version of BlogTrace is available for weblog researchers in my immediate research neighbourhood.
Before delving into detail I would like to thank everybody who blogged about, linked to, commented on and otherwise showed an interest in BlogTrace and its parts. Special thanks to Lilia (also for going on holiday for a month, that helped :-)).
The BlogTrace architecture uses RDF(S) for representation and this opens up an opportunity to define additional abstractions and usage outside of BlogTrace (details of why this design decision is really important are omitted). The architecture is shown in the image (taken from [2]).
Weblog spider. The weblog spider takes a URL of a weblog as input. It then locates the RSS feed and the archives. Locating the RSS feed is hardly ever a problem, but some weblogs do not have a (explicit) link to the archives on the home page. Such weblogs cannot be spidered. Next, the spider extracts the archives and, using a set of heuristics and induction algorithms, determines the posts. These posts are written out as an RSS 1.0 feed with the full text. Issues:
- Quality. How accurate is the spider? This needs to be evaluated.
- Date stamps. The biggest mistake made is not identifying the correct date stamp of a post.
- Non-clean HTML. Some HTML is not clean as it does not conform to the standard or by using Windows specific characters.
- Performance. The induction algorithms are unpredictable in how long they take to find the patterns that define the posts. For small sets of weblogs this is not a problem if the process is started shortly before going to bed.
Links and metadata. With Rogier Brussee an ontology that allows the representation of the metadata about a blog and the links that are in the posts has been defined. The spider instantiates this ontology for a particular weblog. Issues:
- Uniqueness. The anchor (href) of a link is often different from the real URL. For example, "anjo.blogs.com" and "anjo.blogs.com/metis/" resolve to the same URL. In many cases BlogTrace notices that the two are identical, but sometimes it does not.
- Errors. A significant number of links contain typing mistakes and other anomalies. Links, for example, sometimes contain spaces.
- Types of links. For answering some research questions it is necessary to know the "type" of the link. At the moment the following types of link targets are distinguished: self (own homepage), community homepage, own post, community post, and other. I would be interested in techniques to determine types of links.
Term and name extraction. This is essentially the part of Sigmund that determines what you blog about and who your friends are. Issues:
- Quality. This appears to work rather well. Victor de Boer has compared person name extraction to some other named entity recognisers and my approach seems to perform well especially on noisy data.
- Languages. It has been tested for both English and Dutch. Support for German would be possible, although some of the algorithms probably don't work as in German nouns are capitalized.
- Quoting. The biggest problem is that the term extractor does not know about quoting. The implication is that by quoting in a post the terms of somebody else get imported into your vocabulary. This, of course, is a bug that is non-trivial to fix.
- Abbreviations. Long forms of abbreviations are determined by looking at the posts of a single blog. This works most of the time, but in a community abbreviations sometimes travel from one blogger to another and some mechanism that the abbreviations are shared has to be introduced.
Dictionaries. The text analysis part of BlogTrace only works when dictionaries are available. Unfortunately, most high quality dictionaries are licensed material. I'm now in the process of rewriting parts of BlogTrace in such a way that the dictionaries can be distributed. Issues:
- Dictionary ontology. Is there an accepted RDF ontology for representing the core of a dictionary (word classes, inflections and orthography)?
- Exploiting other language resources. WordNet for example.
Link analysis. This module includes functions related to links between posts (or more general documents) and links between weblogs (or more general people). Answers simple questions such as "how often does blogger A link to blogger B" or "which links have bloggers A and B in common". But also more complicated questions such as "find all conversations (linked posts) in which at least five posts appear and in which at least three bloggers are involved". Issues:
- Functions. To be enumerated.
Post analysis. This module includes all functions related to the textual and structural nature of posts, as well as date stamping and sequencing.
- Functions. To be enumerated.
Linguistic analysis. All functions related to terminology. Examples are the co-occurrence analysis by Sigmund or the detection of knowledge flows. Issues:
- Scalability. Text analysis does not scale well, even Google has problems with it (see for example: More arithmetic problems at Google).
- Performance. There is a serious performance bottleneck with the "shared conceptualisations" function in Sigmund. This is caused by the quadratic nature of the algorithm, a design flaw and by using Prolog. At the moment it takes Sigmund 10 minutes to compute the tables (for a community of 30 weblogs) and this is clearly unacceptable.
- Other. There is no end of other issues related to linguistic analysis. But that has nothing to do with BlogTrace.
User interface. BlogTrace is an interactive tool although not all functionality that is implemented is also available through the user interface. Issues:
- Finding suitable visualisations for the data is a major (research) challenge.
Exporting data. In order to support analysis by external tools (Excel, SPSS, Social Network Analysis etc.) the raw data should be exported. Issues:
- Formats. Import formats of external analysis tools.
References.
[1] Anjo Anjewierden, Rogier Brussee and Lilia Efimova. Shared conceptualisations in weblogs. To be published in "Proceedings of BlogTalk 2.0", Thomas N. Burg (ed.), Vienna, 2004 (July). PDF.
[2] Anjo Anjewierden, Robert de Hoog, Rogier Brussee and Lilia Efimova. Detecting knowledge flows in weblogs. Submitted. 2005. PDF.
Interesting stuff! Let's talk, drop me a line at dsifry at technorati dot com...
Dave
Posted by: David Sifry | January 29, 2005 at 02:05 AM
Hmm, noticed my About page disappeared. It is now back.
Posted by: Anjo | January 29, 2005 at 09:33 PM
Some of my own thoughts:
. Glad you could let Lilia travel. It was fun meeting her F2F.
. Is there something that bloggers (or the tools) could do that would make these kinds of analyses "easier?" I've seen some suggestion for providing an rdf/xml file for each post, for example.
. Beyond "self" and "other," it would be difficult to identify the type of link from strictly the anchor tag, wouldn't it? (What is the definition of "community" in this context?)
. Quoting: While it may not work everywhere, it seems that most authors (and many WYSIWYG tools) use the BlockQuote tag to identify material that comes from elsewhere.
. Can link analysis / conversation discovery look beyond blogs to things like KnowledgeBoard or websites where articles of interest are posted and then linked into the blogosphere?
The attention of Dave Sifry can't be a bad thing. I'm guessing Technorati already does some of this, while generally only seeing "current" conversations. What I've seen of your work looks to history of conversations.
Jack
Posted by: Jack Vinson | January 30, 2005 at 12:21 AM
Jack,
Q: "Is there something that bloggers (or the tools) could do". Full text RSS feeds are best. (I have never understood why most RSS feeds are first 100 words or so only).
Q: "Beyond "self" and "other," it would be difficult to identify the type of link from strictly the anchor tag, wouldn't it?" Perhaps. I'm sure someone has looked into this before. URLs that are permalinks seem to follow a set of patterns. In general, you are right.
Q: "Communities". How does one define a (weblog) community? I have some ideas on this, and slowly, very slowly Lilia is joining me.
Q: "Quoting". Of course you are right again, blockquote is used by most. This is simply something that has to be implemented.
Q: "BlogTrace and Technorati." We'll see. My interest, and that of Lilia also, is to look at weblog communities. Technorati is very useful as the Google equivalent for weblogs, the results it produces are not fit for serious research purposes.
Posted by: Anjo | January 30, 2005 at 02:22 AM
Very impressive. Keep up the good work!!
--
http://linuxhelp.blogspot.com
Posted by: brahmo | February 02, 2005 at 09:59 AM