The first step of the From Weblogs to Ontologies experiment is to get the posts of the Chocolate and Zucchini blog (CZ) into tOKo is to create a corpus. The most obvious method for doing this is using BlogTrace which can generate a corpus for use with Sigmund and tOKo and Sigmund use the same format (in fact Sigmund uses tOKo for most of the text analysis).
Applying BlogTrace to CZ results in some weird behaviour. The archives page of CZ is not a list of the (monthly) archives but it contains all posts including comments. The file is huge, more than 4Mb, and I'm not sure that this is what was intended. For the experiment it is actually a pleasant surprise, as I can now be reasonably sure all (464) posts are found.
A corpus itself is simply the collection of all documents (blog posts in this case) with the following minimal set of meta-data elements: a unique identifier for internal usage (for blogs the permalink is used) and a human-readable title. The result of loading the CZ corpus is shown in the figure below. On the left are the posts of CZ and on the right is the text of the selected post.
Comments