« From Weblogs to Ontologies | Main | Cooking: Utensils »

Cooking: Creating a Corpus

The first step of the From Weblogs to Ontologies experiment is to get the posts of the Chocolate and Zucchini blog (CZ) into tOKo is to create a corpus. The most obvious method for doing this is using BlogTrace which can generate a corpus for use with Sigmund and tOKo and Sigmund use the same format (in fact Sigmund uses tOKo for most of the text analysis).

Applying BlogTrace to CZ results in some weird behaviour. The archives page of CZ is not a list of the (monthly) archives but it contains all posts including comments. The file is huge, more than 4Mb, and I'm not sure that this is what was intended. For the experiment it is actually a pleasant surprise, as I can now be reasonably sure all (464) posts are found.

A corpus itself is simply the collection of all documents (blog posts in this case) with the following minimal set of meta-data elements: a unique identifier for internal usage (for blogs the permalink is used) and a human-readable title. The result of loading the CZ corpus is shown in the figure below. On the left are the posts of CZ and on the right is the text of the selected post.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d83452af8f69e200d8353c9c9669e2

Listed below are links to weblogs that reference Cooking: Creating a Corpus:

Comments

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment