Given that the CZ corpus is now available it is time to get acquainted with it. If CZ had been a book I would probably have started on the first page up to the time the discovery was made that most can be read in a random order. tOKo does this a little quicker, it reads the whole of CZ in seconds and can summarise the results just as quickly, applying techniques from Statistical Natural Language Processing (NLP). The simplest example is to count the words, ignoring stop words, and ordering them by frequency (see figure on the right).
Of all the other corpora loaded into tOKo so far a domain term was also the most frequent. Although counting words is rather crude, the simple fact that words like little and like are used very frequently illustrates the writing style of CZ. Fortunately, I'm after all trying to develop a cooking ontology, there also seem some domain terms used with high frequency: chocolate (why would that be), food, cheese, etc. The list demonstrates that if an author is passionate about a topic, this will appear in the statistics :-).
For the ontology it seems to make sense to define some kind of division. For example, separate branches could deal with hardware to prepare food, ingredients, drinks, types (styles) of food, places to eat and so forth. One of the first posts of CZ is E. Dehillerin which is a cooking utensils outlet. Trusting that cooking utensils is the general term for hardware realted to preparing food, I selected it in the post (it becomes yellow) and after releasing the mouse it also appears in the text entry box. A concept is created by hitting the button with a green C. We now have an ontology with two concepts CZ concept is the top-level and cooking utensils is a sub-concept of it.

In the same post I read that CZ has bought some knives in the shop. knife is therefore added as a kind of cooking utensil in the ontology. Further down:
"A mezza-luna (chopping tool with two handles and two half-moon blades). In French, it's called a "berceuse" because of the cradling movement you make while using it."
This seems to suggest that chopping tool is also a cooking utensil and that a mezza-luna is a kind of chopping tool. Moreover, chopping tools and knives consist of one or more blades and handles. The latter can be modelled by using the hasPart relation (shown as a triangle with a P). See figure on the right.

Perhaps the most useful function in tOKo is to select a point of view and then zoom in or out. Wondering whether there are different kinds of knives, I first select knife and then applied the prefix button. This shows all word pairs in which knife is the second word. The result is shown on the left.
Which of these are types of knife? In most cases it is obvious, although the composition of chef knife is curious. Is a paring knife a kind of knife? tOKo provides three ways of finding out: (1) Clicking on a term results in the posts in which it occurs being listed, which can then be studied; (2) Selecting the KWIC option shows the context; and (3) a collocation algorithm can be used (more about that later). In this case KWIC answers the question, it is a kind of knife (see the figure below). The KWIC concordancing technique is very old, it was for example used to study religious texts.

It is time for some preliminary conclusions on the experiment:
- CZ is not only a nice read, it also is very carefully worded (see the quoted example above). If you are listening Clotilde, in English there is no space before the : (colon).
- Developing an ontology is not easy, especially finding general terms that cover a set of more specific concepts. The cooking domain suffers from this to the extreme, or so it seems: a knife can be used for food preparation and it can also be used while eating. Social tagging tools ignore problems of this kind altogether, this makes them simple to use ("this is a photo of my sister") but who is to benefit from such tags?
- The experiment started of as "From Weblogs to Ontologies" suggesting that a weblog and common sense would be sufficient to create an ontology. Perhaps this was slightly over -ambitious and other resources are needed. Prime candidates are: WordNet and, of course, Wikipedia.
Could you help me. Talk sense to a fool and he calls you foolish.
I am from Albania and also am speaking English, please tell me right I wrote the following sentence: "Back than these doses, many follicles unfortunately shine their insults for a similar radiation."
With best wishes 8-), Griselda.
Posted by: Griselda | September 03, 2009 at 08:54 PM