Finding weblog communities: some fundamental mathematics

For a long time, until today, I thought that in order to find a (weblog) community it was required to have both a list of inbound and outbound links. This incorrect line of thinking stems from the difference between finding possible members in a community and the actual members. To be a member of a community it is necessary to have at least one inbound link to the community (by definition) and given this simple axiom it actually suffices to have reference to inbound links only to find the entire community. I'll skip the proof given that it is trivial by thinking about a "community member" with 0 inbound links to the other community members, 1 inbound links and then applying mathematical induction.

Tools such as Technorati and BlogPulse either discovered this independently or their Web 2.0 impetus prompted them to implement the inbound links (Who links to me) in their interfaces, while considering that outbound links (Who I link to) are not very interesting for most users.

Suddenly, finding potential weblog community members appears rather easy :-).

Literate programming

Last week I was on holiday, cycling of course, and after a little bit of careful planning I found a hotel that had a Biergarten, free wifi and to top it all off: free wifi in the Biergarten. There was therefore little reason to reject an offer from Jan Wielemaker to write a paper on PlDoc, the literate programming environment he has developed for SWI-Prolog. My contribution to the paper was minimal, writing a user experience section, providing mental help, and correcting some typo's :-).

Literate programming is an old idea, Donald Knuth proposed, practiced and propagated it more than 30 years ago through his work on TeX. Knuth's original idea is simple: you write the source code along with the documentation in a single file. That file is handed to the compiler to generate the program and to a document processor to generate the documentation. I met Knuth at the University College London when I was an intern working on computer networks in 1977. He wanted access to the network so he could read his mail, and I helped out. A lot has changed since then, 30 years later you just turn on your portable in a Biergarten with wifi ...

I think there are several reasons for literate programming to take off for real. One is open source. Open source without documentation will simply die. The second is that the wiki style of writing web content mixes rather well with source code. Just insert the documentation as a comment in the source code. A third point, which Jan Wielemaker wanted to stress in the paper, is that programmers should actually reap the benefits of writing documentation immediately. PlDoc does this in a rather clever way, taking advantage of the interpretive nature of Prolog, once the source code file has changed both the program and the documentation are immediately updated.

I have started documenting the internals of tOKo using PlDoc and perhaps should quote Knuth what this feels like:

The literate programmer can be regarded as an essayist that explains the solution to a human by crisply defining the components and delicately weaving them together into a complete artistic creation.

As Jan pointed out while I was enjoying a Hefe Weizen reading the latest draft, tOKo consists of 321 modules with about 130,000 lines of code. Wonder how many lines there will be after becoming a fluent literate programmer.

Reference: Jan Wielemaker and Anjo Anjewierden. "PlDoc: Wiki style Literate Programming for Prolog", 17th Workshop on Logic-Based Methods in Programming Environments (WLPE 2007), Porto, 2007. Submitted.

Communities: donor codicil vs. those who need organs

I have had a donor codicil for a very long time. The codicil states that my organs are available for anyone who needs them, after I die.

What strikes me as strange is that after you fill out a codicil nothing happens, I did not even receive confirmation. From a social perspective this is very bad practice, and from a community perspective this looks like a lost chance. It appears that there is a single community, the I-need-an-organ community, and millions of individuals who are willing to donate and who will remain anonymous until they die (their organs may be useful afterwards). The publicity around the Dutch ``donor show'' once again reinforced my view that the I-need-an-organ community only wants signed donor codicils without thinking about what it means for donors.

Perhaps it is an idea to create a single community of potential organ recipients and those carrying a codicil. A simple thought is that every codicil owner receives a birthday card designed by someone who needs an organ. Obviously, the picture on the card should be something pleasant: how live will improve after receiving an organ (e.g., a visit to the park, I like pictures of baby ducks).

The community perspective may be obvious. Moreover, human nature has it that they celebrate their birthdays with family and friends and perhaps this makes others more aware of filling out a donor codicil, and also that you, as the recipient of the birthday card, actually have good reasons for doing so.

There are a whole host of practical issues to implement this idea. The process has to be entirely anonymous, the same postcard has to be send to a great many people and so forth. The cost, compared to government publicity campaigns, is minimal and it might be a lot more effective.

Getting the learner in the loop

One interpretation of Eucational Data Mining is to perform data mining on educational data. This would potentially be beneficial to some of the stakeholders, in particular those who research learning environments or those scoring learner performance. A more forward looking idea is whether EDM could also benefit learners while they are learning. The key here seems to provide learners with feedback about their behaviour. Generally, in controlled learning environments, feedback is given in the form of pre-programmed prompts: "Your answer was correct". Prompts like this suggest the learner is being controlled by the learning environment and one wonders whether it is possible for learners to become part of the learning environment. An idea is to make the learner part of the environment, literally. The animation below provides an example.

The animation shows two learners chatting while solving physics problems. The learners both look at the same screen on separate computers, and a chat tool is used to communicate. To solve the physics problems the learners can, roughly speaking, chat about:


  • Domain: Their interpretation of what is happening in the simulations shown on the computer screen (learning about the momentum of bouncing balls in this case).
  • Regulatory: Discussing what is the correct answer, whether to move on to the next question and so forth.
  • Social: Compliments (both positive and negative) and other social talk.
  • Technical: What buttons to press, interpretation of the simulations etc.

In the animation the domain is the head, the regulatory chats are the body, the arms are the social talk and the legs are technical. At the end of the animation the learner on the left has a large body and a small head. Suggesting s/he was more into discussing regulatory things and little interested in discussing the meat of the domain. The learner on the right found a better balance between discussing the domain and regulatory chats, an indication of being more focussed.

Although the animation is crude, it provides pointers to what EDM could deliver in the future: the behaviour of a learner is made part of the learning environment.

The animation is based on a model of data from experiments. For each of the classifications above (domain, regulatory, etc.) typical terms and linguistic patterns are defined. For example, phrases like "I agree", "What do you think?", "The answer is 5" are regulatory, whereas "the momentum increases", "what is the value of p" are domain related.

Spelling errors taken seriously

I'm currently dealing with chat data from an educational setting. Statistics reveal that there are 6150 different words in these chats, 3157 not known by the dictionary. More than 50% of the words are not known! To be able to do anything with these chats they need to be "cleaned". Wilson Wong and his co-workers have come up with a method for Cleaning dirty texts which is the first non-trivial attempt I have seen to attack the problem. Cudo's. Matthew Hurst, jokingly on April 1st, quotes the boss of Google: Our spelling correction...is an example of AI. Matthew sounded a little bit cynical, but is spelling correction not a fundamental step in connecting language and meaning. If all of the wonderful technology stumbles on something trivial as a spelling error, what is left?

I'm applying Wong's ideas to my data and the results are encouraging, but not as spectaculair as the 96%+ correct corrections he is quoting. Perhaps it is my data :-), perhaps it is an imperfection in his method.

One of the most challenging examples in my data is k hbe geni dee (in Dutch). All four words are incorrect. The correct spelling is: ik heb geen idee. Which translates as: I have no idea, something social researchers of educational data find significant.

Conclusion: spelling correction is taken seriously by researchers and tools like Google. I think this is good news. Google performs well at single word errors and poorly at multi-word errors, and it does only correct the query (not the results).

Google and first names

Casper Hulshof, one of my new colleagues, told me that typing anjo into Google catapults one directly to my weblog. Remarkable.

I'm wondering whether there are other first names that earn the Google I'm feeling lucky distinction of pointing to a weblog or personal homepage.

Social networks and the semantic web

Someone kindly gave me a copy of Social Networks and the Semantic Web the Ph.D. thesis of Péter Mika at the Free University of Amsterdam. The thesis contains precisely what the title suggests: the application of Semantic Web technology to Social Network Analysis. I guess, the most visible application of Péter's research is Flink which takes co-authored papers as the main source of the social network.

Chapter 3 contains an excellent overview of SNA in general, covering several issues I have thought about (what is a network, what are the boundaries and so forth).

Page 58 is the best page in the thesis. It contains Figure 4.1, an illustration of the structure of a weblog (links, quotes, comments and such). Although the illustration, a screenshot, is anonymous I recognised it immediately. It is the Thinking and berries in Umea post of Lilia. Further down the page it gets even better:

The early work of Efimova and Anjewierden also stands out in that they were among the first to study blogs from a communication perspective. [Understanding weblog communities through digital traces: a framework, a tool and an example, International Workshop on Community Informatics (COMINF 2006), Montpellier, France.]

A nice social touch in a nice thesis.

Settled

I'm more or less settled in my new job. The major findings:

Simple things

  • The key that opens my office door also fits on the bicycle parking lot.
  • All employees have a key for the coffee machine. Initially, I thought this was to prevent that students get free coffee, but it is intended to increase the social cohesion. This works as follows. Stick the key into the coffee machine, get coffee, walk back to the office and a few minutes later a colleague will bring back your key. This sheds an entirely new light on coffee machines as Knowledge Management tools.
  • All discussions, emails and meetings are in Dutch. There is actually one non-native Dutch speaker. She is American (troetel allochtoon) and speaks some Dutch.
  • The first two weeks I worked on a Windows machine waiting for a new computer on which I would install Linux. Unfortunately, the head of ICT thinks Linux is not a good idea. No one in the entire faculty is using Linux, ICT does not support Linux and given that I could also work under Windows (perhaps after following some courses!) it seemed wholly inappropriate an exception was made here.
  • After a lot of discussion at the highest levels it was agreed that I was allowed to install Linux. The conditions: no calls to the ICT helpdesk and no promotion of Linux in the faculty.

Research

The topic of my research is Educational Data Mining (EDM). For all I know, my previous research was Weblog Data Mining, Community Data Mining, Semantic Relation Extraction Data Mining and so forth. You probably get the picture: Educational Data Mining is a label that currently has little substance and scientific papers on the matter are very hard to find.

After a little internal discussion we agreed on the following. Data Mining is about extracting knowledge from data (this is a well accepted definition) and EDM is about extracting behavourial knowledge from educational data. Whereas data mining in general concentrates on large amounts of anonymous data (e.g. analysing transaction slips from shops, tracking terms used in weblogs, etc.), EDM should concentrate on finding out what learners are doing inside learning environments and understand more about their individual behaviour. The long-term aim, given that EDM is possible, is to adapt the learning environment to the learner rather than the other way around.

In order to test the approach to getting EDM research going I'm running a number of experiments on data that is available in the department. These are low-key experiments in cooperation with the researchers (Ph.D. students and post-docs) who currently perform the knowledge extraction manually. I'll post results here.

A nice thing is that I'm familiar with a lot of topic areas relevant to EDM: visualisation, the role of time, text analysis (chats and open answers in EDM). The raw data looks different, methods and techniques should largely be applicable.

Wattle trees

New job, new papers to read. During my search for papers on Educational Data Mining, there aren't many, I stumbled on several papers by researchers from the University of Sydney. This group has come up with a visualisation of collaboration they call Wattle trees and it simply is brilliant! The example below, taken from Wattle trees: What'll it tell us, illustrates that some have started the trajectory from simple statistics to sophisticated visualisations. I feel at home :-).

Educational Data Mining

Sometimes history repeats itself, albeit slowly. More than 27 years ago I got a phone call: "Would I like to work in the Psychology laboratory of the University of Amsterdam (UvA)?" Having just finished my studies, I gladly accepted the offer. Six weeks ago I got another phone call: "Would I like to work at the Instructional Technology department of the University of Twente?". This time the choice was not so easy, especially given that I have worked with some colleagues, Bob Wielinga and Jan Wielemaker in particular, for the entire period and I have learned an awful lot from them.

And how, you may wonder, does history repeat itself. Oh well, I'm surrounded by psychologists again :-). The assignment is to think about Educational Data Mining (EDM) for the next two years and the nice bonus is that a continuation of weblog research, which would have been difficult at UvA, is also on the cards. EDM is not well-defined, it is obviously related to (e-)learning environments in the broadest sense. The department I'm moving to has a vacancy for a post-doc with an educational background. Have a look at it if you like (science) education and telling me what to do (let us call that knowledge sharing :-)).

Finally, a remark on long-term planning. Ten years ago I bought a house in Enschede, opposite the University of Twente (guess why!). Now, this means that I can actually cycle to work in about 6 minutes.