Today Lilia is presenting a joint paper with Stephanie and myself at AOIR in Chicago. It also happens that it's Lilia's birthday today and thinking about a present resulted in this post. Happy birthday Lilia, and I hope you like it :-).
The paper is about mapping a weblog community, the principle device for finding members of the community is reciprocal linking. There are, apart from linking, obviously various other indicators of a community. For a community of "professional" bloggers around KM (e-learning, internet research), the terms used is probably also a good indicator and this post investigates this.
The data is the community as identified in the paper (all posts in 2004 from all members). A few members were deleted because their blogs are mainly in German. The full data-set for the text analysis consists of 59 blogs, 17,784 posts and 32Mb of text.
First Sigmund was used to find the terms used by the community as a whole with a frequency of at least 10. There are 12,882 of such terms. For example, the term birthday was used 102 times by the community. Next some statistics was applied to the data such that the following query could be made:
t(Term, Focus, Background) = Weight
Here Term is a term (e.g. birthday), Focus a sub-set of the data (e.g. all posts by Lilia) and Background another sub-set of the data (e.g. all posts not by Lilia). Weight is a number (0.0 ... 1.0) which states whether we are in the Focus as compared to the Background set. (Technical note: the weight is not the same as a probability, this is not important for the purposes of this post). For example:
t(birthday, LiliaEfimova, all) = 0.63
were all is shorthand for all other blogs. Another example:
t(boyfriend, female, all) = 0.85
were female is the collection of all blogs authored by women and all is all non-female blogs. We can now define a passion as follows:
t(Term, Focus, Background) > 0.9
a couple of the passions in the community are:
t(anke, AndyBoyd, all) = 1.0 t(digital video, AdrianMiles, all) = 0.95 t(cycle, AnjoAnjewierden, all) = 0.97 t(travel plan, LiliaEfimova, all) = 0.98 t(fingers crossed, LiliaEfimova, all) = 0.92 t(alarm clock, LiliaEfimova, all) = 0.90 t(chocolate, NancyWhite, all) = 0.94
With all the technical machinery in place, we can now try to find whether the community exists because they blog about similar things (as well as personal passions). For this, we define the notion of the community blog as: the collection of all posts in which the post contains a link to another member of the community and compare these to all posts in which there is no link within the community. For normalisation purposes, we subtract a number from the community weights such that they all sum to zero. This results in, for example:
n(collaborative note taking, linked, notlinked) = 0.43 n(security, linked, notlinked) = -0.35
suggesting that collaborative note taking is a term used more frequently when linking within the community, whereas security is not used much in linked posts.
For all individual posts we can now determine whether they, according to terms used, fit in the community by adding the normalised weights of terms occurring in a post. If the sum is positive community terminology is used, if it is negative non-community terminology is used. The following table lists the results:
Linked Not-linked Positive 1189 823 Negative 3141 12631
The posts classified as Positive-Linked and Negative-Not-Linked are correct if we assume linking and use of shared terminology are related. For all posts, 73% are correctly classified. Correctness rapidly increases to > 90% if the number of terms in a post is also taken into account. The preliminary conclusion, therefore, has to be that both linking and use of shared terminology are strong indicators of belonging to a weblog community.
Finally, below is a table that ranks the community blogs according to using common terms. The icing on the cake is that Lilia comes first. Happy birthday, once again.