Saturday, January 13, 2007

...against all odds


On wednesday I attended a talk given by Michael Strube from EML Research on "World Knowledge induced from Wikipedia - A New Prospect of Knowledge-Based NLP ". He was showing how the (meanwhile famous) collaborative encyclopedia can be used for information retrieval purposes in a way similar to (more traditional) online dictionaries as e.g. WordNet and - though being not well structured - provides results of almost equal quality.
First thing was that for their work, Strube and his colleague regarded each Wikipedia page as being the representation of a concept (we already had some arguments about that as you might remember...). Next, they developed some metric for similarity of concepts w.r.t. to the concept hierarchy (where the wikipedia defined 'concepts' come into play). Since 2004, wikipedia features a user defined concept hierarchy. This hierarchy of concepts also can be regarded as being a folksonomy, simply because this is not a knowledge representation carefully designed by some designated domain expert, but by the wikipedia comunity in a collaborative way. Unfortunately, the wikipedia concept hierarchy suffers exactly from that fact. From my pont of view it seems problematic to compare the proposed similarity measure (based on wikipedia concept hierarchy) with other similarity measures (based on commonly shared expert ontologies). O.k., you might argue that indeed the wikipedia concept hierarchy IS commonly shared, because it has been developed by the wikipedia community...but is the knowledge represented in wikipedia really 'common'? Just remember the diversity and manifold of Star Wars characters or Star Trek episodes in wikipedia compared with, as e.g., the history of smaller Eropean countries. As for all ontologies always the view and the knowledge of the ontology designer has to be considered. The wikipedia concept hierarchy - although partly being really appropriate - reminds me somehow to this famous literary chinese dictonary entry defining the term 'animal' which is quoted by Jorge Luis Borges. Another problem lies in the fact that the different language versions of wikipedia have developed different concept hierarchies (sic!).

In the end, I was asking how this proposed information retrieval based on wikipedia could be improved by considering a 'Semantic Wikipedia', as e.g., the Semantic MediaWiki (given that those semantic wikipedias would contain sufficient data). Instead of answering my question, Michael Strube cited Peter Norwig's argument against the Semantic Web from last years AAI2006. Just to sum up: the semantic web will not become reality because of the inability of its users to provide correct semantic annotations. But hey...this guy (Strube) was talking about wikipedia. Doesn't this argument raise any associations? Just remember the time 5 or 10 years ago. Nobody (well almost nobody) would have believed that it will be possible to write an entire encyclopedia collaboratively on an open source basis - just because the web user's did not seem to be able to write 'correct' articles....

1 comment:

Anonymous said...

Hi,

Nice entry, I found it very interesting. My name is Norberto Fernández I'm a PhD student who has been working since year 2005 in using Wikipedia as a kind of ontology for semantic annotation as part of a system called SQAPS which tries to integrate semantic annotation with querying search engines.

I completely agree on the fact that we can not consider Wikipedia as a formal ontology, for instance, it does not provide a unique identifier for each concept, and the categorization system is also an informal one, but I still believe it is useful for the purpose of semantic annotation. In this paper I elaborate on the advantages of using Wikipedia in annotation, like for instance that for a user creating an HTML page with his/her favourite tool is easy to add a link to a Wikipedia page to explain the meaning of a certain piece of text, and by doing so, he/she is annotating the page. In that paper I also suggest that the multiple-identifier problem could be (at least partially) addressed by using reasoning. The idea is that the transitive closure of the multilingual links of a Wikipedia page can be computed and all the entries in such transitive closure can be considered as "alias" for a single concept, instead of considering each single page as a concept.

In relation with the fact that not all the domains are equally covered by Wikipedia, my opinion is that it is of course true, but you can also argue that the same situation is also true for the whole Web (not all the topics have the same number of pages covering them). Additionally the wiki approach of Wikipedia gives Web surfers the freedom to add new entries which are considered missing (of course, through a community process, to ensure a reasonable degree of quality in the contents).