Saturday, December 30, 2006

Neal Stephenson - The Confusion (Vol. 2 of the Baroque Cycle)


I have finished 'Confusion' ... more than a thousand pages and what a story :)
As I have read the book in German, the short review will also be in German....

Ok, ich bin also durch. Erst einmal ganz dickes Lob an die beiden Übersetzer Nikolaus Stingl und Juliane Gräbener-Müller. Die schwer wiegenden 1000 Seiten kommen mit einer Leichtigkeit daher, die dieses Buch zu einem wahren Lesevergnügen werden lassen. Das Buch - eigentlich sind es ja streng genommen nach der Vorrede des Autors zwei - wartet mit zahlreichen Nebenhandlungen, erzählerischen Beschreibungen und Schnörkeln und abstrusen Handlungsverläufen auf, dass man es wahrhaft als 'barock' bezeichnen mag. Verquickt (confused) werden dabei zwei Haupthandlungsstränge - die Verschwörer (eigentlich Galeerensklaven) rund um Jack Shaftoe und ihrer Jagd nach Salomos Gold (das sie sich schnell wieder abjagen lassen) und ihrer merkantilen Expedition (-> Quecksilber) rund um den Globus....sowie die Geschichte um Eliza (mittlerweile Herzogin von Arcachon (sic!) und Qwghlm) und dem (historischen) Entstehen des bargeldlosen Zahlungsverkehrs und dessen Auswirkungen auf Wirtschaft und Politik (Confusion!) des damaligen Europas sowie Eliza's Rache an ihren ehemaligen (Duc d'Arcachon sr.) und aktuellen (von Hacklheber) Peinigern.
Etwas kurz geraten (im Gegensatz zum ersten Band 'Quicksilver') ist die Geschichte um Daniel Waterhouse (einschließlich Newton, Leibnitz, etc...), die Royal Society und den wissenschaftlichen Erkenntnissen der damaligen Zeit. Dafür stehen in diesem Band die wirtschaftlichen Errungenschaften des ausgehenden 17. und beginnenden 18. Jahrhunderts im Mittelpunkt, wie z.B. das globale Agieren der Holländischen Ostindischen Handelsgesellschaft, Spaniens Plünderung der Neuen Welt oder das Entstehen des bargeldlosen Handels.

Was mir an Stephensons Romanen immer wieder gefällt sind seine Schilderungen vollkommen abstruser Situationen, in die mitunter wichtige Dialoge und Handlungen eingebettet werden, sei es die großmaßstäbliche (behelfsmäßige) Herstellung von Phosphor aus Urin im indischen Hinterland oder die Jagd nach einer Fledermaus im Esszimmer mit Hilfe des angerosteten Rapiers von Leibnitz.
Auf alle Fälle freue ich mich schon auf den letzten Teil der Trilogie, auf dessen Übersetzung wir wahrscheinlich noch ein gutes Jahr warten werden müssen...

Links: -> Clearing up the Confusion in wired news

Thursday, December 21, 2006

Semantic Search ... confusion ahead


When I was attending a talk of Thilo Götz on UIMA, the word came to 'Semantic Search'. Up to that point in time, I was quite sure about the meaning of this term. But, I had to realize that several people think in different ways about it.
As far as I have understood the meaning of that term, 'Semantic Search' refers to all techniques and activities that deploy semantic web technology on any stage of the search process. Thilo Götz (and he's not alone with that) refered to 'Semantic Search' as a 'traditional' search engine that is using a semantically enriched search index (i.e. a search index that incorporates ontologies or information/relationships infered from ontologies).

From my point of view the later definition refers only to a part of the whole process. Let's take a brief look at search engine technology: You have to consider the index generation (including the crawling processes, information retrieval for creating descriptors, inverse indexing, overall ranking, clustering, ...) as well as the user interaction (including query string evaluation, query string refinement, visualization of search results, navigation inside the search domain), not to forget personalization (concerning a personalized ranking of the search results including some kind of 'pre-selection' according to the personal information needs of the user, a personalized visualization, etc.) -- which will become of much more importance in the nearby future.

But, to generate a semantic search index there are several possibilities to consider:

  • Using unstructured web-data (html, etc. ...) in combination with information retrieval techniques to map the information represented in the web-data to (commonly agreed) ontologies of well defined semantic.

  • Using semi-structured web-data that already include links to well defined ontologies (being commonly agreed upon or at least being mapped to som standard ontologies).


For both steps, the generation of a semantic index requires more than just compilation of the retrieved data. Although the index might contain unstructured web-data including ontologies of well defined semantics, the main purpose of the index is to provide fast access to the information being represented in it. To generate the answer for a query, the search engine simply does not have enough time for performing logical inferences to deduce knowledge (a.k.a. answers) online. Of course, this (logical inference) has to be deployed beforehand (i.e. offline), just in a similar way as the computation of today's pageRank.

So, what is the use of machine processible semantics in a search engine's index data structure? The following possibilities can be considered (the list is open for additional suggestions...):

  • to add new cross-references between single index entries (associations),

  • to find similarities between index entries ( = web data) w.r.t. their content, and

  • to discover dependencies between single index entries to enable
    • better visualization of the retrieved domain of information, and also

    • efficient navigation to better fulfill the users information needs.


  • of course also to disambiguate and to cluster web-data for better precision and recall (this is already done with current IR techniques).


To compile a semantic index, also the crawling process has to be considered. While the primary goal of a web crawler is to gather as much web-data as possible as fast as possible (and of course to maintain its level of consistency), a 'semantic' web crawler besides unstructured web-data also has to look for semantic data, as e.g., RDF and OWL files, and also for possible connections between unstructured web-data and semantic data. For crawling RDF or OWL, a traditional crawler has to be modified. While the traditional crawler just analyzes the HTML data for link tags, RDF and OWL don't contain link tags but they often include several namespaces that determine new input for the crawler. Besides mere data gathering, the crawler should also preprocess data for being included within the index. This task often is implemented as a separate step (and denoted as 'information retrieval'). But, it influences the crawlers direction and crawling strategy and thus, also has to be considered here.
Web crawlers often are implemented in a distributed way to increase their efficiency while working in parallel. New URLs found in the web pages being crawled can be arranged according to the location of their domain (geographically). In this way, an instance of the distributed crawler receives only new URLs to be crawled that are located in the same (or a nearby) domain. The same argument holds for the resources that are to be crawled by semantic web crawlers. But, for semantic crawlers, also the (semantic) domain of the crawled data might be of importance, e.g., an instance of the distributed crawler might be responsible for crawling a specific domain (=topic) or only domains that are (semantically) closely related to the domain of interest.


For the semantic search engine the compilation of an index from the web pages being delivered by the crawler differs from the compilation process of the traditional search engine. Let us first recall the traditional index compilation process (for text related data, i.e. this does not hold for multimedia data such as images or video clips):

  1. resource normalization, i.e. all resources that contain explicit textual content have to be transformed into text files

  2. word stemming, i.e. transform all terms of a retrieved and normalized web-document to their word stems

  3. stop word removal, i.e. cut out all terms that are not well suited for identifying the processed text file (i.e. that are not suitable as descriptors). Often only nouns are taken as descriptors (this can partly be achieved by applying pos-stemmers (=part-of-speech stemmers).

  4. black list processing, i.e. terms that for some reason do not qualify as descriptors are cut out.


This process results in a list of descriptors that describe the web-document being processed. For each descriptor a weight according to its descriptive value for the text file has to be calculated (e.g., by term frequency - inverse document frequency (tf-idf) or other weigth function). The table resulting from combining the weighted descriptors with their underlying web-documents constitutes the index. By inverting this index a descriptor delivers all related web-documents in combination with their pre-calculated weight function (that determines how well a given descriptor is suited to describe the content of the according web-document). To increase precision and recall, the general relevance of the web-documents can be computed beforehand (i.e. nothing else but the Google PageRank).

For a 'semantic index', metadata (such as, e.g., ontologies) have to be taken into account and be combined with the traditional index...

...to be continued

Thursday, November 30, 2006

Literary Confusion according to Babel


As mentioned in a previous article, I decided also to write about (i.e. to review) the books I'm currently reading. I realized that my ability to express myself in English if it comes to the authoring of non scientific content is rather limited. Also I was thinking about whether it makes sense to write (in English) about books (written in German)...esp. if I want to address a German speaking audience (at least concerning the non-scientific content of this blog). Alas, the new beta-blogger allows the use of tags and thus, it will be possible to categorize articles (and to distinguish between content written in English or German). To make it short: from now on, book reviews for books of non-scientific content that are written in German will also be written in German. Everything else (including book reviews of books that I have read in English) will be written in English.
To all those who are also interested in those book reviews, you might consider to use babelfish for translation (although I have no idea about the result:)...

Als ich vor wenigen Wochen die ISWC besuchte, hatte ich mich ja bereits darüber beklagt, dass meine derzeitige Lektüre für das Handgepäck schlicht und einfach zu 'schwer' sei. Folglich stand ich vor dem Problem, was ich während der jeweils über 9 Stunden dauernden Flüge lesen sollte (natürlich habe ich währendessen auch geschlafen...so gut es eben ging...). Normalerweise lese ich nicht gerne mehrere Bücher parallel, und da ich es vermeiden wollte, nach meiner Reise mit zwei angefangenen Büchern dazustehen, entschied ich mich für Kurzgeschichten bzw. Erzählungen. Im Bücherschrank stand schon seit längerer Zeit ein kleines Bändchen mit Erzählungen von Thomas Mann, für das zu lesen ich bis dato noch nicht die Muse aufbringen konnte. Da es (zumindest gewichtstechnisch) den Anforderungen an meine Reiselektüre entsprach, durchlebte ich also während des Fluges die Welt von 'Tonio Kröger', schmeckte den Fluch des 'Wälsungenblutes' und machte mich als Hunden gegenüber völlig indifferenten Zeitgenossen mit den Abgründen der Beziehungen zwischen 'Herr und Hund' vertraut....

Also gut, vielleicht sollte ich damit beginnen zu erwähnen, dass ich im Deutschunterricht niemals mit den Werken von Thomas Mann als Lektüre Bekanntschaft schließen musste. Viele meiner Bekannter stöhnen bereits laut auf, sobald nur der Name 'Thomas Mann' fällt...sicherlich, da sich unliebsame, bereits verdrängte Erinnerungen an die Untiefen der im Schulunterricht bis zum Erbrechen interpretierten und analysierten Soap-Opera 'Buddenbrooks' ihren Weg zurück an die Oberfläche bahnten. Aber nicht bei mir. Die Buddenbrooks hatte ich das erste mal im 'zarten Alter' von knapp 30 Jahren vor mir. Ich dachte, "das reicht erst mal für die Weihnachtswoche", nur zog mich dieses Monstrum von einer Familiensaga derart in seinen Bann, dass ich es nach zweieinhalb Tagen (leise seufzend, da es 'schon' zu Ende war) wieder zurück ins Regal stellte. Allerdings kam neulich auf der Frankfurter Buchmesse während eines gemeinsamen Mitttagessens die Rede auf die wohl "am meisten überschätzten" deutschen Autoren. Nachdem ich meiner Nachbarin gegenüber mein Unverständnis ihrer Einschätzung, dass Fontane dabei ganz oben auf ihrer Liste stehe, zum Ausdruck brachte, fiel mir daraufhin der 'Der Zauberberg' und insbesondere der 'Doktor Faustus' ein. Ohne mich jetzt hier zu vertiefen sei mir kurz der Hinweis gestattet "Thomas Mann hätte sich in meinem Gedächtnis sicherlich einen besseren Ruf behalten, hätte er zumindest auf das Schreiben des Letzteren der beiden genannten Romane verzichtet...".

Tonio Kröger bietet dabei alle Höhen und Tiefen der Mann'schen Erzähltradition in kondensierter Form. Die relativ kurze Erzählung beginnt mit der Geschichte der Kindheit und des Heranwachsens Tonio Krögers in allerbesster kurzweiliger 'Buddenbrooks-Manier' und ergießt sich in der zweiten Hälfte in einer Introspektive (-> siehe Zauberberg) Tonios in seiner Rolle als Künstler (im Zwiegespräch und in Briefen an seine Künstlerfreundin Lisaweta). Im letzten Drittel unternimmt unser Held eine Reise zurück an die Stätten seiner Kindheit und beginnt zu begreifen, dass er als Künstler sehr wohl Teil der von ihm verächtlich abgelehnten 'Gesellschaft' ist und dass Gefühle einen Künstler nicht in seiner Arbeit hemmen, sondern gar bestimmen....(-> Wandlung siehe Doktor Faustus).
Fazit: Ein Kurztripp durch Thomas Manns Mikrokosmos, sehr zu empfehlen für alle, die in seine Welt mal 'kurz hineinschmecken' wollen, ohne das Wagniss eingehen zu müssen, sich einem der vorgenannten 'Monstren' zu stellen :)

Das Wälsungenblut ist da schon etwas anders gestrickt - auch wenn die eindrucksvoll plastische Schilderung dieser etwas kurios dekadenten Bankiersfamilie Aarenhold stellenweise an die Addams-Family erinnert.... Thomas Mann karikiert dabei den schwülstige Pathos der Opern-Atmosphäre Richard Wagners - hier ganz speziell die der Walküre. Geschildert wird die inzestiöse Liebe des Wälsungen-Geschwisterpaares Siegmund und Sieglinde, wobei es Mann gelingt, eine durch Liebe zum Detail perfekte Inszenierung abzuliefern, die vom Kunstgenuss des Opernzitats mit Kognak-Kirsche bis hin zum Liebesreigen auf dem Bärenfell (-> siehe Wagners Walküre) reicht.
Fazit: ein kurzweiliges Stück skuriler Prosa, in dem sich Thomas Mann als Meister der pointierten Schilderung und 'Anti-Wagnerianer' erweist...

Tuesday, November 28, 2006

UIMA - Unstructured Information Management Architecture

This morning, we were invited to a talk given by Thilo Götz from IBM about UIMA (Unstructured Information Management Framework), IBM's Framework for the Management of unstructured information that happened to take place at the department of computer linguistics.
UIMA represents (1) an architecture and (2) a software framework for the analysis of ustructured data (just for the record: structured data refers to data that has been formally structured, a.g. data within a relational database, while unstructured data e.g., refers to text in natural language, speech, images, or video data). The purpose of UIMA is to provide a modular framework that enables easy integration and reuse of data analysis modules. In general, the UIMA framework distinguishes three steps in data analysis:

(1) reading data from distinguished sources
(2) (multiple) data analysis
(3) presentation of data/results to the 'consumer'

Also it enables remote processing (and thus simple parallelization of analysis tasks). Unfortunately, at least up to now, there is no GRID support for large scale parallel execution.
Also, simple applications of UIMA, e.g. in semantic search were presented (although their approach to semantic search means: do information retrieval on unstructured data and fit the resulting data into the index of the 'semantic search engine'...)
Nevertheless, we will take a closer look at UIMA. We are planning to map the workflow of our automated semantic annotation process (see [1]) into the UIMA architecture and I will tell you about our experiences made....
UIMA is available as a free SDK, and the core Java framework is also available as open source.

References:
[1] H. Sack, J. Waitelonis: Automated Annotations of Synchronized Multimedia Presentations, in Proceedings of Mastering the Gap : From Information Extraction to Semantic Representation (MTG06 / ESWC2006), Budva, Montenegro, June 12, 2006.

Tuesday, November 21, 2006

Document Retrieval vs. Fact Retrieval - In Search for a Qualified User Interface


Today, if you are looking for information in the Web, you enter a set of keywords (query string) into a search engine and in return you will receive a list (= ordered set) of documents that are supposed to contain those keyword(s) (or their word stem). This list of documents (therefore 'document retrieval') is ordered according to the document's relevance with respect to the user's query string. 'Relevance' - at least for Google - refers to PageRank. To make it short, PageRank reflects the number of links referring to the document under consideration, each link weighted with its own relevance being adjusted by the number of total links starting at the document that contains this link (in addition with some black magic that is still under copyright restriction, see U.S. Patent 6285999).
But, is this list really what the user expects for an answer? O.k. meanwhile, we - the users - have become used to this kind of search engine interface. In fact, there exist books and courses about how to use search engines in order to get the information you want. Interesting fact is that it is the user, who has to get adapted to the search engine interface....and not vice versa.
Instead it should be the other way around. The search engine interface should get adapted to the user - and even better to each different user! But, how then should a search engine interface should look like? In fact, there are already search engines that are able to give the answer to simple questions ('What is the capital of Italy?'). But, they stil fail in answering more complex questions ('What was the reason for Galileo's house arrest?').

In real life - at least if you happen to have one - if you are in need for information, you have different possibilities to get it:

  1. If there is somebody you can ask, then ask.
  2. If there is nobody to ask, then look it up (e.g. in a book).
  3. If there is nobody to ask, and if there is no way to look it up, then think!

Let's consider the first two possibilities. Both do also have their drawbacks: Asking somebody is only helpful, if the person being asked does know the answer. (O.k., there is also the social aspect that you might get another reward just by making social contact...instead of getting the answer). If the person does not know the answer, maybe she/he knows, whom to ask or where to look it up. But we might consider this fact as being a kind of referential answer. On the other hand, even if the person does know the answer, she/he might not be able to communicate the answer. Maybe you speak different languages (not necessarely different languages in the sense of 'English' and 'Suaheli', but also consider a philosopher answering the question of an engineer...). Sometimes you have to read in between the lines to understand somebody's answer. At least, in some sense we have to 'adapt' to the way the other person is giving the answer to understand the answer.
Considering the other possibility of looking up the information, we have the same situation as if asking the www search engine. E.g., if we look up an article in an encyclopedia, we use our knowledge of how to access the encyclopedia (alphabetical order of entries, reading the article, considering links to other articles...being able to read...).
Have you realized that in both cases we have to adapt ourselves to an interface. Even when asking sombody, we have to adopt to way this person is talking to us (her/his level of expertise, background, context, language, etc.). From this point of view, adapting to the search engine interface of Google seems not to be such a bad thing at all....

If it comes to fact retrieval, the first thing to do is to understand the user's query. To understand an ordinary query (and not only a list of interconnected query keywords), natural language processing is the key (or even as they say the 'holy grail'). But even, if the query phrase can be parsed correctly, we have to consider (a) context and (b) the user's background knowledge. While the context helps to disambiguate and to find the correct meaning of the user's query, the user's background determines its level of expertise and the level of detail in which the answer is best suited for the user.

Thus, I propose that there is no such thing as 'the perfect user interface'. Anyway, different kind of interfaces might serve for different users in different situations. No matter how the interface will look like, we - the users - will adapt (because we are used to do that and we learn very quickly). Of course, if the search engine is able to identify the circumstances of the user (maybe she/he's retrieving information orally with a cell phone or the user is sitting in front of a keyboard with a huge display) the search engine may choose (according to the user's infrastructure) the suitable interface for entering the query as well as for presenting the answer...

WebMonday 2 in Jena - Aftermath


Yesterday evening the 2nd WebModay took place in Jena Intershop Tower. I thought that the number of participants that happend to come by the last time could not be surpassed (we had almost 50 people up there), but belief it or not, I counted more than 70 people this time! Lars Zapf moderated the event and we had 4 interesting speakers this evening.
For me, the most interesting talk was the presentation of Prof. Benno Stein from the Bauhaus-University Weimar about Information Retrieval and current projects. He was addressing the way how we are using the web today for retrieving information. Most current search engines are only offering 'document retrieval', i.e. after evaluating the keywords given in the user's query string the search engine presents an ordered list of documents that the user has to read in order to get the information. Instead, the more 'common' way to get information would be to ask a question and to receive an 'real' answer (= fact retrieval). I will discuss these different types of 'user interfaces' in an upcoming post. Interesting thing to mention is that Weimar is so close to Jena and both our research really seems to have some interconnections (thus, this new contact might be considered to be another WebMonday's networking success).
After that, Matthias Leonhard was giving the first part of a series of talks related to Microsoft's .NET 3.0.
Then, Ryan Orrock addressed the problem of 'localisation' and translation of applications. If translating an application into another language, simple translation of all text parts is not sufficient. There are also different units of measure to consider as well as the adaption of screen design, if texts in different languages have diferent sizes.
In the last presentation Karsten Schmidt was addressing networking with openBC/Xing, an interesting social networking tool that is supposed to make business contacts.(At least, now I know that I need some other tool to store (physicaly) my (and other people's) business cards :) ).
Even more interesting was - as always - the socializing part after the presentations. Markus Kämmerer made some photos .

Here you can find other blog articles on the 2nd WebMonday:

Wednesday, November 15, 2006

wikipedia to serve as a global ontology....



Today, I met Lars Zapf for a quick coffee enjoying the rare late afternoon november sun. We were exchanging news about ISWC, WebModay, recent projects, and stuff like that. While talking about semantic annotation, Lars pointed out that instead of using (or developing) own ontologies for annotating (and authoring) documents, you could also use a wikipedia reference to indicate the semantic concept that you are writing about. Thus, as he already wrote in a comment, e.g., you could use the link http://en.wikipedia.org/wiki/Rome to indicate that you are refering to the city of Rome, the capital of Italy.
Of course you might object that there are several language versions of wikipedia and thus, there are several (different) articles that refer to the city of Rome. To use wikipedia as a 'commonly agreed and shared conceptualization' - to fulfill at least some points of Tom Gruber's ontology definition as long as wikipedia lacks the 'formal' aspect of machine understandability - we can make use of the fact that articles in wikipedia can be identified with articles in other language versions with the help of the language indicators at the lower left side of wikipedia's user interface. To serve as a real ontology, each wikipedia article should (at least) be connected to formalized concept (maybe encoded in RDF or OWL). This concept does not necessarely have to reflect all the aspects that are reported in the natural language wikipedia article. E.g., Semantic Media Wiki is working on a wiki extension to capture simple conceptualizations (such as e.g. classes or relationships).
An application for authoring documents could easily be upgraded by offering links to related wikipedia articles. If the author enters the string 'Rome', the application could offer the related wikipedia link to Rome [or any selection of related offers] and according to the authors this link can be automatically encoded as a semantic annotation (link).
O.k., that sounds pretty simply. Are any students out there to implement it (anybody in need for credit points??)? I would highly appreciate that...

International Semantic Web Conference 2006 (ISWC 2006) - Aftermath


Back home again, jetlag is almost gone while already travelling to Potsdam again for a talk on 'Semantic Annotation in Use'....After all, ISWC 2006 was a very nice and interesting conference...although being set up at a rather remote location (at least from my point of view). One of its highlights (as already pointed out) was the panel discussion about web 2.0 and semantic web. Leo Sauermann was raising the question, why there is such a bad marketing of semantic web technology. Obviously - as TBL was replied - because the W3C invests all funding into hiring scientist and not marketing people. One of the major problems is that semantic web applications don't look 'cool' and 'sexy' ... and therefore, they don't get public attention. BTW, right at the same time, the 3rd Web 2.0 conference took place in San Francisco. Why did the conference organizers of ISWC not try to organize a panel discussion with a live connection between the two conferences? At ISWC of course there were only 'Semantic Web'-people and (at least as far as I have realized) nobody from the 'Web 2.0'-community. Allright, you can be both SemWeb and Web2.0. But, as long as you are are focused on Semantic Web most people will have common focus (and argument) on Web 2.0. Another point of view for sure would have been interesting to listen to.

Closely related to the lack of marketing question is the question about the Semantic Web killer application. Nobody knows what type of application it will be - of course ... otherwise the application would already be there. But, as for all killer applications, it will not necessarely be somthing 'really' useful :) If you consider that the killer application for the WWW (at least for the time in its early beginnings back at CERN) was the 'telephone book'. Not to mention the sms and the mobile phone. Maybe the semantic web killer application will be related to rather ordinary applications such as a dating service that is really able to find a match....

BTW, I have switched to the new beta release of blogger.com... (comments should be working now - at last! - and also keywords)...and the other guy on the picture is Ulrich Küster (also from FSU Jena) at the ISWC dinner reception...

Thursday, November 09, 2006

International Semantic Web Conference 2006 (ISWC 2006), Athens (GA), USA - Day 3


Thursday, the last day of ISWC started with a keynote of Rudi Studer from the University of Karlsruhe on 'The Semantic Web: Suppliers and Customers'. He drew the historic connection from databases to the Semantic web as being a web of human-readable content connected to machine-interpretable semantic resources. He also pointed out the importance of interdisciplinary research for realizing the Semantic Web, while on the other hand, the Semantic Web also contributes to other disciplines and comunities. After that I was listening to an interesting talk of Andreas Harth from DERI on 'Crawling and Indexing the Semantic Web', where he introduced an architecture for a semantic web crawler and gave some first results.
The most interesting talk of the day was the talk of Ivan Herman from W3C (here you can find his foaf data) on 'Semantic Web @ W3C: Activities, Recommendations and State of Adoption'. He proposed 2007 to be the 'year of rules', because finally, we might come to a recommendation concerning rule languages for the Semantic Web. He also mentioned the efforts of integrating RDF data into XHTML via RDFa or - vice versa - to get RDF data out of XHTML with GRDDL.
The ISWC closed with the announcement of the best paper awards and the winners of theis year's semantic web challenge.
If you are interested in the conference, you might have a look at the video recordings of the talks.

International Semantic Web Conference 2006 (ISWC 2006), Athens (GA), USA - Day 2


Wednesday...the 2nd day of ISWC started with a keynote of Jane E. Fountain from the University of Massachussetts in Amherst about 'The Semantic Web and Networked Governance'. From her point of view, Governements have to be considered as major information processing [and knowledge creating] entities in the world, and she was trying topoint out the key challenges faced by governements in a networked world (for me the topic was not that interesting...). Also today's sessions - at least those that I have attended - were not that exciting. I liked one presentation given by Natasha Noy from Stanford on 'A Framework for Ontology Evolution in Collaborative Environments' in the 'Collaboration and Cooperation' session. She presented an extension of the protégé ontology editor for collaborative ontology development.
The most interesting session for me was the 'Web 2.0' panel in the afternoon. Amon the panelist were Prof. Jürgen Angele (Ontoprise), Dave Beckett (Yahoo!), Sir Tim Berners-Lee (W3C), Prof. Benjamin Grosof (MIT Sloane School of Management), and Tom Gruber. The panelwas discussing the role of semantic web technology for web 2.0 applications.


Jürgen Angele pointed out that the only thing that is really new about web 2.0 is ad-hoc remixability. Everything else is nothing but 'old' technology. But, as he stated, web 2.0 could be a driving force for semantic web technology.

Dave Beckett made some advertising for Yahoo! in the sense that he was pointing out that Yahoo! indeed is making use of semantic web technology (at least in their new system called Yahoo!Food) and Yahoo! is a great participation platform with more than 500 million visitors per month.

Tim Berners-Lee gave a survey on the flaws and drawbacks of web 2.0 and how semantic web technology could help. While web 2.0 is not able to provide real inter-application integration, the semantic web on the other side does not provide such cool interfaces to data. Together both in combination, they could become interesting.
All so called new aspects of web 2.0 have already been the goals of the original web (1.0), as easy creation of content, collaborative spaces, intercreativity, collective intelligence from designing together, creating relationships, reuse of information, and of course user-generated content. Web 2.0 architecture consists of client side (AJAX) interaction and server side data processing (aka the good old 'client-server'-paradigm) and mashups (one per application / each needs coding in javascript, each needs scraping/converting/...). Essentially, web 2.0 is fully centralized. So, why are skype, del.icio.us, or flickr websites instead of protocols (as foaf is)? The reuse of web 2.0 data is only limited to the hostside. Only with the help of feeds, data are able to break out from centralized sites. What will happen with all of your tags? Will they end up as simply being words or will they become real (and usefull) URIs?
With semantic web technology, web 2.0 enables multiple identities for you. You may have many URIs, enabling you to access different sorts of data, to fullfill different expectations concerning trust, accuracy, and persistence. In the end, web 2.0 and semantic web while being good seperately could be great together!

Benjamin Grosof asked, where semantic web technology could help web 2.0. He focused on backend semantic integration and mediation (augment your information via shallow inferences), collaboration and semantic search. Semantic search will enable you a morhuman centered search interface, as e.g., 'Give me all recipes of cake....but I don't like any fruits' and 'I want a good recommendation from a well reputed web site'. He sees semantic web technology piggyback on web 2.0 interactions ('web 2.0 = search for terrestrial intelligence in the crowd' :) The semantic web should exploit web 2.0 to obtain knowledge.

Tom Gruber was asking 'Where is the mojo in Web 2.0?'. He characterized web 2.0 as being a fundamentally democratic architecture, driven by social and entertainment payoffs (universal appeal...), while the web 1.0 business model actually keeps working ('attention economy'). He was discussing the way from today's 'collected intelligence' to real 'collective intelligence'. He concluded 'don't ask what the web knows....ask what the world knows!' and 'don't make the web smart...make the world smart'.

Wednesday, November 08, 2006

International Semantic Web Conference 2006 (ISWC 2006), Athens (GA), USA - Day 1


Tuesday morning 9 a.m. ... the ISWC 2006 starts with the keynote of Tom Gruber (godfather of computer science based definition of the term 'ontology') on 'Where the Social Web Meets the Semantic Web'. He focused on 'Collective Intelligence' as being the reason that companies as google or amazon did survive the first Dot-com bubble, because they where making use of their users' collective knowledge. Google uses other people's intelligence by computing a page rank out of the users' links to other webpages. Amazon uses the people's choices for their recommentation system, and ebay uses the people's reputation. Interesting thing about that is that the notion of 'Collective Intelligence' (aka 'Social Web', aka 'Web 2.0') - was already addressed by Douglas Engelbart in the late 60's. Engelbart did not only invent the mouse, the window-based user interface, and many other important things that are part of today's computing environment, his driving force - as Gruber said - was 'Collective Intelligence'....to cope with the set of growing problems that humanity is facing today. Thus - as I have also stated in another post - also the semantic web depends on collaboration and participation of the users and therfore, on 'Collective Intelligence' to become a success.

BTW, I prefer using the term 'Social Web' instead of 'Web 2.0'. From my point of view 'Social Web' hits exactly the point and does not suggest any new and exciting technology (but only the fact that people are using existing web technology in a collaborative way to interact with each other).

After the keynote I visited the 'Knowledge Representation' session with an interesting talk of Sören Auer on OntoWiki (a semantic wiki system .. interesting, because one of my students is alsoimplementing a semantic wiki). In the afternoon sessions I esp. liked the talks about representation and visualization (esp. the talk of Eyal Oren on 'Extending faceted navigation for RDF data', where he presented a nice server application that is able to visualize arbitrary RDF-data). In the evening, a dinner buffet (including cuban music) was combined with the poster sessions and the 'Semantic Web Challenge' exhibition, where I found the possibility for a cooperation with Siegfried Handschuh from DERI (on semantic authoring and annotation....).

Oh...I already forgot to mention that there is also a flickr group with ISWC photographs...

Monday, November 06, 2006

International Semantic Web Conference 2006 (ISWC 2006), Athens (GA), USA - Day 0


The very first day here at the ISWC...ok, it's the workshop and tutorial day. Officially, the conference will start tomorrow morning with Tom Gruber's keynote. I did already arrive here in Athens at Saturday. The 10 hours flight from Frankfurt was really exhausting...at least I had no stop-over. Athens is about 90 minutes away from Atlanta and it is famous for her university, which is the oldest public funded university of the US. It has a really nice historic campus (I will provide some pictures later on).
Today started with the "1st Semantic Authoring and Annotation Workshop" (SAAW 2006), where I had two papers to present...two day ago I was told (by email) that the short presentations will be 'lightning talks' of 5 minutes length each. I had prepared slides for some 15 minutes talks :) ...and was a little bit 'pissed off' by throwing away all the 'interesting stuff'. But, at least I could raise interest of a few people. The afternoon's workshop (on Web Content Mining with Human Language Technologies) also had some interesting topics. Esp., I liked the talk of Gijs Geleijnse about 'Instance Classification using Co-Occurrences on the Web'. It was about classifying musicians and artists (as instances) with their genre (as concepts) by finding co-occurence relationships of terms with the help of Google.

Wednesday, November 01, 2006

From the 'Hiroshima Gate' to 'Confusion'......


O.k....I decided to start putting reviews of currently read books into the blog. The very first one is 'The Hiroshima Gate' by Ilkka Remes. I don't think it is already available in the U.S....I guess the reason for this is that the book originally is written in Finnish, and the stories told by Ilkka Remes put a Finnish executive of the European Community - Timo Nortamo - in the focus. Timo Nortamo is occupied by solving a strange case of murder that happend in Paris. While delivering a disk with secret KGB data (old but seemingly important data from the times of the cold war), a women jumps from a bridge into the river Seine...and is found with her throat being cut. But, besides the KGB data - which might put Finnlands current prime minister under suspect of having conspired with the KGB - the disk contains additional data that attracts secret agencies from around the world...you have to listen to a lot of conspiracy theories ranging from aliens from outer space (Erich von Däniken revisited...), ancient superior civilizations, up to anti-mater bombs....
Well...in the end, everything turns out to be rather down-to-earth (without giving away the story...). An interesting (well...not really. I don't like theese very short chapters...just cliff hangers and cliff hangers again...) mixture of Dan Brown and Michael Crichton, but less mystical and less scientific (if you can say so). Remes introduces you into a Europe-centered world with the USA and China acting as the villaina. The characters remain pretty flat, although you are reading a lot about our hero's family problems (marriage and cheating, father-son conflicts, alcohol problems, or Nortamo loosing his temper... ). I wouldn't read the book for a second time...thus, I guess, this means at most 2-3 stars out of 5.
Yesterday, I began to read 'Confusion', the second book of Neal Stephenson's Baroque Cycle. Unfortunately, it is much too heavy (almost 1.5 kg) for taking it with me on the flight to ISWC (International Semantic Web Conference) at Athens (GA, USA) on Saturday....

Wednesday, October 25, 2006

From Web to Semantic Web - the 'Missing Link'


Starting with WebMonday's keynote in September the topic seems to get more attention every day (see also the discussion Markus Trapp's blog entries on Semantic Web and Web 2.0...). Web 2.0 almost seems to have reached the top in Gartner's hype cycle, and for sure you will have noticed that purchasing Web 2.0 companies has become rather expensive. Are we chasing a new bubble? But, that's another question that I won't discuss today.
I want to emphasize the thesis that for becoming a success 'Semantic Web' depends on 'Web 2.0 paradigms'.
Just follow this simple train of thoughts:

  1. The 'Semantic Web' assumes the web pages to provide semantic annotation, i.e. subjects discussed in the web page are linked to semantic metadata (ontologies) that provides well defined meaning to be processable by machines.
  2. The current Web comprises billions of web pages (at least more than 20 billion web pages seem to be indexed by Google).
  3. How to annotate billions of web pages?
    (a) Currently, there's no way to annotate web pages automatically. A sound semantic annotation is only possible with true text understanding.
    (b) Of course, authors might provide annotations for their own web pages. Even if there would be efficient tools for manual annotation, there are still billions of 'old' web pages that also have to be annotated.
    (c) So why not engage all web users? Just think of wikipedia... If there would be tools for collaborative semantic annotation of web pages, the users are able to annotate the web pages that are of interest for them.
Thus, depending on the assumption that (a) there are alredy sound ontologies available and (b) there are tools for collaborative annotation, the annotation of billions of already existing web pages seems viable.

Just think of corrent collaborative tagging systems (CTS). Although its deficiencies collaborative tagging shows the way, how to provide metadata in a collaborative way.

Monday, October 23, 2006

Last night at the opera...


On Sunday, we spent an diverting evening at the opera...well not exactly...we visited the 'Deutsches Nationaltheater Weimar' (DNT), where they performed 'Il Capella di paglia de Firenze' (The Italian Straw Hat), an opera from Nino Rota, a famous Italian composer, particularly remembered for his work on film scores, esp. 'The Godfather' series and a number of films by Federico Fellini. The story of the opera could be right out of a Marx Brothers movie: Not only that Fadinard's coach horse had just nibbled the Italian straw hat of a totaly unknown lady, no...it also happens to be just on his own wedding day. The wedding guests are just assembling for the wedding ceremony, while the groom brings a total stranger into his house, who insists on getting her hat replaced immediately! What about Mr. Fadinard's bride, Helene....and what about his father-in-law, who seems to be eager to cancel the affiance....Fadinard is hunting for a straw hat replacement with the whole wedding guests close on his heels...and it comes to a total confusion...
The opera debuted in 1955 - and it was rather obvious for me that Rota was also responsible for a lot of film scores, although there were many citations of Rossini, Verdi, and Puccini. All in all it was great fun - almost a slapstick comedy - although the actors could have acted a little bit more briskly. As usual, the 'Staatskapelle Weimar' this time with their conductor Marco Comin (I haven't seen him before...) did a great job and especially I want to highlight the choir (I really loved the two nuns...esp. the one with the pocket flask). BTW....the libretto (which is based on a play of the 19th century French author Eugène Labiche) of the opera was also subject of an old German movie of 1939 ('Der Florentiner Hut') with Heinz Rühmann.
Alas....if you want to spent an entertaining night at the opera and you do like also film scores as well as Rossini and Puccini, then try the 'Florentiner Hut'... :-)

Saturday, October 21, 2006

Searching Multimedia at the ETH Zürich


On Wednesday, I was invited to give a presentation about 'Efficient Search in Multimedia Presentations based on Automated and Collaborative Annotation' at the ETH Zürich. We - Jörg Waitelonis and myself - have been invited by Olaf Schulte (Leiter Video Services at the ETH) to present our solution of lecture recording (at the FSU Jena) and of course the possibility to make the content of the recorded video lectures accessible via content based search - with as little human interaction (i.e. manual annotation of video data, etc.) as possible. Just right before we started with our presentation, we realized some problems with the real player interface that occured when synchronizing different video streams not from the start, but from the point in time that is returned as a search result (we're working on it...). Right now, our lecture recording system records the video signal with the lecturer and the desktop being presented by the lecturer as two separate video streams that have to be synchronized for replay (with SMIL). If both streams are transferred via internet, sometimes we have realized serious synchronization problems...Maybe we should put both streams together (including a table ofcontents for in-video navigation) into a single Mpeg4 container...we'll see.
When it came to the point that recorded video lectures should be archived for a long term period, the argument was about which video format to choose (...with real media not being on the favourite side).
Zürich is a rather nice place...but also really expensive. We had a room in a rather nice hotel (designhotel plattenhof) nearby the ETH including 'designer lamps', 'designer furniture', and 'designer bathroom'. Thus, I almost amalgamated with my surrounding by wearing my 'desinger stubble' :-). Travelling to Zürich by train is somehow stressful...simply because of the 7-8 hours you need to get ther from Jena (Weimar).

Sunday, October 15, 2006

Semantic Web - Semester Start on Monday, October 16th


Why is the semester break always so short...?? Okay, we almost had 3 months of 'vacation' (our so called 'vacation' means 'time for REAL work, i.e. doing research, writing papers, visiting other researchers, attending conferences and workshops, writing applications for research grants, trying to cover up new research cooperations, etc...). This semester, I will give a lecture on 'Semantic Web'. So, this time lecturing will be great fun, because of the close connection to one of my most important research topic. As usual all lectures will be recorded and annotated (for content based search). You may participate via live streaming (Mondays, 12.15pm - 1.45pm) or have a look into the video archive. I'm looking forward to meeting about 15 students, who have already signed up for attending the lecture. The lecture will cover the following subjects:
  • The limits of the current WWW
  • The vision of the Semantic Web
  • Languages of the Semantic Web
    • XML / RDF / RDFS / OWL / SPARQL / SWRL /...
  • Ontologies
    • Philisophy / Computer Science / Ontology Design
    • Description Logics / Inference Systems / Frameworks
    • Folksonomies
    • Ontology Engineering (Alignment / Merging / ...)
  • Semantic Web Applications
Heiko Peter will supervise the lab courses and -- if we are lucky -- there will be some of the students who will take over a thesis (Studienarbeit / Diplomarbeit) related to Semantic Web technologies.

Tuesday, October 10, 2006

...and now for something completely different

I found a wonderful poem written by T. S. Eliot in 1925:

The Hollow Men

We are the hollow men
We are the stuffed men
Leaning together
Headpiece filled with straw. Alas!
Our dried voices, when
We whisper together
Are quiet and meaningless
As wind in dry grass
Or rats' feet over broken glass
In our dry cellar

Shape without form, shade without colour,
Paralysed force, gesture without motion;

Those who have crossed
With direct eyes, to death's other Kingdom
Remember us -- if at all -- not as lost
Violent souls, but only
As the hollow men
The stuffed men.
...
(This is only the first verse. The rest of the poem (verse 2 -- 5) you may find here...

Frankfurt Book Fair and Library 2.0


On Friday, October 6th, Abebooks was inviting for a panel discussion on library 2.0 ("Revolution im Karteikasten: Bibliotheken & Web 2.0") at the Frankfurt Book Fair 2006. My part in the panel -- as being the "Web 2.0 Experte"...at least this was written on my badge :) -- was to explain (of cource not too technically...) web 2.0 technology. The other members of the panel were Marcus Polke, executive manager of Abebooks Europe, Tim Spalding, Founder of Librarything, my dear colleague and friend Steffen Büffel from the University of Trier (together we already moderated a workshop on library 2.0 at the InetBib in Münster), and Dr. Holger Schmidt, editor of FAZ, who anchored the panel.
Despite of our topic a "real" librarian was missing in the panel, but at least there was a number of librarians in the audience, as we realized in the discussion later on. The result of our discussion was quite obviuos...as expected, everybody welcomed the opportunities that web 2.0 offers for the library business, esp. the possibility of connecting collaborative tagging systems with library online catalogues via openURL. Thus, books can be tagged by the community of all users/readers (not only by users of the local library) and the tagging information (meta data) can be used for improving search capabilities within the local library.

Another account and nice pictures of the panel discussion can be found here:

Wednesday, September 27, 2006

XML Tage Berlin 2006


I'm just back from the XML-Tage in Berlin, an annual event dedicated to XML-Technology -- which means a real broad spectrum of topics. For being a mere 'national' event - the conference language was german - a huge number of participants showed up -- at least on the list of participants there are more than 500 entries. But, the lecture room where our sessions took place wasn't crowded at all (30 - 50 people). Interesting for me were the two invited talks: Heiner Stuckenschmitt from Uni Mannheim was asking 'what's wrong with the Semantic Web?', giving way to modular ontologies and distributed reasoning -- thus, for the semantic web we will need some distributed semantic google.... Andreas Hotho from Uni Kassel gave a presentation on bibsonomy, the well known social tagging (bibtex database) service. In the same way as del.icio.us, bibsonomy is tagging urls. But, in addition, you are able to administrate (and tag) your (and others) bibtex bibliographies...a rather useful tool for authoring scientific papers.
If everything works out well, we will cooperate with Andreas and his group by enhancing bibsonomy's features as well as by exploiting bibsonomy's database for Ontology Learning and Data Mining.

Tuesday, September 19, 2006

Web Monday in Jena - Aftermath


Yesterday evening, I visited the first Web Monday in Jena (see yesterday's Blog post). After all, it was a reall success. With almost 50 participants (ok...some 10 or 20 of them coming from synchronity alone) I had a large audience for my presentation on 'Semantic Web 2.0 Hype...'. After Lars welcomed all guests he briefly explained the idea of Web Monday and its relation to web 2.0 in general. Interesting thing to mention is that I became acquainted with Lars via openbc - although his office is only a few minutes from mine. He told me about Web Monday about a month ago and I volunteered to give the very first keynote speech -- although I first thought that I only had to give a 5-10 minutes spotlight talk (as it is the traditional way for Web Mondays).
My talk went rather smoothly but according to the echoes of our audience (see links blog references below) they seemed to like it.
For me the following discussion and socializing was even more interesting. Only to mention a few of the emerging topics:
  • web 2.0 and handicapped users -- a rather interesting and IMHO also important topic (application of specialized CSS, etc...)
  • user generated metadata beyond mere tagging (semantic wikipedia, ontology editors, etc...)
  • semantic web technology vs. information retrieval technology
  • ajax vs. modern programming paradigms and wasting resources (bandwidth)
  • applications of the web of trust
All in all the evening was some real success and we should be curious about the upcoming Web Mondays (ass proposed on October 16th and November 20th). So stay tuned!
Finally some echoes from other blogs:

Monday, September 18, 2006

Web Monday in Jena

Today, the first Web Monday in Jena will take place at TowerByte eG, c/o Intershop Tower, Leutragraben 1, 07743 Jena at 19.30. Lars Zapf volunteered to bring the idea of Web Monday to Jena. Web Monday is kind of an informal meeting focussed on the topic of web 2.0 (in the broadest sense). It connects users, developers, founders, entrepreneurs, venture capitalists, researchers, web pioneers, bloggers, podcasters, designers and other folks interested in Web 2.0 topics. I will host the 'opening session' with a talk about 'Semantic Web / web 2.0 - Hype and Reality'.
The tricky thing for me in this talk is, how to connect semantic web and web 2.0.....and the answer is pretty simple as that. What is needed to bring Semantic Web to life? [short theatric pause] Yes, it's Metadata! And what's the 'dernier crie' concerning web 2.0? [another pause...] Yes, collaborative tagging! So, maybe we should choose collaborative tagging as kind of 'missing link' in between web 2.0 and semantic web.
Of course - you will say - tags are only flat (shallow) metadata without any structure, etc....But why not create structured metadata (or even 'deep' and complex metadata) in a collaborative way. What's missing are - again - suitable interfaces (technology) for easy (non expert) metadata (ontology) editing. In semantic web, somehow we have the same situation as in pre-Blog/pre-Wiki times....only experts are able to create complex metadata... but there are too few of them...and thus, too few metadata. Semantic Wiki is one initiative that points in the right direction. Rico Landefeld is one of my students, who for his diploma thesis tries to implement a fully functional semantic wiki (maariwa). more to come soon....

Monday, September 11, 2006

InetBib 2006 in Münster


On wednesday, September 6th, I visited InetBib 2006 (a conference on libraries and the internet) in Münster. Together with Steffen Büffel and Michael Schaarwächter I was contributing to the workshop Web2.0- Technologien: Zukunft der Bibliothek - Bibliothek der Zukunft ("the future of libraries" and "the library of the future"). From my point of view (as a computer scientist) libraries and Web 2.0 in combination constitute a rather interesting subject. Decentralization of information, active interaction with and involvment of the users, esp. w.r.t. providing keywords and tags for the library index (catalog), blogs and wikis as new (living) web based library information systems are only a few of all those rather promissing approaches. I have collected a few links on articles and further resources about InetBib 2006:

Friday, September 01, 2006

An Integrated View on Document Annotation.... (part 3)

As already mentioned in [1], we identify three semantically interrelated structures within a document or a collection of documents:
  • logical structure (chapters, paragraphs, pages, ...)
  • conceptual structure (index entries, concepts, definitions, ...)
  • referential structure (references, associations, links, ...)

All three in concert form the so called dependency graph (or reading graph).

Now, what is the purpose of this dependency graph? Imagine, your intention is to understand a certain topic. What you usually do is to enter one or more descriptive keywords related to that topic into a search engine. More traditional, you would look up the index of a textbook for those keywords to identify some relevant sections to read to understand the given topic. Probably you look up a certain index entry and the section in the text contains other terms that you don't understand. Thus, you look up those terms in the index to read the referring section and you will probably again find terms that you don't understand. Thus you enter an interative process that stops, if you have read all sections that cover all the terms that you didn't understand before looking up the first index entry.

In this way, we recursively define a rather simple concept of understanding: To understand a term, we have understood all terms that are mentioned in a descriptive definition of that term.

This iterative process is closely related to self containment. Therefore, to understand a topic means to calculate the self containment closure of that topic. The size of the closure is determined by the end of the iterative process. The process ends, if there are no new terms that are not understood (i.e. read). Thus, the size of the closure depends on how much the user already knows (i.e. has read), is different for individual users, and changes (adapts) while the user is reading (understanding).
The dependency (reading) graph is then ordered sequentially (with the help of the logical document structure) to form a roadmap that leads to a better understanding of the given topic.
(to be continued....)


References:
[1] An Integrated View on Document Annotation (part 1) (part 2)

Tuesday, August 15, 2006

flickr!

Quite a lot has been written about flickr. Of course it's a great tool combining a web based picture database and a collaborative tagging system and I do like it. Image processing and tagging, creating groups and social networks...really well done. But, from my point of view, what really sucks is flickr's tag based image search. Here, all the negative stuff that can be argued against the use of collaborative tagging systems (CTS) can be found.

Ok, let's start with the general shortcomings of keyword based search engines:
(1) Homonymy: The keyword used to query the search engine can have multiple meanings while being the same word. This means that you will get many non relevant results (bad precision) because their keyword refers to another meaning.
(2) Synonymy: There are other words that have the same meaning as the keyword that you have entered as query string. Thus, you won't get all relevant results (bad recall). This also refers to the fact that flickr is an international platform and the tags being used are of multilangual origin.

Usually, one can try to cope with this polysemic shortcomings by refining the query string -- broadening or narrowing the search results by connecting additional keywords with a boolean operator (AND / OR).
Then, in CTS there are additional problems that originate from the people's way to generate tags.
(3) Tagging: Different kind of tags can be distinguished. A rough classification distinguish between descritional and functional tags. While descriptional tags -- as the name says -- describe the resource being tagged in a general way ( and thus, might be also useful for others), functional tags most times are only relevant for the single user who tagged this resource.
Classic example: A picture of an apple can be tagged as 'apple' (relevant for all users) as well as 'breakfast' (relevant maybe for a fraction of users).
Taken all three arguments together, collaborative tagging systems really have some problems, if they are considered as search engines, because bad precision and recall lead to bad results.

Another point is relevance. Search engines -- as e.g. google -- introduce relevance weights (as e.g. google's pagerank) to consider the importance of the single search results. The search results are ordered according to their relevance. Google's pagerank roughly reflects the fact that a resource can be considered as being important, if many other resources link to that particular resource (in addition also the number of outgoing links and the weight of each single link is considered). So, try to adapt this concept to flickr.... First, a picture doesn't link to other pictures. No big deal, flickr offers the user to choose 'favourites' among all the pictures. In addition there are some statistical indicators, as e.g. page views or an indicator of 'interestingness' as well as the number of comments. Thus, a relevance indicator could be moulded that could be used for a ranking of search results.

Unfortunately, if you try a search in flickr you will get a huge amount of results without any visible order or relevance. (Simply try 'Rome', you will get more than 290.000 results. Of course there are pictures of the city of Rome among the results...but also many other pictures, where the tag 'Rome' has been assigned for other reasons.)
One problem with flickr for sure is that it is not a 'real' CTS. It seems to be focussed on people tagging their 'own' resources and not the resources of others. In addition, the tags being assigned to a resource are considered as being a set and not a list (or bag - where you can see how often a tag has been assigned by how many users). Thus, tag convergence is out of reach.



On the other hand, there are some nice tools around. Just try this nice little flickr tagbrowser application. OK, it has the already mentioned shortcomings, but it's a nice little gadget.

Friday, August 11, 2006

The Secret Life of Meta Search Engines...

As you probably know, during the last two weeks I had to spent a lot of time in oral examinations about last semester's "web technologies" lecture. Whenever you do examinations, you might get some insight on what the students did understand and what they kept in mind from your lectures. Sometimes you wonder and ask yourself "What the f.... were they doing while sitting in your lecture??" You try to make it simple and give some hints on what is important and what is not. But I guess, sometimes you just make it too simple for them...and thus, they underrate the whole thing and you'll get the receipt during the exam...
Anyway...sometimes the students seem to have some 'cloak-and-dagger' sources of information...as I had to experience last tuesday. I was just starting the exam with a question about search engine types, which finally lead to the (not so important) topic of 'meta search engines'. Usually, I ask how meta search engines are working and try to get quickly to some more interesting topics. On tuesday I received a strange answer ... and the funny thing is that I was told the same queer answer already by another student the week before:
"Meta search engines have their name because they are search engines that analyse the meta-tags in the header of HTML documents..."
Okay...stunned I was looking at my assessor and the assessor immediately stopped writing his protocol also having a stunned expression in his face. "From where did you get that?", I was asking the student. But he explained to me that this fact is just obvious and everybody knows...
Sometimes you really wonder what they are doing while sitting in your lecture...
If a third student will come up with this answer I'll have to cover up the source of this "information".
(At least - as we have more friendly temperatures now - they are not wearing muscle-shirts and flip-flops... :-)

Tuesday, August 08, 2006

An Integrated View on Document Annotation.... (part 2)

It is important to state the dependencies between the three identified characteristcs (logical, conceptual, and referential) that are common to any kind of document despite their heterogeneity.

As e.g., let's consider a large text document (textbook). The document's index can be considered to be part of the conceptual document structure, while the linear text flow with its hierarchical structure defines a part of the logical document structure. The index of a textbook can be considered as being of good quality, if an index entry in the first place refers to the location, where the concept denoted by that index entry is defined and subsequently to the other locations, where it is used. Thus, we distinguish between concept definition and concept usage (refering to the logical structure) for the document index (conceptual structure). Obviously, both are dependent on each other. The same holds for the bibliographic references (referential structure) together with the table of contents (logical structure) and the document index.

We claim that these dependencies within a document have to be made explicit. As additional document annotation these dependencies can help to represent document semantics without the necessity to really "understand" the content of the document. Acting in concert with external (user and/or author defined annotations) the document dependency structure is able to facilitate cross annotation reasoning that is a prerequisite for the semantic web.

Thursday, August 03, 2006

An Integrated View on Document Annotation...

Did you realize that documents - no matter what type of document, as e. g., text documents, graphic documents, or videos - can be all subsumed within an abstract view?

We can define a document to be a string of addressable tokens. Some of these tokens are special tokens, the so called tags. In difference to all the other tokens that can be regarded as the document content, tags have a special function: they mark up single tokens or groups of tokens (document units) to be interpreted in a special syntactic or semantic way. On the other hand, tags can also be interpreted as document annotation, i.e. the document can be regarded as an unformatted string of tokens, while the tag (annotation) defines the document structure. We distinguish between different types of tags:
  • structural tags define the structural elements of a document, as e.g., in a text document we can define sentences, paragraphs, sections, chapters, headings, annotations, e.a.
  • referential tags define relationships between document units of the same (internal) or some other (external) document unit, as e.g., in a text document see-references or bibliographic references, e.a.
  • conceptual tags define concepts and relationships of concepts within a document, as e.g., in a text document index entries and hierarchical relationships of index entries, e.a.
Thus, within a document, we can distinguish between the logical structure, the referential structure, and the conceptual structure that are each dependend on each other. Additionally, we distinguish between tags that are supplied by the author (e.g. structural tags) and tags provided by the user (e.g. "tags", annotations, reviews, e.a.).

We can apply this view also on other types of documents, as e.g., video documents: The smallest unit - the tokens - of a video document is a single pixel. Considering the logical structure, pixels can be subsumed to blocks, which form macro blocks, which can be subsumed to slices, which together constitute a frame (picture). In difference to a text document, a video document also depends on time. Thus, we can identify groups of pictures (gop) that form the entire video sequence. In mpeg-4 we even have the possibility to identify objects within a picture that can be subsumed within a scene. With mpeg-7 also conceptual and referential information can be added to the video by adding metadata about the author or about objects or scenes within the video.

Tuesday, August 01, 2006

Men and Flipflops....


Summer seems to take a little break - at least the temperatures are below the 30s - and immediately you start to miss it. Have you noticed that most people don't seem to realize that the temperatures have dropped. They seem to pretend we're still in the tropics...at least considering their style of clothing.
Yesterday I had a student in oral examination wearing only a tight muscle-shirt...you could really see the sweat of anxiety being cooled down by the air condition dripping from his naked skin...I wonder what will be next to be seen, if the temperatures start to rise again. Bare chest, bare foot, or even.....I don't want to imagine?!
But, this reminds me of the "flipflop dilemma". Yes, men in flipflops...not those flipflops made of transistors being basic electronic components. For decades those electronic flipflops were the only flipflops to be associated with the male (geek). Not at all ... I'm talking about those bathing shoes called flipflops. The have their name because of the sound they produce while walking. Yesterday I read a flaming article about men and flipflops in spiegel online. Nice quote: "Shoes that are able to say their name, being worn by men that are not able to spell the word style...". Don't get me wrong. Of course you can wear whatever you like, but please consider the occasion!
During the last weeks really it was rather hot. But the office is air conditioned (you will freeze) and an examination is an examination...and not a seaside holiday pool game.