Saturday, March 24, 2007

A short note on Semantic Search Engines

Lars featured an article on Hakia, a 'semantic search engine' in his blog today.
Contrarywise to Google, Hakia uses natural language processing (NLP) to 'understand' search queries given in natural language (and not as plain keywords). Ok, this is now new.
Just remember AskJeeves, a.k.a. But, in difference to that search engine, Hakia claims to perform a 'semantic search' (and not a keyword based search). If you read a little bit further in their technical description at Hakia labs you will find thet they are using a parser called 'OntoSem', which as they claim is able to perform a 'deep semantic analysis' of sentences. This parser is used for query string analysis and also for the generation of the search index. But, their search index is different to that we are used from 'traditional' search engines.
Just remember, traditional keyword based search engines extract so called 'descriptors' from web pages that are used to describe the content of those pages. These descriptors are managed within an inverted index. Thus, by accessing the inverted index with a descriptor (=keyword in query string), a list of web pages will be returned together with some weights indicating the relevance of the descriptor for the page.
So, how does the index work in Hakia? They clain, that their parser is analyzing all web pages to be indexed sentence by sentence. In effect, all possible questions that can be posed for each sentence are generated, forming what they call the 'QDEX data'. All possible questions for all sentences in all pages have to be stored in an index-like data structure (simply for fast and efficient access). If now a query string contains a question, this question is mapped against the index, resulting in a large number of 'relevant' (questions, sentences and in the end...) web pages. Now they apply a 'smart' algorithm called 'Semantic Rank' which orders the resulting list of documents according to their relevance wrt. the question given in the query string. More details about the technique is not published (at least not to my knowledge).

Thus, the only way to find out about the quality of their approach is to try out their search engine. A nice feature is that they have included small example applications where you may try out OntoSem or QDEX interactively by yourself. I have only tried OntoSem (because for QDEX you have to sign up a request) and the result was not really different from any standard NLP-Parser (and thus not really convincing. I will sign up and give QDEX a try..and of course I will post the result).

If I compare their approach to Google's, the problem is that Google is also claiming to incorporate semantic technology altogether. So, again I can only compare the results of my queries. I have tried several queries and have come to the following results:

  • in general the results from hakia do look very promising!

  • in comparison to Google results, they are not really that different this may come because of my 'queries' and thus, my results are probably not really objective).
    Let me give you an example: I was asking both (Google and Hakia) the question 'When did the Semantic Web start?'. Ok, it's not so easy to understand at all. First, it didn't really start (as an being implementation), but the topic itself startet several years ago...and so I was curious about the results:

Ok, as stated before, this is not representative at all, but I will keep an eye on Hakia anyway. As soon as I have more representative experiences (and hopefully some comments or other interesting comparisons), I will write about.

One thing in the end. After all, what I have read about Hakia and on their web site, they seem to have a completely different notion of 'Semantic Search'. What they do is to apply (enhanced) information retrieval techniques based on natural language processing. They do NOT evaluate any additional given semantic annotation (RDF, OWL, etc...), which might be addid into the web pages or connected to them. They claim (same as Peter Norwig) that the average user is much too stupid to supply semantic annotation (because for doing that he needs to be a linguist as they say).
Of course, if you are following the PLAIN trail of the W3C and encode everything by hand (HTML, XML, RDF, OWL, SWRL, etc...) then you have to be some real expert. But today, you (at least most of you) don't encode HTML by your own. Remember, there are lots of real nice WYSIWYG editors for doing that AND (even more) there are several simplifications (just think of writing your blog). In the same way there will be smart user interfaces and editors providing help in generating semantic annotations for your texts. First and most simple step is providing labels and keywords (just 'tags') for your blog posts. Most people do that...just because (1) they want to put some order into the set of their postings and (2) they want their posting to be found.