Tuesday, January 23, 2007

SOFSEM 2007 - Day 3

Today started with a keynote given by Ricardo Baeza-Yates from Yahoo! Research on 'Mining Web Queries'. In particular he showed how to identify categories of user queries and how to use this information to create an appropriate ranking of the search results. Besides the already identified 'coarse' categories, such as, e.g., queries being 'informational', 'navigational', or 'transactional' (which means that the user wants to have (a) information about a specified topic, (b) a starting point for further research, or (c) a homepage related to the resource for transactional purposes (e.g. shopping)...), he addressed several graphs that can be compiled out of the search engine logfile, as e. g., URL cover graph, URL link graph, session graph...These graphs can be used for identifying polysemic expressions, similar or related queries, clusterings of queries, or even a (pseudo)taxonomy of queries.
Besides web query mining, he mentioned some interesting numbers concerning Yahoo, as e.g. that Yahoo administrates about 20 PetaBytes of Data with more than 10 TeraBytes of data traffic per day. But, on the other hand, he gave an estimation of the actual world knowledge and related it to the ammount of data managed by Yahoo today: given that a person creates about 10 pages of data concerning a distinct event, and if we estimate the number of events of about 5000 in a lifetime, and if we multiply that number by the world's population....we will end up with about 0,0057% of the 'world knowledge' currently being represented in Yahoo...

