Wednesday, July 30, 2014

Help Us with a Research Problem

As you might know, we already have tried previously to let the public participate in our research. Last time, we have had developed several games (with a purpose). This time, unfortunately it is not a game, simply because the development of a good game is really expensive. But, let's get to the point. What is the task all about, where you can help us....?

You know, my research group is working on semantic technologies. Semantics in that sense means, we are trying to (automatically) understand what information (or data) is all about and what is the meaning of it. Sometimes, information is ambiguous. This makes it difficult to understand, because you have to solve ambiguities with the help of context.

On the other hand, sometimes you have various different information about a subject. How do you determine, which information or fact is more important or relevant than another? Just a quick example. Let's assume we have the following two facts:

(1) Albert Einstein is a physicist.
(2) Albert Einstein is a Vegetarian.

Which of the two facts is more important or relevant? Yes, this is difficult to answer, simply because the truth often lies in the eye of the beholder. For a vegetarian, maybe the second fact is more important. But, what about the most common opinion? What would the mainstream think? Probably, most people would say that fact (1) in general is more important.

So, what we are doing is that we develop heuristics that determine the importance of facts (relative to other facts). To get an idea about the quality of our heuristics, we have to do an evaluation, i.e. somebody has to decide whether the decision of the heuristics was wrong or right. Unfortunately, there does not exist a ground truth for this task called "fact ranking". Therefore, we are about to create a new ground truth that later will be publicly available and open for all researchers.

This ground truth is achieved with the little 'voting' application that you will find here [1]. You just have to register with the tool and then the task will be explained to you in detail. We took 500 popular concepts from Wikipedia and you have (1) to think about the most important facts about these concepts that come to your mind and then (2) rate the (new) facts presented to you according to their relevance. There is no right or wrong answer. Just vote as you think it seems right for you. Afterwards, we will aggregate all votes from all participants to determine the general (mainstream) relevance of the presented facts.

You might interrupt your rating of the presented facts at any time you like and continue later. To make it a bit more interesting, you can also score points and of course there is a highscore list. We would really appreciate your help in this task. Please do also spread the word. The more participants, the more valid our ground truth will be.

We know that this is a difficult and sometimes rather boring task. The more we would be really grateful for your assistance!

[1] Fact Ranking Web-Application, Times of War

For us Western Europeans, war always seems to be far, far away in some other country (or some distant times in the past). Usually, we read about it in the newspapers or see the pictures in the media. But, we are not concerned directly. This also includes me as a researcher. Of course, we also have students in our institute who come from countries or regions in crisis. But, they are here and the crisis is there... elsewhere.

As you might know, I recently finished my OpenHPI online course 'Knowledge Engineering with Semantic Web Technologies'. The course was rather popular with a total of 4,623 enrolled students from all over the world. 611 students took part in the final course exam and 450 students have finished the course successfully (yeah!!).

Of course the means of interaction with the students are limited in an online course. You follow the stream of discussions in the OpenHPI platform, answering a question here and then. Sometimes you also receive email from one of the course participants...

Today, I have received email from a course participant in Gaza, Palestine. He wrote me about his appreciation for the OpenHPI team to offer courses like this and about the projects he carried out during his University studies. Unfortunately, as he wrote, due to the current situation in Gaza, infrastructure has been destroyed including power outages as well as network failure. Of course this makes it difficult next to impossible to continue the course (not to speak about all the other major problems for the people that arise from this conflict). I am deeply impressed that in a situation like this, people still continue their efforts to invest in their education...and their future.

And yes, war has finally also knocked on the door of our small island of the fortunate...

Wednesday, July 16, 2014

New DBpedia Graph Statistics

Recently, we have been working on the DBpedia / Wikipedia Page Link dataset. We have considered the English and the German language versions for this project. In the current DBpedia 3.9 page links English and German datasets 18 million and 6 million entities are represented respectively. But the original DBpedia only contains about 4 million and 1 million distinct entities for English and German versions. This significant difference is mainly due to the current DBpedia pagelinks dataset include redirect pages and pagelinks with resources that are not considered as entites (as e.g. thumbnails and other images). So we considered cleaning up DBpedia pagelinks dataset for the computation of statistical parameters (a.g. pagerank or HITS). For the Cleanup we have removed all unnecessary and redundant RDF-Triples from the pagelinks dataset, i.e all removing the redirect pages (Redirection pages are just URIs that automatically forward a user to another Wikipedia page, but do not represent entities) as well as RDF-Triples representing resources that do not have an own rdfs:label (as per DBpedia documentation every entity has an rdfs:label reference).

One of the benefits of the cleaned up pagelink dataset is the faster computation of statistical graph measures (while not influencing the overall statistics, i.e. redirect pages usually don't have incoming links and theother removed resources (as e.g. images) don't have outgoing links). Based on this dataset we have computed PageRank, Hub and Authorities (HITS), PageInlink Counts and PageOutLink Counts. Please find the details of the datasets here on our research group's webpage [1].

For Computation of the DBpedia graph statistics we have used JUNG — the Java Universal Network/Graph Framework. Please find the source code for PageRank and HITS computation here via GitHub [2].

References and further Reading:
[1] New PageRank Computations for DBpedia 3.9 (English/German) at SemanticMultimedia
[2] Source code for DBpedia Graph Statistics

Monday, July 14, 2014

Harald's Original Miscellany - More Truth about Football - Part 4

Finally, Germany has won the Soccer Worldcup 2014. Therefore, also our little statistics on soccer will come to an end with the post today. You might ask yourself what kind of data is left for soccer players in Wikipedia and DBpedia. Well, unfortunately only a little. But, we will try to make something out of it. Last time, we've ask for the number of team changes and the correlation to popularity vs. scored goals for soccer players. What is left, if we look at the available data?

We have the data about the years in which the football players were active or have played in their national soccer team. Let's start with the national team years [1]:

nationalyears NumPlayers
8 3
7 20
6 85
5 371
4 1070
3 2479
2 6116
1 28211

Well, it was obvious that the most players have only 1 or two years in the national team. But, there are exceptional players who achieved even 8 years. But, who are these long term players? [2]:

nationalyears Player Team
8 Wojciech Łobodziński "Poland"@en
8 Wojciech Łobodziński "Poland Under 16"@en
8 Wojciech Łobodziński "Poland Under 17"@en
8 Wojciech Łobodziński "Poland Under 21"@en
8 Wojciech Łobodziński "Poland Under 18"@en
8 Santiago Cañizares "Spain Under-17"@en
8 Santiago Cañizares "Spain Under-16"@en
8 Santiago Cañizares "Spain Under-21"@en
8 Santiago Cañizares "Spain Under-18"@en
8 Santiago Cañizares "Spain Under-23"@en
8 Santiago Cañizares "Spain Under-19"@en
8 Santiago Cañizares "Spain Under-20"@en
7 Aydın Yılmaz "Turkey Under-21"@en
7 Aydın Yılmaz "Turkey"@en
7 Aydın Yılmaz "Turkey"@en
7 Aydın Yılmaz "Turkey Under-19s"@en
7 Aydın Yılmaz "Turkey Under-17"@en
7 Aydın Yılmaz "Turkey A2"@en
7 Ismael Urzaiz "Spain Under-17"@en
7 Ismael Urzaiz "Spain Under-16"@en

Possibly you will never have heard of Poland's Wojciech Łobodziński or Spain's Santiago Cañizares. Here another flaw in the data becomes visible. There is no such thing as the unique national team. We have "under 16", "under 17", "under 18", and so on... So you start your career already with 15 and after 8 years you would be 23 and possibly be in the "real" national team.

Let's have a look at the active years of players. Unfortunately, here the data is rather messy [3]: 
person From To 2000 9223372036854775807 2006 200720082008 2008 200420052006 1998 200320042005 1998 20112012 1995 20092010 2003 20092010 2007 20092010 1994 20082009 2004 20082009 2003 20072008 2004 20072008 1994 20062010 2001 20062007 1999 20032007 2000 20032004 2005 20032006 1992 20022003 2005 20022004 1998 20012004

Possibly this is because of people not only writing single year's into the Wikipedia infoboxes, but time spans and other things. In the list above are only the Top 20. Just remove the LIMIT from the SPARQL query and further down you will find more valid data.

OK, let's come to our last problem related to football. Is there a correlation between the height of a player (simply because we have that data) and the number of achieved goals [4]?
Height sumgoals
243.84 1
23.0 59
215.9 1
206.0 0
204.0 0
203.2 24
203.2 47
203.0 0
203.0 0
203.0 51
202.0 19
202.0 0
202.0 15
201.0 20
201.0 0
201.0 0
200.66 11
200.66 48
200.66 0
200.66 0
200.66 0
200.66 124
200.66 146
200.66 0
200.66 1
200.66 16
200.66 0
200.66 102
200.66 6
200.66 0
200.0 0
200.0 20
199.0 0
199.0 0
199.0 78
199.0 47
199.0 0
199.0 62
199.0 32
199.0 84

It is rather difficult to recognize something in this data. You see heights and the number of goals that player of that height have scored. Fascinating that there seem to be a significant number of players that are taller than 2 meters. I guess that the list leader with 2.43 meters is just incorrect data. Now. this is so many data that we have to visualize it to recognize something....

Correlation between height of soccer players and scored goals
In the diagram to the left you see the soccer players height (x-axis) vs. the number of scored goals (y-axis). Interesting thing to notice is that the heights approximately show a Gaussian distribution, i.e most players have a "middle" height, on the extremes there are only a few. Well, there seems to be an exception. on the outer left you will notice a large fraction of players with a height of 1.52m with goal scores ranging among 0 to 300. This is extraordinary, because I have no idea what kind of group this is or if there is simply an error in the data again. What I noticed is that among this group there seems to be a larger fraction of female Asian soccer players. Maybe they are responsible for that large number of outliers, but this requires further investigation.

Alas, I want to add one last table. For each player there is the information on which position she or he is playing. Of course Wikipedia authors are far from any sort of agreement how to name the player's position. Thus, there is a rather huge variety. Nevertheless, I will leave you with the table to make any sense from it. Enjoy [5]:
position number avheight avgoals
midfielder 4450 176.822828644929299 18.077078651685393
defender 3429 181.37523034450506 8.058326042578011
striker 2643 180.169761559433703 65.213015512674991
goalkeeper 2086 186.186064557569709 0.174496644295302
forward 1787 178.865346269495465 46.388919977616116
centre back 786 185.7391350821381 10.651399491094148
winger 648 174.69354896725695 27.583333333333333
attacking midfielder 490 175.897877222177931 38.006122448979592
left back 461 177.661735253323698 6.995661605206074
defensive midfielder 427 179.741452037869347 11.978922716627635
right back 405 177.829456545982828 7.301234567901235
central defender 196 185.005408462213008 8.897959183673469
central midfielder 161 178.223105294363834 20.099378881987578
full back 158 174.659366173080242 6.759493670886076
centre forward 140 179.09685701642717 80.792857142857143
left winger 113 174.790176753449224 28.398230088495575
defender, midfielder 102 179.789412255380659 10.607843137254902
inside forward 96 172.094789346059145 64.083333333333333
defender/midfielder 82 179.178292995545911 11.463414634146341
defender / midfielder 79 180.188100887250282 15.050632911392405
left-back 65 175.642922504131603 7.692307692307692
right winger 62 175.247418803553424 30.096774193548387
centre-back 56 521.327500479561941 9.375
right-back 54 176.101110952871815 6.111111111111111
midfielder/forward 52 172.013075924836657 22.557692307692308
midfield 52 176.982691251314588 30.673076923076923
defender (retired) 50 181.37344055175781 14.48
second striker 48 174.673332850138346 63.854166666666667
full-back 44 175.908408771861666 2.681818181818182
striker / winger 41 178.235609845417295 49.219512195121951

Of course I limited the table to 30 rows. Interestingly, the average height of goalkeepers is larger than for midfielders or strikers. As expected, strikers on the average score more goals than defenders or goalkeepers.

SPARQL Queries and original data:

Thursday, July 03, 2014

Harald's Original Miscellany - More Truth about Football - Part 3

To change the team means to earn more
money...what about the football millionaires?
How often do they change the team?
Are you ready for more statistics on your favorite kind of sports? Well, data is fun, and obviously Big Data means Big Fun. There are lot's of interesting things to discover while exploring data, and wikipedia (i.e. dbpedia for the insiders) provides all the necessary means.

Have you ever wondered about this kind of slave trade in professional football? Well, I wouldn't exactly call the transfer of a millionaire to a higher paying job a 'slave trade'. But, have you ever thought about the following question: Do the real good (and well paid) players more often change the team - or is it vice versa, that teams try to get rid of players that have a bad season or are on the decline? Who knows? Let's have a look on the data:
TeamChanges NumPlayers
16 2
15 26
14 84
13 287
12 792
11 2247
10 3848
9 5109
8 6464
7 8110
6 9790
5 11264
4 11837
3 11448
2 10961
1 6515
Here, we have a table providing an overview about how many players (in wikipedia) have changed their team for how many times [1]. Obviously, it seems to be some kind of Gaussian distribution with a peak between 2 and 6 team switches. OK, what about the players? Where are the top players listed in this table? Well, David Beckham switched team 11 times according to wikipedia, Cristiano Ronaldo 8 times, Thierry Henry 9 times. At least these numbers are above average which we had identified to be between 2 and 6. This seems to give proof to our original assumption.

person TeamChanges popularity 8 1794 11 1572 9 1414 7 1404 5 1343 5 1188 4 1111 8 1047 10 1037 6 1011

But, we get a better overview, if we look at the average popularity of each switching group in the table [3]:
TeamChanges NumPlayers avgindegree
16 2 107.5
15 26 76.769230769230769
14 84 47.297619047619048
13 287 60.885017421602787
12 788 45.073604060913706
11 2235 37.206263982102908
10 3800 32.577631578947368
9 5023 30.939080230937687
8 6332 29.685881238155401
7 7935 25.499054820415879
6 9525 23.188346456692913
5 10886 18.682895462061363
4 11423 14.937844699290904
3 10842 11.131802250507286
2 10089 7.704628803647537
1 5534 5.19588001445609
As we had originally thought, on the average, the popularity of the players rises with the number of team changes. Although the top group with 16 changes is far from the highest possible popularity scores (as e.g. 1572 for David Beckham). Hmm, maybe there is a correlation among the number of achieved goals with the number of team changes? Is it more likely that a top goal hunter switches team more often? Let's have a look [4]:
TeamChanges NumPlayers AvgGoals
16 2 22.5
15 26 28.807692307692308
14 83 37.855421686746988
13 287 38.062717770034843
12 781 39.939820742637644
11 2205 38.625850340136054
10 3784 35.646141649048626
9 4980 32.826907630522088
8 6296 29.489517153748412
7 7826 24.455788397648863
6 9314 21.937835516426884
5 10471 18.655142775284118
4 10627 16.317869577491296
3 9612 12.756450270495214
2 7193 9.911858751564021
1 4624 4.444204152249135

Looks interesting. Top goal scorer have 9 to 14 team switches. This is way above the average. Thus, the more goals you score, the more often you will have the chance of being transferred (and thus earn more money). Players that don't score goals will obviously not be transferred (that often).