As you might know, we already have tried previously to let the public participate in our research. Last time, we have had developed several games (with a purpose). This time, unfortunately it is not a game, simply because the development of a good game is really expensive. But, let's get to the point. What is the task all about, where you can help us....?
You know, my research group is working on semantic technologies. Semantics in that sense means, we are trying to (automatically) understand what information (or data) is all about and what is the meaning of it. Sometimes, information is ambiguous. This makes it difficult to understand, because you have to solve ambiguities with the help of context.
On the other hand, sometimes you have various different information about a subject. How do you determine, which information or fact is more important or relevant than another? Just a quick example. Let's assume we have the following two facts:
(1) Albert Einstein is a physicist.
(2) Albert Einstein is a Vegetarian.
Which of the two facts is more important or relevant? Yes, this is difficult to answer, simply because the truth often lies in the eye of the beholder. For a vegetarian, maybe the second fact is more important. But, what about the most common opinion? What would the mainstream think? Probably, most people would say that fact (1) in general is more important.
So, what we are doing is that we develop heuristics that determine the importance of facts (relative to other facts). To get an idea about the quality of our heuristics, we have to do an evaluation, i.e. somebody has to decide whether the decision of the heuristics was wrong or right. Unfortunately, there does not exist a ground truth for this task called "fact ranking". Therefore, we are about to create a new ground truth that later will be publicly available and open for all researchers.
This ground truth is achieved with the little 'voting' application that you will find here [1]. You just have to register with the tool and then the task will be explained to you in detail. We took 500 popular concepts from Wikipedia and you have (1) to think about the most important facts about these concepts that come to your mind and then (2) rate the (new) facts presented to you according to their relevance. There is no right or wrong answer. Just vote as you think it seems right for you. Afterwards, we will aggregate all votes from all participants to determine the general (mainstream) relevance of the presented facts.
You might interrupt your rating of the presented facts at any time you like and continue later. To make it a bit more interesting, you can also score points and of course there is a highscore list.
We would really appreciate your help in this task. Please do also spread the word. The more participants, the more valid our ground truth will be.
We know that this is a difficult and sometimes rather boring task. The more we would be really grateful for your assistance!
[1] Fact Ranking Web-Application, http://s16a.org/fr/
...just a few words about life, the universe, and research on topics related to the semantic web
Wednesday, July 30, 2014
...in Times of War
For us Western Europeans, war always seems to be far, far away in some other country (or some distant times in the past). Usually, we read about it in the newspapers or see the pictures in the media. But, we are not concerned directly. This also includes me as a researcher. Of course, we also have students in our institute who come from countries or regions in crisis. But, they are here and the crisis is there... elsewhere.
As you might know, I recently finished my OpenHPI online course 'Knowledge Engineering with Semantic Web Technologies'. The course was rather popular with a total of 4,623 enrolled students from all over the world. 611 students took part in the final course exam and 450 students have finished the course successfully (yeah!!).
Of course the means of interaction with the students are limited in an online course. You follow the stream of discussions in the OpenHPI platform, answering a question here and then. Sometimes you also receive email from one of the course participants...
Today, I have received email from a course participant in Gaza, Palestine. He wrote me about his appreciation for the OpenHPI team to offer courses like this and about the projects he carried out during his University studies. Unfortunately, as he wrote, due to the current situation in Gaza, infrastructure has been destroyed including power outages as well as network failure. Of course this makes it difficult next to impossible to continue the course (not to speak about all the other major problems for the people that arise from this conflict). I am deeply impressed that in a situation like this, people still continue their efforts to invest in their education...and their future.
And yes, war has finally also knocked on the door of our small island of the fortunate...
As you might know, I recently finished my OpenHPI online course 'Knowledge Engineering with Semantic Web Technologies'. The course was rather popular with a total of 4,623 enrolled students from all over the world. 611 students took part in the final course exam and 450 students have finished the course successfully (yeah!!).
Of course the means of interaction with the students are limited in an online course. You follow the stream of discussions in the OpenHPI platform, answering a question here and then. Sometimes you also receive email from one of the course participants...
Today, I have received email from a course participant in Gaza, Palestine. He wrote me about his appreciation for the OpenHPI team to offer courses like this and about the projects he carried out during his University studies. Unfortunately, as he wrote, due to the current situation in Gaza, infrastructure has been destroyed including power outages as well as network failure. Of course this makes it difficult next to impossible to continue the course (not to speak about all the other major problems for the people that arise from this conflict). I am deeply impressed that in a situation like this, people still continue their efforts to invest in their education...and their future.
And yes, war has finally also knocked on the door of our small island of the fortunate...
Wednesday, July 16, 2014
New DBpedia Graph Statistics
Recently, we have been working on the DBpedia / Wikipedia Page Link dataset. We have considered the English and the German language versions for this project. In the current DBpedia 3.9 page links English and German datasets 18 million and 6 million entities are represented respectively. But the original DBpedia only contains about 4 million and 1 million distinct entities for English and German versions.
This significant difference is mainly due to the current DBpedia pagelinks dataset include redirect pages and pagelinks with resources that are not considered as entites (as e.g. thumbnails and other images). So we considered cleaning up DBpedia pagelinks dataset for the computation of statistical parameters (a.g. pagerank or HITS). For the Cleanup we have removed all unnecessary and redundant RDF-Triples from the pagelinks dataset, i.e all removing the redirect pages (Redirection pages are just URIs that automatically forward a user to another Wikipedia page, but do not represent entities) as well as RDF-Triples representing resources that do not have an own rdfs:label (as per DBpedia documentation every entity has an rdfs:label reference).
One of the benefits of the cleaned up pagelink dataset is the faster computation of statistical graph measures (while not influencing the overall statistics, i.e. redirect pages usually don't have incoming links and theother removed resources (as e.g. images) don't have outgoing links). Based on this dataset we have computed PageRank, Hub and Authorities (HITS), PageInlink Counts and PageOutLink Counts. Please find the details of the datasets here on our research group's webpage [1].
For Computation of the DBpedia graph statistics we have used JUNG — the Java Universal Network/Graph Framework. Please find the source code for PageRank and HITS computation here via GitHub [2].
References and further Reading:
[1] New PageRank Computations for DBpedia 3.9 (English/German) at SemanticMultimedia
[2] Source code for DBpedia Graph Statistics
One of the benefits of the cleaned up pagelink dataset is the faster computation of statistical graph measures (while not influencing the overall statistics, i.e. redirect pages usually don't have incoming links and theother removed resources (as e.g. images) don't have outgoing links). Based on this dataset we have computed PageRank, Hub and Authorities (HITS), PageInlink Counts and PageOutLink Counts. Please find the details of the datasets here on our research group's webpage [1].
For Computation of the DBpedia graph statistics we have used JUNG — the Java Universal Network/Graph Framework. Please find the source code for PageRank and HITS computation here via GitHub [2].
References and further Reading:
[1] New PageRank Computations for DBpedia 3.9 (English/German) at SemanticMultimedia
[2] Source code for DBpedia Graph Statistics
Monday, July 14, 2014
Harald's Original Miscellany - More Truth about Football - Part 4
Finally, Germany has won the Soccer Worldcup 2014. Therefore, also our little statistics on soccer will come to an end with the post today. You might ask yourself what kind of data is left for soccer players in Wikipedia and DBpedia. Well, unfortunately only a little. But, we will try to make something out of it. Last time, we've ask for the number of team changes and the correlation to popularity vs. scored goals for soccer players. What is left, if we look at the available data?
We have the data about the years in which the football players were active or have played in their national soccer team. Let's start with the national team years [1]:
Well, it was obvious that the most players have only 1 or two years in the national team. But, there are exceptional players who achieved even 8 years. But, who are these long term players? [2]:
We have the data about the years in which the football players were active or have played in their national soccer team. Let's start with the national team years [1]:
nationalyears | NumPlayers |
---|---|
8 | 3 |
7 | 20 |
6 | 85 |
5 | 371 |
4 | 1070 |
3 | 2479 |
2 | 6116 |
1 | 28211 |
Well, it was obvious that the most players have only 1 or two years in the national team. But, there are exceptional players who achieved even 8 years. But, who are these long term players? [2]:
nationalyears | Player | Team |
---|---|---|
8 | Wojciech Łobodziński | "Poland"@en |
8 | Wojciech Łobodziński | "Poland Under 16"@en |
8 | Wojciech Łobodziński | "Poland Under 17"@en |
8 | Wojciech Łobodziński | "Poland Under 21"@en |
8 | Wojciech Łobodziński | "Poland Under 18"@en |
8 | Santiago Cañizares | "Spain Under-17"@en |
8 | Santiago Cañizares | "Spain Under-16"@en |
8 | Santiago Cañizares | "Spain Under-21"@en |
8 | Santiago Cañizares | "Spain Under-18"@en |
8 | Santiago Cañizares | "Spain Under-23"@en |
8 | Santiago Cañizares | "Spain Under-19"@en |
8 | Santiago Cañizares | "Spain Under-20"@en |
7 | Aydın Yılmaz | "Turkey Under-21"@en |
7 | Aydın Yılmaz | "Turkey"@en |
7 | Aydın Yılmaz | "Turkey"@en |
7 | Aydın Yılmaz | "Turkey Under-19s"@en |
7 | Aydın Yılmaz | "Turkey Under-17"@en |
7 | Aydın Yılmaz | "Turkey A2"@en |
7 | Ismael Urzaiz | "Spain Under-17"@en |
7 | Ismael Urzaiz | "Spain Under-16"@en |
Possibly you will never have heard of Poland's Wojciech Łobodziński or Spain's Santiago Cañizares. Here another flaw in the data becomes visible. There is no such thing as the unique national team. We have "under 16", "under 17", "under 18", and so on... So you start your career already with 15 and after 8 years you would be 23 and possibly be in the "real" national team.
Possibly this is because of people not only writing single year's into the Wikipedia infoboxes, but time spans and other things. In the list above are only the Top 20. Just remove the LIMIT from the SPARQL query and further down you will find more valid data.
OK, let's come to our last problem related to football. Is there a correlation between the height of a player (simply because we have that data) and the number of achieved goals [4]?
It is rather difficult to recognize something in this data. You see heights and the number of goals that player of that height have scored. Fascinating that there seem to be a significant number of players that are taller than 2 meters. I guess that the list leader with 2.43 meters is just incorrect data. Now. this is so many data that we have to visualize it to recognize something....
In the diagram to the left you see the soccer players height (x-axis) vs. the number of scored goals (y-axis). Interesting thing to notice is that the heights approximately show a Gaussian distribution, i.e most players have a "middle" height, on the extremes there are only a few. Well, there seems to be an exception. on the outer left you will notice a large fraction of players with a height of 1.52m with goal scores ranging among 0 to 300. This is extraordinary, because I have no idea what kind of group this is or if there is simply an error in the data again. What I noticed is that among this group there seems to be a larger fraction of female Asian soccer players. Maybe they are responsible for that large number of outliers, but this requires further investigation.
Alas, I want to add one last table. For each player there is the information on which position she or he is playing. Of course Wikipedia authors are far from any sort of agreement how to name the player's position. Thus, there is a rather huge variety. Nevertheless, I will leave you with the table to make any sense from it. Enjoy [5]:
Of course I limited the table to 30 rows. Interestingly, the average height of goalkeepers is larger than for midfielders or strikers. As expected, strikers on the average score more goals than defenders or goalkeepers.
Let's have a look at the active years of players. Unfortunately, here the data is rather messy [3]:
Possibly this is because of people not only writing single year's into the Wikipedia infoboxes, but time spans and other things. In the list above are only the Top 20. Just remove the LIMIT from the SPARQL query and further down you will find more valid data.
OK, let's come to our last problem related to football. Is there a correlation between the height of a player (simply because we have that data) and the number of achieved goals [4]?
Height | sumgoals |
---|---|
243.84 | 1 |
23.0 | 59 |
215.9 | 1 |
206.0 | 0 |
204.0 | 0 |
203.2 | 24 |
203.2 | 47 |
203.0 | 0 |
203.0 | 0 |
203.0 | 51 |
202.0 | 19 |
202.0 | 0 |
202.0 | 15 |
201.0 | 20 |
201.0 | 0 |
201.0 | 0 |
200.66 | 11 |
200.66 | 48 |
200.66 | 0 |
200.66 | 0 |
200.66 | 0 |
200.66 | 124 |
200.66 | 146 |
200.66 | 0 |
200.66 | 1 |
200.66 | 16 |
200.66 | 0 |
200.66 | 102 |
200.66 | 6 |
200.66 | 0 |
200.0 | 0 |
200.0 | 20 |
199.0 | 0 |
199.0 | 0 |
199.0 | 78 |
199.0 | 47 |
199.0 | 0 |
199.0 | 62 |
199.0 | 32 |
199.0 | 84 |
It is rather difficult to recognize something in this data. You see heights and the number of goals that player of that height have scored. Fascinating that there seem to be a significant number of players that are taller than 2 meters. I guess that the list leader with 2.43 meters is just incorrect data. Now. this is so many data that we have to visualize it to recognize something....
Correlation between height of soccer players and scored goals |
Alas, I want to add one last table. For each player there is the information on which position she or he is playing. Of course Wikipedia authors are far from any sort of agreement how to name the player's position. Thus, there is a rather huge variety. Nevertheless, I will leave you with the table to make any sense from it. Enjoy [5]:
position | number | avheight | avgoals |
---|---|---|---|
midfielder | 4450 | 176.822828644929299 | 18.077078651685393 |
defender | 3429 | 181.37523034450506 | 8.058326042578011 |
striker | 2643 | 180.169761559433703 | 65.213015512674991 |
goalkeeper | 2086 | 186.186064557569709 | 0.174496644295302 |
forward | 1787 | 178.865346269495465 | 46.388919977616116 |
centre back | 786 | 185.7391350821381 | 10.651399491094148 |
winger | 648 | 174.69354896725695 | 27.583333333333333 |
attacking midfielder | 490 | 175.897877222177931 | 38.006122448979592 |
left back | 461 | 177.661735253323698 | 6.995661605206074 |
defensive midfielder | 427 | 179.741452037869347 | 11.978922716627635 |
right back | 405 | 177.829456545982828 | 7.301234567901235 |
central defender | 196 | 185.005408462213008 | 8.897959183673469 |
central midfielder | 161 | 178.223105294363834 | 20.099378881987578 |
full back | 158 | 174.659366173080242 | 6.759493670886076 |
centre forward | 140 | 179.09685701642717 | 80.792857142857143 |
left winger | 113 | 174.790176753449224 | 28.398230088495575 |
defender, midfielder | 102 | 179.789412255380659 | 10.607843137254902 |
inside forward | 96 | 172.094789346059145 | 64.083333333333333 |
defender/midfielder | 82 | 179.178292995545911 | 11.463414634146341 |
defender / midfielder | 79 | 180.188100887250282 | 15.050632911392405 |
left-back | 65 | 175.642922504131603 | 7.692307692307692 |
right winger | 62 | 175.247418803553424 | 30.096774193548387 |
centre-back | 56 | 521.327500479561941 | 9.375 |
right-back | 54 | 176.101110952871815 | 6.111111111111111 |
midfielder/forward | 52 | 172.013075924836657 | 22.557692307692308 |
midfield | 52 | 176.982691251314588 | 30.673076923076923 |
defender (retired) | 50 | 181.37344055175781 | 14.48 |
second striker | 48 | 174.673332850138346 | 63.854166666666667 |
full-back | 44 | 175.908408771861666 | 2.681818181818182 |
striker / winger | 41 | 178.235609845417295 | 49.219512195121951 |
Of course I limited the table to 30 rows. Interestingly, the average height of goalkeepers is larger than for midfielders or strikers. As expected, strikers on the average score more goals than defenders or goalkeepers.
SPARQL Queries and original data:
[2] player name and team name of national team players with the longest playing terms
[3] players ordered by number of active years
[4] is there a correlation between soccer player height and scored goals?
[5] is there any correlation between player position, height, and scored goals?
[3] players ordered by number of active years
[4] is there a correlation between soccer player height and scored goals?
[5] is there any correlation between player position, height, and scored goals?
Labels:
DBpedia,
soccer,
soccer worldcup 2014,
SPARQL,
statistics
Thursday, July 03, 2014
Harald's Original Miscellany - More Truth about Football - Part 3
To change the team means to earn more money...what about the football millionaires? How often do they change the team? |
Have you ever wondered about this kind of slave trade in professional football? Well, I wouldn't exactly call the transfer of a millionaire to a higher paying job a 'slave trade'. But, have you ever thought about the following question: Do the real good (and well paid) players more often change the team - or is it vice versa, that teams try to get rid of players that have a bad season or are on the decline? Who knows? Let's have a look on the data:
TeamChanges | NumPlayers |
---|---|
16 | 2 |
15 | 26 |
14 | 84 |
13 | 287 |
12 | 792 |
11 | 2247 |
10 | 3848 |
9 | 5109 |
8 | 6464 |
7 | 8110 |
6 | 9790 |
5 | 11264 |
4 | 11837 |
3 | 11448 |
2 | 10961 |
1 | 6515 |
person | TeamChanges | popularity |
---|---|---|
http://dbpedia.org/resource/Cristiano_Ronaldo | 8 | 1794 |
http://dbpedia.org/resource/David_Beckham | 11 | 1572 |
http://dbpedia.org/resource/Thierry_Henry | 9 | 1414 |
http://dbpedia.org/resource/Lionel_Messi | 7 | 1404 |
http://dbpedia.org/resource/Wayne_Rooney | 5 | 1343 |
http://dbpedia.org/resource/Frank_Lampard | 5 | 1188 |
http://dbpedia.org/resource/Pel%C3%A9 | 4 | 1111 |
http://dbpedia.org/resource/Didier_Drogba | 8 | 1047 |
http://dbpedia.org/resource/Ronaldo | 10 | 1037 |
http://dbpedia.org/resource/Michael_Owen | 6 | 1011 |
But, we get a better overview, if we look at the average popularity of each switching group in the table [3]:
TeamChanges | NumPlayers | avgindegree |
---|---|---|
16 | 2 | 107.5 |
15 | 26 | 76.769230769230769 |
14 | 84 | 47.297619047619048 |
13 | 287 | 60.885017421602787 |
12 | 788 | 45.073604060913706 |
11 | 2235 | 37.206263982102908 |
10 | 3800 | 32.577631578947368 |
9 | 5023 | 30.939080230937687 |
8 | 6332 | 29.685881238155401 |
7 | 7935 | 25.499054820415879 |
6 | 9525 | 23.188346456692913 |
5 | 10886 | 18.682895462061363 |
4 | 11423 | 14.937844699290904 |
3 | 10842 | 11.131802250507286 |
2 | 10089 | 7.704628803647537 |
1 | 5534 | 5.19588001445609 |
TeamChanges | NumPlayers | AvgGoals |
---|---|---|
16 | 2 | 22.5 |
15 | 26 | 28.807692307692308 |
14 | 83 | 37.855421686746988 |
13 | 287 | 38.062717770034843 |
12 | 781 | 39.939820742637644 |
11 | 2205 | 38.625850340136054 |
10 | 3784 | 35.646141649048626 |
9 | 4980 | 32.826907630522088 |
8 | 6296 | 29.489517153748412 |
7 | 7826 | 24.455788397648863 |
6 | 9314 | 21.937835516426884 |
5 | 10471 | 18.655142775284118 |
4 | 10627 | 16.317869577491296 |
3 | 9612 | 12.756450270495214 |
2 | 7193 | 9.911858751564021 |
1 | 4624 | 4.444204152249135 |
Looks interesting. Top goal scorer have 9 to 14 team switches. This is way above the average. Thus, the more goals you score, the more often you will have the chance of being transferred (and thus earn more money). Players that don't score goals will obviously not be transferred (that often).
References:
[1] How many players have how many team changes overall?
[2] The Top10 popular soccer players and their number of team changes
[3] The average popularity of football players regarding the number of team switches
[4] Number of average goals per soccer player with respect to the number of team changes
[2] The Top10 popular soccer players and their number of team changes
[3] The average popularity of football players regarding the number of team switches
[4] Number of average goals per soccer player with respect to the number of team changes
Labels:
data mining,
DBpedia,
football,
soccer,
SPARQL,
statistics
Subscribe to:
Posts (Atom)