Thursday, December 04, 2014

The Fact Ranking Challenge continues....

Yes, you might remember our experiment with fact ranking [1,2,3,4]. After almost 4 months, this is the current state of intermediate results (cf. below).

To make it short: YES, there has been some progress. NO, it is not sufficient so far.

Therefore, PLEASE HELP!
If you have already started to work with the application [3], PLEASE CONTINUE!
If you don't know the application [3], PLEASE START!
PLEASE PARTICIPATE!



We know that is is not an easy task and we also know that it takes time. Therefore, WE REALLY DO APPRECIATE YOUR HELP VERY MUCH!!

Please keep on playing and help us to gather more data!
Please tell all your family and friends to support us!
Please tell all your colleagues and fellow students to support us!

THANK YOU VERY MUCH!

[1] The Fact Ranking Quiz Application, http://s16a.org/fr/
[2] Help us with a Research Problem, July 30, 2014
[3] The Importance of Relevance - Intermediate Results, Aug 19, 2014
[4] More intermediate Results from our Fact Ranking Challenge, Sep. 2, 2014


...and here are the statistics:

Number of users who participated: 465
Sum of concepts done: 2410
485 unique concepts are covered (out of 541). 
Average concepts done per user: 5.183

295 times concepts were skipped (relevance in Step2 hasn't been changed for any of the facts).

CONCEPTS DONE:
0 concepts were done by 97 users. 
1 concepts were done by 93 users. 
2 concepts were done by 78 users. 
3 concepts were done by 49 users. 
4 concepts were done by 33 users. 
5 concepts were done by 26 users. 
6 concepts were done by 10 users. 
7 concepts were done by 7 users. 
8 concepts were done by 9 users. 
9 concepts were done by 8 users. 
10 concepts were done by 8 users. 
11 concepts were done by 6 users. 
12 concepts were done by 2 users. 
13 concepts were done by 1 users. 
14 concepts were done by 5 users. 
15 concepts were done by 3 users. 
16 concepts were done by 3 users. 
17 concepts were done by 1 users. 
18 concepts were done by 1 users. 
19 concepts were done by 2 users. 
20 concepts were done by 3 users. 
21 concepts were done by 2 users. 
23 concepts were done by 1 users. 
24 concepts were done by 2 users. 
25 concepts were done by 1 users. 
26 concepts were done by 1 users. 
31 concepts were done by 1 users. 
36 concepts were done by 1 users. 
40 concepts were done by 1 users. 
42 concepts were done by 2 users. 
56 concepts were done by 1 users. 
58 concepts were done by 1 users. 
60 concepts were done by 1 users. 
64 concepts were done by 1 users. 
68 concepts were done by 1 users. 
70 concepts were done by 1 users. 
88 concepts were done by 1 users. 
201 concepts were done by 1 users. 

EDUCATION:
highschool : 47 users.
phd : 71 users.
other : 26 users.
bachelors : 105 users.
masters : 213 users.

AGE:
33+ :  261 users.
19-25 :  72 users.
26-32 :  122 users.
0-18 :  7 users.

GENDER:
female :  111 users.
male :  351 users.

COUNTRY OF ORIGIN:
Angola :  1 users.
Belarus :  1 users.
Portugal :  4 users.
Philippines :  2 users.
Morocco :  3 users.
Greece :  5 users.
Ukraine :  3 users.
Indonesia :  4 users.
Afghanistan :  2 users.
Sri Lanka :  1 users.
Italy :  15 users.
Iraq :  2 users.
India :  44 users.
France :  14 users.
Denmark :  1 users.
Latvia :  1 users.
Pakistan :  4 users.
Syrian Arab Republic :  2 users.
Montenegro :  1 users.
Armenia :  1 users.
Mexico :  2 users.
Canada :  1 users.
Brazil :  10 users.
Venezuela :  1 users.
Croatia :  2 users.
Macedonia, The Former Yugoslav Republic of :  1 users.
Romania :  2 users.
Western Sahara :  1 users.
Algeria :  5 users.
Sweden :  2 users.
United States :  24 users.
Serbia :  10 users.
Nigeria :  3 users.
Estonia :  1 users.
Spain :  9 users.
Taiwan, Republic of China :  2 users.
Ireland :  1 users.
Russian Federation :  12 users.
Israel :  1 users.
Colombia :  3 users.
Switzerland :  2 users.
Azerbaijan :  2 users.
Kenya :  2 users.
Yemen :  1 users.
Malaysia :  2 users.
Viet Nam :  1 users.
Australia :  5 users.
Peru :  1 users.
Albania :  1 users.
South Africa :  2 users.
Tunisia :  1 users.
Netherlands :  9 users.
China :  3 users.
Somalia :  1 users.
Slovenia :  1 users.
Finland :  3 users.
Lithuania :  1 users.
Austria :  7 users.
Sudan :  1 users.
United Kingdom :  15 users.
Egypt :  2 users.
Bahamas :  1 users.
Hungary :  1 users.
Belgium :  1 users.
Poland :  6 users.
Iran, Islamic Republic of :  2 users.
Bulgaria :  3 users.
Norway :  1 users.
Germany :  177 users.
New Zealand :  3 users.

COUNTRY OF RESIDENCE:
United Arab Emirates :  1 users.
null :  0 users.
Belarus :  1 users.
Portugal :  2 users.
Philippines :  1 users.
Morocco :  2 users.
Greece :  3 users.
Ukraine :  1 users.
Indonesia :  3 users.
Luxembourg :  1 users.
Sri Lanka :  1 users.
Italy :  14 users.
Iraq :  1 users.
India :  32 users.
France :  17 users.
Jordan :  1 users.
Denmark :  1 users.
Latvia :  1 users.
Pakistan :  3 users.
Syrian Arab Republic :  1 users.
Oman :  1 users.
Turkey :  1 users.
Czech Republic :  1 users.
Armenia :  1 users.
Canada :  4 users.
Brazil :  9 users.
Croatia :  1 users.
Romania :  3 users.
Algeria :  6 users.
Sweden :  3 users.
United States :  25 users.
Serbia :  5 users.
Nigeria :  1 users.
Saudi Arabia :  1 users.
Estonia :  2 users.
Spain :  7 users.
Taiwan, Republic of China :  1 users.
Ireland :  3 users.
Russian Federation :  6 users.
Israel :  2 users.
Colombia :  2 users.
Switzerland :  8 users.
Azerbaijan :  2 users.
Kenya :  2 users.
Norfolk Island :  1 users.
Yemen :  1 users.
Malaysia :  2 users.
Australia :  7 users.
Peru :  1 users.
Albania :  1 users.
South Africa :  2 users.
Netherlands :  13 users.
China :  1 users.
Somalia :  1 users.
Slovenia :  1 users.
Gambia :  1 users.
Finland :  3 users.
Lithuania :  1 users.
Austria :  6 users.
United Kingdom :  17 users.
Egypt :  1 users.
Bahamas :  1 users.
Belgium :  3 users.
Poland :  7 users.
Singapore :  2 users.
Iran, Islamic Republic of :  1 users.
Bulgaria :  3 users.
Norway :  1 users.
Germany :  197 users.
New Zealand :  3 users.

Overall average confidence of users about the seen concepts: 2.616

NO. OF USERS PER CONCEPT:
1 users for 67 concepts
2 users for 123 concepts
3 users for 108 concepts
4 users for 86 concepts
5 users for 73 concepts
6 users for 37 concepts
7 users for 11 concepts
8 users for 7 concepts
9 users for 2 concepts
10 users for 2 concepts
18 users for 1 concepts
On AVERAGE there are 3.398 users per concept.

NO. OF ANSWERS PER CONCEPT (STEP 1):
On AVERAGE there are 5.002 answers per concept.

NONSENSE STATEMENTS:
TOTAL number of nonsense sentences = 1771



The Flipped Classroom Experiment - First Lessons Learned

Sailing in previously unknown seas...
As I already had announced in the previous post entitled "And Now for Something Completely Different..." this winter semester the Semantic Web Technologies lecture does not take place like an ordinary lecture. We follow the so-called Flipped Classroom Concept, i.e. the students "prepare" for the lecture (by watching the video recordings from the previous semesters), and we do the lecture in a more interactive way, driven by the needs of the students. So far the theory, but does it also hold in practice?
OK, now we have reached lecture No. 6, i.e. almost already half of the entire course. Let me tell you what happened so far.

  1. Lecture preparation: I have to admit, I was a little bit afraid whether it would really work. But, the students always seem to be very well prepared and I'm really happy about that!
  2. Interaction: Well, I have learned that it is me, who has to take over the incentive. Now, the game works as follows: every week, I publish so-called syllabus questions, i.e. essential questions that the students should be able to answer after they have understood the lecture. We start the lecture now always by going through these syllabus questions, which give us sufficient material for interactivity (=discussion).

    When I simply ask "Do you have any questions about the content of the lecture?", I rarely receive any answers. But, when I start to talk about some of the issues from the content, more and more people are contributing.
  3. Feedback: Well, according to the answers I receive for the syllabus questions, I can immediately recognize where potential problems might be and we can talk about it. For this, I usually keep the lecture slides at hand for visual support. I can then repeat what is important and can dive deeper into explanations as well as to give more examples

    Actually, I have asked the students whether they like the way of this lecture, and I have not received any negative answer so far.
What I have learned so far is that for me as a lecturer the effort is almost quite the same for the flipped classroom as for a 'traditional lecture'. Moreover, you have to be very well prepared, because you always encourage the students to ask all kind of questions. The nice thing is the direct feedback that guides you to exactly those points where more explanation is needed.  Personally, I like this way of the lecture very much. I never had any other lecture so far (well,  besides lab courses or seminars) with so much feedback from the students. 

...and of course I would also like to thank all of my students that they are willing to undergo this experiment together with me! :)

Tuesday, October 14, 2014

And now for something completely different...

As always in October, lectures are starting again. Like every year, I will give a lecture on Semantic Web Technologies. BTW, I have realized that I give now courses on Semantic Web for almost 10 years. It all started as a seminar at the Friedrich-Schiller Universität back in Jena and became a fully-fledged lecture here at the HPI in Potsdam. Like the lecture of last winter semester, almost all lectures have been recorded and are online available either at tele-Task or yovisto.

Moreover, we have also prepared two MOOC courses Semantic Web Technologies in Spring 2013, and Knowledge Engineering with Semantic Web Technologies in Spring 2014, both very successful with thousand(s) of students.

This semester, I have decided not to do the very same all over again and to try out something completely different...

Have you ever heard of the Flipped Classroom concept? This semester, we are going to turn the lecture situation around for the students. All the lecture content has already been recorded. Thus, students can prepare for each lecture at home by watching the videos and studying the handouts as well as the course materials. Then, in the classroom, I will not present the content again, but we are going to discuss

  • everything which needs more attention according to the students,
  • everything that the students did not quite well understand,
  • including all problems, errors, and complements that seem to be important.
 Thus, to follow the (live) lecture the students have to prepare accordingly. Of course this will only work with the active participation of the students. On the other hand, it will also be more challenging for the lecturer and the tutors, because we have to be very well prepared to deal with all kind of potential questions and problems. Of course we will work out problem solutions and answers always together with the students. And it will be also the students who will take over the lead - well of course under the lecturer's guidance.

I'm very curious whether this concept will work out well with my lecture here at HPI. Please keep your fingers crossed and I will keep you posted.

Additional Links:

Tuesday, September 02, 2014

More intermediate Results from our Fact Ranking Challenge

Fact Ranking Challenge
Again, here is an update about our currently gathered data about our fact ranking experiment.

We started the original challenge about 5 weeks ago and now are able to present you some more intermediate results [1]. Nevertheless, the challenge is still running. Therefore, please distribute, participate, advertise, and help us to generate a fully fledged ground truth for fact ranking [2]. All data will be made publicly available for further research.

To determine the importance of a fact is of utmost importance, if you want to properly understand the content of information. Usually, you have a rich variety of possible interpretations of information. To determine the proper interpretation, you are going to use the context, i.e. further available information. So, the question develops from "what is important?" to "what is important with regard of this specific context?".

Current Intermediate Statistics (from Aug 28):

Number of users who participated: 388

Sum of concepts done: 1736

446 unique concepts are covered (out of 541). 

Average concepts done per user: 4.47

CONCEPTS DONE:
0 concepts were done by 79 users. 
1 concepts were done by 79 users. 
2 concepts were done by 68 users. 
3 concepts were done by 41 users. 
4 concepts were done by 25 users. 
5 concepts were done by 22 users. 
6 concepts were done by 8 users. 
7 concepts were done by 7 users. 
8 concepts were done by 9 users. 
9 concepts were done by 6 users. 
10 concepts were done by 8 users. 
11 concepts were done by 4 users. 
12 concepts were done by 2 users. 
13 concepts were done by 1 users. 
14 concepts were done by 4 users. 
15 concepts were done by 3 users. 
16 concepts were done by 2 users. 
19 concepts were done by 1 users. 
20 concepts were done by 3 users. 
21 concepts were done by 2 users. 
22 concepts were done by 1 users. 
23 concepts were done by 1 users. 
24 concepts were done by 2 users. 
25 concepts were done by 1 users. 
31 concepts were done by 1 users. 
38 concepts were done by 2 users. 
40 concepts were done by 1 users. 
42 concepts were done by 1 users. 
58 concepts were done by 1 users. 
59 concepts were done by 1 users. 
62 concepts were done by 1 users. 
64 concepts were done by 1 users. 

EDUCATION:
highschool : 41 users.
phd : 57 users.
other : 22 users.
bachelors : 93 users.
masters : 174 users.

AGE:
33+ :  216 users.
19-25 :  68 users.
26-32 :  98 users.
0-18 :  5 users.

GENDER:
female :  90 users.
male :  297 users.

COUNTRY OF ORIGIN:
Angola :  1 users.
Belarus :  1 users.
Portugal :  4 users.
Philippines :  2 users.
Morocco :  3 users.
Greece :  5 users.
Ukraine :  3 users.
Indonesia :  3 users.
Sri Lanka :  1 users.
Italy :  13 users.
Iraq :  1 users.
India :  40 users.
France :  11 users.
Latvia :  1 users.
Pakistan :  3 users.
Syrian Arab Republic :  1 users.
Montenegro :  1 users.
Armenia :  1 users.
Mexico :  2 users.
Brazil :  10 users.
Venezuela :  1 users.
Croatia :  2 users.
Macedonia, The Former Yugoslav Republic of :  1 users.
Romania :  2 users.
Western Sahara :  1 users.
Algeria :  4 users.
Sweden :  2 users.
United States :  21 users.
Serbia :  8 users.
Nigeria :  2 users.
Estonia :  1 users.
Spain :  8 users.
Taiwan, Republic of China :  2 users.
Ireland :  1 users.
Israel :  1 users.
Russian Federation :  9 users.
Colombia :  3 users.
Switzerland :  1 users.
Azerbaijan :  1 users.
Kenya :  2 users.
Yemen :  1 users.
Malaysia :  2 users.
Viet Nam :  1 users.
Australia :  4 users.
Peru :  1 users.
Albania :  1 users.
South Africa :  2 users.
Netherlands :  8 users.
China :  2 users.
Somalia :  1 users.
Slovenia :  1 users.
Finland :  3 users.
Lithuania :  1 users.
Austria :  6 users.
Sudan :  1 users.
United Kingdom :  15 users.
Egypt :  2 users.
Bahamas :  1 users.
Hungary :  1 users.
Poland :  4 users.
Iran, Islamic Republic of :  2 users.
Bulgaria :  3 users.
Norway :  1 users.
Germany :  140 users.
New Zealand :  3 users.

COUNTRY OF RESIDENCE:
United Arab Emirates :  1 users.
Belarus :  1 users.
Portugal :  2 users.
Philippines :  1 users.
Morocco :  2 users.
Greece :  3 users.
Ukraine :  1 users.
Indonesia :  2 users.
Luxembourg :  1 users.
Sri Lanka :  1 users.
Italy :  11 users.
India :  31 users.
France :  13 users.
Jordan :  1 users.
Denmark :  1 users.
Latvia :  1 users.
Pakistan :  3 users.
Oman :  1 users.
Turkey :  1 users.
Czech Republic :  1 users.
Armenia :  1 users.
Canada :  3 users.
Brazil :  9 users.
Croatia :  1 users.
Romania :  3 users.
Algeria :  4 users.
Sweden :  3 users.
United States :  23 users.
Serbia :  4 users.
Nigeria :  1 users.
Saudi Arabia :  1 users.
Estonia :  2 users.
Spain :  7 users.
Taiwan, Republic of China :  1 users.
Ireland :  3 users.
Israel :  2 users.
Russian Federation :  2 users.
Colombia :  2 users.
Switzerland :  7 users.
Azerbaijan :  2 users.
Kenya :  2 users.
Norfolk Island :  1 users.
Yemen :  1 users.
Malaysia :  2 users.
Australia :  6 users.
Peru :  1 users.
Albania :  1 users.
South Africa :  1 users.
Netherlands :  12 users.
Somalia :  1 users.
Slovenia :  1 users.
Gambia :  1 users.
Finland :  3 users.
Lithuania :  1 users.
Austria :  5 users.
United Kingdom :  17 users.
Egypt :  1 users.
Bahamas :  1 users.
Belgium :  2 users.
Poland :  5 users.
Singapore :  1 users.
Iran, Islamic Republic of :  1 users.
Bulgaria :  3 users.
Norway :  1 users.
Germany :  154 users.
New Zealand :  2 users.

Overall confidence of users about the seen concepts: 2.687

NO. OF USERS PER CONCEPT:
On AVERAGE there are 2.18 users per concept.

NO. OF ANSWERS PER CONCEPT (STEP 1):
On AVERAGE there are 4.71 answers per concept.

NONSENSE STATEMENTS:

TOTAL number of nonsense sentences = 1371


Hint: You might wonder about the impressive high scores on the top of the list? Well, actually points are given exponentially, i.e. the longer you play, the more points you will score per processed concept.

References:
[1] Help Us with a Research Problem, July 30, 2914
[2] Fact Ranking Web-Application, http://s16a.org/fr/

Tuesday, August 19, 2014

The Importance of Relevance - Intermediate Results

Current Highscore List of our fact ranking challenge
In my last post, we invited you to take part in our research challenge, which was about creating a ground truth for fact ranking algorithms. To determine the importance of a fact is of utmost importance, if you want to properly understand the content of information. Usually, you have a rich variety of possible interpretations of information. To determine the proper interpretation, you are going to use the context, i.e. further available information. So, the question develops from "what is important?" to "what is important with regard of this specific context?".

We started the original challenge about 3 weeks ago and now are able to present you first intermediate results [1]. Nevertheless, the challenge is still running. Therefore, please distribute, participate, advertise, and help us to generate a fully fledged ground truth for fact ranking [2]. All data will be made publicly available for further research.

Current Intermediate Statistics:
Number of users who participated: 110 (Thanks to all you you!!!)
Number of overall processed concepts: 509
Overall 200 unique concepts are covered (out of 541).

Average concepts processed per user: 4.63

Detailed number of processed concepts per user:
0 concepts were done by 15 users.
1 concepts were done by 27 users.
2 concepts were done by 21 users.
3 concepts were done by 12 users.
4 concepts were done by 7 users.
5 concepts were done by 8 users.
6 concepts were done by 2 users.
7 concepts were done by 1 users.
8 concepts were done by 4 users.
9 concepts were done by 1 users.
10 concepts were done by 2 users.
11 concepts were done by 2 users.
14 concepts were done by 1 users.
16 concepts were done by 1 users.
20 concepts were done by 1 users.
22 concepts were done by 1 users.
25 concepts were done by 1 users.
31 concepts were done by 1 users.
53 concepts were done by 2 users.

Participant statistics:

EDUCATION:
highschool : 7 users.
bachelors : 28 users.
masters : 47 users.
phd : 23 users.
other : 5 users. 

AGE:
33+ : 43 users.
26-32 : 34 users.
19-25 : 33 users.

GENDER:
female : 25 users.
male : 85 users.

COUNTRY OF ORIGIN:
United States : 9 users.
Serbia : 6 users.
Spain : 2 users.
Ukraine : 1 users.
Russian Federation : 3 users.
Colombia : 1 users.
Italy : 6 users. India : 2 users.
France : 3 users.
Malaysia : 1 users.
Australia : 1 users.
Albania : 1 users.
China : 1 users.
Pakistan : 1 users.
Finland : 1 users.
Austria : 2 users.
Montenegro : 1 users.
United Kingdom : 8 users.
Brazil : 3 users.
Poland : 2 users.
Iran, Islamic Republic of : 1 users.
Macedonia, The Former Yugoslav Republic of : 1 users.
Croatia : 1 users.
Germany : 49 users.
Algeria : 1 users.
New Zealand : 1 users.
Sweden : 1 users.

Overall confidence of users about the seen concepts: 2.585

NO. OF USERS PER CONCEPT:
On AVERAGE there are 1.295 users per concept.

NO. OF ANSWERS PER CONCEPT (STEP 1):
On AVERAGE there are 4.825 answers per concept.

We will keep you posted about the results.
Please distribute, participate, advertise, and help us to generate a fully fledged ground truth for fact ranking.

Hint: You might wonder about the impressive high scores on the top of the list? Well, actually points are given exponentially, i.e. the longer you play, the more points you will score per processed concept.

References:
[1] Help Us with a Research Problem, July 30, 2914
[2] Fact Ranking Web-Application, http://s16a.org/fr/

Wednesday, July 30, 2014

Help Us with a Research Problem

As you might know, we already have tried previously to let the public participate in our research. Last time, we have had developed several games (with a purpose). This time, unfortunately it is not a game, simply because the development of a good game is really expensive. But, let's get to the point. What is the task all about, where you can help us....?

You know, my research group is working on semantic technologies. Semantics in that sense means, we are trying to (automatically) understand what information (or data) is all about and what is the meaning of it. Sometimes, information is ambiguous. This makes it difficult to understand, because you have to solve ambiguities with the help of context.

On the other hand, sometimes you have various different information about a subject. How do you determine, which information or fact is more important or relevant than another? Just a quick example. Let's assume we have the following two facts:

(1) Albert Einstein is a physicist.
(2) Albert Einstein is a Vegetarian.

Which of the two facts is more important or relevant? Yes, this is difficult to answer, simply because the truth often lies in the eye of the beholder. For a vegetarian, maybe the second fact is more important. But, what about the most common opinion? What would the mainstream think? Probably, most people would say that fact (1) in general is more important.

So, what we are doing is that we develop heuristics that determine the importance of facts (relative to other facts). To get an idea about the quality of our heuristics, we have to do an evaluation, i.e. somebody has to decide whether the decision of the heuristics was wrong or right. Unfortunately, there does not exist a ground truth for this task called "fact ranking". Therefore, we are about to create a new ground truth that later will be publicly available and open for all researchers.

This ground truth is achieved with the little 'voting' application that you will find here [1]. You just have to register with the tool and then the task will be explained to you in detail. We took 500 popular concepts from Wikipedia and you have (1) to think about the most important facts about these concepts that come to your mind and then (2) rate the (new) facts presented to you according to their relevance. There is no right or wrong answer. Just vote as you think it seems right for you. Afterwards, we will aggregate all votes from all participants to determine the general (mainstream) relevance of the presented facts.

You might interrupt your rating of the presented facts at any time you like and continue later. To make it a bit more interesting, you can also score points and of course there is a highscore list. We would really appreciate your help in this task. Please do also spread the word. The more participants, the more valid our ground truth will be.

We know that this is a difficult and sometimes rather boring task. The more we would be really grateful for your assistance!

[1] Fact Ranking Web-Application, http://s16a.org/fr/

...in Times of War

For us Western Europeans, war always seems to be far, far away in some other country (or some distant times in the past). Usually, we read about it in the newspapers or see the pictures in the media. But, we are not concerned directly. This also includes me as a researcher. Of course, we also have students in our institute who come from countries or regions in crisis. But, they are here and the crisis is there... elsewhere.

As you might know, I recently finished my OpenHPI online course 'Knowledge Engineering with Semantic Web Technologies'. The course was rather popular with a total of 4,623 enrolled students from all over the world. 611 students took part in the final course exam and 450 students have finished the course successfully (yeah!!).

Of course the means of interaction with the students are limited in an online course. You follow the stream of discussions in the OpenHPI platform, answering a question here and then. Sometimes you also receive email from one of the course participants...

Today, I have received email from a course participant in Gaza, Palestine. He wrote me about his appreciation for the OpenHPI team to offer courses like this and about the projects he carried out during his University studies. Unfortunately, as he wrote, due to the current situation in Gaza, infrastructure has been destroyed including power outages as well as network failure. Of course this makes it difficult next to impossible to continue the course (not to speak about all the other major problems for the people that arise from this conflict). I am deeply impressed that in a situation like this, people still continue their efforts to invest in their education...and their future.

And yes, war has finally also knocked on the door of our small island of the fortunate...

Wednesday, July 16, 2014

New DBpedia Graph Statistics

Recently, we have been working on the DBpedia / Wikipedia Page Link dataset. We have considered the English and the German language versions for this project. In the current DBpedia 3.9 page links English and German datasets 18 million and 6 million entities are represented respectively. But the original DBpedia only contains about 4 million and 1 million distinct entities for English and German versions. This significant difference is mainly due to the current DBpedia pagelinks dataset include redirect pages and pagelinks with resources that are not considered as entites (as e.g. thumbnails and other images). So we considered cleaning up DBpedia pagelinks dataset for the computation of statistical parameters (a.g. pagerank or HITS). For the Cleanup we have removed all unnecessary and redundant RDF-Triples from the pagelinks dataset, i.e all removing the redirect pages (Redirection pages are just URIs that automatically forward a user to another Wikipedia page, but do not represent entities) as well as RDF-Triples representing resources that do not have an own rdfs:label (as per DBpedia documentation every entity has an rdfs:label reference).

One of the benefits of the cleaned up pagelink dataset is the faster computation of statistical graph measures (while not influencing the overall statistics, i.e. redirect pages usually don't have incoming links and theother removed resources (as e.g. images) don't have outgoing links). Based on this dataset we have computed PageRank, Hub and Authorities (HITS), PageInlink Counts and PageOutLink Counts. Please find the details of the datasets here on our research group's webpage [1].

For Computation of the DBpedia graph statistics we have used JUNG — the Java Universal Network/Graph Framework. Please find the source code for PageRank and HITS computation here via GitHub [2].

References and further Reading:
[1] New PageRank Computations for DBpedia 3.9 (English/German) at SemanticMultimedia
[2] Source code for DBpedia Graph Statistics

Monday, July 14, 2014

Harald's Original Miscellany - More Truth about Football - Part 4

Finally, Germany has won the Soccer Worldcup 2014. Therefore, also our little statistics on soccer will come to an end with the post today. You might ask yourself what kind of data is left for soccer players in Wikipedia and DBpedia. Well, unfortunately only a little. But, we will try to make something out of it. Last time, we've ask for the number of team changes and the correlation to popularity vs. scored goals for soccer players. What is left, if we look at the available data?





We have the data about the years in which the football players were active or have played in their national soccer team. Let's start with the national team years [1]:

nationalyears NumPlayers
8 3
7 20
6 85
5 371
4 1070
3 2479
2 6116
1 28211

Well, it was obvious that the most players have only 1 or two years in the national team. But, there are exceptional players who achieved even 8 years. But, who are these long term players? [2]:

nationalyears Player Team
8 Wojciech Łobodziński "Poland"@en
8 Wojciech Łobodziński "Poland Under 16"@en
8 Wojciech Łobodziński "Poland Under 17"@en
8 Wojciech Łobodziński "Poland Under 21"@en
8 Wojciech Łobodziński "Poland Under 18"@en
8 Santiago Cañizares "Spain Under-17"@en
8 Santiago Cañizares "Spain Under-16"@en
8 Santiago Cañizares "Spain Under-21"@en
8 Santiago Cañizares "Spain Under-18"@en
8 Santiago Cañizares "Spain Under-23"@en
8 Santiago Cañizares "Spain Under-19"@en
8 Santiago Cañizares "Spain Under-20"@en
7 Aydın Yılmaz "Turkey Under-21"@en
7 Aydın Yılmaz "Turkey"@en
7 Aydın Yılmaz "Turkey"@en
7 Aydın Yılmaz "Turkey Under-19s"@en
7 Aydın Yılmaz "Turkey Under-17"@en
7 Aydın Yılmaz "Turkey A2"@en
7 Ismael Urzaiz "Spain Under-17"@en
7 Ismael Urzaiz "Spain Under-16"@en

Possibly you will never have heard of Poland's Wojciech Łobodziński or Spain's Santiago Cañizares. Here another flaw in the data becomes visible. There is no such thing as the unique national team. We have "under 16", "under 17", "under 18", and so on... So you start your career already with 15 and after 8 years you would be 23 and possibly be in the "real" national team.

Let's have a look at the active years of players. Unfortunately, here the data is rather messy [3]: 
person From To
http://dbpedia.org/resource/Marta_(footballer) 2000 9223372036854775807
http://dbpedia.org/resource/Carlos_Alberto_Gomes_de_Lima 2006 200720082008
http://dbpedia.org/resource/Blake_Camp 2008 200420052006
http://dbpedia.org/resource/Birgit_Prinz 1998 200320042005
http://dbpedia.org/resource/Dejan_Damjanovi%C4%87 1998 20112012
http://dbpedia.org/resource/Inka_Grings 1995 20092010
http://dbpedia.org/resource/Breno_Silva 2003 20092010
http://dbpedia.org/resource/Vin%C3%ADcius_Calamari 2007 20092010
http://dbpedia.org/resource/Samuel_Jos%C3%A9_da_Silva_Vieira 1994 20082009
http://dbpedia.org/resource/Chad_Marshall 2004 20082009
http://dbpedia.org/resource/Michael_Parkhurst 2003 20072008
http://dbpedia.org/resource/Geison_Rodrigues_Marrote 2004 20072008
http://dbpedia.org/resource/Dejan_Stankovi%C4%87 1994 20062010
http://dbpedia.org/resource/Glauber_Da_Silva 2001 20062007
http://dbpedia.org/resource/Hugo_Sarmiento 1999 20032007
http://dbpedia.org/resource/Obafemi_Martins 2000 20032004
http://dbpedia.org/resource/Heslley_Couto 2005 20032006
http://dbpedia.org/resource/Dalibor_Filipovi%C4%87 1992 20022003
http://dbpedia.org/resource/Dmytro_Zayko 2005 20022004
http://dbpedia.org/resource/Aviv_Volnerman 1998 20012004

Possibly this is because of people not only writing single year's into the Wikipedia infoboxes, but time spans and other things. In the list above are only the Top 20. Just remove the LIMIT from the SPARQL query and further down you will find more valid data.

OK, let's come to our last problem related to football. Is there a correlation between the height of a player (simply because we have that data) and the number of achieved goals [4]?
Height sumgoals
243.84 1
23.0 59
215.9 1
206.0 0
204.0 0
203.2 24
203.2 47
203.0 0
203.0 0
203.0 51
202.0 19
202.0 0
202.0 15
201.0 20
201.0 0
201.0 0
200.66 11
200.66 48
200.66 0
200.66 0
200.66 0
200.66 124
200.66 146
200.66 0
200.66 1
200.66 16
200.66 0
200.66 102
200.66 6
200.66 0
200.0 0
200.0 20
199.0 0
199.0 0
199.0 78
199.0 47
199.0 0
199.0 62
199.0 32
199.0 84

It is rather difficult to recognize something in this data. You see heights and the number of goals that player of that height have scored. Fascinating that there seem to be a significant number of players that are taller than 2 meters. I guess that the list leader with 2.43 meters is just incorrect data. Now. this is so many data that we have to visualize it to recognize something....

Correlation between height of soccer players and scored goals
In the diagram to the left you see the soccer players height (x-axis) vs. the number of scored goals (y-axis). Interesting thing to notice is that the heights approximately show a Gaussian distribution, i.e most players have a "middle" height, on the extremes there are only a few. Well, there seems to be an exception. on the outer left you will notice a large fraction of players with a height of 1.52m with goal scores ranging among 0 to 300. This is extraordinary, because I have no idea what kind of group this is or if there is simply an error in the data again. What I noticed is that among this group there seems to be a larger fraction of female Asian soccer players. Maybe they are responsible for that large number of outliers, but this requires further investigation.

Alas, I want to add one last table. For each player there is the information on which position she or he is playing. Of course Wikipedia authors are far from any sort of agreement how to name the player's position. Thus, there is a rather huge variety. Nevertheless, I will leave you with the table to make any sense from it. Enjoy [5]:
position number avheight avgoals
midfielder 4450 176.822828644929299 18.077078651685393
defender 3429 181.37523034450506 8.058326042578011
striker 2643 180.169761559433703 65.213015512674991
goalkeeper 2086 186.186064557569709 0.174496644295302
forward 1787 178.865346269495465 46.388919977616116
centre back 786 185.7391350821381 10.651399491094148
winger 648 174.69354896725695 27.583333333333333
attacking midfielder 490 175.897877222177931 38.006122448979592
left back 461 177.661735253323698 6.995661605206074
defensive midfielder 427 179.741452037869347 11.978922716627635
right back 405 177.829456545982828 7.301234567901235
central defender 196 185.005408462213008 8.897959183673469
central midfielder 161 178.223105294363834 20.099378881987578
full back 158 174.659366173080242 6.759493670886076
centre forward 140 179.09685701642717 80.792857142857143
left winger 113 174.790176753449224 28.398230088495575
defender, midfielder 102 179.789412255380659 10.607843137254902
inside forward 96 172.094789346059145 64.083333333333333
defender/midfielder 82 179.178292995545911 11.463414634146341
defender / midfielder 79 180.188100887250282 15.050632911392405
left-back 65 175.642922504131603 7.692307692307692
right winger 62 175.247418803553424 30.096774193548387
centre-back 56 521.327500479561941 9.375
right-back 54 176.101110952871815 6.111111111111111
midfielder/forward 52 172.013075924836657 22.557692307692308
midfield 52 176.982691251314588 30.673076923076923
defender (retired) 50 181.37344055175781 14.48
second striker 48 174.673332850138346 63.854166666666667
full-back 44 175.908408771861666 2.681818181818182
striker / winger 41 178.235609845417295 49.219512195121951

Of course I limited the table to 30 rows. Interestingly, the average height of goalkeepers is larger than for midfielders or strikers. As expected, strikers on the average score more goals than defenders or goalkeepers.


SPARQL Queries and original data:

Thursday, July 03, 2014

Harald's Original Miscellany - More Truth about Football - Part 3

To change the team means to earn more
money...what about the football millionaires?
How often do they change the team?
Are you ready for more statistics on your favorite kind of sports? Well, data is fun, and obviously Big Data means Big Fun. There are lot's of interesting things to discover while exploring data, and wikipedia (i.e. dbpedia for the insiders) provides all the necessary means.

Have you ever wondered about this kind of slave trade in professional football? Well, I wouldn't exactly call the transfer of a millionaire to a higher paying job a 'slave trade'. But, have you ever thought about the following question: Do the real good (and well paid) players more often change the team - or is it vice versa, that teams try to get rid of players that have a bad season or are on the decline? Who knows? Let's have a look on the data:
TeamChanges NumPlayers
16 2
15 26
14 84
13 287
12 792
11 2247
10 3848
9 5109
8 6464
7 8110
6 9790
5 11264
4 11837
3 11448
2 10961
1 6515
Here, we have a table providing an overview about how many players (in wikipedia) have changed their team for how many times [1]. Obviously, it seems to be some kind of Gaussian distribution with a peak between 2 and 6 team switches. OK, what about the players? Where are the top players listed in this table? Well, David Beckham switched team 11 times according to wikipedia, Cristiano Ronaldo 8 times, Thierry Henry 9 times. At least these numbers are above average which we had identified to be between 2 and 6. This seems to give proof to our original assumption.

person TeamChanges popularity
http://dbpedia.org/resource/Cristiano_Ronaldo 8 1794
http://dbpedia.org/resource/David_Beckham 11 1572
http://dbpedia.org/resource/Thierry_Henry 9 1414
http://dbpedia.org/resource/Lionel_Messi 7 1404
http://dbpedia.org/resource/Wayne_Rooney 5 1343
http://dbpedia.org/resource/Frank_Lampard 5 1188
http://dbpedia.org/resource/Pel%C3%A9 4 1111
http://dbpedia.org/resource/Didier_Drogba 8 1047
http://dbpedia.org/resource/Ronaldo 10 1037
http://dbpedia.org/resource/Michael_Owen 6 1011

But, we get a better overview, if we look at the average popularity of each switching group in the table [3]:
TeamChanges NumPlayers avgindegree
16 2 107.5
15 26 76.769230769230769
14 84 47.297619047619048
13 287 60.885017421602787
12 788 45.073604060913706
11 2235 37.206263982102908
10 3800 32.577631578947368
9 5023 30.939080230937687
8 6332 29.685881238155401
7 7935 25.499054820415879
6 9525 23.188346456692913
5 10886 18.682895462061363
4 11423 14.937844699290904
3 10842 11.131802250507286
2 10089 7.704628803647537
1 5534 5.19588001445609
As we had originally thought, on the average, the popularity of the players rises with the number of team changes. Although the top group with 16 changes is far from the highest possible popularity scores (as e.g. 1572 for David Beckham). Hmm, maybe there is a correlation among the number of achieved goals with the number of team changes? Is it more likely that a top goal hunter switches team more often? Let's have a look [4]:
TeamChanges NumPlayers AvgGoals
16 2 22.5
15 26 28.807692307692308
14 83 37.855421686746988
13 287 38.062717770034843
12 781 39.939820742637644
11 2205 38.625850340136054
10 3784 35.646141649048626
9 4980 32.826907630522088
8 6296 29.489517153748412
7 7826 24.455788397648863
6 9314 21.937835516426884
5 10471 18.655142775284118
4 10627 16.317869577491296
3 9612 12.756450270495214
2 7193 9.911858751564021
1 4624 4.444204152249135

Looks interesting. Top goal scorer have 9 to 14 team switches. This is way above the average. Thus, the more goals you score, the more often you will have the chance of being transferred (and thus earn more money). Players that don't score goals will obviously not be transferred (that often).

References: