....more semantic!

Thursday, December 04, 2014

The Fact Ranking Challenge continues....

Yes, you might remember our experiment with fact ranking [1,2,3,4]. After almost 4 months, this is the current state of intermediate results (cf. below).

To make it short: YES, there has been some progress. NO, it is not sufficient so far.

Therefore, PLEASE HELP!
If you have already started to work with the application [3], PLEASE CONTINUE!
If you don't know the application [3], PLEASE START!
PLEASE PARTICIPATE!

We know that is is not an easy task and we also know that it takes time. Therefore, WE REALLY DO APPRECIATE YOUR HELP VERY MUCH!!

Please keep on playing and help us to gather more data!
Please tell all your family and friends to support us!
Please tell all your colleagues and fellow students to support us!

THANK YOU VERY MUCH!

[1] The Fact Ranking Quiz Application, http://s16a.org/fr/
[2] Help us with a Research Problem, July 30, 2014
[3] The Importance of Relevance - Intermediate Results, Aug 19, 2014
[4] More intermediate Results from our Fact Ranking Challenge, Sep. 2, 2014

...and here are the statistics:

Number of users who participated: 465

Sum of concepts done: 2410

485 unique concepts are covered (out of 541).

Average concepts done per user: 5.183

295 times concepts were skipped (relevance in Step2 hasn't been changed for any of the facts).

CONCEPTS DONE:

0 concepts were done by 97 users.

1 concepts were done by 93 users.

2 concepts were done by 78 users.

3 concepts were done by 49 users.

4 concepts were done by 33 users.

5 concepts were done by 26 users.

6 concepts were done by 10 users.

7 concepts were done by 7 users.

8 concepts were done by 9 users.

9 concepts were done by 8 users.

10 concepts were done by 8 users.

11 concepts were done by 6 users.

12 concepts were done by 2 users.

13 concepts were done by 1 users.

14 concepts were done by 5 users.

15 concepts were done by 3 users.

16 concepts were done by 3 users.

17 concepts were done by 1 users.

18 concepts were done by 1 users.

19 concepts were done by 2 users.

20 concepts were done by 3 users.

21 concepts were done by 2 users.

23 concepts were done by 1 users.

24 concepts were done by 2 users.

25 concepts were done by 1 users.

26 concepts were done by 1 users.

31 concepts were done by 1 users.

36 concepts were done by 1 users.

40 concepts were done by 1 users.

42 concepts were done by 2 users.

56 concepts were done by 1 users.

58 concepts were done by 1 users.

60 concepts were done by 1 users.

64 concepts were done by 1 users.

68 concepts were done by 1 users.

70 concepts were done by 1 users.

88 concepts were done by 1 users.

201 concepts were done by 1 users.

EDUCATION:

highschool : 47 users.

phd : 71 users.

other : 26 users.

bachelors : 105 users.

masters : 213 users.

AGE:

33+ : 261 users.

19-25 : 72 users.

26-32 : 122 users.

0-18 : 7 users.

GENDER:

female : 111 users.

male : 351 users.

COUNTRY OF ORIGIN:

Angola : 1 users.

Belarus : 1 users.

Portugal : 4 users.

Philippines : 2 users.

Morocco : 3 users.

Greece : 5 users.

Ukraine : 3 users.

Indonesia : 4 users.

Afghanistan : 2 users.

Sri Lanka : 1 users.

Italy : 15 users.

Iraq : 2 users.

India : 44 users.

France : 14 users.

Denmark : 1 users.

Latvia : 1 users.

Pakistan : 4 users.

Syrian Arab Republic : 2 users.

Montenegro : 1 users.

Armenia : 1 users.

Mexico : 2 users.

Canada : 1 users.

Brazil : 10 users.

Venezuela : 1 users.

Croatia : 2 users.

Macedonia, The Former Yugoslav Republic of : 1 users.

Romania : 2 users.

Western Sahara : 1 users.

Algeria : 5 users.

Sweden : 2 users.

United States : 24 users.

Serbia : 10 users.

Nigeria : 3 users.

Estonia : 1 users.

Spain : 9 users.

Taiwan, Republic of China : 2 users.

Ireland : 1 users.

Russian Federation : 12 users.

Israel : 1 users.

Colombia : 3 users.

Switzerland : 2 users.

Azerbaijan : 2 users.

Kenya : 2 users.

Yemen : 1 users.

Malaysia : 2 users.

Viet Nam : 1 users.

Australia : 5 users.

Peru : 1 users.

Albania : 1 users.

South Africa : 2 users.

Tunisia : 1 users.

Netherlands : 9 users.

China : 3 users.

Somalia : 1 users.

Slovenia : 1 users.

Finland : 3 users.

Lithuania : 1 users.

Austria : 7 users.

Sudan : 1 users.

United Kingdom : 15 users.

Egypt : 2 users.

Bahamas : 1 users.

Hungary : 1 users.

Belgium : 1 users.

Poland : 6 users.

Iran, Islamic Republic of : 2 users.

Bulgaria : 3 users.

Norway : 1 users.

Germany : 177 users.

New Zealand : 3 users.

COUNTRY OF RESIDENCE:

United Arab Emirates : 1 users.

null : 0 users.

Belarus : 1 users.

Portugal : 2 users.

Philippines : 1 users.

Morocco : 2 users.

Greece : 3 users.

Ukraine : 1 users.

Indonesia : 3 users.

Luxembourg : 1 users.

Sri Lanka : 1 users.

Italy : 14 users.

Iraq : 1 users.

India : 32 users.

France : 17 users.

Jordan : 1 users.

Denmark : 1 users.

Latvia : 1 users.

Pakistan : 3 users.

Syrian Arab Republic : 1 users.

Oman : 1 users.

Turkey : 1 users.

Czech Republic : 1 users.

Armenia : 1 users.

Canada : 4 users.

Brazil : 9 users.

Croatia : 1 users.

Romania : 3 users.

Algeria : 6 users.

Sweden : 3 users.

United States : 25 users.

Serbia : 5 users.

Nigeria : 1 users.

Saudi Arabia : 1 users.

Estonia : 2 users.

Spain : 7 users.

Taiwan, Republic of China : 1 users.

Ireland : 3 users.

Russian Federation : 6 users.

Israel : 2 users.

Colombia : 2 users.

Switzerland : 8 users.

Azerbaijan : 2 users.

Kenya : 2 users.

Norfolk Island : 1 users.

Yemen : 1 users.

Malaysia : 2 users.

Australia : 7 users.

Peru : 1 users.

Albania : 1 users.

South Africa : 2 users.

Netherlands : 13 users.

China : 1 users.

Somalia : 1 users.

Slovenia : 1 users.

Gambia : 1 users.

Finland : 3 users.

Lithuania : 1 users.

Austria : 6 users.

United Kingdom : 17 users.

Egypt : 1 users.

Bahamas : 1 users.

Belgium : 3 users.

Poland : 7 users.

Singapore : 2 users.

Iran, Islamic Republic of : 1 users.

Bulgaria : 3 users.

Norway : 1 users.

Germany : 197 users.

New Zealand : 3 users.

Overall average confidence of users about the seen concepts: 2.616

NO. OF USERS PER CONCEPT:

1 users for 67 concepts

2 users for 123 concepts

3 users for 108 concepts

4 users for 86 concepts

5 users for 73 concepts

6 users for 37 concepts

7 users for 11 concepts

8 users for 7 concepts

9 users for 2 concepts

10 users for 2 concepts

18 users for 1 concepts

On AVERAGE there are 3.398 users per concept.

NO. OF ANSWERS PER CONCEPT (STEP 1):

On AVERAGE there are 5.002 answers per concept.

NONSENSE STATEMENTS:

TOTAL number of nonsense sentences = 1771

The Flipped Classroom Experiment - First Lessons Learned

Sailing in previously unknown seas...

As I already had announced in the previous post entitled "And Now for Something Completely Different..." this winter semester the Semantic Web Technologies lecture does not take place like an ordinary lecture. We follow the so-called Flipped Classroom Concept, i.e. the students "prepare" for the lecture (by watching the video recordings from the previous semesters), and we do the lecture in a more interactive way, driven by the needs of the students. So far the theory, but does it also hold in practice?
OK, now we have reached lecture No. 6, i.e. almost already half of the entire course. Let me tell you what happened so far.

Lecture preparation: I have to admit, I was a little bit afraid whether it would really work. But, the students always seem to be very well prepared and I'm really happy about that!
Interaction: Well, I have learned that it is me, who has to take over the incentive. Now, the game works as follows: every week, I publish so-called syllabus questions, i.e. essential questions that the students should be able to answer after they have understood the lecture. We start the lecture now always by going through these syllabus questions, which give us sufficient material for interactivity (=discussion).

When I simply ask "Do you have any questions about the content of the lecture?", I rarely receive any answers. But, when I start to talk about some of the issues from the content, more and more people are contributing.
Feedback: Well, according to the answers I receive for the syllabus questions, I can immediately recognize where potential problems might be and we can talk about it. For this, I usually keep the lecture slides at hand for visual support. I can then repeat what is important and can dive deeper into explanations as well as to give more examples

Actually, I have asked the students whether they like the way of this lecture, and I have not received any negative answer so far.

What I have learned so far is that for me as a lecturer the effort is almost quite the same for the flipped classroom as for a 'traditional lecture'. Moreover, you have to be very well prepared, because you always encourage the students to ask all kind of questions. The nice thing is the direct feedback that guides you to exactly those points where more explanation is needed. Personally, I like this way of the lecture very much. I never had any other lecture so far (well, besides lab courses or seminars) with so much feedback from the students.

...and of course I would also like to thank all of my students that they are willing to undergo this experiment together with me! :)

Tuesday, October 14, 2014

And now for something completely different...

As always in October, lectures are starting again. Like every year, I will give a lecture on Semantic Web Technologies. BTW, I have realized that I give now courses on Semantic Web for almost 10 years. It all started as a seminar at the Friedrich-Schiller Universität back in Jena and became a fully-fledged lecture here at the HPI in Potsdam. Like the lecture of last winter semester, almost all lectures have been recorded and are online available either at tele-Task or yovisto.

Moreover, we have also prepared two MOOC courses Semantic Web Technologies in Spring 2013, and Knowledge Engineering with Semantic Web Technologies in Spring 2014, both very successful with thousand(s) of students.

This semester, I have decided not to do the very same all over again and to try out something completely different...

Have you ever heard of the Flipped Classroom concept? This semester, we are going to turn the lecture situation around for the students. All the lecture content has already been recorded. Thus, students can prepare for each lecture at home by watching the videos and studying the handouts as well as the course materials. Then, in the classroom, I will not present the content again, but we are going to discuss

everything which needs more attention according to the students,
everything that the students did not quite well understand,
including all problems, errors, and complements that seem to be important.

Thus, to follow the (live) lecture the students have to prepare accordingly. Of course this will only work with the active participation of the students. On the other hand, it will also be more challenging for the lecturer and the tutors, because we have to be very well prepared to deal with all kind of potential questions and problems. Of course we will work out problem solutions and answers always together with the students. And it will be also the students who will take over the lead - well of course under the lecturer's guidance.

I'm very curious whether this concept will work out well with my lecture here at HPI. Please keep your fingers crossed and I will keep you posted.

Additional Links:

The Blog for our lecture 'Semantic Web Technologies', winter semester 2014/15

Tuesday, September 02, 2014

More intermediate Results from our Fact Ranking Challenge

Fact Ranking Challenge

Again, here is an update about our currently gathered data about our fact ranking experiment.

We started the original challenge about 5 weeks ago and now are able to present you some more intermediate results [1]. Nevertheless, the challenge is still running. Therefore, please distribute, participate, advertise, and help us to generate a fully fledged ground truth for fact ranking [2]. All data will be made publicly available for further research.

To determine the importance of a fact is of utmost importance, if you want to properly understand the content of information. Usually, you have a rich variety of possible interpretations of information. To determine the proper interpretation, you are going to use the context, i.e. further available information. So, the question develops from "what is important?" to "what is important with regard of this specific context?".

Current Intermediate Statistics (from Aug 28):

Number of users who participated: 388

Sum of concepts done: 1736

446 unique concepts are covered (out of 541).

Average concepts done per user: 4.47

CONCEPTS DONE:

0 concepts were done by 79 users.

1 concepts were done by 79 users.

2 concepts were done by 68 users.

3 concepts were done by 41 users.

4 concepts were done by 25 users.

5 concepts were done by 22 users.

6 concepts were done by 8 users.

7 concepts were done by 7 users.

8 concepts were done by 9 users.

9 concepts were done by 6 users.

10 concepts were done by 8 users.

11 concepts were done by 4 users.

12 concepts were done by 2 users.

13 concepts were done by 1 users.

14 concepts were done by 4 users.

15 concepts were done by 3 users.

16 concepts were done by 2 users.

19 concepts were done by 1 users.

20 concepts were done by 3 users.

21 concepts were done by 2 users.

22 concepts were done by 1 users.

23 concepts were done by 1 users.

24 concepts were done by 2 users.

25 concepts were done by 1 users.

31 concepts were done by 1 users.

38 concepts were done by 2 users.

40 concepts were done by 1 users.

42 concepts were done by 1 users.

58 concepts were done by 1 users.

59 concepts were done by 1 users.

62 concepts were done by 1 users.

64 concepts were done by 1 users.

EDUCATION:

highschool : 41 users.

phd : 57 users.

other : 22 users.

bachelors : 93 users.

masters : 174 users.

AGE:

33+ : 216 users.

19-25 : 68 users.

26-32 : 98 users.

0-18 : 5 users.

GENDER:

female : 90 users.

male : 297 users.

COUNTRY OF ORIGIN:

Angola : 1 users.

Belarus : 1 users.

Portugal : 4 users.

Philippines : 2 users.

Morocco : 3 users.

Greece : 5 users.

Ukraine : 3 users.

Indonesia : 3 users.

Sri Lanka : 1 users.

Italy : 13 users.

Iraq : 1 users.

India : 40 users.

France : 11 users.

Latvia : 1 users.

Pakistan : 3 users.

Syrian Arab Republic : 1 users.

Montenegro : 1 users.

Armenia : 1 users.

Mexico : 2 users.

Brazil : 10 users.

Venezuela : 1 users.

Croatia : 2 users.

Macedonia, The Former Yugoslav Republic of : 1 users.

Romania : 2 users.

Western Sahara : 1 users.

Algeria : 4 users.

Sweden : 2 users.

United States : 21 users.

Serbia : 8 users.

Nigeria : 2 users.

Estonia : 1 users.

Spain : 8 users.

Taiwan, Republic of China : 2 users.

Ireland : 1 users.

Israel : 1 users.

Russian Federation : 9 users.

Colombia : 3 users.

Switzerland : 1 users.

Azerbaijan : 1 users.

Kenya : 2 users.

Yemen : 1 users.

Malaysia : 2 users.

Viet Nam : 1 users.

Australia : 4 users.

Peru : 1 users.

Albania : 1 users.

South Africa : 2 users.

Netherlands : 8 users.

China : 2 users.

Somalia : 1 users.

Slovenia : 1 users.

Finland : 3 users.

Lithuania : 1 users.

Austria : 6 users.

Sudan : 1 users.

United Kingdom : 15 users.

Egypt : 2 users.

Bahamas : 1 users.

Hungary : 1 users.

Poland : 4 users.

Iran, Islamic Republic of : 2 users.

Bulgaria : 3 users.

Norway : 1 users.

Germany : 140 users.

New Zealand : 3 users.

COUNTRY OF RESIDENCE:

United Arab Emirates : 1 users.

Belarus : 1 users.

Portugal : 2 users.

Philippines : 1 users.

Morocco : 2 users.

Greece : 3 users.

Ukraine : 1 users.

Indonesia : 2 users.

Luxembourg : 1 users.

Sri Lanka : 1 users.

Italy : 11 users.

India : 31 users.

France : 13 users.

Jordan : 1 users.

Denmark : 1 users.

Latvia : 1 users.

Pakistan : 3 users.

Oman : 1 users.

Turkey : 1 users.

Czech Republic : 1 users.

Armenia : 1 users.

Canada : 3 users.

Brazil : 9 users.

Croatia : 1 users.

Romania : 3 users.

Algeria : 4 users.

Sweden : 3 users.

United States : 23 users.

Serbia : 4 users.

Nigeria : 1 users.

Saudi Arabia : 1 users.

Estonia : 2 users.

Spain : 7 users.

Taiwan, Republic of China : 1 users.

Ireland : 3 users.

Israel : 2 users.

Russian Federation : 2 users.

Colombia : 2 users.

Switzerland : 7 users.

Azerbaijan : 2 users.

Kenya : 2 users.

Norfolk Island : 1 users.

Yemen : 1 users.

Malaysia : 2 users.

Australia : 6 users.

Peru : 1 users.

Albania : 1 users.

South Africa : 1 users.

Netherlands : 12 users.

Somalia : 1 users.

Slovenia : 1 users.

Gambia : 1 users.

Finland : 3 users.

Lithuania : 1 users.

Austria : 5 users.

United Kingdom : 17 users.

Egypt : 1 users.

Bahamas : 1 users.

Belgium : 2 users.

Poland : 5 users.

Singapore : 1 users.

Iran, Islamic Republic of : 1 users.

Bulgaria : 3 users.

Norway : 1 users.

Germany : 154 users.

New Zealand : 2 users.

Overall confidence of users about the seen concepts: 2.687

NO. OF USERS PER CONCEPT:

On AVERAGE there are 2.18 users per concept.

NO. OF ANSWERS PER CONCEPT (STEP 1):

On AVERAGE there are 4.71 answers per concept.

NONSENSE STATEMENTS:

TOTAL number of nonsense sentences = 1371

Hint: You might wonder about the impressive high scores on the top of the list? Well, actually points are given exponentially, i.e. the longer you play, the more points you will score per processed concept.

References:
[1] Help Us with a Research Problem, July 30, 2914
[2] Fact Ranking Web-Application, http://s16a.org/fr/

Tuesday, August 19, 2014

The Importance of Relevance - Intermediate Results

Current Highscore List of our fact ranking challenge

In my last post, we invited you to take part in our research challenge, which was about creating a ground truth for fact ranking algorithms. To determine the importance of a fact is of utmost importance, if you want to properly understand the content of information. Usually, you have a rich variety of possible interpretations of information. To determine the proper interpretation, you are going to use the context, i.e. further available information. So, the question develops from "what is important?" to "what is important with regard of this specific context?".

We started the original challenge about 3 weeks ago and now are able to present you first intermediate results [1]. Nevertheless, the challenge is still running. Therefore, please distribute, participate, advertise, and help us to generate a fully fledged ground truth for fact ranking [2]. All data will be made publicly available for further research.

Current Intermediate Statistics:
Number of users who participated: 110 (Thanks to all you you!!!)
Number of overall processed concepts: 509
Overall 200 unique concepts are covered (out of 541).

Average concepts processed per user: 4.63

Detailed number of processed concepts per user:
0 concepts were done by 15 users.
1 concepts were done by 27 users.
2 concepts were done by 21 users.
3 concepts were done by 12 users.
4 concepts were done by 7 users.
5 concepts were done by 8 users.
6 concepts were done by 2 users.
7 concepts were done by 1 users.
8 concepts were done by 4 users.
9 concepts were done by 1 users.
10 concepts were done by 2 users.
11 concepts were done by 2 users.
14 concepts were done by 1 users.
16 concepts were done by 1 users.
20 concepts were done by 1 users.
22 concepts were done by 1 users.
25 concepts were done by 1 users.
31 concepts were done by 1 users.
53 concepts were done by 2 users.

Participant statistics:

EDUCATION:
highschool : 7 users.
bachelors : 28 users.
masters : 47 users.
phd : 23 users.

other : 5 users.

AGE:
33+ : 43 users.
26-32 : 34 users.
19-25 : 33 users.

GENDER:
female : 25 users.
male : 85 users.

COUNTRY OF ORIGIN:
United States : 9 users.
Serbia : 6 users.
Spain : 2 users.
Ukraine : 1 users.
Russian Federation : 3 users.
Colombia : 1 users.
Italy : 6 users. India : 2 users.
France : 3 users.
Malaysia : 1 users.
Australia : 1 users.
Albania : 1 users.
China : 1 users.
Pakistan : 1 users.
Finland : 1 users.
Austria : 2 users.
Montenegro : 1 users.
United Kingdom : 8 users.
Brazil : 3 users.
Poland : 2 users.
Iran, Islamic Republic of : 1 users.
Macedonia, The Former Yugoslav Republic of : 1 users.
Croatia : 1 users.
Germany : 49 users.
Algeria : 1 users.
New Zealand : 1 users.
Sweden : 1 users.

Overall confidence of users about the seen concepts: 2.585

NO. OF USERS PER CONCEPT:
On AVERAGE there are 1.295 users per concept.

NO. OF ANSWERS PER CONCEPT (STEP 1):
On AVERAGE there are 4.825 answers per concept.

We will keep you posted about the results.
Please distribute, participate, advertise, and help us to generate a fully fledged ground truth for fact ranking.

Hint: You might wonder about the impressive high scores on the top of the list? Well, actually points are given exponentially, i.e. the longer you play, the more points you will score per processed concept.

References:
[1] Help Us with a Research Problem, July 30, 2914
[2] Fact Ranking Web-Application, http://s16a.org/fr/

Wednesday, July 30, 2014

Help Us with a Research Problem

As you might know, we already have tried previously to let the public participate in our research. Last time, we have had developed several games (with a purpose). This time, unfortunately it is not a game, simply because the development of a good game is really expensive. But, let's get to the point. What is the task all about, where you can help us....?

You know, my research group is working on semantic technologies. Semantics in that sense means, we are trying to (automatically) understand what information (or data) is all about and what is the meaning of it. Sometimes, information is ambiguous. This makes it difficult to understand, because you have to solve ambiguities with the help of context.

On the other hand, sometimes you have various different information about a subject. How do you determine, which information or fact is more important or relevant than another? Just a quick example. Let's assume we have the following two facts:

(1) Albert Einstein is a physicist.
(2) Albert Einstein is a Vegetarian.

Which of the two facts is more important or relevant? Yes, this is difficult to answer, simply because the truth often lies in the eye of the beholder. For a vegetarian, maybe the second fact is more important. But, what about the most common opinion? What would the mainstream think? Probably, most people would say that fact (1) in general is more important.

So, what we are doing is that we develop heuristics that determine the importance of facts (relative to other facts). To get an idea about the quality of our heuristics, we have to do an evaluation, i.e. somebody has to decide whether the decision of the heuristics was wrong or right. Unfortunately, there does not exist a ground truth for this task called "fact ranking". Therefore, we are about to create a new ground truth that later will be publicly available and open for all researchers.

This ground truth is achieved with the little 'voting' application that you will find here [1]. You just have to register with the tool and then the task will be explained to you in detail. We took 500 popular concepts from Wikipedia and you have (1) to think about the most important facts about these concepts that come to your mind and then (2) rate the (new) facts presented to you according to their relevance. There is no right or wrong answer. Just vote as you think it seems right for you. Afterwards, we will aggregate all votes from all participants to determine the general (mainstream) relevance of the presented facts.

You might interrupt your rating of the presented facts at any time you like and continue later. To make it a bit more interesting, you can also score points and of course there is a highscore list. We would really appreciate your help in this task. Please do also spread the word. The more participants, the more valid our ground truth will be.

We know that this is a difficult and sometimes rather boring task. The more we would be really grateful for your assistance!

[1] Fact Ranking Web-Application, http://s16a.org/fr/

...in Times of War

For us Western Europeans, war always seems to be far, far away in some other country (or some distant times in the past). Usually, we read about it in the newspapers or see the pictures in the media. But, we are not concerned directly. This also includes me as a researcher. Of course, we also have students in our institute who come from countries or regions in crisis. But, they are here and the crisis is there... elsewhere.

As you might know, I recently finished my OpenHPI online course 'Knowledge Engineering with Semantic Web Technologies'. The course was rather popular with a total of 4,623 enrolled students from all over the world. 611 students took part in the final course exam and 450 students have finished the course successfully (yeah!!).

Of course the means of interaction with the students are limited in an online course. You follow the stream of discussions in the OpenHPI platform, answering a question here and then. Sometimes you also receive email from one of the course participants...

Today, I have received email from a course participant in Gaza, Palestine. He wrote me about his appreciation for the OpenHPI team to offer courses like this and about the projects he carried out during his University studies. Unfortunately, as he wrote, due to the current situation in Gaza, infrastructure has been destroyed including power outages as well as network failure. Of course this makes it difficult next to impossible to continue the course (not to speak about all the other major problems for the people that arise from this conflict). I am deeply impressed that in a situation like this, people still continue their efforts to invest in their education...and their future.

And yes, war has finally also knocked on the door of our small island of the fortunate...

Wednesday, July 16, 2014

New DBpedia Graph Statistics

Recently, we have been working on the DBpedia / Wikipedia Page Link dataset. We have considered the English and the German language versions for this project. In the current DBpedia 3.9 page links English and German datasets 18 million and 6 million entities are represented respectively. But the original DBpedia only contains about 4 million and 1 million distinct entities for English and German versions. This significant difference is mainly due to the current DBpedia pagelinks dataset include redirect pages and pagelinks with resources that are not considered as entites (as e.g. thumbnails and other images). So we considered cleaning up DBpedia pagelinks dataset for the computation of statistical parameters (a.g. pagerank or HITS). For the Cleanup we have removed all unnecessary and redundant RDF-Triples from the pagelinks dataset, i.e all removing the redirect pages (Redirection pages are just URIs that automatically forward a user to another Wikipedia page, but do not represent entities) as well as RDF-Triples representing resources that do not have an own rdfs:label (as per DBpedia documentation every entity has an rdfs:label reference).

One of the benefits of the cleaned up pagelink dataset is the faster computation of statistical graph measures (while not influencing the overall statistics, i.e. redirect pages usually don't have incoming links and theother removed resources (as e.g. images) don't have outgoing links). Based on this dataset we have computed PageRank, Hub and Authorities (HITS), PageInlink Counts and PageOutLink Counts. Please find the details of the datasets here on our research group's webpage [1].

For Computation of the DBpedia graph statistics we have used JUNG — the Java Universal Network/Graph Framework. Please find the source code for PageRank and HITS computation here via GitHub [2].

References and further Reading:
[1] New PageRank Computations for DBpedia 3.9 (English/German) at SemanticMultimedia
[2] Source code for DBpedia Graph Statistics

Monday, July 14, 2014

Harald's Original Miscellany - More Truth about Football - Part 4

Finally, Germany has won the Soccer Worldcup 2014. Therefore, also our little statistics on soccer will come to an end with the post today. You might ask yourself what kind of data is left for soccer players in Wikipedia and DBpedia. Well, unfortunately only a little. But, we will try to make something out of it. Last time, we've ask for the number of team changes and the correlation to popularity vs. scored goals for soccer players. What is left, if we look at the available data?

We have the data about the years in which the football players were active or have played in their national soccer team. Let's start with the national team years [1]:

nationalyears	NumPlayers
8	3
7	20
6	85
5	371
4	1070
3	2479
2	6116
1	28211

Well, it was obvious that the most players have only 1 or two years in the national team. But, there are exceptional players who achieved even 8 years. But, who are these long term players? [2]:

nationalyears	Player	Team
8	Wojciech Łobodziński	"Poland"@en
8	Wojciech Łobodziński	"Poland Under 16"@en
8	Wojciech Łobodziński	"Poland Under 17"@en
8	Wojciech Łobodziński	"Poland Under 21"@en
8	Wojciech Łobodziński	"Poland Under 18"@en
8	Santiago Cañizares	"Spain Under-17"@en
8	Santiago Cañizares	"Spain Under-16"@en
8	Santiago Cañizares	"Spain Under-21"@en
8	Santiago Cañizares	"Spain Under-18"@en
8	Santiago Cañizares	"Spain Under-23"@en
8	Santiago Cañizares	"Spain Under-19"@en
8	Santiago Cañizares	"Spain Under-20"@en
7	Aydın Yılmaz	"Turkey Under-21"@en
7	Aydın Yılmaz	"Turkey"@en
7	Aydın Yılmaz	"Turkey"@en
7	Aydın Yılmaz	"Turkey Under-19s"@en
7	Aydın Yılmaz	"Turkey Under-17"@en
7	Aydın Yılmaz	"Turkey A2"@en
7	Ismael Urzaiz	"Spain Under-17"@en
7	Ismael Urzaiz	"Spain Under-16"@en

Possibly you will never have heard of Poland's Wojciech Łobodziński or Spain's Santiago Cañizares. Here another flaw in the data becomes visible. There is no such thing as the unique national team. We have "under 16", "under 17", "under 18", and so on... So you start your career already with 15 and after 8 years you would be 23 and possibly be in the "real" national team.

Let's have a look at the active years of players. Unfortunately, here the data is rather messy [3]:

person	From	To
http://dbpedia.org/resource/Marta_(footballer)	2000	9223372036854775807
http://dbpedia.org/resource/Carlos_Alberto_Gomes_de_Lima	2006	200720082008
http://dbpedia.org/resource/Blake_Camp	2008	200420052006
http://dbpedia.org/resource/Birgit_Prinz	1998	200320042005
http://dbpedia.org/resource/Dejan_Damjanovi%C4%87	1998	20112012
http://dbpedia.org/resource/Inka_Grings	1995	20092010
http://dbpedia.org/resource/Breno_Silva	2003	20092010
http://dbpedia.org/resource/Vin%C3%ADcius_Calamari	2007	20092010
http://dbpedia.org/resource/Samuel_Jos%C3%A9_da_Silva_Vieira	1994	20082009
http://dbpedia.org/resource/Chad_Marshall	2004	20082009
http://dbpedia.org/resource/Michael_Parkhurst	2003	20072008
http://dbpedia.org/resource/Geison_Rodrigues_Marrote	2004	20072008
http://dbpedia.org/resource/Dejan_Stankovi%C4%87	1994	20062010
http://dbpedia.org/resource/Glauber_Da_Silva	2001	20062007
http://dbpedia.org/resource/Hugo_Sarmiento	1999	20032007
http://dbpedia.org/resource/Obafemi_Martins	2000	20032004
http://dbpedia.org/resource/Heslley_Couto	2005	20032006
http://dbpedia.org/resource/Dalibor_Filipovi%C4%87	1992	20022003
http://dbpedia.org/resource/Dmytro_Zayko	2005	20022004
http://dbpedia.org/resource/Aviv_Volnerman	1998	20012004

Possibly this is because of people not only writing single year's into the Wikipedia infoboxes, but time spans and other things. In the list above are only the Top 20. Just remove the LIMIT from the SPARQL query and further down you will find more valid data.

OK, let's come to our last problem related to football. Is there a correlation between the height of a player (simply because we have that data) and the number of achieved goals [4]?

Height	sumgoals
243.84	1
23.0	59
215.9	1
206.0	0
204.0	0
203.2	24
203.2	47
203.0	0
203.0	0
203.0	51
202.0	19
202.0	0
202.0	15
201.0	20
201.0	0
201.0	0
200.66	11
200.66	48
200.66	0
200.66	0
200.66	0
200.66	124
200.66	146
200.66	0
200.66	1
200.66	16
200.66	0
200.66	102
200.66	6
200.66	0
200.0	0
200.0	20
199.0	0
199.0	0
199.0	78
199.0	47
199.0	0
199.0	62
199.0	32
199.0	84

It is rather difficult to recognize something in this data. You see heights and the number of goals that player of that height have scored. Fascinating that there seem to be a significant number of players that are taller than 2 meters. I guess that the list leader with 2.43 meters is just incorrect data. Now. this is so many data that we have to visualize it to recognize something....

Correlation between height of soccer players and scored goals

In the diagram to the left you see the soccer players height (x-axis) vs. the number of scored goals (y-axis). Interesting thing to notice is that the heights approximately show a Gaussian distribution, i.e most players have a "middle" height, on the extremes there are only a few. Well, there seems to be an exception. on the outer left you will notice a large fraction of players with a height of 1.52m with goal scores ranging among 0 to 300. This is extraordinary, because I have no idea what kind of group this is or if there is simply an error in the data again. What I noticed is that among this group there seems to be a larger fraction of female Asian soccer players. Maybe they are responsible for that large number of outliers, but this requires further investigation.

Alas, I want to add one last table. For each player there is the information on which position she or he is playing. Of course Wikipedia authors are far from any sort of agreement how to name the player's position. Thus, there is a rather huge variety. Nevertheless, I will leave you with the table to make any sense from it. Enjoy [5]:

position	number	avheight	avgoals
midfielder	4450	176.822828644929299	18.077078651685393
defender	3429	181.37523034450506	8.058326042578011
striker	2643	180.169761559433703	65.213015512674991
goalkeeper	2086	186.186064557569709	0.174496644295302
forward	1787	178.865346269495465	46.388919977616116
centre back	786	185.7391350821381	10.651399491094148
winger	648	174.69354896725695	27.583333333333333
attacking midfielder	490	175.897877222177931	38.006122448979592
left back	461	177.661735253323698	6.995661605206074
defensive midfielder	427	179.741452037869347	11.978922716627635
right back	405	177.829456545982828	7.301234567901235
central defender	196	185.005408462213008	8.897959183673469
central midfielder	161	178.223105294363834	20.099378881987578
full back	158	174.659366173080242	6.759493670886076
centre forward	140	179.09685701642717	80.792857142857143
left winger	113	174.790176753449224	28.398230088495575
defender, midfielder	102	179.789412255380659	10.607843137254902
inside forward	96	172.094789346059145	64.083333333333333
defender/midfielder	82	179.178292995545911	11.463414634146341
defender / midfielder	79	180.188100887250282	15.050632911392405
left-back	65	175.642922504131603	7.692307692307692
right winger	62	175.247418803553424	30.096774193548387
centre-back	56	521.327500479561941	9.375
right-back	54	176.101110952871815	6.111111111111111
midfielder/forward	52	172.013075924836657	22.557692307692308
midfield	52	176.982691251314588	30.673076923076923
defender (retired)	50	181.37344055175781	14.48
second striker	48	174.673332850138346	63.854166666666667
full-back	44	175.908408771861666	2.681818181818182
striker / winger	41	178.235609845417295	49.219512195121951

Of course I limited the table to 30 rows. Interestingly, the average height of goalkeepers is larger than for midfielders or strikers. As expected, strikers on the average score more goals than defenders or goalkeepers.

SPARQL Queries and original data:

[1] number of players wrt. to years in the national team

[2] player name and team name of national team players with the longest playing terms
[3] players ordered by number of active years
[4] is there a correlation between soccer player height and scored goals?
[5] is there any correlation between player position, height, and scored goals?

Thursday, July 03, 2014

Harald's Original Miscellany - More Truth about Football - Part 3

To change the team means to earn more
money...what about the football millionaires?
How often do they change the team?

Are you ready for more statistics on your favorite kind of sports? Well, data is fun, and obviously Big Data means Big Fun. There are lot's of interesting things to discover while exploring data, and wikipedia (i.e. dbpedia for the insiders) provides all the necessary means.

Have you ever wondered about this kind of slave trade in professional football? Well, I wouldn't exactly call the transfer of a millionaire to a higher paying job a 'slave trade'. But, have you ever thought about the following question: Do the real good (and well paid) players more often change the team - or is it vice versa, that teams try to get rid of players that have a bad season or are on the decline? Who knows? Let's have a look on the data:

TeamChanges	NumPlayers
16	2
15	26
14	84
13	287
12	792
11	2247
10	3848
9	5109
8	6464
7	8110
6	9790
5	11264
4	11837
3	11448
2	10961
1	6515

Here, we have a table providing an overview about how many players (in wikipedia) have changed their team for how many times [1]. Obviously, it seems to be some kind of Gaussian distribution with a peak between 2 and 6 team switches. OK, what about the players? Where are the top players listed in this table? Well, David Beckham switched team 11 times according to wikipedia, Cristiano Ronaldo 8 times, Thierry Henry 9 times. At least these numbers are above average which we had identified to be between 2 and 6. This seems to give proof to our original assumption.

person	TeamChanges	popularity
http://dbpedia.org/resource/Cristiano_Ronaldo	8	1794
http://dbpedia.org/resource/David_Beckham	11	1572
http://dbpedia.org/resource/Thierry_Henry	9	1414
http://dbpedia.org/resource/Lionel_Messi	7	1404
http://dbpedia.org/resource/Wayne_Rooney	5	1343
http://dbpedia.org/resource/Frank_Lampard	5	1188
http://dbpedia.org/resource/Pel%C3%A9	4	1111
http://dbpedia.org/resource/Didier_Drogba	8	1047
http://dbpedia.org/resource/Ronaldo	10	1037
http://dbpedia.org/resource/Michael_Owen	6	1011

But, we get a better overview, if we look at the average popularity of each switching group in the table [3]:

TeamChanges	NumPlayers	avgindegree
16	2	107.5
15	26	76.769230769230769
14	84	47.297619047619048
13	287	60.885017421602787
12	788	45.073604060913706
11	2235	37.206263982102908
10	3800	32.577631578947368
9	5023	30.939080230937687
8	6332	29.685881238155401
7	7935	25.499054820415879
6	9525	23.188346456692913
5	10886	18.682895462061363
4	11423	14.937844699290904
3	10842	11.131802250507286
2	10089	7.704628803647537
1	5534	5.19588001445609

As we had originally thought, on the average, the popularity of the players rises with the number of team changes. Although the top group with 16 changes is far from the highest possible popularity scores (as e.g. 1572 for David Beckham). Hmm, maybe there is a correlation among the number of achieved goals with the number of team changes? Is it more likely that a top goal hunter switches team more often? Let's have a look [4]:

TeamChanges	NumPlayers	AvgGoals
16	2	22.5
15	26	28.807692307692308
14	83	37.855421686746988
13	287	38.062717770034843
12	781	39.939820742637644
11	2205	38.625850340136054
10	3784	35.646141649048626
9	4980	32.826907630522088
8	6296	29.489517153748412
7	7826	24.455788397648863
6	9314	21.937835516426884
5	10471	18.655142775284118
4	10627	16.317869577491296
3	9612	12.756450270495214
2	7193	9.911858751564021
1	4624	4.444204152249135

Looks interesting. Top goal scorer have 9 to 14 team switches. This is way above the average. Thus, the more goals you score, the more often you will have the chance of being transferred (and thus earn more money). Players that don't score goals will obviously not be transferred (that often).

References:

[1] How many players have how many team changes overall?
[2] The Top10 popular soccer players and their number of team changes
[3] The average popularity of football players regarding the number of team switches
[4] Number of average goals per soccer player with respect to the number of team changes