Wednesday, June 25, 2014

Harald's Original Miscellany - The Truth about Football - Part 2

John Terry Celebration Meme, read on and you will understand...
Of course you always wanted to know, who is the best football player of all times. Sure this might be a question about which real football afficionados might argue forever. Also Wikipedia will not be able to give you the definite answer. But, we can play around with the available data and maybe we find out something interesting about football players again ...

But, first at all, I want to say thank you to Kingsley Idehen, who gave me the hint for my SPARQL query links to use the parameter "qtxt=" instead of "query=", which enables others to see the original query and to use it for further data explorations. Thus, all SPARQL query links will be given in this form.

So let's start with the most simple query: Select all football players and their popularity (indegree) in descending order starting with the most popular player. We must be a little bit careful, because the class SoccerPlayer does not only contain "real persons" but also popular roles of football players such as e.g. "Captain". Therefore, we filter the results for entities that have a name (via foaf:name). Here are the Top50 football players according to wikipedia. For the entire list, please refer to the references [1].
Name Popularity
Cristiano Ronaldo 1794
David Beckham 1572
Thierry Henry 1414
Lionel Messi 1404
Wayne Rooney 1343
Frank Lampard 1188
Pelé 1111
Didier Drogba 1047
Ronaldo 1037
Michael Owen 1011
Steven Gerrard 1002
Zlatan Ibrahimović 964
Alessandro Del Piero 926
Ronaldinho 914
Raúl (footballer) 903
Ryan Giggs 894
Fernando Torres 889
Zinedine Zidane 867
Ruud van Nistelrooy 861
Robbie Keane 861
Samuel Eto'o 859
Landon Donovan 835
Andriy Shevchenko 823
Kaká 804
Francesco Totti 730
Robin van Persie 720
Paul Scholes 692
Hernán Crespo 680
David Villa 669
John Terry 669
Cesc Fàbregas 669
George Best 667
Carlos Tévez 666
Robinho 643
Gary Lineker 641
Teddy Sheringham 633
Andrew Cole 620
Dwayne De Rosario 617
Xavi 616
Jermain Defoe 613
Craig Bellamy 609
Dimitar Berbatov 587
David Trezeguet 587
Luis Suárez 581
Peter Crouch 577
Michael Ballack 572
Miroslav Klose 568
Luís Figo 567
Lee Dong-Gook 558
Filippo Inzaghi 557
Yes, it was obvious for everybody that names such as Ronaldo, Beckham, Thierry, Pelé occur among the top popular players. Unfortunately, I'm not a football expert to comment further on that. Let's have a look, whether popularity corresponds with the number of achieved goals. However, this information is not easy to extract. For some of the football players, there's a property dbprop:totalGoals, while most of them has dbprop:goals. But the later sometimes exists multiple times for single years or periods. Thus, we have to sum up all dbprop:goals, while keeping in mind not to count any number more often than once (because an entry might be reproduced in our result list for several reasons).
Name Goals Popularity
David Schofield (footballer) 76543210 9
Alcindo Sartori 5019110 160
Oh Seung-Bum 1842256 32
Marei Al Ramly 6037 11
Darío Espínola 1715 6
Kim Andersson 1537 23
Stefan Lövgren 1328 18
Nikola Karabatić 1318 84
Elias Ribeiro de Oliveira 1187 26
Mohd Amar Rohidan 1020 38
Slaviša Žungul 856 113
John Bartley (footballer) 762 1
Zoran Karić 759 11
Jimmy Greaves 748 342
Ernest Spiteri Gonzi 704 11
Pierre van Hooijdonk 670 238
Reg Date 664 3
Trevor Phillips (footballer) 655 4
Joan Linares 645 12
Domenic Mobilio 625 58
Pelé 620 1111
Harry Johnson (footballer born 1899) 610 25
Ernst Stojaspal 602 23
Max Morlock 588 66
Ernie Hine 574 72
Branko Šegota 561 38
Serhiy Koridze 557 4
Salvinu Schembri 538 6
Konstantin Yeryomenko 537 13
Tony Brown (English footballer) 498 47
Waldo Machado 497 36
Nguyen Minh Phuong 496 50
Tony Cascarino 496 141
Leônidas da Silva 484 98
Zeki Rıza Sporel 470 75
Ángeles Parejo 469 9
Tommy Dickson 457 13
Elisabetta Vignotto 454 22
Alberto Spencer 445 99
Stefan Schwoch 435 7
Peter Kitchen 429 16
Edgar Kail 427 7
Eusébio 423 433
Giorgos Sideris 415 51
Tommy Browell 414 88
Patricio Margetic 412 14
Arsénio Trindade Duarte 409 19
Uwe Seeler 406 153
Hughie Gallacher 406 131
Dragan Džajić 401 150
Again we see, that DBpedia data (resp. Wikipedia data) is somehow 'noisy'. The first 3 ranks are obviously wrong concerning the number of goals. Simply because if David Schofield really would have achieved 76,543,210 goals, it would mean that he had won about 5 goals per minute of all the 32 years of his entire life so far. This must be kind of an extraction error. If we look at the players with more than 1000 goals, then a closer inspection reveals some handballers that either are also footballers or are wrongly declared to be footballers. In handball it is easier to achieve a higher number of goals compared to football. Trevor Phillips and John Bartley really achieved more than 600 goals, but their popularity score signals that they did achieve this not necessarely in the major league. The first top ranked prominent football player in this list definitely is Pelé with 620 goals. The only other two in this Top50 list I have already heard of are Eusébio and Uwe Seeler, but don't take me as a reference :)

Lets order the list again the other way around according to the most popular players to investigate their goal score:
Name Goals Popularity
Cristiano Ronaldo 227 1794
David Beckham 95 1572
Thierry Henry 265 1414
Lionel Messi 223 1404
Wayne Rooney 156 1343
Frank Lampard 163 1188
Pelé 620 1111
Didier Drogba 160 1047
Ronaldo 217 1037
Michael Owen 163 1011
Steven Gerrard 98 1002
Zlatan Ibrahimović 198 964
Alessandro Del Piero 223 926
Ronaldinho 157 914
Raúl (footballer) 280 903
Ryan Giggs 114 894
Fernando Torres 161 889
Zinedine Zidane 95 867
Ruud van Nistelrooy 249 861
Robbie Keane 179 861
Samuel Eto'o 219 859
Landon Donovan 135 835
Andriy Shevchenko 219 823
Kaká 114 804
Francesco Totti 226 730
Robin van Persie 130 720
Paul Scholes 107 692
Hernán Crespo 198 680
John Terry 30 669
David Villa 234 669
Cesc Fàbregas 50 669
George Best 238 667
Carlos Tévez 135 666
Robinho 122 643
Gary Lineker 243 641
Teddy Sheringham 289 633
Andrew Cole 226 620
Dwayne De Rosario 94 617
Xavi 57 616
Jermain Defoe 151 613
Craig Bellamy 113 609
Dimitar Berbatov 189 587
David Trezeguet 218 587
Luis Suárez 128 581
Peter Crouch 102 577
Michael Ballack 117 572
Miroslav Klose 181 568
Luís Figo 91 567
Filippo Inzaghi 184 557
Patrick Vieira 45 551
As we would expect, most of the popular football players are also good goal scorers. Well, there are a few exceptions. Take John Terry with a popularity score of 669 and only 30 goals. Why might he be so popular then? Taking a closer look at Wikipedia reveals that Terry plays at centre back position and is the captain of Chelsea in the Premier League. Well, that's already something for popularity. But, if you look even closer, you will find more: under the topic 'Controversies' you will find charges for assault and affray, a £60 fine for parking his Bentley in a disabled bay, extramarital affair allegations as well as racial abuse allegations. But, neither of these is directly responsible for Terry's popularity. In fact it's an internet meme (cf. introductory picture of this article). John Terry was suspended for the UEFA Final and had to watch his team in a suit and tie on the sidelines. He did look quite miserable as he sat there, watching his team defend for their lives and then miraculously pull out the victory. However, as soon as Chelsea made the victory, it was party time for Terry! He immediately threw off his suit like Superman and revealed his full Chelsea kit underneath his suit. The internet community enjoyed his dedication to his club and soccer so much that immediately a popular internet meme lampooning his behaviour appeared on the web, becoming one of the most popular online jokes in 2012. Terry has been pictured taking part in great moments in history and fiction. These included the fall of the Berlin Wall, the freeing of Nelson Mandela, the triumph of Rocky Balboa, as well as the first landing on the Moon [3]. Well, this should be reason for some popularity :)

Monday, June 23, 2014

Harald's Original Miscellany - The Truth about Football

The England National football team, 1893, photo: wikipedia
Well, it's the time of the Worldcup 2014. Why should I bother you with the peculiarities of authors and writers, when we can also have a look on Football! As you might remember, we had about 15,328 individuals in Wikipedia classified as authors [1]. What do you think, how many footballers are there compared to authors? Well there's a huge difference: 162,597 referenced footballer players, i.e. about 10 times as much as authors [2]. Maybe you think now there might be an overlap. How many football players are also listed as authors? I am sorry to disappoint you, but there is no overlap. No football player is also listed as being an author.

Fact No. 1: Footballers are no authors, and vice versa.

So you might wonder, what other categories these football players are in to get a better overview about what we are talking about. Interestingly, when looking at the most popular categories, you will soon find the large number of expatriates among those players. The Top5 expatriate nationalities among football players are: Brazil, Argentina, Russia, France, Serbia [3]. If you look at the bottom of the list, you will find the more or less exotic combinations, such as e.g. Hungarian expatriates in Uzbekistan, or Cameroonian expatriates in Venezuela, or even German expatriates in the Netherlands ;-)

Fact No. 2: Only a few Hungarian football players emigrate to Uzbekistan.

And in which countries they prefer to emigrate? The Top5 countries for football players to emigrate are: England, Germany, Spain, Italy, France [4]. At least for France the statistics seems to be balanced somehow, while England and Germany are the leading nations to attract foreign football players all around the world. Very interesting also the bottom of the list, where as the "least attractive" countries Gambia, Guam, Nepal, or Antigua and Barbuda are listed.

Fact No.3: French football players seem to be undecided whether to stay or leave the country.

But what about the categories that do not have a direct relationship to football in the first place? Let's filter out these categories and let's have a look on what football players are up to.

alternative professions of football players #players 279 271 250 237 234 234 134 73 64 56 55 54 51 51 43 33 32 28 28 27 26 26 25 25 25 25 23 23 21 21 20 18 18 16 16 16 16 16 16 15 15 15 15 14 14 14 13 13 12 12
Please find the entire list in the references [5].

Fact No. 4: There are more intellectuals among football players than bad persons.

Wow, 237 football players are also categorized as being intellectuals, while 23 football players are listed as "bad persons". But, in this statistics, we will also find 21 writers(!) among the football players, as well as 15 artists, 14 journalists, and 13 criminals.
Further down the list, you will also find
  • 12 politicians, 
  • 8 identical twins, 
  • 8(!) head of state, 
  • 7 musicians, 
  • 4 comedians (I'll bet there are more...), 
  • 4 scientists, 
  • 3 singers, 
  • 2 mammals,
  • 2 Gentleman Cricketeers,
  • 2 gambling addicts,
  • 2 aviators,
  • 2 painters,
  • 1 UFO conspiracy theorists,
  • 1 bank robber,
  • 1 rapper,
  • 1 plumber,


So what does this tell us about football players?

Fact No. 5: There are more politicians among football players than comedians.

Well, in general, and according to Wikipedia, football players most times stick to their original profession. While emigrating here and there sometimes, there are only a few among them who actually have a second career outside of their original profession. Please note that we did not follow categories like football manager, football trainer, football coach, etc. Anyway, we finally did find also some authors among them....

Fact No. 6: There are writers among the football players...although they are not listed as authors.

to be continued....

Please find the full tables with all the results listed here in the References:

[1] total number of authors (?author rdf:type dbpedia-owl:Writer .)
[2] total number of football players (?player rdf:type dbpedia-owl:SoccerPlayer .)
[3] expatriate football players by home country
[4] expatriate football players by emigration country
[5] occupations of football players other than football

Sunday, June 22, 2014

Harald's Original Miscellany - Prolificacy vs. Popularity - Part 3

photo: wikipedia
So we were stranded with the problem that The Lord of the Rings was (of course) a notable work of J.R.R. Tolkien, but DBpedia said that The Lord of the Rings is not a "book", but it consists of 3 books [1,2]. The problem then is that filtering "notable works" with "books" cuts out book series. If we don't use the filter "notable works", then we will have also paintings, photographs, sculptures, etc. in our result list. 

Let's find out about books in general in DBpedia. How many are there anyway? If we simply ask for all entities of the type "book", then we end up with currently 28,128 books [3]. If we are asking for all things that have an author, then we end up with 63.071 [4] or 71.046 [5] depending on the way we ask. OK, not everything what is authored by somebody is also a book. There might be short stories, essays, articles, but also series of books. Ok, say we don't care and try it with all these kind of written works. Moreover, let's also consider the overall impact of all the written works of an author (by simply sum up all indegrees (=popularity) of her books). The next table shows the Top 40 authors list based on all written works of the author and ordered by the overall impact (GrandTotal) [6]:

name numOfWorks popularityOfWorks GrandTotal authDegree
"Charles Dickens"@en 8 398.7 3190 4026
"J. R. R. Tolkien"@en 3 819.2 2964 2859
"Elizabeth Sarnoff"@en 1 2457.0 2457 99
"Robin Green"@en 1 2277.0 2277 130
"Lewis Carroll"@en 2 939.0 1878 1576
"J. K. Rowling"@en 1 1829.0 1829 983
"Michael Stewart (playwright)"@en 6 233.6 1402 120
"Robert Louis Stevenson"@en 3 448.0 1344 1457
"George Orwell"@en 4 363.7 1309 1687
"Arthur Miller"@en 4 319.2 1277 1077
"Miguel de Cervantes"@en 2 611.0 1222 992
"Henrik Ibsen"@en 4 299.7 1199 1630
"Bram Stoker"@en 1 1145.0 1145 717
"Stephen King"@en 7 146.4 1105 2906
"Oscar Wilde"@en 2 486.3 1032 2324
"Samuel Beckett"@en 8 83.2 889 1414
"C. S. Lewis"@en 6 105.4 881 1530
"Naoko Takeuchi"@en 1 875.0 875 133
"Alexandre Dumas"@en 2 427.5 855 1125
"Roald Dahl"@en 16 53.8 834 855
"Tony Barwick"@en 4 202.2 809 114
"Terry Pratchett"@en 2 292.6 809 1032
"Jeremy Lloyd"@en 6 129.0 774 251
"Jimmy Perry"@en 2 382.5 765 84
"Victoria Morrow"@en 1 760.0 760 12
"Leo Tolstoy"@en 2 374.5 749 1927
"Dan Brown"@en 3 224.3 673 387
"John Steinbeck"@en 3 218.0 654 984
"Mark Twain"@en 2 333.6 615 2426
"Isaac Asimov"@en 4 123.7 607 2026
"Jonathan Swift"@en 2 303.5 607 1190
"Joseph Conrad"@en 11 59.4 603 909
"Tsugumi Ohba"@en 2 283.0 566 62
"Vladimir Nabokov"@en 4 157.1 535 1015
"Yoshihiro Togashi"@en 2 266.5 533 70
"Ryukishi07"@en 2 264.5 529 57
"Aldous Huxley"@en 3 175.3 526 968
"Dustin Lance Black"@en 2 262.5 525 159
"Rudyard Kipling"@en 3 196.2 520 1829
"Charlotte Brontë"@en 2 258.0 516 510

The column "authDegree" denotes the popularity of the author alone (measured by the indegree of the author's article in Wikipedia). We notice that the situation has changed since our last experiment. Tolkien is now reported with 3 books (denoting that The Lord of the Rings is now included in his notable works list). Interestingly, he is now leading the list together with Charles Dickens, followed by Elizabeth Sarnoff and Robin Green, who only are mentioned with one notable work. Never heard of the later two? Well, Elizabeth Sarnoff is a writer for tv series as e.g. Lost, while Robin Green was writer and producer of The Sopranos. By opening the category "books" we now also have screen writers in our list, and the popularity of tv series seems to be rather significant, at least compared to literature.

The same holds for Naoko Takeuchi. Ever heard about her? Well, maybe you don't. But, then you probably will know Sailor Moon, a very popular Japanese manga series. Yes, now we also include what belongs our contemporary literary culture: tv series and comics. Naoko Takeuchi between C.S. Lewis (The Narnia Chronicles) and Alexandre Dumas (The Three Musketeers). That fits well, doesn't it? If you look at the last column (authDegree), you will notice that this value is significant lower for Elizabeth Sarnoff, Robin Green, and Naoko Takeuchi as compared with C.S. Lewis or Alexandre Dumas. So maybe there works are currently rather popular, but in total the cultural influence of the already established writers of the past seems to have more impact.

As a last question - and then I won't bother you with this statistics again - lets reorder the current table according to the authors general impact (authDegree) [7].

name numOfWorks popularityOfWorks GrandTotal authDegree
"Charles Dickens"@en 8 398.7 3190 4026
"Stephen King"@en 7 146.4 1105 2906
"J. R. R. Tolkien"@en 3 819.2 2964 2859
"Johann Wolfgang von Goethe"@en 5 73.8 363 2734
"Cicero"@en 1 23.0 23 2469
"Mark Twain"@en 2 333.6 615 2426
"Oscar Wilde"@en 2 486.3 1032 2324
"Isaac Asimov"@en 4 123.7 607 2026
"H. P. Lovecraft"@en 3 101.3 304 2024
"T. S. Eliot"@en 3 169.8 477 1995
"Leo Tolstoy"@en 2 374.5 749 1927
"Arthur Conan Doyle"@en 1 130.0 130 1882
"Rudyard Kipling"@en 3 196.2 520 1829
"George Orwell"@en 4 363.7 1309 1687
"Henrik Ibsen"@en 4 299.7 1199 1630
"Neil Gaiman"@en 5 89.1 424 1589
"Lewis Carroll"@en 2 939.0 1878 1576
"C. S. Lewis"@en 6 105.4 881 1530
"Rabindranath Tagore"@en 4 67.8 276 1530
"Alan Moore"@en 1 17.0 17 1480
"Robert Louis Stevenson"@en 3 448.0 1344 1457
"Samuel Beckett"@en 8 83.2 889 1414
"Alexander Pushkin"@en 2 131.0 262 1372
"Philip K. Dick"@en 4 95.0 380 1259
"Arthur C. Clarke"@en 2 72.5 145 1252
"Ray Bradbury"@en 3 129.0 387 1249
"Robert E. Howard"@en 3 35.0 90 1223
"Jonathan Swift"@en 2 303.5 607 1190
"William Wordsworth"@en 1 79.0 79 1164
"Henry James"@en 6 74.0 444 1128
"Alexandre Dumas"@en 2 427.5 855 1125
"William S. Burroughs"@en 1 159.0 159 1119
"Arthur Miller"@en 4 319.2 1277 1077
"Jack Kerouac"@en 3 123.0 369 1034
"Terry Pratchett"@en 2 292.6 809 1032
"Virginia Woolf"@en 3 89.6 269 1026
"Vladimir Nabokov"@en 4 157.1 535 1015
"Miguel de Cervantes"@en 2 611.0 1222 992
"John Steinbeck"@en 3 218.0 654 984
"J. K. Rowling"@en 1 1829.0 1829 983

Now we also see authors that haven't shown up in the first place because their single works were somehow too insignificant, but their overall impact wasn't. Goethe, Mark Twain, but also Stephen King, Cicero, or Arthur Conan Doyle are then among the top ranked authors. But no writers of tv shows or any mangas.

Well, you might ask why it should make sense to make a statistics like that when you get so many different and in the end confusing results. Which one should we trust? If you ask me, trust none of them. All data is insufficient and doesn't reflect the "whole story", esp. in DBpedia (which is again only a fraction of all information available in Wikipedia, which again only reflects one (or some) specific viewpoints of reality. Thus the old saying is reinforced again: Don't ever trust statistics that you haven't falsified yourself.

[3] Number of all books in DBpedia, -> ?book rdf:type dbpedia:Book .
[4] Number of books by counting what is authored -> ?book dbpedia-owl:author ?author .
[5] Number of books by counting what is authored, using dbpedia-owl:author or dbprop:author

Saturday, June 21, 2014

Harald's Original Miscellany - Prolificacy vs. Popularity in Literature, Part 2

It might have been a surprise for you that according to Wikipedia, editor in chief of the Oxford English Dictionary John Simpson is the most popular author [1]. Thus, we have to take a look on the notion of "popularity". In scientific publishing, an "important" author is an author whose works are cited (referenced) by many other authors. In this way, a ranking among the most important authors has been established. In Wikipedia, we can follow this approach and simply count, how many articles are referencing the article of an author or the articles of the author`s books. In our approach, this was simply the number of incoming pageLinks of the referenced articles. Currently, we are adopting the PageRank algorithm for achieving a better measure of "importance" of an article. You all know the PageRank algorithm named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages [2]. We will get back on this, when we have finished PageRank integration into DBpedia.

But, let's discuss the results we had achieved in our last blog post. We stated that there are 15,328 authors [3] referenced in DBpedia. Well, of course there might be more authors, but this is the number of individuals that belong to the DBpedia class authors. The question is, when we made the popularity vs prolificacy statistics, did we include the works of 15,328 authors? Surely not. Let's first check, for how many authors, there are books written by these authors referenced in DBpedia: Its only 3,763 authors with referenced books [4]. Thus, we don't have any idea about the remaining 11,000 authors who don't have one of their books referenced in DBpedia. They must have written some book, otherwise they would not be an "author"....

OK, so we are dealing with insufficient information. If we look at the available data, we could also use another property, called dbpedia-owl:notableWork denoting only the "important" works of an author (if we relate it with authors and not with other artists). Now it would be interesting, to repeat our statistics based on this property and look, if there is a significant difference. First let's look at the most prolific authors with respect to referenced "notable works":

name numOfNotableWorks popularityOfNotableWorks
"Roald Dahl"@en 16 53.8
"Joseph Conrad"@en 9 68.2
"Charles Dickens"@en 8 398.7
"Roger Zelazny"@en 7 15.7
"Henryk Sienkiewicz"@en 6 53.8
"Tom Wolfe"@en 6 44.3
"Margaret Atwood"@en 6 41.8
"Henry James"@en 6 74.0
"Ismail Kadare"@en 6 7.8
"Alfred Döblin"@en 6 13.3
"Stephen King"@en 6 133.1
"Robert Girardi"@en 6 3.0
"Hunter S. Thompson"@en 5 56.4
"Joanne Harris"@en 5 16.4
"David Mitchell (author)"@en 5 25.6
"Chris Kuzneski"@en 5 7.0
"Jackie French"@en 5 8.2
"Sita Ram Goel"@en 5 7.0
"Peter Hitchens"@en 5 9.2
"Chinua Achebe"@en 5 31.6
"Colm Tóibín"@en 5 17.4
"E. L. Doctorow"@en 5 26.4
"Don DeLillo"@en 5 27.4
"Johann Wolfgang von Goethe"@en 5 72.6
"Samuel Beckett"@en 5 25.4
"Cormac McCarthy"@en 5 55.6
"Brian O'Nolan"@en 5 29.8
"Poppy Z. Brite"@en 5 11.2
"William Trevor"@en 5 20.2
"Neil Gaiman"@en 5 84.8
"Dean Koontz"@en 5 10.4
"George MacDonald"@en 5 21.6
"Ann Bannon"@en 5 9.0
"William Faulkner"@en 4 70.7
"Will Self"@en 4 10.0
"Alan Hollinghurst"@en 4 23.5
"George Eliot"@en 4 81.0
"Malcolm Gladwell"@en 4 39.0
"C. S. Lewis"@en 4 43.2
"Hermann Hesse"@en 4 53.2

There is a huge difference. Less science fiction authors, more international authors, and the average "popularity" of the top 40 author's works has also significantly increased. We also realize that considering Charles Dickens, the popularity of his 8 notable works (398) doesn't differ so much from the popularity of his overall 30 works (159.5) as for Stephen King, where the popularity of his 5 notable works (133) is much higher compared to the overall average from last time (44) for his 75 books. Interesting also that Road Dahl is mentioned with 16 notable works. Either the wikipedia authors could not decide which of the works of Road Dahl really was notable or the article must have been written by a huge fan. OK, so we might learn that is in the eye of the beholder, which works of an author are referenced as being "notable".

Let's order the list now again according to the popularity measure and see what happens. This is the Top 40 list of authors with the most popular works (on average) with respect to "notable works" only:

name numOfNotableWorks popularityOfNotableWorks
"Bram Stoker"@en 1 1145.0
"Lewis Carroll"@en 2 939.0
"Lucifer Chu"@en 1 612.0
"Miguel de Cervantes"@en 2 611.0
"J. R. R. Tolkien"@en 2 566.0
"Robert Louis Stevenson"@en 3 448.0
"Alexandre Dumas"@en 2 427.5
"Oscar Wilde"@en 1 427.0
"Kenneth Grahame"@en 1 423.0
"Walter Prescott Webb"@en 1 423.0
"James Vincent Murphy"@en 1 421.0
"George Orwell"@en 3 412.3
"Charles Dickens"@en 8 398.7
"Emily Brontë"@en 1 387.0
"Leo Tolstoy"@en 2 374.5
"John Bunyan"@en 1 341.0
"Louisa May Alcott"@en 1 314.0
"Mark Twain"@en 2 307.5
"Jonathan Swift"@en 2 303.5
"Emma Orczy"@en 1 282.0
"Charlotte Brontë"@en 2 258.0
"Ian McFarlane"@en 1 255.0
"William Golding"@en 1 251.0
"Anne Frank"@en 1 246.0
"James Fenimore Cooper"@en 1 246.0
"Margaret Mitchell"@en 2 242.0
"Frans Sammut"@en 2 233.5
"Rudyard Kipling"@en 2 230.5
"Dan Brown"@en 3 224.3
"John Steinbeck"@en 3 218.0
"William Gibson"@en 1 216.0
"Ayn Rand"@en 2 210.5
"Harriet McDougal"@en 1 207.0
"Jim Butcher"@en 1 200.0
"Richard Adams"@en 1 192.0
"William Makepeace Thackeray"@en 1 191.0
"T. S. Eliot"@en 2 186.0
"Aldous Huxley"@en 3 175.3
"Hitoshi Igarashi"@en 1 175.0
"Alice Walker"@en 1 174.0

This looks a lot more like we would have expect it to be in the first place. Bram Stoker on the first place with only 1 notable work, we all know which book is meant by that :) impressive score. Lewis Carroll on the second place with his Alice in Wonderland stories is also no wonder. But, who is Lucifer Chu? This is some kind of surprise for me. You won't find Lucifer Chu in the German version of DBpedia, so why does our result suggest that he is a popular author? Well, if you look at the very short Wikipedia article in the English version, you will find out that he is the author of the Chinese translation of J.R.R.Tolkien's The Hobbit and The Lord of the Rings. Tolkien's books follow on rank number 5 with a slightly less popularity. How can that be? Well we must examine the available data a little bit closer.

Chu is referenced in our list with 1 book only although we know that he has translated The Hobbit as well as The Lord of the Rings. And J.R.R. Tolkien is referenced with 2 books. Obviously, for Chu only the Hobbit as being the most popular of these books has been counted. But, if we look at the DBpedia page of Lucifer Chu, we realize that for Chu there are 2 books referenced as notable Work, as there are 3 books for Tolkien (including The Silmarillion). The reason for this is that The Lord of the Rings is not a member of the class dbpedia-owl:Book. Why is that so? Maybe because The Lord of the Rings consists out of 3 different books. Thus, to get a more complete result, we should think of how we get all notable works into that list. But, we should simultaneously take care of only including books or series of books and to exclude other pieces or work such as e.g. paintings, sculptures, movies, photographs, etc.

We will take care of this next time... :)

[1] Harald's Original Miscellany - Prolificacy vs. Popularity in Literature, Part 1, 2014/06/20
[2] Brin, S.; Page, L. (1998). "The anatomy of a large-scale hypertextual Web search engine". Computer Networks and ISDN Systems 30: 107–117.
[3] live SPARQL query for the number or authors
[4] live SPARQL query for the number of distinct authors who have written a book, referenced in DBpedia
[5] live SPARQL query for the Top 40 most prolific authors (based on notable works)
[6] live SPARQL query for the Top 40 most popular authors (based on notable works)

Friday, June 20, 2014

Harald's Original Miscellany - Prolificacy vs. Popularity in Literature

Oh wow, it's quite a while that I wrote my last post here in the blog... But, while preparing exercises for the OpenHPI MOOC course 'Knowledge Engineering with Semantic Technologies', I was about to play around a little with SPARQL to come up with new exercises for the students of the course. To make it short, our current lecture examples all deal with writers and books. Thus, to learn how to query RDF knowledge bases with the SPARQL query language, I chose the DBpedia. While trying to think of some interesting toy examples, I started to play around and the facts that I discovered by chance were so interesting that I totally forgot about my lunch break :)

So here are some interesting facts about books and authors that will be continued in later posts. All presented statistics is based on the (English) Wikipedia (of course for the SPARQL queries we use DBpedia)...but nevertheless, it is wikipedia knowledge.

There are currently 15,328 authors listed (i.e. they are member of the class dbpedia-owl:Writer). First thing I wanted to find out was, who are the most prolific authors according to Wikipedia (at least this means, whose works also exist as Wikipedia Pages and who are connected via dbpedia-owl:author).

Well, here are the Top 40 Most Prolific Writers:

name numOfWorks popularityOfWorks
"L. Sprague de Camp"@en 128 10.9
"Agatha Christie"@en 103 32.7
"Isaac Asimov"@en 75 28.3
"Stephen King"@en 75 44.0
"Philip K. Dick"@en 74 14.4
"Edgar Rice Burroughs"@en 73 18.7
"Ruth Rendell"@en 70 5.3
"Dean Koontz"@en 67 6.2
"Lin Carter"@en 64 9.0
"Terry Pratchett"@en 63 41.7
"Jules Verne"@en 63 31.2
"P. G. Wodehouse"@en 63 20.9
"Robert E. Howard"@en 62 10.5
"Gary Paulsen"@en 61 4.1
"K. A. Applegate"@en 61 18.5
"August Derleth"@en 60 5.2
"John Dickson Carr"@en 59 5.9
"H. G. Wells"@en 57 30.0
"James Patterson"@en 56 12.0
"Leslie Charteris"@en 55 8.8
"Robert A. Heinlein"@en 52 36.5
"Rex Stout"@en 52 21.4
"Arthur C. Clarke"@en 50 20.8
"Harry Turtledove"@en 49 11.3
"Ray Bradbury"@en 49 15.3
"David Weber"@en 49 30.9
"Danielle Steel"@en 48 3.3
"J. M. G. Le Clézio"@en 48 3.7
"Henry James"@en 48 18.5
"Piers Anthony"@en 48 14.0
"Clive Cussler"@en 47 10.1
"Roger Zelazny"@en 45 10.5
"Alan Dean Foster"@en 44 6.7
"Joe R. Lansdale"@en 43 4.7
"Gordon R. Dickson"@en 41 5.0
"Marion Zimmer Bradley"@en 41 7.6
"Samuel R. Delany"@en 41 9.0
"Bernard Cornwell"@en 40 16.6
"Enid Blyton"@en 40 6.8
"Dr. Seuss"@en 40 26.5
The average Popularity Score that you see in the third column corresponds to the number of references (links) from other wikipedia articles to these books. Interestingly, Agatha Christie as well as Isaac Asimov are rather prolific authors whose books also have an above the average popularity. On the other hand, Ruth Rendell or Dean Koontz are rather prolific, but not very popular (at least according to wikipedia). Most popular in this list are Stephen King and Terry Prachett.

Well, let's turn it the other way around. Let's sort this list by the average popularity of the books of these authors....

Here is the Top 40 list of the authors with the most popular books (on average):
name numOfWorks popularityOfWorks
"John Simpson (lexicographer)"@en 1 1627.0
"John Milton"@en 1 669.0
"Kenneth Grahame"@en 1 423.0
"Emily Brontë"@en 1 387.0
"Wilhelm Grimm"@en 1 343.0
"Jacob Grimm"@en 1 343.0
"Harper Lee"@en 1 334.0
"Lewis Carroll"@en 6 319.8
"Miguel de Cervantes"@en 4 310.5
"Jonathan Swift"@en 2 303.5
"Wilbert Awdry"@en 1 296.0
"Cao Xueqin"@en 1 291.0
"Giovanni Boccaccio"@en 1 287.0
"William Shakespeare"@en 3 264.6
"Ian McFarlane"@en 1 255.0
"Antoine de Saint-Exupéry"@en 1 253.0
"Margaret Mitchell"@en 2 242.0
"Suetonius"@en 1 229.0
"Roger Hargreaves"@en 2 214.0
"Joseph O'Neill (writer)"@en 1 211.0
"T. S. Eliot"@en 2 186.0
"George Bernard Shaw"@en 2 184.5
"Harriet Beecher Stowe"@en 3 173.6
"Petronius"@en 1 162.0
"Charles Dickens"@en 30 159.5
"Johanna Spyri"@en 1 156.0
"Herman Melville"@en 7 154.7
"Jaroslav Hašek"@en 1 152.0
"Pierre Choderlos de Laclos"@en 1 152.0
"Oscar Wilde"@en 3 148.0
"Dave Arneson"@en 3 146.0
"Ngô Sĩ Liên"@en 1 145.0
"John Eric Holmes"@en 2 143.5
"Carlo Collodi"@en 1 142.0
"George Orwell"@en 11 135.0
"Erik Mona"@en 3 134.6
"Daniel Defoe"@en 5 132.8
"Monte Cook"@en 3 132.3
"Apuleius"@en 1 132.0
"Dan Brown"@en 6 126.6
Possibly you have never heard of John Simpson? But you will have heard about the Oxford English Dictionary. Well John Simpson was its Chief Editor...that makes sense, doesn't it? What about Kenneth Graham? Maybe you know his 1908 published novel The Wind and the Willows...

In this list it seems that it is more about literary excellency. Only one author with a rather prolific output is found, which is Charles Dickens with 30 listed Books.  But, to find also Dan Brown on this list tells me, that popularity doesn't hold for literary excellency or quality. At least he is last among the Top 40 after Herman Melville, George Orwell, Daniel Defoe, Lewis Caroll or Jonathan Swift. On the other hand, John Milton did not become rich with his one shot Paradise Lost although it is rather popular.

Here are the links to the online queries to get the most recent and complete results:
Enjoy....I'll be back, when I will find again something interesting ;-)