Saturday, June 21, 2014

Harald's Original Miscellany - Prolificacy vs. Popularity in Literature, Part 2

It might have been a surprise for you that according to Wikipedia, editor in chief of the Oxford English Dictionary John Simpson is the most popular author [1]. Thus, we have to take a look on the notion of "popularity". In scientific publishing, an "important" author is an author whose works are cited (referenced) by many other authors. In this way, a ranking among the most important authors has been established. In Wikipedia, we can follow this approach and simply count, how many articles are referencing the article of an author or the articles of the author`s books. In our approach, this was simply the number of incoming pageLinks of the referenced articles. Currently, we are adopting the PageRank algorithm for achieving a better measure of "importance" of an article. You all know the PageRank algorithm named after Larry Page, one of the founders of Google. PageRank is a way of measuring the importance of website pages [2]. We will get back on this, when we have finished PageRank integration into DBpedia.

But, let's discuss the results we had achieved in our last blog post. We stated that there are 15,328 authors [3] referenced in DBpedia. Well, of course there might be more authors, but this is the number of individuals that belong to the DBpedia class authors. The question is, when we made the popularity vs prolificacy statistics, did we include the works of 15,328 authors? Surely not. Let's first check, for how many authors, there are books written by these authors referenced in DBpedia: Its only 3,763 authors with referenced books [4]. Thus, we don't have any idea about the remaining 11,000 authors who don't have one of their books referenced in DBpedia. They must have written some book, otherwise they would not be an "author"....

OK, so we are dealing with insufficient information. If we look at the available data, we could also use another property, called dbpedia-owl:notableWork denoting only the "important" works of an author (if we relate it with authors and not with other artists). Now it would be interesting, to repeat our statistics based on this property and look, if there is a significant difference. First let's look at the most prolific authors with respect to referenced "notable works":

name numOfNotableWorks popularityOfNotableWorks
"Roald Dahl"@en 16 53.8
"Joseph Conrad"@en 9 68.2
"Charles Dickens"@en 8 398.7
"Roger Zelazny"@en 7 15.7
"Henryk Sienkiewicz"@en 6 53.8
"Tom Wolfe"@en 6 44.3
"Margaret Atwood"@en 6 41.8
"Henry James"@en 6 74.0
"Ismail Kadare"@en 6 7.8
"Alfred Döblin"@en 6 13.3
"Stephen King"@en 6 133.1
"Robert Girardi"@en 6 3.0
"Hunter S. Thompson"@en 5 56.4
"Joanne Harris"@en 5 16.4
"David Mitchell (author)"@en 5 25.6
"Chris Kuzneski"@en 5 7.0
"Jackie French"@en 5 8.2
"Sita Ram Goel"@en 5 7.0
"Peter Hitchens"@en 5 9.2
"Chinua Achebe"@en 5 31.6
"Colm Tóibín"@en 5 17.4
"E. L. Doctorow"@en 5 26.4
"Don DeLillo"@en 5 27.4
"Johann Wolfgang von Goethe"@en 5 72.6
"Samuel Beckett"@en 5 25.4
"Cormac McCarthy"@en 5 55.6
"Brian O'Nolan"@en 5 29.8
"Poppy Z. Brite"@en 5 11.2
"William Trevor"@en 5 20.2
"Neil Gaiman"@en 5 84.8
"Dean Koontz"@en 5 10.4
"George MacDonald"@en 5 21.6
"Ann Bannon"@en 5 9.0
"William Faulkner"@en 4 70.7
"Will Self"@en 4 10.0
"Alan Hollinghurst"@en 4 23.5
"George Eliot"@en 4 81.0
"Malcolm Gladwell"@en 4 39.0
"C. S. Lewis"@en 4 43.2
"Hermann Hesse"@en 4 53.2

There is a huge difference. Less science fiction authors, more international authors, and the average "popularity" of the top 40 author's works has also significantly increased. We also realize that considering Charles Dickens, the popularity of his 8 notable works (398) doesn't differ so much from the popularity of his overall 30 works (159.5) as for Stephen King, where the popularity of his 5 notable works (133) is much higher compared to the overall average from last time (44) for his 75 books. Interesting also that Road Dahl is mentioned with 16 notable works. Either the wikipedia authors could not decide which of the works of Road Dahl really was notable or the article must have been written by a huge fan. OK, so we might learn that is in the eye of the beholder, which works of an author are referenced as being "notable".

Let's order the list now again according to the popularity measure and see what happens. This is the Top 40 list of authors with the most popular works (on average) with respect to "notable works" only:

name numOfNotableWorks popularityOfNotableWorks
"Bram Stoker"@en 1 1145.0
"Lewis Carroll"@en 2 939.0
"Lucifer Chu"@en 1 612.0
"Miguel de Cervantes"@en 2 611.0
"J. R. R. Tolkien"@en 2 566.0
"Robert Louis Stevenson"@en 3 448.0
"Alexandre Dumas"@en 2 427.5
"Oscar Wilde"@en 1 427.0
"Kenneth Grahame"@en 1 423.0
"Walter Prescott Webb"@en 1 423.0
"James Vincent Murphy"@en 1 421.0
"George Orwell"@en 3 412.3
"Charles Dickens"@en 8 398.7
"Emily Brontë"@en 1 387.0
"Leo Tolstoy"@en 2 374.5
"John Bunyan"@en 1 341.0
"Louisa May Alcott"@en 1 314.0
"Mark Twain"@en 2 307.5
"Jonathan Swift"@en 2 303.5
"Emma Orczy"@en 1 282.0
"Charlotte Brontë"@en 2 258.0
"Ian McFarlane"@en 1 255.0
"William Golding"@en 1 251.0
"Anne Frank"@en 1 246.0
"James Fenimore Cooper"@en 1 246.0
"Margaret Mitchell"@en 2 242.0
"Frans Sammut"@en 2 233.5
"Rudyard Kipling"@en 2 230.5
"Dan Brown"@en 3 224.3
"John Steinbeck"@en 3 218.0
"William Gibson"@en 1 216.0
"Ayn Rand"@en 2 210.5
"Harriet McDougal"@en 1 207.0
"Jim Butcher"@en 1 200.0
"Richard Adams"@en 1 192.0
"William Makepeace Thackeray"@en 1 191.0
"T. S. Eliot"@en 2 186.0
"Aldous Huxley"@en 3 175.3
"Hitoshi Igarashi"@en 1 175.0
"Alice Walker"@en 1 174.0

This looks a lot more like we would have expect it to be in the first place. Bram Stoker on the first place with only 1 notable work, we all know which book is meant by that :) impressive score. Lewis Carroll on the second place with his Alice in Wonderland stories is also no wonder. But, who is Lucifer Chu? This is some kind of surprise for me. You won't find Lucifer Chu in the German version of DBpedia, so why does our result suggest that he is a popular author? Well, if you look at the very short Wikipedia article in the English version, you will find out that he is the author of the Chinese translation of J.R.R.Tolkien's The Hobbit and The Lord of the Rings. Tolkien's books follow on rank number 5 with a slightly less popularity. How can that be? Well we must examine the available data a little bit closer.

Chu is referenced in our list with 1 book only although we know that he has translated The Hobbit as well as The Lord of the Rings. And J.R.R. Tolkien is referenced with 2 books. Obviously, for Chu only the Hobbit as being the most popular of these books has been counted. But, if we look at the DBpedia page of Lucifer Chu, we realize that for Chu there are 2 books referenced as notable Work, as there are 3 books for Tolkien (including The Silmarillion). The reason for this is that The Lord of the Rings is not a member of the class dbpedia-owl:Book. Why is that so? Maybe because The Lord of the Rings consists out of 3 different books. Thus, to get a more complete result, we should think of how we get all notable works into that list. But, we should simultaneously take care of only including books or series of books and to exclude other pieces or work such as e.g. paintings, sculptures, movies, photographs, etc.

We will take care of this next time... :)

[1] Harald's Original Miscellany - Prolificacy vs. Popularity in Literature, Part 1, 2014/06/20
[2] Brin, S.; Page, L. (1998). "The anatomy of a large-scale hypertextual Web search engine". Computer Networks and ISDN Systems 30: 107–117.
[3] live SPARQL query for the number or authors
[4] live SPARQL query for the number of distinct authors who have written a book, referenced in DBpedia
[5] live SPARQL query for the Top 40 most prolific authors (based on notable works)
[6] live SPARQL query for the Top 40 most popular authors (based on notable works)