I'm making a crossword-style word game, and I'm trying to automate the process 
of creating the puzzles, at least somewhat.

I am hoping to find or create a list of English Wikipedia page titles, sorted 
roughly by how "recognizable" they are, where by recognizable I mean something 
like, "how likely it is that the average American on the street will be 
familiar with the name/phrase/subject".


For instance, just to take a random example, on a recognizability scale from 0 
to 100, I might score (just guessing here):


    Lady_Gaga = 90

    Lady_Jane_Grey = 10

    Lady_and_the_Tramp = 90

    Lady_Antebellum = 5

    Lady-in-waiting = 70

    Lady_Bird_Johnson = 65

    Lady_Marmalade = 10

    Ladysmith_Black_Mambazo = 10


One suggestion would just be to use the page length (either number of 
characters or physical rendered page length) as a proxy for recognizability. 
That might work, but it feels kind of crude, and certainly would get many false 
positives, such as Bose-Einstein_condensation.

Someone suggested to me that I might count incoming page links, and referred me 
to http://dumps.wikimedia.org/enwiki/latest/ and in particular the file 
enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that file but 
couldn't understand whether/how the linking structure was represented.

So my questions are:

(1) Do you know if a list like I'm try to make already exists?

(2) If you were going to make a list like this how would you do it? If it was 
based on page length, which files would you download and process to make it as 
efficient as possible? If it was based on incoming links, which files 
specifically would you use, and how would you determine the link count?

Thanks for any help.
_______________________________________________
WikiEN-l mailing list
[email protected]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l

Reply via email to