Re: [WikiEN-l] finding the "most recognizable" page names

Bob the Wikipedian Fri, 30 Sep 2011 18:07:31 -0700

You might also consider http://buzzlog.yahoo.com/overall/ which lists 
the topics the world is searching for.


Bob

On 9/30/2011 1:24 PM, Ian Woollard wrote:
> The raw dumps are here:
>
> http://dammit.lt/wikistats/
>
> IRC the compressed files consist of the list of the articles that were
> accessed, in the order they were retrieved. You have to process them to
> count how often each article was read.
>
> Of course:
>
> http://stats.grok.se/
>
> has done that heavy lifting already and they keep lists of the most popular
> articles.
>
> On 30 September 2011 18:53, Michael Katz<[email protected]>  wrote:
>
>> Thanks for the reply. Can you tell me exactly which dump files you'd look
>> in to find the number of page views, plus any information about finding the
>> page views within those files, if it's not obvious? Is there a way to
>> distinguish between editor page views and user page views? (Perhaps subtract
>> the number of edits made? If so, how I can find the number of edits made?)
>>
>> Something about page views seems a little funny, because it seems like
>> there are some very recognizable things that just aren't looked up much. But
>> perhaps it's my best hope...
>>
>>
>>
>> ________________________________
>> From: WereSpielChequers<[email protected]>
>> To: Michael Katz<[email protected]>; English Wikipedia<
>> [email protected]>
>> Sent: Friday, September 30, 2011 2:55 AM
>> Subject: Re: [WikiEN-l] finding the "most recognizable" page names
>>
>>
>> Hi Michael,
>>
>> I don't know if such a list exists, other than lists by largest numbers of
>> views.
>>
>> Size of article probably relates to interest of one or a few editors and
>> complexity of information, I doubt if it would closely relate to
>> recognisability. Incoming links is probably better but can get awfully
>> skewed by templates, and some links are more meaningful than others.
>>
>> Recognisable in the USA is not necessarily the same as recognisable
>> globally. Ideally if you want a US specific list you need US specific data,
>> if you use a global list you could wind up asking Americans about Johnny
>> Vegas, Aby Titmuss, Jack Straw and Kevin Pietersen. You might also consider
>> the generation you are targeting.    Lady_Bird_Johnson would be better known
>> among Americans and older people.
>>
>> I'd suggest using metrics of page views per article, and if you want a
>> specifically US product screen out articles that don't use American English
>> spelling. Better still would be to get page views from the USA, or at least
>> page views ignoring the 6 hours when the US is most likely to be asleep.
>>
>> WereSpielChequers
>>
>>
>> On 30 September 2011 04:17, Michael Katz<[email protected]>
>> wrote:
>>
>> I'm making a crossword-style word game, and I'm trying to automate the
>> process of creating the puzzles, at least somewhat.
>>> I am hoping to find or create a list of English Wikipedia page titles,
>> sorted roughly by how "recognizable" they are, where by recognizable I mean
>> something like, "how likely it is that the average American on the street
>> will be familiar with the name/phrase/subject".
>>>
>>> For instance, just to take a random example, on a recognizability scale
>> from 0 to 100, I might score (just guessing here):
>>>
>>>     Lady_Gaga = 90
>>>
>>>     Lady_Jane_Grey = 10
>>>
>>>     Lady_and_the_Tramp = 90
>>>
>>>     Lady_Antebellum = 5
>>>
>>>     Lady-in-waiting = 70
>>>
>>>     Lady_Bird_Johnson = 65
>>>
>>>     Lady_Marmalade = 10
>>>
>>>     Ladysmith_Black_Mambazo = 10
>>>
>>>
>>> One suggestion would just be to use the page length (either number of
>> characters or physical rendered page length) as a proxy for recognizability.
>> That might work, but it feels kind of crude, and certainly would get many
>> false positives, such as Bose-Einstein_condensation.
>>> Someone suggested to me that I might count incoming page links, and
>> referred me to http://dumps.wikimedia.org/enwiki/latest/ and in particular
>> the file enwiki-latest-pagelinks.sql.gz. I downloaded and looked at that
>> file but couldn't understand whether/how the linking structure was
>> represented.
>>> So my questions are:
>>>
>>> (1) Do you know if a list like I'm try to make already exists?
>>>
>>> (2) If you were going to make a list like this how would you do it? If it
>> was based on page length, which files would you download and process to make
>> it as efficient as possible? If it was based on incoming links, which files
>> specifically would you use, and how would you determine the link count?
>>> Thanks for any help.
>>> _______________________________________________
>>> WikiEN-l mailing list
>>> [email protected]
>>> To unsubscribe from this mailing list, visit:
>>> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>>>
>> _______________________________________________
>> WikiEN-l mailing list
>> [email protected]
>> To unsubscribe from this mailing list, visit:
>> https://lists.wikimedia.org/mailman/listinfo/wikien-l
>>
>
>

_______________________________________________
WikiEN-l mailing list
[email protected]
To unsubscribe from this mailing list, visit:
https://lists.wikimedia.org/mailman/listinfo/wikien-l

Re: [WikiEN-l] finding the "most recognizable" page names

Reply via email to