Re: [Wiki-research-l] 2012 top pageview list

2013-01-03 Thread Federico Leva (Nemo)

Kerry Raymond, 02/01/2013 22:46:

The problem (as always) is that there is a difference between pages served
(by the web server) and pages actually wanted and read by the user.

It would be interesting to have referrer statistics. I'm guessing that many
of Wikipedia pages are being referred by Google (and other general search
engines).


See http://stats.wikimedia.org/wikimedia/squids/SquidReportGoogle.htm

Nemo

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2013-01-03 Thread Kerry Raymond
Sorry, I meant the referrer stats for the top pages of 2012 in the hope
that some unusual patterns might shed some light on why some of these pages
are so popular (contrary to what common sense might suggest).

Kerry

-Original Message-
From: Federico Leva (Nemo) [mailto:nemow...@gmail.com] 
Sent: Thursday, 3 January 2013 10:26 PM
To: kerry.raym...@gmail.com; Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] 2012 top pageview list

Kerry Raymond, 02/01/2013 22:46:
 The problem (as always) is that there is a difference between pages served
 (by the web server) and pages actually wanted and read by the user.

 It would be interesting to have referrer statistics. I'm guessing that
many
 of Wikipedia pages are being referred by Google (and other general search
 engines).

See http://stats.wikimedia.org/wikimedia/squids/SquidReportGoogle.htm

Nemo


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2013-01-03 Thread Andrew G. West

The Google Doodle often explains some of the most unusual:

http://en.wikipedia.org/wiki/List_of_Google_Doodles_in_2012

Thanks, -AW


On 01/03/2013 04:06 PM, Kerry Raymond wrote:

Sorry, I meant the referrer stats for the top pages of 2012 in the hope
that some unusual patterns might shed some light on why some of these pages
are so popular (contrary to what common sense might suggest).

Kerry

-Original Message-
From: Federico Leva (Nemo) [mailto:nemow...@gmail.com]
Sent: Thursday, 3 January 2013 10:26 PM
To: kerry.raym...@gmail.com; Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] 2012 top pageview list

Kerry Raymond, 02/01/2013 22:46:

The problem (as always) is that there is a difference between pages served
(by the web server) and pages actually wanted and read by the user.

It would be interesting to have referrer statistics. I'm guessing that

many

of Wikipedia pages are being referred by Google (and other general search
engines).


See http://stats.wikimedia.org/wikimedia/squids/SquidReportGoogle.htm

Nemo


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



--
Andrew G. West, Doctoral Candidate
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email:   west...@cis.upenn.edu
Website: http://www.andrew-g-west.com

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2013-01-02 Thread Kerry Raymond
The problem (as always) is that there is a difference between pages served
(by the web server) and pages actually wanted and read by the user. 

It would be interesting to have referrer statistics. I'm guessing that many
of Wikipedia pages are being referred by Google (and other general search
engines). If so, people may just be clicking through a list of search
results, which causes them to download a WP page but then immediately move
onto the next search result because it isn't what they are looking for. I
rather suspect the prominence of Facebook in the English Wikipedia results
is due to this effect, as I often find myself on the Wikipedia page for
Facebook instead of Facebook itself following a google search. I think the
use of mobile devices (with small screens) probably encourages this sort of
behaviour.

Kerry


-Original Message-
From: wiki-research-l-boun...@lists.wikimedia.org
[mailto:wiki-research-l-boun...@lists.wikimedia.org] On Behalf Of Andrew G.
West
Sent: Sunday, 30 December 2012 2:06 PM
To: wiki-research-l@lists.wikimedia.org
Subject: Re: [Wiki-research-l] 2012 top pageview list

The WMF aggregates them as (page,views) pairs on an hourly basis:

http://dumps.wikimedia.org/other/pagecounts-raw/

I've been parsing these and storing them in a query-able DB format (for 
en.wp exclusively; though the files are available for all projects I 
think) for about two years. If you want to maintain such a fine 
granularity, it can quickly become a terrabyte scale task that eats up a 
lot of processing time.

If your looking for more coarse granularity reports (like top views for 
day, week, month) a lot of efficient aggregation can be done.

See also: http://en.wikipedia.org/wiki/Wikipedia:5000

Thanks, -AW


On 12/28/2012 07:28 PM, John Vandenberg wrote:
 There is a steady stream of blogs and 'news' about these lists


https://encrypted.google.com/search?client=ubuntuchannel=fsq=%22Sean+hoyla
nd%22ie=utf-8oe=utf-8#q=wikipedia+top+2012hl=ensafe=offclient=ubuntutb
o=dchannel=fstbm=nwssource=lnttbs=qdr:wsa=Xpsj=1ei=GzjeUOPpAsfnrAeQk4
DgCgved=0CB4QpwUoAwbav=on.2,or.r_gc.r_pw.r_cp.r_qf.bvm=bv.1355534169,d.aW
Mfp=4e60e761ee133369bpcl=40096503biw=1024bih=539

 How does a researcher go about obtaining access logs with useragents
 in order to answer some of these questions?


-- 
Andrew G. West, Doctoral Candidate
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Website: http://www.andrew-g-west.com

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2013-01-01 Thread Andrew G. West
I got a couple of private replies to this thread, so I figured I would 
just answer them publicly for the benefit of the list:



(1) Do I only parse/store English Wikipedia?

Yes; for scalability reasons and because that is my research focus. I'd 
consider opening my database to users with specific academic uses, but 
its probably not the most efficient way to do a lot of computations (see 
below). Plus, I transfer the older tables to offline drives, so I 
probably only have  ~6 months of the most recent data online.



(2) Can you provide some insights into your parsing?

First, I began collecting this data for the purposes of:

http://repository.upenn.edu/cis_papers/470/

Where I knew the revision IDs of damaging revisions and wanted to reason 
about how many people saw that article/RID in its damaged state. This 
involved storing data on EVERY article at the finest granularity 
possible (hourly) and then assuming uniform intra-hour distributions.


See the URL below for my code (with the SQL server credentials blanked 
out) that does this work. A nightly [cron] task fires the Java code. It 
goes and downloads an entire days worth of files (24) and parses them. 
These files contain data for ALL WMF projects and languages, but I use a 
simple string match to only handle en.wp lines. Each column in the 
database represents a single day and contains a binary object wrapping 
(hour, hits) pairs. Each table contains 10 consecutive days of data. 
Much of this design was chosen to accommodate the extremely long tail 
and sparseness of the view distribution; filling a DB with billions of 
NULL values didn't prove to be too efficient in my first attempts. I 
think I use ~1TB yearly for the English Wikipedia data.


I would appreciate if anyone ends up using this code that my original 
work above would get a cite/acknowledgement. However, I imagine most 
will want to do a bit more aggregation, and hopefully this can provide a 
baseline for doing that.


Thanks, -AW


CODE LINK:
http://www.cis.upenn.edu/~westand/docs/wp_stats.zip



On 12/29/2012 11:06 PM, Andrew G. West wrote:

The WMF aggregates them as (page,views) pairs on an hourly basis:

http://dumps.wikimedia.org/other/pagecounts-raw/

I've been parsing these and storing them in a query-able DB format (for
en.wp exclusively; though the files are available for all projects I
think) for about two years. If you want to maintain such a fine
granularity, it can quickly become a terrabyte scale task that eats up a
lot of processing time.

If your looking for more coarse granularity reports (like top views for
day, week, month) a lot of efficient aggregation can be done.

See also: http://en.wikipedia.org/wiki/Wikipedia:5000

Thanks, -AW


On 12/28/2012 07:28 PM, John Vandenberg wrote:

There is a steady stream of blogs and 'news' about these lists

https://encrypted.google.com/search?client=ubuntuchannel=fsq=%22Sean+hoyland%22ie=utf-8oe=utf-8#q=wikipedia+top+2012hl=ensafe=offclient=ubuntutbo=dchannel=fstbm=nwssource=lnttbs=qdr:wsa=Xpsj=1ei=GzjeUOPpAsfnrAeQk4DgCgved=0CB4QpwUoAwbav=on.2,or.r_gc.r_pw.r_cp.r_qf.bvm=bv.1355534169,d.aWMfp=4e60e761ee133369bpcl=40096503biw=1024bih=539


How does a researcher go about obtaining access logs with useragents
in order to answer some of these questions?





--
Andrew G. West, Doctoral Candidate
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Email:   west...@cis.upenn.edu
Website: http://www.andrew-g-west.com

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2012-12-29 Thread Andrew G. West

The WMF aggregates them as (page,views) pairs on an hourly basis:

http://dumps.wikimedia.org/other/pagecounts-raw/

I've been parsing these and storing them in a query-able DB format (for 
en.wp exclusively; though the files are available for all projects I 
think) for about two years. If you want to maintain such a fine 
granularity, it can quickly become a terrabyte scale task that eats up a 
lot of processing time.


If your looking for more coarse granularity reports (like top views for 
day, week, month) a lot of efficient aggregation can be done.


See also: http://en.wikipedia.org/wiki/Wikipedia:5000

Thanks, -AW


On 12/28/2012 07:28 PM, John Vandenberg wrote:

There is a steady stream of blogs and 'news' about these lists

https://encrypted.google.com/search?client=ubuntuchannel=fsq=%22Sean+hoyland%22ie=utf-8oe=utf-8#q=wikipedia+top+2012hl=ensafe=offclient=ubuntutbo=dchannel=fstbm=nwssource=lnttbs=qdr:wsa=Xpsj=1ei=GzjeUOPpAsfnrAeQk4DgCgved=0CB4QpwUoAwbav=on.2,or.r_gc.r_pw.r_cp.r_qf.bvm=bv.1355534169,d.aWMfp=4e60e761ee133369bpcl=40096503biw=1024bih=539

How does a researcher go about obtaining access logs with useragents
in order to answer some of these questions?



--
Andrew G. West, Doctoral Candidate
Dept. of Computer and Information Science
University of Pennsylvania, Philadelphia PA
Website: http://www.andrew-g-west.com

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2012-12-28 Thread John Vandenberg
Is favicon only in the Chinese Wikipedia top 100?

It seems so, and is odd if the problem is a web browser bug.

John Vandenberg.
sent from Galaxy Note
On Dec 28, 2012 4:07 PM, Johan Gunnarsson johan.gunnars...@gmail.com
wrote:

 On Fri, Dec 28, 2012 at 5:33 AM, John Vandenberg jay...@gmail.com wrote:
  Hi Johan,
 
  Thank you for the lovely data at
 
  https://toolserver.org/~johang/2012.html
 
  I posted that link to my facebook (below if you want to join in
  there), and a few language specific facebook groups, and there have
  been some concerns raised about the results, which I'll list below.
 
  These lists are getting some traction in the press so it would be good
  to understand it better.
 
  http://guardian.co.uk/technology/blog/2012/dec/27/wikipedia-most-viewed

 Cool, cool.

 
  Why is [[zh:Favicon]] #2?
 
  The data doesnt appear to support that
 
  http://stats.grok.se/zh/201201/Favicon
  http://stats.grok.se/zh/latest90/Favicon

 My post-processing filtering follows redirects to find the true
 title. In this case the page Favicon.ico redirects to Favicon. This is
 probably due to broken browsers trying to load the icon.

 
  Number 1 in French is a plant native to asia.  The stats for December
 disagree
  https://en.wikipedia.org/wiki/Ilex_crenata
  http://stats.grok.se/fr/201212/Houx_cr%C3%A9nel%C3%A9

 French's Ilex_crenata redirects to Houx_crénelé.

 Ilex_crenata had huge traffic in April:
 http://stats.grok.se/fr/201204/Ilex_crenata

 There are a bunch of spikes like this. I can't really explain it. I
 talked to Domas Mituzas (the maintainer of the original dumps I use)
 yesterday and he suggested it might be bots going crazy for whatever
 reason. I'd love to filter all these false positives, but haven't been
 able to come up with an easy way to do it.

 Might be possible with access to logs with the user-agent string, but
 that would probably inflate the dataset size even more. It's already
 past the terabyte. However that could probably be solved by sampling
 (for example) 1/100 of the entries.

 Comments and ideas are welcome!

 
  Number 1 in German is Cul de sac. This is odd, but matches the stats
  http://stats.grok.se/de/201207/Sackgasse

 RIght. This one is funny. It has huge traffic on weekdays only.
 Deserted on weekends.

 
  Number 1 in Dutch is a Chinese mountain.  The stats for December disagree
  http://stats.grok.se/nl/201212/Hua_Shan

 July/August agree: http://stats.grok.se/nl/201208/Hua_Shan

 
  Number 4 in Hebrew is zipper.  The stats for December disagree
  http://stats.grok.se/he/201212/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F

 April agrees:
 http://stats.grok.se/he/201204/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F

 
  Number 2 in Spanish is '@'.  This is odd, but matches the stats
  http://stats.grok.se/es/201212/Arroba_%28s%C3%ADmbolo%29
 
  --
  John Vandenberg
  https://www.facebook.com/johnmark.vandenberg

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2012-12-28 Thread Tilman Bayer
On Fri, Dec 28, 2012 at 10:24 AM, John Vandenberg jay...@gmail.com wrote:

 Is favicon only in the Chinese Wikipedia top 100?

 It seems so, and is odd if the problem is a web browser bug.

 John Vandenberg.
 sent from Galaxy Note
 On Dec 28, 2012 4:07 PM, Johan Gunnarsson johan.gunnars...@gmail.com
 wrote:

  On Fri, Dec 28, 2012 at 5:33 AM, John Vandenberg jay...@gmail.com
 wrote:
  Hi Johan,
 
  Thank you for the lovely data at
 
  https://toolserver.org/~johang/2012.html
 
  I posted that link to my facebook (below if you want to join in
  there), and a few language specific facebook groups, and there have
  been some concerns raised about the results, which I'll list below.
 
  These lists are getting some traction in the press so it would be good
  to understand it better.
 
  http://guardian.co.uk/technology/blog/2012/dec/27/wikipedia-most-viewed

 Cool, cool.


 
  Why is [[zh:Favicon]] #2?
 
  The data doesnt appear to support that
 
  http://stats.grok.se/zh/201201/Favicon
  http://stats.grok.se/zh/latest90/Favicon

 My post-processing filtering follows redirects to find the true
 title. In this case the page Favicon.ico redirects to Favicon. This is
 probably due to broken browsers trying to load the icon.


 
  Number 1 in French is a plant native to asia.  The stats for December
 disagree
  https://en.wikipedia.org/wiki/Ilex_crenata
  http://stats.grok.se/fr/201212/Houx_cr%C3%A9nel%C3%A9

 French's Ilex_crenata redirects to Houx_crénelé.

 Ilex_crenata had huge traffic in April:
 http://stats.grok.se/fr/201204/Ilex_crenata

 There are a bunch of spikes like this. I can't really explain it. I
 talked to Domas Mituzas (the maintainer of the original dumps I use)
 yesterday and he suggested it might be bots going crazy for whatever
 reason. I'd love to filter all these false positives, but haven't been
 able to come up with an easy way to do it.

 Might be possible with access to logs with the user-agent string, but
 that would probably inflate the dataset size even more. It's already
 past the terabyte. However that could probably be solved by sampling
 (for example) 1/100 of the entries.

 Comments and ideas are welcome!


 
  Number 1 in German is Cul de sac. This is odd, but matches the stats
  http://stats.grok.se/de/201207/Sackgasse

 RIght. This one is funny. It has huge traffic on weekdays only.
 Deserted on weekends.

 This has been noted on the dewiki village pump before. The most
interesting guess
therehttps://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Sackgasse_als_Top_Artikel_.3F.21(by
Benutzer:YMS): There might be a web filtering software installed on
workplace PCs in companies which redirects all prohibited URLs to the
German Wikipedia on cul-de-sac. This would explain the weekly pattern, and
also http://stats.grok.se/de/201112/Sackgasse (December 25-26 are holidays
in Germany, and many employees take the rest of the year off).




 
  Number 1 in Dutch is a Chinese mountain.  The stats for December
 disagree
  http://stats.grok.se/nl/201212/Hua_Shan

 July/August agree: http://stats.grok.se/nl/201208/Hua_Shan


 
  Number 4 in Hebrew is zipper.  The stats for December disagree
  http://stats.grok.se/he/201212/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F

 April agrees:
 http://stats.grok.se/he/201204/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F


 
  Number 2 in Spanish is '@'.  This is odd, but matches the stats
  http://stats.grok.se/es/201212/Arroba_%28s%C3%ADmbolo%29
 
  --
  John Vandenberg
  https://www.facebook.com/johnmark.vandenberg


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Tilman Bayer
Senior Operations Analyst (Movement Communications)
Wikimedia Foundation
IRC (Freenode): HaeB
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] 2012 top pageview list

2012-12-28 Thread John Vandenberg
There is a steady stream of blogs and 'news' about these lists

https://encrypted.google.com/search?client=ubuntuchannel=fsq=%22Sean+hoyland%22ie=utf-8oe=utf-8#q=wikipedia+top+2012hl=ensafe=offclient=ubuntutbo=dchannel=fstbm=nwssource=lnttbs=qdr:wsa=Xpsj=1ei=GzjeUOPpAsfnrAeQk4DgCgved=0CB4QpwUoAwbav=on.2,or.r_gc.r_pw.r_cp.r_qf.bvm=bv.1355534169,d.aWMfp=4e60e761ee133369bpcl=40096503biw=1024bih=539

How does a researcher go about obtaining access logs with useragents
in order to answer some of these questions?

-- 
John Vandenberg

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] 2012 top pageview list

2012-12-27 Thread John Vandenberg
Hi Johan,

Thank you for the lovely data at

https://toolserver.org/~johang/2012.html

I posted that link to my facebook (below if you want to join in
there), and a few language specific facebook groups, and there have
been some concerns raised about the results, which I'll list below.

These lists are getting some traction in the press so it would be good
to understand it better.

http://guardian.co.uk/technology/blog/2012/dec/27/wikipedia-most-viewed

Why is [[zh:Favicon]] #2?

The data doesnt appear to support that

http://stats.grok.se/zh/201201/Favicon
http://stats.grok.se/zh/latest90/Favicon

Number 1 in French is a plant native to asia.  The stats for December disagree
https://en.wikipedia.org/wiki/Ilex_crenata
http://stats.grok.se/fr/201212/Houx_cr%C3%A9nel%C3%A9

Number 1 in German is Cul de sac. This is odd, but matches the stats
http://stats.grok.se/de/201207/Sackgasse

Number 1 in Dutch is a Chinese mountain.  The stats for December disagree
http://stats.grok.se/nl/201212/Hua_Shan

Number 4 in Hebrew is zipper.  The stats for December disagree
http://stats.grok.se/he/201212/%D7%A8%D7%95%D7%9B%D7%A1%D7%9F

Number 2 in Spanish is '@'.  This is odd, but matches the stats
http://stats.grok.se/es/201212/Arroba_%28s%C3%ADmbolo%29

-- 
John Vandenberg
https://www.facebook.com/johnmark.vandenberg

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l