Re: [Wiki-research-l] diffdb formatted Wikipedia dump

2013-10-11 Thread Diederik van Liere
 *From: *Susan Biancani inacn...@gmail.com
 *Subject: **[Wiki-research-l] diffdb formatted Wikipedia dump*
 *Date: *October 3, 2013 10:06:44 PM PDT
 *To: *wiki-research-l@lists.wikimedia.org
 *Reply-To: *Research into Wikimedia content and communities 
 wiki-research-l@lists.wikimedia.org

 I'm looking for a dump from English Wikipedia in diff format (i.e. each
 entry is the text that was added/deleted since the last edit, rather than
 each entry is the current state of the page).

 The Summer of Research folks provided a handy guide to how to create such
 a dataset from the standard complete dumps here:
 http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
 But the time estimate they give is prohibitive for me (20-24 hours for
 each dump file--there are currently 158--running on 24 cores). I'm a grad
 student in a social science department, and don't have access to extensive
 computing power. I've been paying out of pocket for AWS, but this would get
 expensive.

 There is a diff-format dataset available, but only through April, 2011
 (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a
 diff-format dataset for January, 2010- March, 2013 (or, for everything up
 to March, 2013).

 Does anyone know if such a dataset exists somewhere? Any leads or
 suggestions would be much appreciated!

 Hi Susan,

There is no newer version of the dataset then you have found, that's the
bad news. The good news is that the dataset was used with really slow
commodity hardware -- what you could do is run it on AWS using a smaller
dataset, for example the Dutch Wikipedia and see how long it takes. An
alternative solution would be to start thinking (with other researchers and
Wikimedia community members) of having a small Hadoop cluster in Labs with
only public data. That way you don't need to pay but obviously it will be
less performant.   The Analytics has puppet manifests ready that will build
an entire hadoop cluster.

The wikimedia-analytics mailinglist is a good place for such a conversation
or if you need more hands on help with the diffdb then please com to irc:
wikimedia-analytics.

Best,
Diederik
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] diffdb formatted Wikipedia dump

2013-10-08 Thread Klein,Max
Susan,

Hmm, seems like that is a funny middle ground, where it's too long to get live 
- although its probably less than 158 days. I once read an edited 400,000 pages 
with pywikibot (3 network IO calls per page, read, external API, write) in 
about 20 days. You would have to make two IO calls (read, getHistory), per 
userpage. I don't know how many userpages there are, but that might be enough 
variables to satisfy a system of inequalities that you need.

If you are deadset on using hadoop, maybe you could use the Wikimedia Labs  
XGrid https://wikitech.wikimedia.org/wiki/Main_Page.
They have some monster power and it's free for bot operators and other tool 
runners. Maybe it's also worth asking on there if someone already has 
wikihadoop set up.


Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023


From: wiki-research-l-boun...@lists.wikimedia.org 
wiki-research-l-boun...@lists.wikimedia.org on behalf of Susan Biancani 
inacn...@gmail.com
Sent: Tuesday, October 08, 2013 3:28 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] diffdb formatted Wikipedia dump

Right now, I want all the edits to user pages and user talk pages, 2010-2013. 
But as I keep going with this project, I may want to expand a bit, so I figured 
if I was going to run the wikihadoop software, I might as well only do it once.

I'm hesitant to do this via web scraping, because I think it'll take much 
longer than working with the dump files. However, if you have suggestions on 
how to get the diffs (or a similar format) efficiently from the dump files, I 
would definitely love to hear them.

I appreciate the help and advice!


On Mon, Oct 7, 2013 at 10:44 AM, Pierre-Carl Langlais 
pierrecarl.langl...@gmail.commailto:pierrecarl.langl...@gmail.com wrote:
I agree with Klein. If you do not need to exploit the entire Wikipedia 
database, requests through a python scraping library (like Beautiful Soup) are 
certainly sufficient and easy to set up. With an aleatory algorithm to select 
the ids you can create a fine sample.
PCL

Le 07/10/13 19:31, Klein,Max a écrit :
Hi Susan,

Do you need the entire database diff'd? I.e. all edits ever. Or are you 
interested in a particular subset of the diffs? It would help to know your 
purpose.

For instance I am interested in diffs around specific articles for specific 
dates to study news events. So I calculate the diffs myself using python on 
page histories rather than the entire database.

Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023tel:%2B17074787023


From: 
wiki-research-l-boun...@lists.wikimedia.orgmailto:wiki-research-l-boun...@lists.wikimedia.org
 
wiki-research-l-boun...@lists.wikimedia.orgmailto:wiki-research-l-boun...@lists.wikimedia.org
 on behalf of Susan Biancani inacn...@gmail.commailto:inacn...@gmail.com
Sent: Thursday, October 03, 2013 10:06 PM
To: 
wiki-research-l@lists.wikimedia.orgmailto:wiki-research-l@lists.wikimedia.org
Subject: [Wiki-research-l] diffdb formatted Wikipedia dump

I'm looking for a dump from English Wikipedia in diff format (i.e. each entry 
is the text that was added/deleted since the last edit, rather than each entry 
is the current state of the page).

The Summer of Research folks provided a handy guide to how to create such a 
dataset from the standard complete dumps here: 
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for each 
dump file--there are currently 158--running on 24 cores). I'm a grad student in 
a social science department, and don't have access to extensive computing 
power. I've been paying out of pocket for AWS, but this would get expensive.

There is a diff-format dataset available, but only through April, 2011 (here: 
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format 
dataset for January, 2010- March, 2013 (or, for everything up to March, 2013).

Does anyone know if such a dataset exists somewhere? Any leads or suggestions 
would be much appreciated!

Susan



___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] diffdb formatted Wikipedia dump

2013-10-07 Thread Pierre-Carl Langlais
I agree with Klein. If you do not need to exploit the entire Wikipedia 
database, requests through a python scraping library (like Beautiful 
Soup) are certainly sufficient and easy to set up. With an aleatory 
algorithm to select the ids you can create a fine sample.

PCL

Le 07/10/13 19:31, Klein,Max a écrit :

Hi Susan,

Do you need the entire database diff'd? I.e. all edits ever. Or are 
you interested in a particular subset of the diffs? It would help to 
know your purpose.


For instance I am interested in diffs around specific articles for 
specific dates to study news events. So I calculate the diffs myself 
using python on page histories rather than the entire database.


Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023


*From:* wiki-research-l-boun...@lists.wikimedia.org 
wiki-research-l-boun...@lists.wikimedia.org on behalf of Susan 
Biancani inacn...@gmail.com

*Sent:* Thursday, October 03, 2013 10:06 PM
*To:* wiki-research-l@lists.wikimedia.org
*Subject:* [Wiki-research-l] diffdb formatted Wikipedia dump
I'm looking for a dump from English Wikipedia in diff format (i.e. 
each entry is the text that was added/deleted since the last edit, 
rather than each entry is the current state of the page).


The Summer of Research folks provided a handy guide to how to create 
such a dataset from the standard complete dumps here: 
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for 
each dump file--there are currently 158--running on 24 cores). I'm a 
grad student in a social science department, and don't have access to 
extensive computing power. I've been paying out of pocket for AWS, but 
this would get expensive.


There is a diff-format dataset available, but only through April, 2011 
(here: http://dumps.wikimedia.org/other/diffdb/ 
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a 
diff-format dataset for January, 2010- March, 2013 (or, for everything 
up to March, 2013).


Does anyone know if such a dataset exists somewhere? Any leads or 
suggestions would be much appreciated!


Susan


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l