Susan,

Hmm, seems like that is a funny middle ground, where it's too long to get live 
- although its probably less than 158 days. I once read an edited 400,000 pages 
with pywikibot (3 network IO calls per page, read, external API, write) in 
about 20 days. You would have to make two IO calls (read, getHistory), per 
userpage. I don't know how many userpages there are, but that might be enough 
variables to satisfy a system of inequalities that you need.

If you are deadset on using hadoop, maybe you could use the Wikimedia Labs  
XGrid https://wikitech.wikimedia.org/wiki/Main_Page.
They have some monster power and it's free for bot operators and other tool 
runners. Maybe it's also worth asking on there if someone already has 
wikihadoop set up.


Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023

________________________________
From: wiki-research-l-boun...@lists.wikimedia.org 
<wiki-research-l-boun...@lists.wikimedia.org> on behalf of Susan Biancani 
<inacn...@gmail.com>
Sent: Tuesday, October 08, 2013 3:28 PM
To: Research into Wikimedia content and communities
Subject: Re: [Wiki-research-l] diffdb formatted Wikipedia dump

Right now, I want all the edits to user pages and user talk pages, 2010-2013. 
But as I keep going with this project, I may want to expand a bit, so I figured 
if I was going to run the wikihadoop software, I might as well only do it once.

I'm hesitant to do this via web scraping, because I think it'll take much 
longer than working with the dump files. However, if you have suggestions on 
how to get the diffs (or a similar format) efficiently from the dump files, I 
would definitely love to hear them.

I appreciate the help and advice!


On Mon, Oct 7, 2013 at 10:44 AM, Pierre-Carl Langlais 
<pierrecarl.langl...@gmail.com<mailto:pierrecarl.langl...@gmail.com>> wrote:
I agree with Klein. If you do not need to exploit the entire Wikipedia 
database, requests through a python scraping library (like Beautiful Soup) are 
certainly sufficient and easy to set up. With an aleatory algorithm to select 
the "ids" you can create a fine sample.
PCL

Le 07/10/13 19:31, Klein,Max a écrit :
Hi Susan,

Do you need the entire database diff'd? I.e. all edits ever. Or are you 
interested in a particular subset of the diffs? It would help to know your 
purpose.

For instance I am interested in diffs around specific articles for specific 
dates to study news events. So I calculate the diffs myself using python on 
page histories rather than the entire database.

Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023<tel:%2B17074787023>

________________________________
From: 
wiki-research-l-boun...@lists.wikimedia.org<mailto:wiki-research-l-boun...@lists.wikimedia.org>
 
<wiki-research-l-boun...@lists.wikimedia.org><mailto:wiki-research-l-boun...@lists.wikimedia.org>
 on behalf of Susan Biancani <inacn...@gmail.com><mailto:inacn...@gmail.com>
Sent: Thursday, October 03, 2013 10:06 PM
To: 
wiki-research-l@lists.wikimedia.org<mailto:wiki-research-l@lists.wikimedia.org>
Subject: [Wiki-research-l] diffdb formatted Wikipedia dump

I'm looking for a dump from English Wikipedia in diff format (i.e. each entry 
is the text that was added/deleted since the last edit, rather than each entry 
is the current state of the page).

The Summer of Research folks provided a handy guide to how to create such a 
dataset from the standard complete dumps here: 
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for each 
dump file--there are currently 158--running on 24 cores). I'm a grad student in 
a social science department, and don't have access to extensive computing 
power. I've been paying out of pocket for AWS, but this would get expensive.

There is a diff-format dataset available, but only through April, 2011 (here: 
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format 
dataset for January, 2010- March, 2013 (or, for everything up to March, 2013).

Does anyone know if such a dataset exists somewhere? Any leads or suggestions 
would be much appreciated!

Susan



_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org<mailto:Wiki-research-l@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to