Hi Susan,

Do you need the entire database diff'd? I.e. all edits ever. Or are you 
interested in a particular subset of the diffs? It would help to know your 
purpose.

For instance I am interested in diffs around specific articles for specific 
dates to study news events. So I calculate the diffs myself using python on 
page histories rather than the entire database.

Maximilian Klein
Wikipedian in Residence, OCLC
+17074787023

________________________________
From: [email protected] 
<[email protected]> on behalf of Susan Biancani 
<[email protected]>
Sent: Thursday, October 03, 2013 10:06 PM
To: [email protected]
Subject: [Wiki-research-l] diffdb formatted Wikipedia dump

I'm looking for a dump from English Wikipedia in diff format (i.e. each entry 
is the text that was added/deleted since the last edit, rather than each entry 
is the current state of the page).

The Summer of Research folks provided a handy guide to how to create such a 
dataset from the standard complete dumps here: 
http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
But the time estimate they give is prohibitive for me (20-24 hours for each 
dump file--there are currently 158--running on 24 cores). I'm a grad student in 
a social science department, and don't have access to extensive computing 
power. I've been paying out of pocket for AWS, but this would get expensive.

There is a diff-format dataset available, but only through April, 2011 (here: 
http://dumps.wikimedia.org/other/diffdb/). I'd like to get a diff-format 
dataset for January, 2010- March, 2013 (or, for everything up to March, 2013).

Does anyone know if such a dataset exists somewhere? Any leads or suggestions 
would be much appreciated!

Susan
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to