Right now, I want all the edits to user pages and user talk pages, 2010-2013. But as I keep going with this project, I may want to expand a bit, so I figured if I was going to run the wikihadoop software, I might as well only do it once.
I'm hesitant to do this via web scraping, because I think it'll take much longer than working with the dump files. However, if you have suggestions on how to get the diffs (or a similar format) efficiently from the dump files, I would definitely love to hear them. I appreciate the help and advice! On Mon, Oct 7, 2013 at 10:44 AM, Pierre-Carl Langlais < [email protected]> wrote: > I agree with Klein. If you do not need to exploit the entire Wikipedia > database, requests through a python scraping library (like Beautiful Soup) > are certainly sufficient and easy to set up. With an aleatory algorithm to > select the "ids" you can create a fine sample. > PCL > > Le 07/10/13 19:31, Klein,Max a écrit : > > Hi Susan, > > Do you need the entire database diff'd? I.e. all edits ever. Or are you > interested in a particular subset of the diffs? It would help to know your > purpose. > > For instance I am interested in diffs around specific articles for > specific dates to study news events. So I calculate the diffs myself using > python on page histories rather than the entire database. > > Maximilian Klein > Wikipedian in Residence, OCLC > +17074787023 > > ------------------------------ > *From:* [email protected] > <[email protected]><[email protected]>on > behalf of Susan Biancani > <[email protected]> <[email protected]> > *Sent:* Thursday, October 03, 2013 10:06 PM > *To:* [email protected] > *Subject:* [Wiki-research-l] diffdb formatted Wikipedia dump > > I'm looking for a dump from English Wikipedia in diff format (i.e. > each entry is the text that was added/deleted since the last edit, rather > than each entry is the current state of the page). > > The Summer of Research folks provided a handy guide to how to create such > a dataset from the standard complete dumps here: > http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff > But the time estimate they give is prohibitive for me (20-24 hours for > each dump file--there are currently 158--running on 24 cores). I'm a grad > student in a social science department, and don't have access to extensive > computing power. I've been paying out of pocket for AWS, but this would get > expensive. > > There is a diff-format dataset available, but only through April, 2011 > (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a > diff-format dataset for January, 2010- March, 2013 (or, for everything up > to March, 2013). > > Does anyone know if such a dataset exists somewhere? Any leads or > suggestions would be much appreciated! > > Susan > > > _______________________________________________ > Wiki-research-l mailing > [email protected]https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
