Re: [Wiki-research-l] diffdb formatted Wikipedia dump

Susan Biancani Tue, 08 Oct 2013 15:29:01 -0700

Right now, I want all the edits to user pages and user talk pages,
2010-2013. But as I keep going with this project, I may want to expand a
bit, so I figured if I was going to run the wikihadoop software, I might as
well only do it once.


I'm hesitant to do this via web scraping, because I think it'll take much
longer than working with the dump files. However, if you have suggestions
on how to get the diffs (or a similar format) efficiently from the dump
files, I would definitely love to hear them.

I appreciate the help and advice!


On Mon, Oct 7, 2013 at 10:44 AM, Pierre-Carl Langlais <
[email protected]> wrote:

>  I agree with Klein. If you do not need to exploit the entire Wikipedia
> database, requests through a python scraping library (like Beautiful Soup)
> are certainly sufficient and easy to set up. With an aleatory algorithm to
> select the "ids" you can create a fine sample.
> PCL
>
> Le 07/10/13 19:31, Klein,Max a écrit :
>
>  Hi Susan,
>
> Do you need the entire database diff'd? I.e. all edits ever. Or are you
> interested in a particular subset of the diffs? It would help to know your
> purpose.
>
> For instance I am interested in diffs around specific articles for
> specific dates to study news events. So I calculate the diffs myself using
> python on page histories rather than the entire database.
>
>  Maximilian Klein
> Wikipedian in Residence, OCLC
> +17074787023
>
>  ------------------------------
> *From:* [email protected]
> <[email protected]><[email protected]>on
>  behalf of Susan Biancani
> <[email protected]> <[email protected]>
> *Sent:* Thursday, October 03, 2013 10:06 PM
> *To:* [email protected]
> *Subject:* [Wiki-research-l] diffdb formatted Wikipedia dump
>
>     I'm looking for a dump from English Wikipedia in diff format (i.e.
> each entry is the text that was added/deleted since the last edit, rather
> than each entry is the current state of the page).
>
>  The Summer of Research folks provided a handy guide to how to create such
> a dataset from the standard complete dumps here:
> http://meta.wikimedia.org/wiki/WSoR_datasets/revision_diff
>  But the time estimate they give is prohibitive for me (20-24 hours for
> each dump file--there are currently 158--running on 24 cores). I'm a grad
> student in a social science department, and don't have access to extensive
> computing power. I've been paying out of pocket for AWS, but this would get
> expensive.
>
>  There is a diff-format dataset available, but only through April, 2011
> (here: http://dumps.wikimedia.org/other/diffdb/). I'd like to get a
> diff-format dataset for January, 2010- March, 2013 (or, for everything up
> to March, 2013).
>
>  Does anyone know if such a dataset exists somewhere? Any leads or
> suggestions would be much appreciated!
>
>  Susan
>
>
> _______________________________________________
> Wiki-research-l mailing 
> [email protected]https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>

_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] diffdb formatted Wikipedia dump

Reply via email to