The deltas library implements the rough WikiWho strategy in a difflib sort
of way as "SegmentMatcher".

Re. diffs, I have some datasets that I have generated and can share.  Would
enwiki-20150602 be recent enough for your uses?

If not, then I'd also like to point you to http://pythonhosted.org/mwdiffs/
which provides some nice utilities for parallel processing diffs from
MediaWiki dumps using the `deltas` library.  See
http://pythonhosted.org/mwdiffs/utilities.html.  Those utilities will
natively parallelize computation so that you can divide the total runtime
(100 days) by how many CPUs you have to run with.  E.g. 100 days / 16 CPUs
= 6.3 days.   On a hadoop streaming setup (Altiscale), I've been able to
get the whole English Wikipedia history processed in 48 hours, so it's not
a massive benefit -- yet.

-Aaron

On Wed, Jan 20, 2016 at 8:49 AM, Flöck, Fabian <[email protected]>
wrote:

> Hi, you can also look at our WikiWho code, we have tested it to extract
> the changes between revisions considerably faster than a simple diff. see
> here: https://github.com/maribelacosta/wikiwho . you would have to adapt
> the code a bit to give you the pure diffs though. let me know if you need
> help.
>
> best,
> fabian
>
>
>
> On 20.01.2016, at 13:15, Scott Hale <[email protected]> wrote:
>
> Hi Bowen,
>
> You might compare the performance of Aaron Halfaker's deltas library:
> https://github.com/halfak/deltas
> (You might have already done so, I guess, but just in case)
>
> In either case, I suspect the tasks will need to be parallelized to be
> achieved in a reasonable time scale. How many editions are you working with?
>
> Cheers,
> Scott
>
>
> On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu <[email protected]> wrote:
>
>> Hello all,
>>
>> I am a 2nd PhD student working in Grouplens Research group at the
>> University of Minnesota - Twin Cities. Recently, I am working on a project
>> to study how identity based and bond based theories would help understand
>> editor's behavior in WikiProjects within the group context, but I am having
>> a technical problems that need help and advise.
>>
>> I am trying to parse each revision content of the editors from the XML
>> dumps - the contents they added or deleted in each revision. I used the
>> compare function in difflib to obtain the added or deleted contents by
>> comparing two string objects, which runs extremely slow when the strings
>> are huge specifically in the case of the Wikipedia revision contents.
>> Without any parallel processing techniques, the expecting runtime to
>> download and parse the 201 dumps would be ~100+ days.. I was pointed to
>> altiscale, but not yet sure exactly how to use it for my problem.
>>
>> It would be really great if anyone would give me some suggestion to help
>> me make more progress. Thanks in advance!
>>
>> Sincerely,
>> Bowen
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> Dr Scott Hale
> Data Scientist
> Oxford Internet Institute
> University of Oxford
> http://www.scotthale.net/
> [email protected]
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
>
>
> Gruß,
> Fabian
>
> --
> Fabian Flöck
> Research Associate
> Computational Social Science department @GESIS
> Unter Sachsenhausen 6-8, 50667 Cologne, Germany
> Tel: + 49 (0) 221-47694-208
> [email protected]
>
> www.gesis.org
> www.facebook.com/gesis.org
>
>
>
>
>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to