Hi Bowen,

You might compare the performance of Aaron Halfaker's deltas library:
https://github.com/halfak/deltas
(You might have already done so, I guess, but just in case)

In either case, I suspect the tasks will need to be parallelized to be
achieved in a reasonable time scale. How many editions are you working with?

Cheers,
Scott


On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu <yuxxx...@umn.edu> wrote:

> Hello all,
>
> I am a 2nd PhD student working in Grouplens Research group at the
> University of Minnesota - Twin Cities. Recently, I am working on a project
> to study how identity based and bond based theories would help understand
> editor's behavior in WikiProjects within the group context, but I am having
> a technical problems that need help and advise.
>
> I am trying to parse each revision content of the editors from the XML
> dumps - the contents they added or deleted in each revision. I used the
> compare function in difflib to obtain the added or deleted contents by
> comparing two string objects, which runs extremely slow when the strings
> are huge specifically in the case of the Wikipedia revision contents.
> Without any parallel processing techniques, the expecting runtime to
> download and parse the 201 dumps would be ~100+ days.. I was pointed to
> altiscale, but not yet sure exactly how to use it for my problem.
>
> It would be really great if anyone would give me some suggestion to help
> me make more progress. Thanks in advance!
>
> Sincerely,
> Bowen
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
Dr Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to