Hi Bowen, You might compare the performance of Aaron Halfaker's deltas library: https://github.com/halfak/deltas (You might have already done so, I guess, but just in case)
In either case, I suspect the tasks will need to be parallelized to be achieved in a reasonable time scale. How many editions are you working with? Cheers, Scott On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu <yuxxx...@umn.edu> wrote: > Hello all, > > I am a 2nd PhD student working in Grouplens Research group at the > University of Minnesota - Twin Cities. Recently, I am working on a project > to study how identity based and bond based theories would help understand > editor's behavior in WikiProjects within the group context, but I am having > a technical problems that need help and advise. > > I am trying to parse each revision content of the editors from the XML > dumps - the contents they added or deleted in each revision. I used the > compare function in difflib to obtain the added or deleted contents by > comparing two string objects, which runs extremely slow when the strings > are huge specifically in the case of the Wikipedia revision contents. > Without any parallel processing techniques, the expecting runtime to > download and parse the 201 dumps would be ~100+ days.. I was pointed to > altiscale, but not yet sure exactly how to use it for my problem. > > It would be really great if anyone would give me some suggestion to help > me make more progress. Thanks in advance! > > Sincerely, > Bowen > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > -- Dr Scott Hale Data Scientist Oxford Internet Institute University of Oxford http://www.scotthale.net/ scott.h...@oii.ox.ac.uk
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l