The deltas library implements the rough WikiWho strategy in a difflib sort of way as "SegmentMatcher".
Re. diffs, I have some datasets that I have generated and can share. Would enwiki-20150602 be recent enough for your uses? If not, then I'd also like to point you to http://pythonhosted.org/mwdiffs/ which provides some nice utilities for parallel processing diffs from MediaWiki dumps using the `deltas` library. See http://pythonhosted.org/mwdiffs/utilities.html. Those utilities will natively parallelize computation so that you can divide the total runtime (100 days) by how many CPUs you have to run with. E.g. 100 days / 16 CPUs = 6.3 days. On a hadoop streaming setup (Altiscale), I've been able to get the whole English Wikipedia history processed in 48 hours, so it's not a massive benefit -- yet. -Aaron On Wed, Jan 20, 2016 at 8:49 AM, Flöck, Fabian <[email protected]> wrote: > Hi, you can also look at our WikiWho code, we have tested it to extract > the changes between revisions considerably faster than a simple diff. see > here: https://github.com/maribelacosta/wikiwho . you would have to adapt > the code a bit to give you the pure diffs though. let me know if you need > help. > > best, > fabian > > > > On 20.01.2016, at 13:15, Scott Hale <[email protected]> wrote: > > Hi Bowen, > > You might compare the performance of Aaron Halfaker's deltas library: > https://github.com/halfak/deltas > (You might have already done so, I guess, but just in case) > > In either case, I suspect the tasks will need to be parallelized to be > achieved in a reasonable time scale. How many editions are you working with? > > Cheers, > Scott > > > On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu <[email protected]> wrote: > >> Hello all, >> >> I am a 2nd PhD student working in Grouplens Research group at the >> University of Minnesota - Twin Cities. Recently, I am working on a project >> to study how identity based and bond based theories would help understand >> editor's behavior in WikiProjects within the group context, but I am having >> a technical problems that need help and advise. >> >> I am trying to parse each revision content of the editors from the XML >> dumps - the contents they added or deleted in each revision. I used the >> compare function in difflib to obtain the added or deleted contents by >> comparing two string objects, which runs extremely slow when the strings >> are huge specifically in the case of the Wikipedia revision contents. >> Without any parallel processing techniques, the expecting runtime to >> download and parse the 201 dumps would be ~100+ days.. I was pointed to >> altiscale, but not yet sure exactly how to use it for my problem. >> >> It would be really great if anyone would give me some suggestion to help >> me make more progress. Thanks in advance! >> >> Sincerely, >> Bowen >> >> _______________________________________________ >> Wiki-research-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > > -- > Dr Scott Hale > Data Scientist > Oxford Internet Institute > University of Oxford > http://www.scotthale.net/ > [email protected] > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > > Gruß, > Fabian > > -- > Fabian Flöck > Research Associate > Computational Social Science department @GESIS > Unter Sachsenhausen 6-8, 50667 Cologne, Germany > Tel: + 49 (0) 221-47694-208 > [email protected] > > www.gesis.org > www.facebook.com/gesis.org > > > > > > > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
