Hi, you can also look at our WikiWho code, we have tested it to extract the 
changes between revisions considerably faster than a simple diff. see here: 
https://github.com/maribelacosta/wikiwho . you would have to adapt the code a 
bit to give you the pure diffs though. let me know if you need help.

best,
fabian



On 20.01.2016, at 13:15, Scott Hale 
<[email protected]<mailto:[email protected]>> wrote:

Hi Bowen,

You might compare the performance of Aaron Halfaker's deltas library: 
https://github.com/halfak/deltas
(You might have already done so, I guess, but just in case)

In either case, I suspect the tasks will need to be parallelized to be achieved 
in a reasonable time scale. How many editions are you working with?

Cheers,
Scott


On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu 
<[email protected]<mailto:[email protected]>> wrote:
Hello all,

I am a 2nd PhD student working in Grouplens Research group at the University of 
Minnesota - Twin Cities. Recently, I am working on a project to study how 
identity based and bond based theories would help understand editor's behavior 
in WikiProjects within the group context, but I am having a technical problems 
that need help and advise.

I am trying to parse each revision content of the editors from the XML dumps - 
the contents they added or deleted in each revision. I used the compare 
function in difflib to obtain the added or deleted contents by comparing two 
string objects, which runs extremely slow when the strings are huge 
specifically in the case of the Wikipedia revision contents. Without any 
parallel processing techniques, the expecting runtime to download and parse the 
201 dumps would be ~100+ days.. I was pointed to altiscale, but not yet sure 
exactly how to use it for my problem.

It would be really great if anyone would give me some suggestion to help me 
make more progress. Thanks in advance!

Sincerely,
Bowen

_______________________________________________
Wiki-research-l mailing list
[email protected]<mailto:[email protected]>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Dr Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
[email protected]<mailto:[email protected]>
_______________________________________________
Wiki-research-l mailing list
[email protected]<mailto:[email protected]>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




Gruß,
Fabian

--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
[email protected]<mailto:[email protected]>

www.gesis.org
www.facebook.com/gesis.org






_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to