Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Flöck , Fabian
Hi, you can also look at our WikiWho code, we have tested it to extract the changes between revisions considerably faster than a simple diff. see here: https://github.com/maribelacosta/wikiwho . you would have to adapt the code a bit to give you the pure diffs though. let me know if you need

Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Aaron Halfaker
The deltas library implements the rough WikiWho strategy in a difflib sort of way as "SegmentMatcher". Re. diffs, I have some datasets that I have generated and can share. Would enwiki-20150602 be recent enough for your uses? If not, then I'd also like to point you to

Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Bowen Yu
Thanks for all the suggestions you shared! @ Aaron, it would be great if you can share me the dataset you have. I think 20150602 is fairly new. In the meanwhile, I will explore the utilities you mentioned. Think they are good stuff to learn and practice. Thanks! On Wed, Jan 20, 2016 at 9:20 AM,

[Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Bowen Yu
Hello all, I am a 2nd PhD student working in Grouplens Research group at the University of Minnesota - Twin Cities. Recently, I am working on a project to study how identity based and bond based theories would help understand editor's behavior in WikiProjects within the group context, but I am

Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Scott Hale
Hi Bowen, You might compare the performance of Aaron Halfaker's deltas library: https://github.com/halfak/deltas (You might have already done so, I guess, but just in case) In either case, I suspect the tasks will need to be parallelized to be achieved in a reasonable time scale. How many