Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Flöck , Fabian
Hi, you can also look at our WikiWho code, we have tested it to extract the 
changes between revisions considerably faster than a simple diff. see here: 
https://github.com/maribelacosta/wikiwho . you would have to adapt the code a 
bit to give you the pure diffs though. let me know if you need help.

best,
fabian



On 20.01.2016, at 13:15, Scott Hale 
> wrote:

Hi Bowen,

You might compare the performance of Aaron Halfaker's deltas library: 
https://github.com/halfak/deltas
(You might have already done so, I guess, but just in case)

In either case, I suspect the tasks will need to be parallelized to be achieved 
in a reasonable time scale. How many editions are you working with?

Cheers,
Scott


On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu 
> wrote:
Hello all,

I am a 2nd PhD student working in Grouplens Research group at the University of 
Minnesota - Twin Cities. Recently, I am working on a project to study how 
identity based and bond based theories would help understand editor's behavior 
in WikiProjects within the group context, but I am having a technical problems 
that need help and advise.

I am trying to parse each revision content of the editors from the XML dumps - 
the contents they added or deleted in each revision. I used the compare 
function in difflib to obtain the added or deleted contents by comparing two 
string objects, which runs extremely slow when the strings are huge 
specifically in the case of the Wikipedia revision contents. Without any 
parallel processing techniques, the expecting runtime to download and parse the 
201 dumps would be ~100+ days.. I was pointed to altiscale, but not yet sure 
exactly how to use it for my problem.

It would be really great if anyone would give me some suggestion to help me 
make more progress. Thanks in advance!

Sincerely,
Bowen

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




--
Dr Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




Gruß,
Fabian

--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.flo...@gesis.org

www.gesis.org
www.facebook.com/gesis.org






___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Aaron Halfaker
The deltas library implements the rough WikiWho strategy in a difflib sort
of way as "SegmentMatcher".

Re. diffs, I have some datasets that I have generated and can share.  Would
enwiki-20150602 be recent enough for your uses?

If not, then I'd also like to point you to http://pythonhosted.org/mwdiffs/
which provides some nice utilities for parallel processing diffs from
MediaWiki dumps using the `deltas` library.  See
http://pythonhosted.org/mwdiffs/utilities.html.  Those utilities will
natively parallelize computation so that you can divide the total runtime
(100 days) by how many CPUs you have to run with.  E.g. 100 days / 16 CPUs
= 6.3 days.   On a hadoop streaming setup (Altiscale), I've been able to
get the whole English Wikipedia history processed in 48 hours, so it's not
a massive benefit -- yet.

-Aaron

On Wed, Jan 20, 2016 at 8:49 AM, Flöck, Fabian 
wrote:

> Hi, you can also look at our WikiWho code, we have tested it to extract
> the changes between revisions considerably faster than a simple diff. see
> here: https://github.com/maribelacosta/wikiwho . you would have to adapt
> the code a bit to give you the pure diffs though. let me know if you need
> help.
>
> best,
> fabian
>
>
>
> On 20.01.2016, at 13:15, Scott Hale  wrote:
>
> Hi Bowen,
>
> You might compare the performance of Aaron Halfaker's deltas library:
> https://github.com/halfak/deltas
> (You might have already done so, I guess, but just in case)
>
> In either case, I suspect the tasks will need to be parallelized to be
> achieved in a reasonable time scale. How many editions are you working with?
>
> Cheers,
> Scott
>
>
> On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu  wrote:
>
>> Hello all,
>>
>> I am a 2nd PhD student working in Grouplens Research group at the
>> University of Minnesota - Twin Cities. Recently, I am working on a project
>> to study how identity based and bond based theories would help understand
>> editor's behavior in WikiProjects within the group context, but I am having
>> a technical problems that need help and advise.
>>
>> I am trying to parse each revision content of the editors from the XML
>> dumps - the contents they added or deleted in each revision. I used the
>> compare function in difflib to obtain the added or deleted contents by
>> comparing two string objects, which runs extremely slow when the strings
>> are huge specifically in the case of the Wikipedia revision contents.
>> Without any parallel processing techniques, the expecting runtime to
>> download and parse the 201 dumps would be ~100+ days.. I was pointed to
>> altiscale, but not yet sure exactly how to use it for my problem.
>>
>> It would be really great if anyone would give me some suggestion to help
>> me make more progress. Thanks in advance!
>>
>> Sincerely,
>> Bowen
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> Dr Scott Hale
> Data Scientist
> Oxford Internet Institute
> University of Oxford
> http://www.scotthale.net/
> scott.h...@oii.ox.ac.uk
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
>
>
> Gruß,
> Fabian
>
> --
> Fabian Flöck
> Research Associate
> Computational Social Science department @GESIS
> Unter Sachsenhausen 6-8, 50667 Cologne, Germany
> Tel: + 49 (0) 221-47694-208
> fabian.flo...@gesis.org
>
> www.gesis.org
> www.facebook.com/gesis.org
>
>
>
>
>
>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Bowen Yu
Thanks for all the suggestions you shared!

@ Aaron, it would be great if you can share me the dataset you have. I
think 20150602 is fairly new. In the meanwhile, I will explore the
utilities you mentioned. Think they are good stuff to learn and practice.
Thanks!

On Wed, Jan 20, 2016 at 9:20 AM, Aaron Halfaker 
wrote:

> The deltas library implements the rough WikiWho strategy in a difflib sort
> of way as "SegmentMatcher".
>
> Re. diffs, I have some datasets that I have generated and can share.
> Would enwiki-20150602 be recent enough for your uses?
>
> If not, then I'd also like to point you to
> http://pythonhosted.org/mwdiffs/ which provides some nice utilities for
> parallel processing diffs from MediaWiki dumps using the `deltas` library.
> See http://pythonhosted.org/mwdiffs/utilities.html.  Those utilities will
> natively parallelize computation so that you can divide the total runtime
> (100 days) by how many CPUs you have to run with.  E.g. 100 days / 16 CPUs
> = 6.3 days.   On a hadoop streaming setup (Altiscale), I've been able to
> get the whole English Wikipedia history processed in 48 hours, so it's not
> a massive benefit -- yet.
>
> -Aaron
>
> On Wed, Jan 20, 2016 at 8:49 AM, Flöck, Fabian 
> wrote:
>
>> Hi, you can also look at our WikiWho code, we have tested it to extract
>> the changes between revisions considerably faster than a simple diff. see
>> here: https://github.com/maribelacosta/wikiwho . you would have to adapt
>> the code a bit to give you the pure diffs though. let me know if you need
>> help.
>>
>> best,
>> fabian
>>
>>
>>
>> On 20.01.2016, at 13:15, Scott Hale  wrote:
>>
>> Hi Bowen,
>>
>> You might compare the performance of Aaron Halfaker's deltas library:
>> https://github.com/halfak/deltas
>> (You might have already done so, I guess, but just in case)
>>
>> In either case, I suspect the tasks will need to be parallelized to be
>> achieved in a reasonable time scale. How many editions are you working with?
>>
>> Cheers,
>> Scott
>>
>>
>> On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu  wrote:
>>
>>> Hello all,
>>>
>>> I am a 2nd PhD student working in Grouplens Research group at the
>>> University of Minnesota - Twin Cities. Recently, I am working on a project
>>> to study how identity based and bond based theories would help understand
>>> editor's behavior in WikiProjects within the group context, but I am having
>>> a technical problems that need help and advise.
>>>
>>> I am trying to parse each revision content of the editors from the XML
>>> dumps - the contents they added or deleted in each revision. I used the
>>> compare function in difflib to obtain the added or deleted contents by
>>> comparing two string objects, which runs extremely slow when the strings
>>> are huge specifically in the case of the Wikipedia revision contents.
>>> Without any parallel processing techniques, the expecting runtime to
>>> download and parse the 201 dumps would be ~100+ days.. I was pointed to
>>> altiscale, but not yet sure exactly how to use it for my problem.
>>>
>>> It would be really great if anyone would give me some suggestion to help
>>> me make more progress. Thanks in advance!
>>>
>>> Sincerely,
>>> Bowen
>>>
>>> ___
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>>
>> --
>> Dr Scott Hale
>> Data Scientist
>> Oxford Internet Institute
>> University of Oxford
>> http://www.scotthale.net/
>> scott.h...@oii.ox.ac.uk
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>>
>>
>>
>> Gruß,
>> Fabian
>>
>> --
>> Fabian Flöck
>> Research Associate
>> Computational Social Science department @GESIS
>> Unter Sachsenhausen 6-8, 50667 Cologne, Germany
>> Tel: + 49 (0) 221-47694-208
>> fabian.flo...@gesis.org
>>
>> www.gesis.org
>> www.facebook.com/gesis.org
>>
>>
>>
>>
>>
>>
>>
>> ___
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


[Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Bowen Yu
Hello all,

I am a 2nd PhD student working in Grouplens Research group at the
University of Minnesota - Twin Cities. Recently, I am working on a project
to study how identity based and bond based theories would help understand
editor's behavior in WikiProjects within the group context, but I am having
a technical problems that need help and advise.

I am trying to parse each revision content of the editors from the XML
dumps - the contents they added or deleted in each revision. I used the
compare function in difflib to obtain the added or deleted contents by
comparing two string objects, which runs extremely slow when the strings
are huge specifically in the case of the Wikipedia revision contents.
Without any parallel processing techniques, the expecting runtime to
download and parse the 201 dumps would be ~100+ days.. I was pointed to
altiscale, but not yet sure exactly how to use it for my problem.

It would be really great if anyone would give me some suggestion to help me
make more progress. Thanks in advance!

Sincerely,
Bowen
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Parsing editor's each revision contents from wiki XML dumps

2016-01-20 Thread Scott Hale
Hi Bowen,

You might compare the performance of Aaron Halfaker's deltas library:
https://github.com/halfak/deltas
(You might have already done so, I guess, but just in case)

In either case, I suspect the tasks will need to be parallelized to be
achieved in a reasonable time scale. How many editions are you working with?

Cheers,
Scott


On Wed, Jan 20, 2016 at 10:44 AM, Bowen Yu  wrote:

> Hello all,
>
> I am a 2nd PhD student working in Grouplens Research group at the
> University of Minnesota - Twin Cities. Recently, I am working on a project
> to study how identity based and bond based theories would help understand
> editor's behavior in WikiProjects within the group context, but I am having
> a technical problems that need help and advise.
>
> I am trying to parse each revision content of the editors from the XML
> dumps - the contents they added or deleted in each revision. I used the
> compare function in difflib to obtain the added or deleted contents by
> comparing two string objects, which runs extremely slow when the strings
> are huge specifically in the case of the Wikipedia revision contents.
> Without any parallel processing techniques, the expecting runtime to
> download and parse the 201 dumps would be ~100+ days.. I was pointed to
> altiscale, but not yet sure exactly how to use it for my problem.
>
> It would be really great if anyone would give me some suggestion to help
> me make more progress. Thanks in advance!
>
> Sincerely,
> Bowen
>
> ___
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
Dr Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l