Hi Aaron,

Neat LimitedQueue class. It looks like this reverts code wouldn't handle
some corner cases,
for example I don't see logic that would distinguish between blanking (which
produces duplicate checksums) and reverts.

-- Best, Dmitry

On Sun, Aug 21, 2011 at 3:15 PM, Aaron Halfaker <[email protected]>wrote:

> I've updated my dump processing python project to include code for quickly
> detecting identity reverts from XML dumps.  See
> https://bitbucket.org/halfak/wikimedia-utilities for the project and the
> process() function at bottom of
> https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/processors/reverts.py
>  for
> the algorithm.  The actual function with the revert detection logic is about
> 50 lines long.
>
> The resulting dump.map function using this revert processor() will emit
> "revert" revisions and "reverted" revisions with the following fields:
>
> Revert revision:
>
>    - "revert" - denotes that this row is a reverting edit
>    - revision_id - the rev_id if the reverting edit
>    - reverted_to_id - the rev_id of the reverted to edit
>    - for_vandalism - using D_LOOSE/D_STRICT regular expression on the
>    reverting comment (See Priedhorsky et al. "Creating, Destroying and
>    Restoring Value in Wikipedia" GROUP 2007)
>    - reverted_revs - number of revisions that were reverted (this is the
>    number of revisions between the reverting edit and reverted to edit)
>
>
> Reverted revision:
>
>    - "reverted" - denotes that this row is a reverted edit
>    - revision_id - the rev_id of the reverted edit
>    - reverting_id - the rev_id if the reverting edit
>    - reverted_to_id - the rev_id of the reverted to edit
>    - for_vandalism - using D_LOOSE/D_STRICT regular expression on the
>    reverting comment (See Priedhorsky et al. "Creating, Destroying and
>    Restoring Value in Wikipedia" GROUP 2007)
>    - reverted_revs - number of revisions that were reverted (this is the
>    number of revisions between the reverting edit and reverted to edit)
>
> I hope this is helpful.
>
> -Aaron
>
> On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker 
> <[email protected]>wrote:
>
>> An identity revert is one which changes the article to an absolutely
>> identical previous state.  This is a common operation in the English
>> Wikipedia.
>>
>> There is a Kittur & Kraut (and others) paper which I can't recall that
>> found the vast majority of reverts of any sort were identity.  Some other
>> types the define are:
>>
>>    - "Partial reverts": Part of an edit is discarded
>>    - "Effective reverts": Looks to be an identity revert, but not
>>    *exactly* the same as a previous revision.  Often a few white-space
>>    characters were out of place.
>>
>> See http://www.grouplens.org/node/427 for a discussion of the difficulty
>> of detecting reverts in better ways.
>>
>> My code detects identity reverts.  For example suppose the following is
>> the content of a sequence of revisions.
>>
>>
>>    1. "foo"
>>    2. "bar"
>>    3. "foobar"
>>    4. "bar"
>>    5. "barbar"
>>
>> Revision #4 reverts back to revision #2 and revision #3 is reverted.  When
>> looking for identity reverts, I have found that limiting the number of
>> revisions that can be reverted to ~15 produces the highest quality of
>> results.  This is discussed in http://www.grouplens.org/node/416 (see
>> http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for
>> quick/dirty summary of the work.).
>>
>> This subject deserves a long conversation, but I think the bit you might
>> be interested in is that the identity revert (described above and example)
>> seems to be the accepted approach for identifying reverts for most types of
>> analyses.
>>
>> -Aaron
>>
>> On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian <[email protected]>wrote:
>>
>>> Hi Aaron,
>>>
>>> thanks, that would be awesome :) we built something ourselves, but I'm
>>> not quite content with it.
>>>
>>> Could you also tell me how you defined a revert (and maybe how you
>>> determine who is the reverter)? Because this is a crucial issue for me.
>>> Is it the complete deletion of all the characters entered by an editor in
>>> an edit? What about editors that revert others or delete content? do you
>>> treat their edits as being reverted if the deleted content gets
>>> reintroduced? Did you take into account location of the words in the text or
>>> did you use a bag-of-words model?
>>> I read many papers and tool documentations that use "reverts", and some
>>> mention their method (while many don't), while it seems almost no-one
>>> describes their definition of what a "revert" actually is.
>>>
>>> But maybe I will get the answers to this from your code as well :)
>>>
>>> Anyway, thanks for the help!
>>>
>>> Best,
>>> Fabian
>>>
>>>
>>> On 19 Aug 2011, at 18:31, Aaron Halfaker wrote:
>>>
>>> Fabian,
>>>
>>> I actually have some software for quickly producing reverts from a
>>> database dump.  The framework for doing it is available here:
>>> https://bitbucket.org/halfak/wikimedia-utilities.  I still have to
>>> package up the code that actually generates the reverts though.  It's just a
>>> matter of finding time to sit down with it and figure out the dependencies!
>>>  I expect that I can have it ready by Monday.  I hope to actually package up
>>> the revert detecting code into the above python project as an example.
>>>
>>> I just wanted to let you know that I have a response for you on the way.
>>>
>>> -Aaron
>>>
>>> On Thu, Aug 18, 2011 at 4:40 AM, Flöck, Fabian <[email protected]>wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm trying to detect reverts in Wikipedia for my research, right now
>>>> with a self-built script using MD5hashes and DIFFs between revisions. I
>>>> always read about people taking reverts into account in their data, but 
>>>> it's
>>>> seldomly described HOW exactly a revert is determined or what tool they use
>>>> to do that. Can you point me to any research or tools or tell me maybe what
>>>> you used in your own research to identify which edits were reverted and/or
>>>> who reverted them?
>>>>
>>>> Best,
>>>>
>>>> Fabian
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Karlsruhe Institute of Technology (KIT)
>>>> Institute of Applied Informatics and Formal Description Methods
>>>>
>>>> Dipl.-Medwiss. Fabian Flöck
>>>> Research Associate
>>>>
>>>> Building 11.40, Room 222
>>>> KIT-Campus South
>>>> D-76128 Karlsruhe
>>>>
>>>> Phone: +49 721 608 4 6584
>>>> Skype: f.floeck_work
>>>> E-Mail: [email protected]
>>>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
>>>>
>>>> KIT – University of the State of Baden-Wuerttemberg and
>>>> National Research Center of the Helmholtz Association
>>>>
>>>>
>>>> _______________________________________________
>>>> Wiki-research-l mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Karlsruhe Institute of Technology (KIT)
>>> Institute of Applied Informatics and Formal Description Methods
>>>
>>> Dipl.-Medwiss. Fabian Flöck
>>> Research Associate
>>>
>>> Building 11.40, Room 222
>>> KIT-Campus South
>>> D-76128 Karlsruhe
>>>
>>> Phone: +49 721 608 4 6584
>>> Skype: f.floeck_work
>>> E-Mail: [email protected]
>>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck
>>>
>>> KIT – University of the State of Baden-Wuerttemberg and
>>> National Research Center of the Helmholtz Association
>>>
>>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to