Hi Aaron, Neat LimitedQueue class. It looks like this reverts code wouldn't handle some corner cases, for example I don't see logic that would distinguish between blanking (which produces duplicate checksums) and reverts.
-- Best, Dmitry On Sun, Aug 21, 2011 at 3:15 PM, Aaron Halfaker <[email protected]>wrote: > I've updated my dump processing python project to include code for quickly > detecting identity reverts from XML dumps. See > https://bitbucket.org/halfak/wikimedia-utilities for the project and the > process() function at bottom of > https://bitbucket.org/halfak/wikimedia-utilities/src/f1c8fe7224f3/wmf/dump/processors/reverts.py > for > the algorithm. The actual function with the revert detection logic is about > 50 lines long. > > The resulting dump.map function using this revert processor() will emit > "revert" revisions and "reverted" revisions with the following fields: > > Revert revision: > > - "revert" - denotes that this row is a reverting edit > - revision_id - the rev_id if the reverting edit > - reverted_to_id - the rev_id of the reverted to edit > - for_vandalism - using D_LOOSE/D_STRICT regular expression on the > reverting comment (See Priedhorsky et al. "Creating, Destroying and > Restoring Value in Wikipedia" GROUP 2007) > - reverted_revs - number of revisions that were reverted (this is the > number of revisions between the reverting edit and reverted to edit) > > > Reverted revision: > > - "reverted" - denotes that this row is a reverted edit > - revision_id - the rev_id of the reverted edit > - reverting_id - the rev_id if the reverting edit > - reverted_to_id - the rev_id of the reverted to edit > - for_vandalism - using D_LOOSE/D_STRICT regular expression on the > reverting comment (See Priedhorsky et al. "Creating, Destroying and > Restoring Value in Wikipedia" GROUP 2007) > - reverted_revs - number of revisions that were reverted (this is the > number of revisions between the reverting edit and reverted to edit) > > I hope this is helpful. > > -Aaron > > On Fri, Aug 19, 2011 at 3:08 PM, Aaron Halfaker > <[email protected]>wrote: > >> An identity revert is one which changes the article to an absolutely >> identical previous state. This is a common operation in the English >> Wikipedia. >> >> There is a Kittur & Kraut (and others) paper which I can't recall that >> found the vast majority of reverts of any sort were identity. Some other >> types the define are: >> >> - "Partial reverts": Part of an edit is discarded >> - "Effective reverts": Looks to be an identity revert, but not >> *exactly* the same as a previous revision. Often a few white-space >> characters were out of place. >> >> See http://www.grouplens.org/node/427 for a discussion of the difficulty >> of detecting reverts in better ways. >> >> My code detects identity reverts. For example suppose the following is >> the content of a sequence of revisions. >> >> >> 1. "foo" >> 2. "bar" >> 3. "foobar" >> 4. "bar" >> 5. "barbar" >> >> Revision #4 reverts back to revision #2 and revision #3 is reverted. When >> looking for identity reverts, I have found that limiting the number of >> revisions that can be reverted to ~15 produces the highest quality of >> results. This is discussed in http://www.grouplens.org/node/416 (see >> http://www-users.cs.umn.edu/~halfak/summaries/A_Jury_of_Your_Peers.html for >> quick/dirty summary of the work.). >> >> This subject deserves a long conversation, but I think the bit you might >> be interested in is that the identity revert (described above and example) >> seems to be the accepted approach for identifying reverts for most types of >> analyses. >> >> -Aaron >> >> On Fri, Aug 19, 2011 at 4:39 PM, Flöck, Fabian <[email protected]>wrote: >> >>> Hi Aaron, >>> >>> thanks, that would be awesome :) we built something ourselves, but I'm >>> not quite content with it. >>> >>> Could you also tell me how you defined a revert (and maybe how you >>> determine who is the reverter)? Because this is a crucial issue for me. >>> Is it the complete deletion of all the characters entered by an editor in >>> an edit? What about editors that revert others or delete content? do you >>> treat their edits as being reverted if the deleted content gets >>> reintroduced? Did you take into account location of the words in the text or >>> did you use a bag-of-words model? >>> I read many papers and tool documentations that use "reverts", and some >>> mention their method (while many don't), while it seems almost no-one >>> describes their definition of what a "revert" actually is. >>> >>> But maybe I will get the answers to this from your code as well :) >>> >>> Anyway, thanks for the help! >>> >>> Best, >>> Fabian >>> >>> >>> On 19 Aug 2011, at 18:31, Aaron Halfaker wrote: >>> >>> Fabian, >>> >>> I actually have some software for quickly producing reverts from a >>> database dump. The framework for doing it is available here: >>> https://bitbucket.org/halfak/wikimedia-utilities. I still have to >>> package up the code that actually generates the reverts though. It's just a >>> matter of finding time to sit down with it and figure out the dependencies! >>> I expect that I can have it ready by Monday. I hope to actually package up >>> the revert detecting code into the above python project as an example. >>> >>> I just wanted to let you know that I have a response for you on the way. >>> >>> -Aaron >>> >>> On Thu, Aug 18, 2011 at 4:40 AM, Flöck, Fabian <[email protected]>wrote: >>> >>>> Hi, >>>> >>>> I'm trying to detect reverts in Wikipedia for my research, right now >>>> with a self-built script using MD5hashes and DIFFs between revisions. I >>>> always read about people taking reverts into account in their data, but >>>> it's >>>> seldomly described HOW exactly a revert is determined or what tool they use >>>> to do that. Can you point me to any research or tools or tell me maybe what >>>> you used in your own research to identify which edits were reverted and/or >>>> who reverted them? >>>> >>>> Best, >>>> >>>> Fabian >>>> >>>> >>>> >>>> >>>> -- >>>> Karlsruhe Institute of Technology (KIT) >>>> Institute of Applied Informatics and Formal Description Methods >>>> >>>> Dipl.-Medwiss. Fabian Flöck >>>> Research Associate >>>> >>>> Building 11.40, Room 222 >>>> KIT-Campus South >>>> D-76128 Karlsruhe >>>> >>>> Phone: +49 721 608 4 6584 >>>> Skype: f.floeck_work >>>> E-Mail: [email protected] >>>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck >>>> >>>> KIT – University of the State of Baden-Wuerttemberg and >>>> National Research Center of the Helmholtz Association >>>> >>>> >>>> _______________________________________________ >>>> Wiki-research-l mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Karlsruhe Institute of Technology (KIT) >>> Institute of Applied Informatics and Formal Description Methods >>> >>> Dipl.-Medwiss. Fabian Flöck >>> Research Associate >>> >>> Building 11.40, Room 222 >>> KIT-Campus South >>> D-76128 Karlsruhe >>> >>> Phone: +49 721 608 4 6584 >>> Skype: f.floeck_work >>> E-Mail: [email protected] >>> WWW: http://www.aifb.kit.edu/web/Fabian_Flöck >>> >>> KIT – University of the State of Baden-Wuerttemberg and >>> National Research Center of the Helmholtz Association >>> >>> >> > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
