2009/1/8 Brion Vibber <[email protected]>:
> Definitely of interest! If you haven't already, I'd love to see some
> documentation on the format on mediawiki.org, and it'd be great if we

I did some similar work a while ago using Python's difflib[1] as the
diffing engine. Since difflib was much too slow when feeding it lists
of single characters, I broke up the articles into sequences of words
which improved the speed dramatically (but it's still not as fast as
Robert's).

My goal was slightly different, and rather than producing exact
revision deltas I was looking for "blame" information[2]. I also
computed the SHA1-matching graph of reverts, which calculates the
shortest path between the current revision and the first one,
consequently skipping over page-blanking events in most cases.

The output for the first 1400 or so articles in enwiki can be found
here: http://hewgill.com/~greg/wikiblame/

I would be interested in adapting my blame processor to use a faster
diffing algorithm, since it took my machine many hours to process
those 1400 articles.

  [1]: http://python.org/doc/2.5/lib/module-difflib.html
  [2]: http://hewgill.com/journal/entries/461-wikipedia-blame

Greg Hewgill
http://hewgill.com

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to