On 09/20/2012 07:40 PM, MZMcBride wrote: > Scanning dumps (or really dealing with them in any form) is pretty awful. > There's been some brainstorming in the past for how to set up a system where > users (or operators) could run arbitrary regular expressions on all of the > current wikitext regularly, but such a setup requires _a lot_ of anything > involved (disk space, RAM, bandwidth, processing power, etc.). Maybe one day > Labs will have something like this.
We have a dump grepper tool in the Parsoid codebase (see js/tests/dumpGrepper.js) that takes about 25 minutes to grep an XML dump of the English Wikipedia. The memory involved is minimal and constant, the thing is mostly CPU-bound. It should not be hard to hook this up to a web service. Our parser web service in js/api could serve as a template for that. Gabriel _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l