Very cool! --tomasz
On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere <[email protected]> wrote: > Hello! > > Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and > Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard > on a customized stream-based InputFormatReader that allows parsing of both > bz2 compressed and uncompressed files of the full Wikipedia dump (dump file > with the complete edit histories) using Hadoop. Prior to WikiHadoop and the > accompanying InputFormatReader it was not possible to use Hadoop to analyze > the full Wikipedia dump files (see the detailed tutorial / background for an > explanation why that was not possible). > > This means: > 1) We can now harness Hadoop's distributed computing capabilities in > analyzing the full dump files. > 2) You can send either one or two revisions to a single mapper so it's > possible to diff two revisions and see what content has been addded / > removed. > 3) You can exclude namespaces by supplying a regular expression. > 4) We are using Hadoop's Streaming interface which means people can use this > InputFormat Reader using different languages such as Java, Python, Ruby and > PHP. > > The source code is available at: https://github.com/whym/wikihadoop > A more detailed tutorial and installation guide is available at: > https://github.com/whym/wikihadoop/wiki > > > (Apologies for cross-posting to wikitech-l and wiki-research-l) > > [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/ > > > Best, > > Diederik > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
