So the 14 day task included xml parsing and creating diffs. We might gain performance improvements by fine-tuning the Hadoop configuration although that seems to be more of an art than science. Diederik
On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov <[email protected]>wrote: > Hello, > > This is an excellent news! > > Have you tried running it on Amazon EC2? It would be really nice to know > how well WikiHadoop scale up with the number of nodes. > Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on > what kind of task (xml parsing, diffs, md5, etc?) was it obtained? > > -- Best, Dmitry > > On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere > <[email protected]>wrote: > >> Hello! >> >> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker >> and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked >> hard on a customized stream-based InputFormatReader that allows parsing of >> both bz2 compressed and uncompressed files of the full Wikipedia dump(dump >> file with the complete edit histories)using Hadoop. Prior to WikiHadoop and >> the accompanying InputFormatReader it >> was not possible to use Hadoop to analyze the full Wikipedia dump files >> (see the detailed tutorial / background for an explanation why that was not >> possible). >> >> This means: >> 1) We can now harness Hadoop's distributed computing capabilities in >> analyzing the full dump files. >> 2) You can send either one or two revisions to a single mapper so it's >> possible to diff two revisions and see what content has been addded / >> removed. >> 3) You can exclude namespaces by supplying a regular expression. >> 4) We are using Hadoop's Streaming interface which means people can use >> this InputFormat Reader using different languages such as Java, Python, Ruby >> and PHP. >> >> The source code is available at: https://github.com/whym/wikihadoop >> A more detailed tutorial and installation guide is available at: >> https://github.com/whym/wikihadoop/wiki >> >> >> (Apologies for cross-posting to wikitech-l and wiki-research-l) >> >> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/ >> >> >> Best, >> >> Diederik >> >> >> _______________________________________________ >> Wiki-research-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > -- <a href="http://about.me/diederik">Check out my about.me profile!</a>
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
