Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can see how this can be very useful! Otherwise... well... It seems like Hadoop gives you a lot of overhead, and it is just not practical to do parsing this way.
With a straightforward implementation in Python, on a single Core2 Duo you can parse the dump (7z), compute diffs, md5, etc and store everything into a binary form in about 6-7 days. For example an implementation here: http://code.google.com/p/pymwdat/ can do exactly that. I imagine that with faster C++ code and with modern i7 box it can be done within a day. And after that this precomputed binary form (diffs+metadata+stats take about several times of the .7z dump ~ 100Gb) can be serialized very efficiently (just about an hour on a single box). Saying that, I still think using Hadoop/EC2 could be really nice. Particularly if the dump can be made available on the S3/EC2. -- Best, Dmitry On Wed, Aug 17, 2011 at 3:07 PM, Diederik van Liere <[email protected]>wrote: > So the 14 day task included xml parsing and creating diffs. We might gain > performance improvements by fine-tuning the Hadoop configuration although > that seems to be more of an art than science. > Diederik > > > On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov <[email protected]>wrote: > >> Hello, >> >> This is an excellent news! >> >> Have you tried running it on Amazon EC2? It would be really nice to know >> how well WikiHadoop scale up with the number of nodes. >> Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on >> what kind of task (xml parsing, diffs, md5, etc?) was it obtained? >> >> -- Best, Dmitry >> >> On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere >> <[email protected]>wrote: >> >>> Hello! >>> >>> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker >>> and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked >>> hard on a customized stream-based InputFormatReader that allows parsing of >>> both bz2 compressed and uncompressed files of the full Wikipedia dump(dump >>> file with the complete edit histories)using Hadoop. Prior to WikiHadoop and >>> the accompanying InputFormatReader it >>> was not possible to use Hadoop to analyze the full Wikipedia dump files >>> (see the detailed tutorial / background for an explanation why that was not >>> possible). >>> >>> This means: >>> 1) We can now harness Hadoop's distributed computing capabilities in >>> analyzing the full dump files. >>> 2) You can send either one or two revisions to a single mapper so it's >>> possible to diff two revisions and see what content has been addded / >>> removed. >>> 3) You can exclude namespaces by supplying a regular expression. >>> 4) We are using Hadoop's Streaming interface which means people can use >>> this InputFormat Reader using different languages such as Java, Python, Ruby >>> and PHP. >>> >>> The source code is available at: https://github.com/whym/wikihadoop >>> A more detailed tutorial and installation guide is available at: >>> https://github.com/whym/wikihadoop/wiki >>> >>> >>> (Apologies for cross-posting to wikitech-l and wiki-research-l) >>> >>> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/ >>> >>> >>> Best, >>> >>> Diederik >>> >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >>> >>> >> >> _______________________________________________ >> Wiki-research-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > > -- > <a href="http://about.me/diederik">Check out my about.me profile!</a> > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > >
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
