Perhaps finetuning it for EC2, maybe even hosting the dataset there? I can
see how this can be very useful! Otherwise... well... It seems like Hadoop
gives you a lot of overhead, and it is just not practical to do parsing this
way.

With a straightforward implementation in Python, on a single Core2 Duo you
can parse the dump (7z), compute diffs, md5, etc and store everything into a
binary form in about 6-7 days.
For example an implementation here: http://code.google.com/p/pymwdat/  can
do exactly that. I imagine that with faster C++ code and with modern i7 box
it can be done within a day.
And after that this precomputed binary form (diffs+metadata+stats take about
several times of the .7z dump ~ 100Gb) can be serialized very efficiently
(just about an hour on a single box).

Saying that, I still think using Hadoop/EC2 could be really nice.
Particularly if the dump can be made available on the S3/EC2.

-- Best, Dmitry


On Wed, Aug 17, 2011 at 3:07 PM, Diederik van Liere <[email protected]>wrote:

> So the 14 day task included xml parsing and creating diffs. We might gain
> performance improvements by fine-tuning the Hadoop configuration although
> that seems to be more of  an art than science.
> Diederik
>
>
>  On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov <[email protected]>wrote:
>
>> Hello,
>>
>> This is an excellent news!
>>
>> Have you tried running it on Amazon EC2? It would be really nice to know
>> how well WikiHadoop scale up with the number of nodes.
>> Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on
>> what kind of task (xml parsing, diffs, md5, etc?) was it obtained?
>>
>> -- Best, Dmitry
>>
>> On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere 
>> <[email protected]>wrote:
>>
>>> Hello!
>>>
>>> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker
>>> and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked
>>> hard on a customized stream-based InputFormatReader that allows parsing of
>>> both bz2 compressed and uncompressed files of the full Wikipedia dump(dump 
>>> file with the complete edit histories)using Hadoop. Prior to WikiHadoop and 
>>> the accompanying InputFormatReader it
>>> was not possible to use Hadoop to analyze the full Wikipedia dump files
>>> (see the detailed tutorial / background for an explanation why that was not
>>> possible).
>>>
>>> This means:
>>> 1) We can now harness Hadoop's distributed computing capabilities in
>>> analyzing the full dump files.
>>> 2) You can send either one or two revisions to a single mapper so it's
>>> possible to diff two revisions and see what content has been addded /
>>> removed.
>>> 3) You can exclude namespaces by supplying a regular expression.
>>> 4) We are using Hadoop's Streaming interface which means people can use
>>> this InputFormat Reader using different languages such as Java, Python, Ruby
>>> and PHP.
>>>
>>> The source code is available at: https://github.com/whym/wikihadoop
>>> A more detailed tutorial and installation guide is available at:
>>> https://github.com/whym/wikihadoop/wiki
>>>
>>>
>>> (Apologies for cross-posting to wikitech-l and wiki-research-l)
>>>
>>> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>>>
>>>
>>> Best,
>>>
>>> Diederik
>>>
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> --
> <a href="http://about.me/diederik";>Check out my about.me profile!</a>
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to