So the 14 day task included xml parsing and creating diffs. We might gain
performance improvements by fine-tuning the Hadoop configuration although
that seems to be more of  an art than science.
Diederik


On Wed, Aug 17, 2011 at 5:28 PM, Dmitry Chichkov <[email protected]>wrote:

> Hello,
>
> This is an excellent news!
>
> Have you tried running it on Amazon EC2? It would be really nice to know
> how well WikiHadoop scale up with the number of nodes.
> Also, this timing - '3 x Quad Core / 14 days / full wikipedia dump", on
> what kind of task (xml parsing, diffs, md5, etc?) was it obtained?
>
> -- Best, Dmitry
>
> On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere 
> <[email protected]>wrote:
>
>> Hello!
>>
>> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker
>> and Fabian Kaelin (who are all Summer of Research fellows)[0] have worked
>> hard on a customized stream-based InputFormatReader that allows parsing of
>> both bz2 compressed and uncompressed files of the full Wikipedia dump(dump 
>> file with the complete edit histories)using Hadoop. Prior to WikiHadoop and 
>> the accompanying InputFormatReader it
>> was not possible to use Hadoop to analyze the full Wikipedia dump files
>> (see the detailed tutorial / background for an explanation why that was not
>> possible).
>>
>> This means:
>> 1) We can now harness Hadoop's distributed computing capabilities in
>> analyzing the full dump files.
>> 2) You can send either one or two revisions to a single mapper so it's
>> possible to diff two revisions and see what content has been addded /
>> removed.
>> 3) You can exclude namespaces by supplying a regular expression.
>> 4) We are using Hadoop's Streaming interface which means people can use
>> this InputFormat Reader using different languages such as Java, Python, Ruby
>> and PHP.
>>
>> The source code is available at: https://github.com/whym/wikihadoop
>> A more detailed tutorial and installation guide is available at:
>> https://github.com/whym/wikihadoop/wiki
>>
>>
>> (Apologies for cross-posting to wikitech-l and wiki-research-l)
>>
>> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>>
>>
>> Best,
>>
>> Diederik
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


-- 
<a href="http://about.me/diederik";>Check out my about.me profile!</a>
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to