Way cool - Look forward to a brown bag on this project - Diederik? :-)

-Alolita

On Wed, Aug 17, 2011 at 10:05 AM, Tomasz Finc <tf...@wikimedia.org> wrote:
> Very cool!
>
> --tomasz
>
>
>
> On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere <dvanli...@gmail.com> 
> wrote:
>> Hello!
>>
>> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
>> Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
>> on a customized stream-based InputFormatReader that allows parsing of both
>> bz2 compressed and uncompressed files of the full Wikipedia dump (dump file
>> with the complete edit histories) using Hadoop. Prior to WikiHadoop and the
>> accompanying InputFormatReader it was not possible to use Hadoop to analyze
>> the full Wikipedia dump files (see the detailed tutorial / background for an
>> explanation why that was not possible).
>>
>> This means:
>> 1) We can now harness Hadoop's distributed computing capabilities in
>> analyzing the full dump files.
>> 2) You can send either one or two revisions to a single mapper so it's
>> possible to diff two revisions and see what content has been addded /
>> removed.
>> 3) You can exclude namespaces by supplying a regular expression.
>> 4) We are using Hadoop's Streaming interface which means people can use this
>> InputFormat Reader using different languages such as Java, Python, Ruby and
>> PHP.
>>
>> The source code is available at: https://github.com/whym/wikihadoop
>> A more detailed tutorial and installation guide is available at:
>> https://github.com/whym/wikihadoop/wiki
>>
>>
>> (Apologies for cross-posting to wikitech-l and wiki-research-l)
>>
>> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>>
>>
>> Best,
>>
>> Diederik
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to