Re: [Wikitech-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

Tomasz Finc Wed, 17 Aug 2011 10:05:40 -0700

Very cool!

--tomasz




On Wed, Aug 17, 2011 at 9:58 AM, Diederik van Liere <[email protected]> wrote:
> Hello!
>
> Over the last few weeks, Yusuke Matsubara, Shawn Walker, Aaron Halfaker and
> Fabian Kaelin (who are all Summer of Research fellows)[0] have worked hard
> on a customized stream-based InputFormatReader that allows parsing of both
> bz2 compressed and uncompressed files of the full Wikipedia dump (dump file
> with the complete edit histories) using Hadoop. Prior to WikiHadoop and the
> accompanying InputFormatReader it was not possible to use Hadoop to analyze
> the full Wikipedia dump files (see the detailed tutorial / background for an
> explanation why that was not possible).
>
> This means:
> 1) We can now harness Hadoop's distributed computing capabilities in
> analyzing the full dump files.
> 2) You can send either one or two revisions to a single mapper so it's
> possible to diff two revisions and see what content has been addded /
> removed.
> 3) You can exclude namespaces by supplying a regular expression.
> 4) We are using Hadoop's Streaming interface which means people can use this
> InputFormat Reader using different languages such as Java, Python, Ruby and
> PHP.
>
> The source code is available at: https://github.com/whym/wikihadoop
> A more detailed tutorial and installation guide is available at:
> https://github.com/whym/wikihadoop/wiki
>
>
> (Apologies for cross-posting to wikitech-l and wiki-research-l)
>
> [0] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>
>
> Best,
>
> Diederik
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Announcing Wikihadoop: using Hadoop to analyze Wikipedia dump files

Reply via email to