Re: Small files produced by a map-only job

Josh Wills Thu, 06 Jun 2013 05:06:50 -0700

Hey Chao,

It had dropped off my radar, but I'm happy to throw together a patch to do
it this AM.


J



On Thu, Jun 6, 2013 at 4:06 AM, Chao Shi <[email protected]> wrote:

> Hey guys,
>
> I'm writing MR jobs using crunch. Crunch optimizes some very simple
> pipeline into map-only jobs, e.g. sample or grep.
>
> As MR framework splits the input data by HDFS block, the map phase will
> produce plenty of small files on HDFS, which is unpleasant and makes the
> following data processing inefficient. When I write raw MR, I typically
> append this with an identity reducer, which simply merges outputs from map.
>
> I think CRUNCH-162 <https://issues.apache.org/jira/browse/CRUNCH-162> is
> related to this. Is there anyone still working on it?
>
> Thanks,
> Chao
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Small files produced by a map-only job

Reply via email to