Small files produced by a map-only job

Chao Shi Thu, 06 Jun 2013 04:07:43 -0700

Hey guys,

I'm writing MR jobs using crunch. Crunch optimizes some very simple
pipeline into map-only jobs, e.g. sample or grep.


As MR framework splits the input data by HDFS block, the map phase will
produce plenty of small files on HDFS, which is unpleasant and makes the
following data processing inefficient. When I write raw MR, I typically
append this with an identity reducer, which simply merges outputs from map.

I think CRUNCH-162 <https://issues.apache.org/jira/browse/CRUNCH-162> is
related to this. Is there anyone still working on it?

Thanks,
Chao

Small files produced by a map-only job

Reply via email to