Hey Chao, It had dropped off my radar, but I'm happy to throw together a patch to do it this AM.
J On Thu, Jun 6, 2013 at 4:06 AM, Chao Shi <[email protected]> wrote: > Hey guys, > > I'm writing MR jobs using crunch. Crunch optimizes some very simple > pipeline into map-only jobs, e.g. sample or grep. > > As MR framework splits the input data by HDFS block, the map phase will > produce plenty of small files on HDFS, which is unpleasant and makes the > following data processing inefficient. When I write raw MR, I typically > append this with an identity reducer, which simply merges outputs from map. > > I think CRUNCH-162 <https://issues.apache.org/jira/browse/CRUNCH-162> is > related to this. Is there anyone still working on it? > > Thanks, > Chao > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
