Hey guys, I'm writing MR jobs using crunch. Crunch optimizes some very simple pipeline into map-only jobs, e.g. sample or grep.
As MR framework splits the input data by HDFS block, the map phase will produce plenty of small files on HDFS, which is unpleasant and makes the following data processing inefficient. When I write raw MR, I typically append this with an identity reducer, which simply merges outputs from map. I think CRUNCH-162 <https://issues.apache.org/jira/browse/CRUNCH-162> is related to this. Is there anyone still working on it? Thanks, Chao
