How big are your 50 files? How long are the reducers taking? On Jul 30, 2013, at 10:26 PM, Something Something <[email protected]> wrote:
> Hello, > > One of our pig scripts creates over 500 small part files. To save on > namespace, we need to cut down the # of files, so instead of saving 500 > small files we need to merge them into 50. We tried the following: > > 1) When we set parallel number to 50, the Pig script takes a long time - > for obvious reasons. > 2) If we use Hadoop Streaming, it puts some garbage values into the key > field. > 3) We wrote our own Map Reducer program that reads these 500 small part > files & uses 50 reducers. Basically, the Mappers simply write the line & > reducers loop thru values & write them out. We set > job.setOutputKeyClass(NullWritable.class) so that the key is not written to > the output file. This is performing better than Pig. Actually Mappers run > very fast, but Reducers take some time to complete, but this approach seems > to be working well. > > Is there a better way to do this? What strategy can you think of to > increase speed of reducers. > > Any help in this regard will be greatly appreciated. Thanks.
