How big are your 50 files?  How long are the reducers taking?

- HC

On Jul 30, 2013, at 10:26 PM, Something Something <[email protected]> 
wrote:

> Hello,
> 
> One of our pig scripts creates over 500 small part files.  To save on
> namespace, we need to cut down the # of files, so instead of saving 500
> small files we need to merge them into 50.  We tried the following:
> 
> 1)  When we set parallel number to 50, the Pig script takes a long time -
> for obvious reasons.
> 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> field.
> 3)  We wrote our own Map Reducer program that reads these 500 small part
> files & uses 50 reducers.  Basically, the Mappers simply write the line &
> reducers loop thru values & write them out.  We set
> job.setOutputKeyClass(NullWritable.class) so that the key is not written to
> the output file.  This is performing better than Pig.  Actually Mappers run
> very fast, but Reducers take some time to complete, but this approach seems
> to be working well.
> 
> Is there a better way to do this?  What strategy can you think of to
> increase speed of reducers.
> 
> Any help in this regard will be greatly appreciated.  Thanks.

Reply via email to