Combining small S3 inputs

Brian Stempin Tue, 17 Jun 2014 13:50:21 -0700

Hi,
I was comparing performance of a Hadoop job that I wrote in Java to one
that I wrote in Pig.  I have ~106,000 small (<1Mb) input files.  In my Java
job, I get one split per file, which is really inefficient.  In Pig, this
gets done over 49 splits, which is much faster.


How does Pig do this?  Is there a piece of the source code that I can be
referred to?  I seem to be banging my head on how to combine multiple S3
objects into a single split.

Thanks,
Brian

Combining small S3 inputs

Reply via email to