Hi, I was comparing performance of a Hadoop job that I wrote in Java to one that I wrote in Pig. I have ~106,000 small (<1Mb) input files. In my Java job, I get one split per file, which is really inefficient. In Pig, this gets done over 49 splits, which is much faster.
How does Pig do this? Is there a piece of the source code that I can be referred to? I seem to be banging my head on how to combine multiple S3 objects into a single split. Thanks, Brian
