Adding ORDER BY is what I have done. Basically, ordering by the same field that I am splitting by. This field is the same on all rows so essentially there's nothing to order! But this sounds kludgy! That's why I asked. Thanks.
On Sun, Dec 1, 2013 at 8:31 PM, Cheolsoo Park <[email protected]> wrote: > Unfortunately, no. The settings are script-wide. Can you add an order-by > before storing your output and set its parallel to a smaller number? That > will force a reduce phase and combine small files. Of course, it will add > extra MR jobs. > > > On Sat, Nov 30, 2013 at 9:20 AM, Something Something < > [email protected]> wrote: > > > Is there a way in Pig to change this configuration > > (pig.maxCombinedSplitSize) at different steps inside the *same* Pig > script? > > > > For example, when I am LOADing the data I want this value to be low so > that > > we use the block size effectively & many mappers get triggered. > (Otherwise, > > the job takes too long). > > > > But later when I SPLIT my output, I want split size to be large so we > don't > > create 4000 small output files. (SPLIT is a mapper only task). > > > > Is there a way to accomplish this? > > >
