We used to take the first character of the partition field, and multistorage on that.
Shawn On Fri, Jun 17, 2011 at 4:18 AM, Thomas Kappler <[email protected]> wrote: > On Thu, Jun 16, 2011 at 20:00, Daniel Dai <[email protected]> wrote: >> Try custom partitioner: >> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby > > AFAIK "partition by" maps to the Hadoop partitioning, which is about > what keys go to which reducer, which is a different problem. > > Hadoop In Action chapter 7.2 addresses partitioning into multiple > output files, and highlights this difference. The book shows a custom > implementation of MultipleOutputFormat as a solution. > > Thomas > > >> On 06/16/2011 12:38 AM, Thomas Kappler wrote: >>> >>> Hi all, >>> >>> piggybank.storage.MultiStorage allows storing the Pig output into >>> different directories, taken from a given field in a relation, so that >>> the output is partitioned by the unique values of that field. >>> >>> This is just what I need for my use-case. However, I have about 50,000 >>> unique values in the partitioning field. It seems that MutliStorage >>> will run one reducer per unique value, i.e., per output directory. >>> Obviously, this takes a long time. >>> >>> Is there a better way of doing it? >>> >>> I could group by the partitioning field and write a post-processing >>> script to go through the Pig output and write each line to a different >>> line. It would be simple, but I'd prefer to do it all in Pig for >>> consistency. >>> >>> Thanks, >>> Thomas >> >> >
