On Thu, Jun 16, 2011 at 20:00, Daniel Dai <[email protected]> wrote: > Try custom partitioner: > http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby
AFAIK "partition by" maps to the Hadoop partitioning, which is about what keys go to which reducer, which is a different problem. Hadoop In Action chapter 7.2 addresses partitioning into multiple output files, and highlights this difference. The book shows a custom implementation of MultipleOutputFormat as a solution. Thomas > On 06/16/2011 12:38 AM, Thomas Kappler wrote: >> >> Hi all, >> >> piggybank.storage.MultiStorage allows storing the Pig output into >> different directories, taken from a given field in a relation, so that >> the output is partitioned by the unique values of that field. >> >> This is just what I need for my use-case. However, I have about 50,000 >> unique values in the partitioning field. It seems that MutliStorage >> will run one reducer per unique value, i.e., per output directory. >> Obviously, this takes a long time. >> >> Is there a better way of doing it? >> >> I could group by the partitioning field and write a post-processing >> script to go through the Pig output and write each line to a different >> line. It would be simple, but I'd prefer to do it all in Pig for >> consistency. >> >> Thanks, >> Thomas > >
