I have the same doubt as Thomas Kappler. And it will be kind of you if someone can say something more detailed about 'custom partitioner' said by Daniel Dai. I think the docs 'piglatin_ref2.html#partitionby' seems too simple.
2011/6/17 Daniel Dai <[email protected]> > Try custom partitioner: http://pig.apache.org/docs/r0.** > 8.1/piglatin_ref2.html#**partitionby<http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby> > > Daniel > > > On 06/16/2011 12:38 AM, Thomas Kappler wrote: > >> Hi all, >> >> piggybank.storage.MultiStorage allows storing the Pig output into >> different directories, taken from a given field in a relation, so that >> the output is partitioned by the unique values of that field. >> >> This is just what I need for my use-case. However, I have about 50,000 >> unique values in the partitioning field. It seems that MutliStorage >> will run one reducer per unique value, i.e., per output directory. >> Obviously, this takes a long time. >> >> Is there a better way of doing it? >> >> I could group by the partitioning field and write a post-processing >> script to go through the Pig output and write each line to a different >> line. It would be simple, but I'd prefer to do it all in Pig for >> consistency. >> >> Thanks, >> Thomas >> > >
