Re: MultiStorage for many key values

Thomas Kappler Fri, 17 Jun 2011 03:19:49 -0700

On Thu, Jun 16, 2011 at 20:00, Daniel Dai <[email protected]> wrote:
> Try custom partitioner:
> http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby


AFAIK "partition by" maps to the Hadoop partitioning, which is about
what keys go to which reducer, which is a different problem.

Hadoop In Action chapter 7.2 addresses partitioning into multiple
output files, and highlights this difference. The book shows a custom
implementation of MultipleOutputFormat as a solution.

Thomas


> On 06/16/2011 12:38 AM, Thomas Kappler wrote:
>>
>> Hi all,
>>
>> piggybank.storage.MultiStorage allows storing the Pig output into
>> different directories, taken from a given field in a relation, so that
>> the output is partitioned by the unique values of that field.
>>
>> This is just what I need for my use-case. However, I have about 50,000
>> unique values in the partitioning field. It seems that MutliStorage
>> will run one reducer per unique value, i.e., per output directory.
>> Obviously, this takes a long time.
>>
>> Is there a better way of doing it?
>>
>> I could group by the partitioning field and write a post-processing
>> script to go through the Pig output and write each line to a different
>> line. It would be simple, but I'd prefer to do it all in Pig for
>> consistency.
>>
>> Thanks,
>> Thomas
>
>

Re: MultiStorage for many key values

Reply via email to