Re: Joining more than two PTables in a single MR job

Suyash Agarwal Thu, 30 Aug 2018 22:13:18 -0700

So, if there are a lot of values for a key, will they all be loaded in
memory in the collection? If that's the case then I'll be running in
container OOM issues. Or, will the largest table be sharded?


On Thu, Aug 30, 2018 at 6:22 PM David Ortiz <[email protected]> wrote:

> CoGroup is your best bet to join multiple tables.  They also are handy if
> you expect a lot of values from a table for the same key and don't want to
> blow up your collection size.  The Collections are simply all the values
> from each table that matched the given key.
>
> On Thu, Aug 30, 2018 at 2:33 AM Suyash Agarwal <[email protected]>
> wrote:
>
>> Hi,
>>
>> Is there a way to join more than two PTables in a single MR job in Apache
>> Crunch?
>> I am unable to find an API which does that. And, using multiple Join
>> Strategies to have two join statements results in different MR jobs.
>> Cogroup API seems to take arbitrary PTables but I am not sure if that is
>> the way to go since they result in collection<> of the values of the joined
>> tables. I am not sure how these collections are different from iterables.
>>
>> Thanks.
>>
>

Re: Joining more than two PTables in a single MR job

Reply via email to