That is correct, the Cogroup will load all of the values for the key into
memory-- is this not a situation where a combination of a
MapSideJoinStrategy plus another JoinStrategy will do what you want?

J

On Thu, Aug 30, 2018 at 10:12 PM Suyash Agarwal <[email protected]>
wrote:

> So, if there are a lot of values for a key, will they all be loaded in
> memory in the collection? If that's the case then I'll be running in
> container OOM issues. Or, will the largest table be sharded?
>
> On Thu, Aug 30, 2018 at 6:22 PM David Ortiz <[email protected]> wrote:
>
>> CoGroup is your best bet to join multiple tables.  They also are handy if
>> you expect a lot of values from a table for the same key and don't want to
>> blow up your collection size.  The Collections are simply all the values
>> from each table that matched the given key.
>>
>> On Thu, Aug 30, 2018 at 2:33 AM Suyash Agarwal <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to join more than two PTables in a single MR job in
>>> Apache Crunch?
>>> I am unable to find an API which does that. And, using multiple Join
>>> Strategies to have two join statements results in different MR jobs.
>>> Cogroup API seems to take arbitrary PTables but I am not sure if that is
>>> the way to go since they result in collection<> of the values of the joined
>>> tables. I am not sure how these collections are different from iterables.
>>>
>>> Thanks.
>>>
>>

Reply via email to