Re: Joining more than two PTables in a single MR job

Suyash Agarwal Fri, 31 Aug 2018 01:06:02 -0700

All the tables I want to join are huge enough to discard MapSideJoin as an
option. As David mentioned, cogroup may help in cases when we have lot of
values from a table for the same key, is there a way to shard the values or
iteratively read the values from the largest table for this to work?



On Fri, Aug 31, 2018 at 12:16 PM Josh Wills <[email protected]> wrote:

> That is correct, the Cogroup will load all of the values for the key into
> memory-- is this not a situation where a combination of a
> MapSideJoinStrategy plus another JoinStrategy will do what you want?
>
> J
>
> On Thu, Aug 30, 2018 at 10:12 PM Suyash Agarwal <[email protected]>
> wrote:
>
>> So, if there are a lot of values for a key, will they all be loaded in
>> memory in the collection? If that's the case then I'll be running in
>> container OOM issues. Or, will the largest table be sharded?
>>
>> On Thu, Aug 30, 2018 at 6:22 PM David Ortiz <[email protected]> wrote:
>>
>>> CoGroup is your best bet to join multiple tables.  They also are handy
>>> if you expect a lot of values from a table for the same key and don't want
>>> to blow up your collection size.  The Collections are simply all the values
>>> from each table that matched the given key.
>>>
>>> On Thu, Aug 30, 2018 at 2:33 AM Suyash Agarwal <[email protected]>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Is there a way to join more than two PTables in a single MR job in
>>>> Apache Crunch?
>>>> I am unable to find an API which does that. And, using multiple Join
>>>> Strategies to have two join statements results in different MR jobs.
>>>> Cogroup API seems to take arbitrary PTables but I am not sure if that is
>>>> the way to go since they result in collection<> of the values of the joined
>>>> tables. I am not sure how these collections are different from iterables.
>>>>
>>>> Thanks.
>>>>
>>>

Re: Joining more than two PTables in a single MR job

Reply via email to