Little bit crazy idea, however…

All your 3 tables have different schemas?
If you can cast somehow your values to some generic type then you can try to 
“union all” all 3 PTables into single one and then do join to itself.
There will be some overhead because each of the table is joined to itself. 
However, it’s possible to sort all these records out later using single MAP 
operation.

Thanks,
Dmitry.
From: Suyash Agarwal <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, August 31, 2018 at 11:04
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Joining more than two PTables in a single MR job

All the tables I want to join are huge enough to discard MapSideJoin as an 
option. As David mentioned, cogroup may help in cases when we have lot of 
values from a table for the same key, is there a way to shard the values or 
iteratively read the values from the largest table for this to work?


On Fri, Aug 31, 2018 at 12:16 PM Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
That is correct, the Cogroup will load all of the values for the key into 
memory-- is this not a situation where a combination of a MapSideJoinStrategy 
plus another JoinStrategy will do what you want?

J

On Thu, Aug 30, 2018 at 10:12 PM Suyash Agarwal 
<[email protected]<mailto:[email protected]>> wrote:
So, if there are a lot of values for a key, will they all be loaded in memory 
in the collection? If that's the case then I'll be running in container OOM 
issues. Or, will the largest table be sharded?

On Thu, Aug 30, 2018 at 6:22 PM David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:
CoGroup is your best bet to join multiple tables.  They also are handy if you 
expect a lot of values from a table for the same key and don't want to blow up 
your collection size.  The Collections are simply all the values from each 
table that matched the given key.

On Thu, Aug 30, 2018 at 2:33 AM Suyash Agarwal 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

Is there a way to join more than two PTables in a single MR job in Apache 
Crunch?
I am unable to find an API which does that. And, using multiple Join Strategies 
to have two join statements results in different MR jobs. Cogroup API seems to 
take arbitrary PTables but I am not sure if that is the way to go since they 
result in collection<> of the values of the joined tables. I am not sure how 
these collections are different from iterables.

Thanks.

Reply via email to