Little bit crazy idea, however… All your 3 tables have different schemas? If you can cast somehow your values to some generic type then you can try to “union all” all 3 PTables into single one and then do join to itself. There will be some overhead because each of the table is joined to itself. However, it’s possible to sort all these records out later using single MAP operation.
Thanks, Dmitry. From: Suyash Agarwal <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Friday, August 31, 2018 at 11:04 To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Joining more than two PTables in a single MR job All the tables I want to join are huge enough to discard MapSideJoin as an option. As David mentioned, cogroup may help in cases when we have lot of values from a table for the same key, is there a way to shard the values or iteratively read the values from the largest table for this to work? On Fri, Aug 31, 2018 at 12:16 PM Josh Wills <[email protected]<mailto:[email protected]>> wrote: That is correct, the Cogroup will load all of the values for the key into memory-- is this not a situation where a combination of a MapSideJoinStrategy plus another JoinStrategy will do what you want? J On Thu, Aug 30, 2018 at 10:12 PM Suyash Agarwal <[email protected]<mailto:[email protected]>> wrote: So, if there are a lot of values for a key, will they all be loaded in memory in the collection? If that's the case then I'll be running in container OOM issues. Or, will the largest table be sharded? On Thu, Aug 30, 2018 at 6:22 PM David Ortiz <[email protected]<mailto:[email protected]>> wrote: CoGroup is your best bet to join multiple tables. They also are handy if you expect a lot of values from a table for the same key and don't want to blow up your collection size. The Collections are simply all the values from each table that matched the given key. On Thu, Aug 30, 2018 at 2:33 AM Suyash Agarwal <[email protected]<mailto:[email protected]>> wrote: Hi, Is there a way to join more than two PTables in a single MR job in Apache Crunch? I am unable to find an API which does that. And, using multiple Join Strategies to have two join statements results in different MR jobs. Cogroup API seems to take arbitrary PTables but I am not sure if that is the way to go since they result in collection<> of the values of the joined tables. I am not sure how these collections are different from iterables. Thanks.
