All the tables I want to join are huge enough to discard MapSideJoin as an option. As David mentioned, cogroup may help in cases when we have lot of values from a table for the same key, is there a way to shard the values or iteratively read the values from the largest table for this to work?
On Fri, Aug 31, 2018 at 12:16 PM Josh Wills <[email protected]> wrote: > That is correct, the Cogroup will load all of the values for the key into > memory-- is this not a situation where a combination of a > MapSideJoinStrategy plus another JoinStrategy will do what you want? > > J > > On Thu, Aug 30, 2018 at 10:12 PM Suyash Agarwal <[email protected]> > wrote: > >> So, if there are a lot of values for a key, will they all be loaded in >> memory in the collection? If that's the case then I'll be running in >> container OOM issues. Or, will the largest table be sharded? >> >> On Thu, Aug 30, 2018 at 6:22 PM David Ortiz <[email protected]> wrote: >> >>> CoGroup is your best bet to join multiple tables. They also are handy >>> if you expect a lot of values from a table for the same key and don't want >>> to blow up your collection size. The Collections are simply all the values >>> from each table that matched the given key. >>> >>> On Thu, Aug 30, 2018 at 2:33 AM Suyash Agarwal <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> Is there a way to join more than two PTables in a single MR job in >>>> Apache Crunch? >>>> I am unable to find an API which does that. And, using multiple Join >>>> Strategies to have two join statements results in different MR jobs. >>>> Cogroup API seems to take arbitrary PTables but I am not sure if that is >>>> the way to go since they result in collection<> of the values of the joined >>>> tables. I am not sure how these collections are different from iterables. >>>> >>>> Thanks. >>>> >>>
