That is correct, the Cogroup will load all of the values for the key into memory-- is this not a situation where a combination of a MapSideJoinStrategy plus another JoinStrategy will do what you want?
J On Thu, Aug 30, 2018 at 10:12 PM Suyash Agarwal <[email protected]> wrote: > So, if there are a lot of values for a key, will they all be loaded in > memory in the collection? If that's the case then I'll be running in > container OOM issues. Or, will the largest table be sharded? > > On Thu, Aug 30, 2018 at 6:22 PM David Ortiz <[email protected]> wrote: > >> CoGroup is your best bet to join multiple tables. They also are handy if >> you expect a lot of values from a table for the same key and don't want to >> blow up your collection size. The Collections are simply all the values >> from each table that matched the given key. >> >> On Thu, Aug 30, 2018 at 2:33 AM Suyash Agarwal <[email protected]> >> wrote: >> >>> Hi, >>> >>> Is there a way to join more than two PTables in a single MR job in >>> Apache Crunch? >>> I am unable to find an API which does that. And, using multiple Join >>> Strategies to have two join statements results in different MR jobs. >>> Cogroup API seems to take arbitrary PTables but I am not sure if that is >>> the way to go since they result in collection<> of the values of the joined >>> tables. I am not sure how these collections are different from iterables. >>> >>> Thanks. >>> >>
