From what I understand replicated should not be used with full outer join since full outer means both tables records will be in the output regardless if they exist in the joined table. In your case you only care about session which is left join and not a full outer.
Reason for that is pigs and Hadoop schematics of the join: the "small" table is loaded into each mapper and thus is not meant to be used solely in the output. Alex Sent from my iPhone On Jan 27, 2012, at 8:15 AM, Vincent Barat <[email protected]> wrote: > Hi folks, > > I use replicated joins, and recently I encountered an issue : my rightmost > relation seems to become too big and, even if I don't get any "Java heap > space" the time it take to finish the maps become exponentially long (I > cannot figure why exactly). > > Removing "replicated" fix the issue, but several questions raise. > > In Alan's book " *Figure 8.1. Choosing a Join Implementation " it is said > that replicated joins should NOT BE USED for outer joins. > > *Nevertheless, it seems to work in the following case, and is faster than > regular joins. So why ? > > sessions = JOIN sessions BY locid LEFT, locations BY locid USING 'replicated'; > > (not all sessions have a location in this case) > > Thanks for your advices. > > > >
