I have been using a replicated join to join on very large set of data with
another one that is about 1000x smaller.  Generally seen large performance
gains.

However, they do scale together, so that now  even though the RHS table is
still 1000x smaller, it is too large to fit into memory.  There will happen
on only every 20th or so dataset that join is performed on, but I'd like to
have something robust built to handle this.

Is there anyway to setup the replicated join to back to a regular join only
on memory issues?  Or any type of conditional I could set to check the
dataset size first?  Willing to even dig into the Pig could and implement
this if anyone has ideas.

Thanks

Arun

Reply via email to