I have been using a replicated join to join on very large set of data with another one that is about 1000x smaller. Generally seen large performance gains.
However, they do scale together, so that now even though the RHS table is still 1000x smaller, it is too large to fit into memory. There will happen on only every 20th or so dataset that join is performed on, but I'd like to have something robust built to handle this. Is there anyway to setup the replicated join to back to a regular join only on memory issues? Or any type of conditional I could set to check the dataset size first? Willing to even dig into the Pig could and implement this if anyone has ideas. Thanks Arun
