Hi, I am trying to do some work with in memory Join Map Reduce implementation,
it can be summarized as a a join between two data set, R and S, one of them is too large to fit into memory, the other one can fit into memory reasonably well, (size of R << size of S). The typical implementation 1) distributes or broadcasts R to all map tasks (each mapper loads R in memory, hashed by join key). 2) map (stream) over S, divide S into datums and use it as input to each map task, 3) within each map task, for every tuple in S, look up join key in R 4) reduce computation is trivial If anyone could point me to a good implementation that I could use a reference, that would be great. I do plan to write my own implementation, but it would be helpful to take a look to see if there are established implementation out there, Thanks Yunming