Why not look at HIVE ? It already implements the JOIN that you are looking for and has features to do MAPJOIN i.e. load small file into memory.
On Fri, Feb 15, 2013 at 1:25 PM, Yunming Zhang <[email protected]>wrote: > Hi, > > I am trying to do some work with in memory Join Map Reduce implementation, > > it can be summarized as a a join between two data set, R and S, one of > them is too large to fit into memory, the other one can fit into memory > reasonably well, > (size of R << size of S). The typical implementation > 1) distributes or broadcasts R to all map tasks (each mapper loads R in > memory, hashed by join key). > 2) map (stream) over S, divide S into datums and use it as input to each > map task, > 3) within each map task, for every tuple in S, look up join key in R > 4) reduce computation is trivial > > If anyone could point me to a good implementation that I could use a > reference, that would be great. > I do plan to write my own implementation, but it would be helpful to take > a look to see if there are established implementation out there, > > Thanks > Yunming >
