MutipleInputs would be ideal, but that seems pretty complicated. MultiTableInputFormat seems like a simple change in the getSplits() method of TableInputFormat + support for a collection of table and their matching scanners instead of a single table and scanner, doesn't sound too complicated. Any other suggestions?
-eran On Tue, May 31, 2011 at 15:31, Ferdy Galema <[email protected]>wrote: > As far as I can tell there is not yet a build-in mechanism you can use for > this. You could implement your own InputFormat, something like > MultiTableInputFormat. If you need different map functions for the two > tables, perhaps something similar to Hadoop's MultipleInputs should do the > trick. > > > On 05/31/2011 02:06 PM, Eran Kutner wrote: > >> Hi, >> I need to join two HBase tables. The obvious way is to use a M/R job for >> that. The problem is that the few references to that question I found >> recommend pulling one table to the mapper and then do a lookup for the >> referred row in the second table. >> This sounds like a very inefficient way to do join with map reduce. I >> believe it would be much better to feed the rows of both tables to the >> mapper and let it emit a key based on the join fields. Since all the rows >> with the same join fields values will have the same key the reducer will >> be >> able to easily generate the result of the join. >> The problem with this is that I couldn't find a way to feed two tables to >> a >> single map reduce job. I could probably dump the tables to files in a >> single >> directory and then run the join on the files but that really makes no >> sense. >> >> Am I missing something? Any other ideas? >> >> -eran >> >>
