Re:  "The problem is that the few references to that question I found recommend 
pulling one table to the mapper and then do a lookup for the referred row in 
the second table."

With multi-get in .90.x you could perform some reasonably clever processing and 
not do the lookups one-by-one but in batches.

Also, if the other table is "small" you could have the leverage the block cache 
on the lookups (i.e., if it's a domain/lookup table).  



-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Eran Kutner
Sent: Tuesday, May 31, 2011 8:06 AM
To: [email protected]
Subject: How to efficiently join HBase tables?

Hi,
I need to join two HBase tables. The obvious way is to use a M/R job for that. 
The problem is that the few references to that question I found recommend 
pulling one table to the mapper and then do a lookup for the referred row in 
the second table.
This sounds like a very inefficient way to do  join with map reduce. I believe 
it would be much better to feed the rows of both tables to the mapper and let 
it emit a key based on the join fields. Since all the rows with the same join 
fields values will have the same key the reducer will be able to easily 
generate the result of the join.
The problem with this is that I couldn't find a way to feed two tables to a 
single map reduce job. I could probably dump the tables to files in a single 
directory and then run the join on the files but that really makes no sense.

Am I missing something? Any other ideas?

-eran

Reply via email to