RE: How to efficiently join HBase tables?

Michael Segel Tue, 31 May 2011 07:22:05 -0700


Eran,

You want to join two tables? The short answer is to use a relational database 
to solve that problem.

Longer answer:

You're using HBase so you don't need to think in terms of a reducer.
You can create a temp table for your query.
You can then run one map job to scan and filter table A, dumping the result set 
in to the temp table
In parallel, you run a map job to scan and filter table B, dumping the result 
set in to the temp table.

Voila! You're done. Just remember to clean up and drop the temp table when 
you're done.

But there may be a problem.
If you use the same column name but the data means different things.  Like both 
tables have a column named 'Tim' (and why you would name something Tim is 
beyond me... ;-) ) but this column means one thing in table A and something 
else in table B and you want to retain both values... You just need to create a 
column whose name is based on ${tablename}+'|'+${column name} so it would be 
TableA|Tim and TableB|Tim.

HTH 

-Mike

> From: [email protected]
> Date: Tue, 31 May 2011 15:43:43 +0300
> Subject: Re: How to efficiently join HBase tables?
> To: [email protected]
> CC: [email protected]
> 
> MutipleInputs would be ideal, but that seems pretty complicated.
> MultiTableInputFormat seems like a simple change in the getSplits() method
> of TableInputFormat + support for a collection of table and their matching
> scanners instead of a single table and scanner, doesn't sound too
> complicated.
> Any other suggestions?
> 
> -eran
> 
> 
> 
> On Tue, May 31, 2011 at 15:31, Ferdy Galema <[email protected]>wrote:
> 
> > As far as I can tell there is not yet a build-in mechanism you can use for
> > this. You could implement your own InputFormat, something like
> > MultiTableInputFormat. If you need different map functions for the two
> > tables, perhaps something similar to Hadoop's MultipleInputs should do the
> > trick.
> >
> >
> > On 05/31/2011 02:06 PM, Eran Kutner wrote:
> >
> >> Hi,
> >> I need to join two HBase tables. The obvious way is to use a M/R job for
> >> that. The problem is that the few references to that question I found
> >> recommend pulling one table to the mapper and then do a lookup for the
> >> referred row in the second table.
> >> This sounds like a very inefficient way to do  join with map reduce. I
> >> believe it would be much better to feed the rows of both tables to the
> >> mapper and let it emit a key based on the join fields. Since all the rows
> >> with the same join fields values will have the same key the reducer will
> >> be
> >> able to easily generate the result of the join.
> >> The problem with this is that I couldn't find a way to feed two tables to
> >> a
> >> single map reduce job. I could probably dump the tables to files in a
> >> single
> >> directory and then run the join on the files but that really makes no
> >> sense.
> >>
> >> Am I missing something? Any other ideas?
> >>
> >> -eran
> >>
> >>

RE: How to efficiently join HBase tables?

Reply via email to