Doug, 

I read the OP's post as the following:
"> Hi,
> I need to join two HBase tables. 
The obvious way is to use a M/R job for that. 
The problem is that the few references to that question I found recommend 
pulling one table to the mapper 
and then do a lookup for the referred row in the second table.
"

So the underlying problem that the OP was trying to solve was how to join two 
tables from HBase.
Unfortunately I goofed.
I gave a quick and dirty solution that is a bit incomplete. They row key in the 
temp table has to be unique and I forgot about the Cartesian
product. So my solution wouldn't work in the general case.

If you wanted to do this in the general case...

Map job 1 selects data from the first table and then writes it to tempA where 
the key is 
the join column(s) in order, followed by all of the rows. (This is a wide key)
Also the map job returns a count of all of the rows in tempA

Map job 2 selects data from the second table again writing the entire return 
set as the key so that each row is also unique.

Now take the smaller of two sets and run a map reduce job. 
For each row, you'd want to create a scan of the other table, using a start and 
stop row based on the value in the joined columns.
Write out that row and the corresponding row from the scan in to your result 
set... but here I think you just need a unique key if you're just
dumping this to file, otherwise you will want to write the entire record as 
your key and each column as your value.

This will account for the Cartesian product in the join.

So it will look something like this:
Tab A
K, 1,2,3
   a,b,c
   a,a,b
   b,b,a
   b,c,a
   c,b,a

Tab B
K, 1,2,3
   a,e,f
   a,c,d
   b,z,z

SELECT *
FROM A, B
WHERE A.1 = B.1
AND   A.1 = a


Tab A
K       1,2,3
a|b|c   a,b,c
a|a|b   a,a,b

Tab B 
a|e|f   a,e,f
a|c|d   a,c,d

So output = 
K               1,2,3,4,5,6  
a|b|c|e|f       a,b,c,a,e,f
a|b|c|c|d       a,b,c,a,c,d
a|a|b|e|f       a,a,b,a,e,f
a|a|b|c|d       a,a,b,a,c,d


I'm looking at the http://hbase.apache.org/docs/current/api/
Didn't see a multi-get... not sure it would buy you much. My suggestion takes 
advantage of the fact that HBase puts things in sort order for you.

I think the above solution will work in the general sense, however what's the 
max width of the row key?


-Mike





> From: doug.m...@explorysmedical.com
> To: user@hbase.apache.org
> Date: Tue, 31 May 2011 11:42:27 -0400
> Subject: RE: How to efficiently join HBase tables?
> 
> Eran's observation was that a join is solvable in a Mapper via lookups on a 
> 2nd HBase table, but it might not be that efficient if the lookups are 1 by 
> 1.  I agree with that.
> 
> My suggestion was to use multi-Get for the lookups instead.  So you'd hold 
> onto a batch of records in the Mapper and then the batch size is filled, then 
> you do the lookups (and then any required emitting, etc.).  
> 
> 
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_se...@hotmail.com] 
> Sent: Tuesday, May 31, 2011 10:56 AM
> To: user@hbase.apache.org
> Subject: RE: How to efficiently join HBase tables?
> 
> 
> Maybe I'm missing something... but this isn't a hard problem to solve.
> 
> Eran wants to join two tables.
> If we look at an SQL Statement...
> 
> SELECT A.*, B.*
> FROM A, B
> WHERE A.1 = B.1
> AND  A.2 = B.2
> AND  A.3 = xxx
> AND A.4 = yyy
> AND B.45 = zzz
> 
> Or something along those lines.
> 
> So what you're essentially doing is saying I want to take a subset of data 
> from table A, and a subset of data from table B and join them on the values 
> in columns 1 and 2.
> Table A's data will be filtered on columns 3 and 4 and B's data will be 
> filtered on column 45. NOTE: since you don't know the relationship of the 
> column names to either table, you're safer in writing tableA|column_name and 
> tableB|column_name to your temp table.
> 
> So if you create a temp table FOO where the key is column 1 and column 2 
> (column1|column2) then when you walk through the subsets adding them to the 
> temp table, you will get the end result automatically.
> 
> Then you can output your hbase temp table and then truncate the table.
> 
> So what am I missing?
> 
> -Mike
> 
> 
> > From: doug.m...@explorysmedical.com
> > To: user@hbase.apache.org
> > Date: Tue, 31 May 2011 10:22:34 -0400
> > Subject: RE: How to efficiently join HBase tables?
> > 
> > 
> > Re:  "The problem is that the few references to that question I found 
> > recommend pulling one table to the mapper and then do a lookup for the 
> > referred row in the second table."
> > 
> > With multi-get in .90.x you could perform some reasonably clever processing 
> > and not do the lookups one-by-one but in batches.
> > 
> > Also, if the other table is "small" you could have the leverage the block 
> > cache on the lookups (i.e., if it's a domain/lookup table).  
> > 
> > 
> > 
> > -----Original Message-----
> > From: e...@gigya-inc.com [mailto:e...@gigya-inc.com] On Behalf Of Eran 
> > Kutner
> > Sent: Tuesday, May 31, 2011 8:06 AM
> > To: user@hbase.apache.org
> > Subject: How to efficiently join HBase tables?
> > 
> > Hi,
> > I need to join two HBase tables. The obvious way is to use a M/R job for 
> > that. The problem is that the few references to that question I found 
> > recommend pulling one table to the mapper and then do a lookup for the 
> > referred row in the second table.
> > This sounds like a very inefficient way to do  join with map reduce. I 
> > believe it would be much better to feed the rows of both tables to the 
> > mapper and let it emit a key based on the join fields. Since all the rows 
> > with the same join fields values will have the same key the reducer will be 
> > able to easily generate the result of the join.
> > The problem with this is that I couldn't find a way to feed two tables to a 
> > single map reduce job. I could probably dump the tables to files in a 
> > single directory and then run the join on the files but that really makes 
> > no sense.
> > 
> > Am I missing something? Any other ideas?
> > 
> > -eran
>                                         
                                          

Reply via email to