Depends on a couple of things. If your LIST is a permanent feature of your document, then it might make sense to add the list(Boolean? Or the list index if the list has a particular sort order) to the doc record. Otherwise, a little simple programming can get you the results you want: 1) Sort the list (if it is big, then a map reduce job with an identity map / single identity reducer would do the job). If you require the order of the list to be maintained then you need to add another field to the list indicating order, so that you can recover that after the join. 2) output a list of DOCID / UUID sorted on DOCID 3) use a double iterator through your two outputs to find the UUIDs from the list (and optionally its order in the list) 4) optionally resort the UUID list by the list order index
This will not be particularly fast, but it should be robust to large list sizes. If your list can fit into the memory of a map task, then put it in a hash map for each Map job, and while you iterate over your docs table, you can only output UUIDs and sort order, and let your reducer reorder them according to your list order. Dave -----Original Message----- From: Florin P [mailto:[email protected]] Sent: Thursday, June 16, 2011 5:44 AM To: [email protected] Subject: Re: How to efficiently join HBase tables? Hello! Regarding the same subject of joining, I have the following scenario: 1. I have a big table DOCS that contains the columns UUID DOCID sdsd 1 hdhs 3 gdhg 7 shdg 9 and so on (hope you got the idea) 2. an external list of docID (LIST) 3 1 7 upon a I have to query("join") the DOCS DOCID column, so that the result should be hdhs, sdsd, gdhg. How I can implement such a request? Can be this a possible solution: 1. to add a new column LIST (in the same column family ) to the DOCS 2 add a new record in it that contain my LIST of docID 3. "Join" column LIST with DOCID column? ( perhaps a weird idea) Thank you. Regards, Florin
