On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]> wrote:
> Hi Anil, > > >1. In your presentation you mentioned that region of Primary Table and > Region of Secondary Table are always located on the same region server. How > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > of Secondary Table? Will your implementation work if the rowkey of primary > table cannot be used as prefix in rowkey of Secondary table( i have this > limitation in my use case)? > First all there will be same number of regions in both primary and index > tables. All the start/stop keys of the regions also will be same. > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > Then we will create 2 regions in index table also with same key ranges. > At the master balancing level it is easy to collocate these regions seeing > the start and end keys. > When the selection of the rowkey that will be used in the index table is > the key here. > What we will do is all the rowkeys in the index table will be prefixed > with the start key of the region/ > When an entry is added to the main table with rowkey as 5 it will go to > the 1st region (0-10) > Now there will be index region with range as 0-10. We will select this > region to store this index data. > The row getting added into the index region for this entry will have a > rowkey 0_x_5 > I am just using '_' as a seperator here just to show this. Actually we > wont be having any seperator. > So the rowkeys (in index region) will have a static begin part always. > Will scan time also we know this part and so the startrow and endrow > creation for the scan will be possible.. Note that we will store the actual > table row key as the last part of the index rowkey itself not as a value. > This is better option in our case of handling the scan index usage also at > sever side. There is no index data fetch to client side.. > Anil: My primary table rowkey is customerId+event_id, and my secondary table rowkey is timestamp+ customerid. In your implementation it seems like for using secondary index the application needs to know about the "start_key" of the region(static begin part) it wants to query. Right? Do you separately manage the logic of determining the region "start_key"(static begin part) for a scan? Also, Its possible that while using secondary index the customerId is not provided. So, i wont be having customer id for all the queries. Hence i cannot use customer_id as a prefix in rowkey of my Secondary Table. > > I feel your use case perfectly fit with our model > Anil: Somehow i am unable to fit your implementation into my use case due to the constraint of static begin part of rowkey in Secondary table. There seems to be a disconnect. Can you tell me how does my use case fits into your implementation? > > >2. Are you using an Endpoint or Observer for building the secondary index > table? > Observer > > >3. "Custom balancer do collocation". Is it a custom load balancer of HBase > Master or something else? > It is a balancer implementation which will be plugged into Master > > >4. Your region split looks interesting. I dont have much info about it. > Can > you point to some docs on IndexHalfStoreFileReader? > Sorry I am not able to publish any design doc or code as the company has > not decided to open src the solution yet. > Any particular query you come acorss pls feel free to aske me :) > You can see the HalfStoreFileReader class 1st.. > > -Anoop- > ________________________________________ > From: anil gupta [[email protected]] > Sent: Friday, December 14, 2012 2:11 PM > To: [email protected] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > Nice presentation and seems like a smart implementation. Since the > presentation only covered bullet points so i have couple of questions on > your implementation. :) > > Here is a recap to my implementation and our previous discussion on > Secondary index: > > Here is the link to previous email thread: > http://search-hadoop.com/m/1zWPMaaRtr . > > The secondary index is stored in table "B" as rowkey B --> family:<rowkey > A> . "<rowkey A>" is the column qualifier. Every row in B will only on > have one column "k" and the value of that column is the rowkey of A. > > Suppose i am storing customer events in table A. I have two requirement for > data query: > 1. Query customer events on basis of customer_Id and event_ID. > 2. Query customer events on basis of event_timestamp and customer_ID. > > 70% of querying is done by query#1, so i will create > <customer_Id><event_ID> as row key of Table A. > Now, in order to support fast results for query#2, i need to create a > secondary index on A. I store that secondary index in B, rowkey of B is > <event_timestamp><customer_ID>.Every row stores the corresponding rowkey of > A. > > HBase Querying approach: > 1. Scan the secondary table by using prefix filter and startRow to get the > list of Rowkeys of Primary table. > 2. Do a batch get on primary table by using HTable.get(List<Get>) method > using the list of Rowkeys obtained in step1. > > The only issue is that in my solution i have at least two RPC calls. Once > each in step1 and step2 above. I want to reduce the number of RPC to 1 if > possible. > > > ******Questions on your implementation:********* > > 1. In your presentation you mentioned that region of Primary Table and > Region of Secondary Table are always located on the same region server. How > do you achieve it? By using the Primary table rowkey as prefix of Rowkey > of Secondary Table? Will your implementation work if the rowkey of primary > table cannot be used as prefix in rowkey of Secondary table( i have this > limitation in my use case)? > 2. Are you using an Endpoint or Observer for building the secondary index > table? > 3. "Custom balancer do collocation". Is it a custom load balancer of HBase > Master or something else? > 4. Your region split looks interesting. I dont have much info about it. Can > you point to some docs on IndexHalfStoreFileReader? > > Thanks, > Anil Gupta > > > > On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]> > wrote: > > > Hi All > > > > Last week I got a chance to present the secondary indexing > > solution what we have done in Huawei at the China Hadoop Conference. You > > can see the presentation from > > http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > > > > > > > > I would like to hear what others think on this. :) > > > > > > > > -Anoop- > > > > > > -- > Thanks & Regards, > Anil Gupta > -- Thanks & Regards, Anil Gupta
