Hi Anoop, For my use case, scans will never have primary table rowkey range whenever i query using secondary index. IMHO, if i am sending the request to all the RS of table then i am afraid/concerned of too many unnecessary RPC's across the cluster for every single query based on secondary index. Essentially everytime it will look like a full table scan but under the hood the CP's will do the magic using secondary table.Your solution works well when rowkey range on primary table can be specified. Unfortunately, i dont have that luxury for now to use "primary table rowkey range". It seems like i will have to stick to my current solution. However, it's always good to have a healthy discussion on different approaches. :)
PS: My current secondary index implementation is not yet in production. I did some preliminary testing and it seems to work fine but i think i need to do some more testing. Thanks, Anil Gupta On Tue, Dec 18, 2012 at 1:27 AM, Anoop Sam John <[email protected]> wrote: > Anil: > If the scan from client side does not specify any rowkey range but > only the filter condition, yes it will go to all the primary table regions > for the scan. There 1st it will scan the index table region and seek to > exact rows in the main table region. If that region is not having any data > at all corresponding to the filter condition, the entire region will get > skipped simply. > > In a normal scan also, if there is a rowkey range that we can specify, > then only to specific regions the request will go. In the sec index case of > ours also it is same.. > > In a simple way what I can say is for the scan there is no change at all > wrt the operation that is what is happening at the client side. From the > meta data to know which all region and RSs to contact, and contacting that > regions one by one and getting data from that region. Only difference is > what is happening at the server side. With out index the whole data from > all the Hfiles will get fetched at the server side and the filter will get > applied for every row. Only those rows which passes the filter will get > back to the client side. With index, when the scanning happen at the > server side, the index data will get scanned 1st from the index region. > This region will be in the same RS so no extra RPCs. The data to be scanned > from the index table will be limited.. We can create the start key and stop > key for that.. Based on the result of the index scan, we will know the > rowkeys where all the data what we are interested in resides. So reseek > will happen to those rows and read only those rows. So the time spent at > the server side for scanning a region will get reduced to a very high value. > > Yes but still there will be calls from the client side to the RS for each > region... > > Now I think u might be clear.. In the ppt that I have shared, there also > it is saying the same thing. It is showing what is happening at the server > side. > > -Anoop- > > ________________________________________ > From: anil gupta [[email protected]] > Sent: Tuesday, December 18, 2012 1:58 PM > To: [email protected] > Subject: Re: HBase - Secondary Index > > Hi Anoop, > > Please find my reply inline. > > Thanks, > Anil Gupta > > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[email protected]> > wrote: > > > Hi Anil > > During the scan, there is no need to fetch any index data > > to client side. So there is no need to create any scanner on the index > > table at the client side. This happens at the server side. > > > > > > > > For the Scan on the main table with condition on timestamp and customer > > id, a scanner to be created with Filters. Yes like normal when there is > no > > secondary index. So this scan from the client will go through all the > > regions in the main table. > > > Anil: Do you mean that if the table is spread across 50 region servers in > 60 node cluster then we need to send a scan request to all the 50 RS. > Right? Doesn't it sounds expensive? IMHO you were not doing this in your > solution. Your solution looked cleaner than this since you exactly knew > which Node you need to go to for querying while using secondary index due > to co-location(due to static begin part for secondary table rowkey) of > region of primary table and secondary index table. My problem is little > more complicated due to the constraints that: I cannot have a "static begin > part" in the rowkey of my secondary table. > > When it scans one particular region say (x,y] on the main table, using the > > CP we can get the index table region object corresponding to this main > > table region from the RS. There is no issue in creating the static part > of > > the rowkey. You know 'x' is the region start key. Then at the server side > > will create a scanner on the index region directly and here we can > specify > > the startkey. 'x' + <timestamp value> + <customer id>.. Using the > results > > from the index scan we will make reseek on the main region to the exact > > rows where the data what we are interested in is available. So there wont > > be a full region data scan happening. > > > > > When in the cases where only timestamp is there but no customer id, it > > will be simple again. Create a scanner on the main table with only one > > filter. At the CP side the scanner on the index region will get created > > with startkey as 'x' + <timestamp value>.. When you create the scan > > object and set startRow on that it need not be the full rowkey. It can be > > part of the rowkey also. Yes like prefix. > > > > Hope u got it now :) > > > Anil: I hope now we are on same page. Thanks a lot for your valuable time > to discuss this stuff. > > > > > -Anoop- > > ________________________________________ > > From: anil gupta [[email protected]] > > Sent: Friday, December 14, 2012 11:31 PM > > To: [email protected] > > Subject: Re: HBase - Secondary Index > > > > On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]> > > wrote: > > > > > Hi Anil, > > > > > > >1. In your presentation you mentioned that region of Primary Table and > > > Region of Secondary Table are always located on the same region server. > > How > > > do you achieve it? By using the Primary table rowkey as prefix of > Rowkey > > > of Secondary Table? Will your implementation work if the rowkey of > > primary > > > table cannot be used as prefix in rowkey of Secondary table( i have > this > > > limitation in my use case)? > > > First all there will be same number of regions in both primary and > index > > > tables. All the start/stop keys of the regions also will be same. > > > Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > > > Then we will create 2 regions in index table also with same key > ranges. > > > At the master balancing level it is easy to collocate these regions > > seeing > > > the start and end keys. > > > When the selection of the rowkey that will be used in the index table > is > > > the key here. > > > What we will do is all the rowkeys in the index table will be prefixed > > > with the start key of the region/ > > > When an entry is added to the main table with rowkey as 5 it will go to > > > the 1st region (0-10) > > > Now there will be index region with range as 0-10. We will select this > > > region to store this index data. > > > The row getting added into the index region for this entry will have a > > > rowkey 0_x_5 > > > I am just using '_' as a seperator here just to show this. Actually we > > > wont be having any seperator. > > > So the rowkeys (in index region) will have a static begin part always. > > > Will scan time also we know this part and so the startrow and endrow > > > creation for the scan will be possible.. Note that we will store the > > actual > > > table row key as the last part of the index rowkey itself not as a > value. > > > This is better option in our case of handling the scan index usage also > > at > > > sever side. There is no index data fetch to client side.. > > > > > > > Anil: My primary table rowkey is customerId+event_id, and my secondary > > table rowkey is timestamp+ customerid. In your implementation it seems > like > > for using secondary index the application needs to know about the > > "start_key" of the region(static begin part) it wants to query. Right? Do > > you separately manage the logic of determining the region > > "start_key"(static begin part) for a scan? > > Also, Its possible that while using secondary index the customerId is not > > provided. So, i wont be having customer id for all the queries. Hence i > > cannot use customer_id as a prefix in rowkey of my Secondary Table. > > > > > > > > I feel your use case perfectly fit with our model > > > > > Anil: Somehow i am unable to fit your implementation into my use case due > > to the constraint of static begin part of rowkey in Secondary table. > There > > seems to be a disconnect. Can you tell me how does my use case fits into > > your implementation? > > > > > > > > >2. Are you using an Endpoint or Observer for building the secondary > > index > > > table? > > > Observer > > > > > > >3. "Custom balancer do collocation". Is it a custom load balancer of > > HBase > > > Master or something else? > > > It is a balancer implementation which will be plugged into Master > > > > > > >4. Your region split looks interesting. I dont have much info about > it. > > > Can > > > you point to some docs on IndexHalfStoreFileReader? > > > Sorry I am not able to publish any design doc or code as the company > has > > > not decided to open src the solution yet. > > > Any particular query you come acorss pls feel free to aske me :) > > > You can see the HalfStoreFileReader class 1st.. > > > > > > -Anoop- > > > ________________________________________ > > > From: anil gupta [[email protected]] > > > Sent: Friday, December 14, 2012 2:11 PM > > > To: [email protected] > > > Subject: Re: HBase - Secondary Index > > > > > > Hi Anoop, > > > > > > Nice presentation and seems like a smart implementation. Since the > > > presentation only covered bullet points so i have couple of questions > on > > > your implementation. :) > > > > > > Here is a recap to my implementation and our previous discussion on > > > Secondary index: > > > > > > Here is the link to previous email thread: > > > http://search-hadoop.com/m/1zWPMaaRtr . > > > > > > The secondary index is stored in table "B" as rowkey B --> > family:<rowkey > > > A> . "<rowkey A>" is the column qualifier. Every row in B will only on > > > have one column "k" and the value of that column is the rowkey of A. > > > > > > Suppose i am storing customer events in table A. I have two requirement > > for > > > data query: > > > 1. Query customer events on basis of customer_Id and event_ID. > > > 2. Query customer events on basis of event_timestamp and customer_ID. > > > > > > 70% of querying is done by query#1, so i will create > > > <customer_Id><event_ID> as row key of Table A. > > > Now, in order to support fast results for query#2, i need to create a > > > secondary index on A. I store that secondary index in B, rowkey of B is > > > <event_timestamp><customer_ID>.Every row stores the corresponding > rowkey > > of > > > A. > > > > > > HBase Querying approach: > > > 1. Scan the secondary table by using prefix filter and startRow to get > > the > > > list of Rowkeys of Primary table. > > > 2. Do a batch get on primary table by using HTable.get(List<Get>) > method > > > using the list of Rowkeys obtained in step1. > > > > > > The only issue is that in my solution i have at least two RPC calls. > Once > > > each in step1 and step2 above. I want to reduce the number of RPC to 1 > if > > > possible. > > > > > > > > > ******Questions on your implementation:********* > > > > > > 1. In your presentation you mentioned that region of Primary Table and > > > Region of Secondary Table are always located on the same region server. > > How > > > do you achieve it? By using the Primary table rowkey as prefix of > Rowkey > > > of Secondary Table? Will your implementation work if the rowkey of > > primary > > > table cannot be used as prefix in rowkey of Secondary table( i have > this > > > limitation in my use case)? > > > 2. Are you using an Endpoint or Observer for building the secondary > index > > > table? > > > 3. "Custom balancer do collocation". Is it a custom load balancer of > > HBase > > > Master or something else? > > > 4. Your region split looks interesting. I dont have much info about it. > > Can > > > you point to some docs on IndexHalfStoreFileReader? > > > > > > Thanks, > > > Anil Gupta > > > > > > > > > > > > On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]> > > > wrote: > > > > > > > Hi All > > > > > > > > Last week I got a chance to present the secondary > indexing > > > > solution what we have done in Huawei at the China Hadoop Conference. > > You > > > > can see the presentation from > > > > http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > > > > > > > > > > > > > > > > I would like to hear what others think on this. :) > > > > > > > > > > > > > > > > -Anoop- > > > > > > > > > > > > > > > > -- > > > Thanks & Regards, > > > Anil Gupta > > > > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > > > > > > -- > Thanks & Regards, > Anil Gupta > -- Thanks & Regards, Anil Gupta
