Hi Michael, Please find my replies inline.
Thanks, Anil On Tue, Dec 18, 2012 at 1:02 AM, Michel Segel <[email protected]>wrote: > Just a couple of questions... > > First, since you don't have any natural secondary indices, you can create > one from a couple of choices. Keeping it simple, you choose an inverted > table as your index. > Reasons for not creating a inverted table: 1. There can be millions of columns corresponding to a rowkey in my secondary index. In future it can even grow more. 2. While using secondary index, we are also planning to have filtering on the basis of other non-rowkey columns. For example: 1 Row of Secondary table might look like this: Rowkey: cf:PrimarytableRowKey=x, cf:customerFirstName=xyz, cf:customerAddress=123, Union Sq, LA My primary table has around 50 columns and in secondary table i duplicate two columns to used along with secondary index for filtering. > > In doing so, you have one column containing all of the row ids for a given > value. > This means that it is a simple get(). > > My question is that since you don't have any formal SQL syntax, how are > you doing this all server side? > As Anoop said, I am not doing the index data scan at the server side. He scan the index table data back to client and from client doing gets to get the main table data. > > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Dec 18, 2012, at 2:28 AM, anil gupta <[email protected]> wrote: > > > Hi Anoop, > > > > Please find my reply inline. > > > > Thanks, > > Anil Gupta > > > > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[email protected]> > wrote: > > > >> Hi Anil > >> During the scan, there is no need to fetch any index data > >> to client side. So there is no need to create any scanner on the index > >> table at the client side. This happens at the server side. > > > > > >> > >> For the Scan on the main table with condition on timestamp and customer > >> id, a scanner to be created with Filters. Yes like normal when there is > no > >> secondary index. So this scan from the client will go through all the > >> regions in the main table. > > > > > > Anil: Do you mean that if the table is spread across 50 region servers in > > 60 node cluster then we need to send a scan request to all the 50 RS. > > Right? Doesn't it sounds expensive? IMHO you were not doing this in your > > solution. Your solution looked cleaner than this since you exactly knew > > which Node you need to go to for querying while using secondary index due > > to co-location(due to static begin part for secondary table rowkey) of > > region of primary table and secondary index table. My problem is little > > more complicated due to the constraints that: I cannot have a "static > begin > > part" in the rowkey of my secondary table. > > > > When it scans one particular region say (x,y] on the main table, using > the > >> CP we can get the index table region object corresponding to this main > >> table region from the RS. There is no issue in creating the static > part of > >> the rowkey. You know 'x' is the region start key. Then at the server > side > >> will create a scanner on the index region directly and here we can > specify > >> the startkey. 'x' + <timestamp value> + <customer id>.. Using the > results > >> from the index scan we will make reseek on the main region to the exact > >> rows where the data what we are interested in is available. So there > wont > >> be a full region data scan happening. > > > >> When in the cases where only timestamp is there but no customer id, it > >> will be simple again. Create a scanner on the main table with only one > >> filter. At the CP side the scanner on the index region will get created > >> with startkey as 'x' + <timestamp value>.. When you create the scan > >> object and set startRow on that it need not be the full rowkey. It can > be > >> part of the rowkey also. Yes like prefix. > >> > >> Hope u got it now :) > > Anil: I hope now we are on same page. Thanks a lot for your valuable time > > to discuss this stuff. > > > >> > >> -Anoop- > >> ________________________________________ > >> From: anil gupta [[email protected]] > >> Sent: Friday, December 14, 2012 11:31 PM > >> To: [email protected] > >> Subject: Re: HBase - Secondary Index > >> > >> On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]> > >> wrote: > >> > >>> Hi Anil, > >>> > >>>> 1. In your presentation you mentioned that region of Primary Table and > >>> Region of Secondary Table are always located on the same region server. > >> How > >>> do you achieve it? By using the Primary table rowkey as prefix of > Rowkey > >>> of Secondary Table? Will your implementation work if the rowkey of > >> primary > >>> table cannot be used as prefix in rowkey of Secondary table( i have > this > >>> limitation in my use case)? > >>> First all there will be same number of regions in both primary and > index > >>> tables. All the start/stop keys of the regions also will be same. > >>> Suppose there are 2 regions on main table say for keys 0-10 and 10-20. > >>> Then we will create 2 regions in index table also with same key ranges. > >>> At the master balancing level it is easy to collocate these regions > >> seeing > >>> the start and end keys. > >>> When the selection of the rowkey that will be used in the index table > is > >>> the key here. > >>> What we will do is all the rowkeys in the index table will be prefixed > >>> with the start key of the region/ > >>> When an entry is added to the main table with rowkey as 5 it will go to > >>> the 1st region (0-10) > >>> Now there will be index region with range as 0-10. We will select this > >>> region to store this index data. > >>> The row getting added into the index region for this entry will have a > >>> rowkey 0_x_5 > >>> I am just using '_' as a seperator here just to show this. Actually we > >>> wont be having any seperator. > >>> So the rowkeys (in index region) will have a static begin part always. > >>> Will scan time also we know this part and so the startrow and endrow > >>> creation for the scan will be possible.. Note that we will store the > >> actual > >>> table row key as the last part of the index rowkey itself not as a > value. > >>> This is better option in our case of handling the scan index usage also > >> at > >>> sever side. There is no index data fetch to client side.. > >> > >> Anil: My primary table rowkey is customerId+event_id, and my secondary > >> table rowkey is timestamp+ customerid. In your implementation it seems > like > >> for using secondary index the application needs to know about the > >> "start_key" of the region(static begin part) it wants to query. Right? > Do > >> you separately manage the logic of determining the region > >> "start_key"(static begin part) for a scan? > >> Also, Its possible that while using secondary index the customerId is > not > >> provided. So, i wont be having customer id for all the queries. Hence i > >> cannot use customer_id as a prefix in rowkey of my Secondary Table. > >> > >>> > >>> I feel your use case perfectly fit with our model > >> Anil: Somehow i am unable to fit your implementation into my use case > due > >> to the constraint of static begin part of rowkey in Secondary table. > There > >> seems to be a disconnect. Can you tell me how does my use case fits into > >> your implementation? > >> > >>> > >>>> 2. Are you using an Endpoint or Observer for building the secondary > >> index > >>> table? > >>> Observer > >>> > >>>> 3. "Custom balancer do collocation". Is it a custom load balancer of > >> HBase > >>> Master or something else? > >>> It is a balancer implementation which will be plugged into Master > >>> > >>>> 4. Your region split looks interesting. I dont have much info about > it. > >>> Can > >>> you point to some docs on IndexHalfStoreFileReader? > >>> Sorry I am not able to publish any design doc or code as the company > has > >>> not decided to open src the solution yet. > >>> Any particular query you come acorss pls feel free to aske me :) > >>> You can see the HalfStoreFileReader class 1st.. > >>> > >>> -Anoop- > >>> ________________________________________ > >>> From: anil gupta [[email protected]] > >>> Sent: Friday, December 14, 2012 2:11 PM > >>> To: [email protected] > >>> Subject: Re: HBase - Secondary Index > >>> > >>> Hi Anoop, > >>> > >>> Nice presentation and seems like a smart implementation. Since the > >>> presentation only covered bullet points so i have couple of questions > on > >>> your implementation. :) > >>> > >>> Here is a recap to my implementation and our previous discussion on > >>> Secondary index: > >>> > >>> Here is the link to previous email thread: > >>> http://search-hadoop.com/m/1zWPMaaRtr . > >>> > >>> The secondary index is stored in table "B" as rowkey B --> > family:<rowkey > >>> A> . "<rowkey A>" is the column qualifier. Every row in B will only on > >>> have one column "k" and the value of that column is the rowkey of A. > >>> > >>> Suppose i am storing customer events in table A. I have two requirement > >> for > >>> data query: > >>> 1. Query customer events on basis of customer_Id and event_ID. > >>> 2. Query customer events on basis of event_timestamp and customer_ID. > >>> > >>> 70% of querying is done by query#1, so i will create > >>> <customer_Id><event_ID> as row key of Table A. > >>> Now, in order to support fast results for query#2, i need to create a > >>> secondary index on A. I store that secondary index in B, rowkey of B is > >>> <event_timestamp><customer_ID>.Every row stores the corresponding > rowkey > >> of > >>> A. > >>> > >>> HBase Querying approach: > >>> 1. Scan the secondary table by using prefix filter and startRow to get > >> the > >>> list of Rowkeys of Primary table. > >>> 2. Do a batch get on primary table by using HTable.get(List<Get>) > method > >>> using the list of Rowkeys obtained in step1. > >>> > >>> The only issue is that in my solution i have at least two RPC calls. > Once > >>> each in step1 and step2 above. I want to reduce the number of RPC to 1 > if > >>> possible. > >>> > >>> > >>> ******Questions on your implementation:********* > >>> > >>> 1. In your presentation you mentioned that region of Primary Table and > >>> Region of Secondary Table are always located on the same region server. > >> How > >>> do you achieve it? By using the Primary table rowkey as prefix of > Rowkey > >>> of Secondary Table? Will your implementation work if the rowkey of > >> primary > >>> table cannot be used as prefix in rowkey of Secondary table( i have > this > >>> limitation in my use case)? > >>> 2. Are you using an Endpoint or Observer for building the secondary > index > >>> table? > >>> 3. "Custom balancer do collocation". Is it a custom load balancer of > >> HBase > >>> Master or something else? > >>> 4. Your region split looks interesting. I dont have much info about it. > >> Can > >>> you point to some docs on IndexHalfStoreFileReader? > >>> > >>> Thanks, > >>> Anil Gupta > >>> > >>> > >>> > >>> On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]> > >>> wrote: > >>> > >>>> Hi All > >>>> > >>>> Last week I got a chance to present the secondary indexing > >>>> solution what we have done in Huawei at the China Hadoop Conference. > >> You > >>>> can see the presentation from > >>>> http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf > >>>> > >>>> > >>>> > >>>> I would like to hear what others think on this. :) > >>>> > >>>> > >>>> > >>>> -Anoop- > >>> > >>> > >>> > >>> -- > >>> Thanks & Regards, > >>> Anil Gupta > >> > >> > >> > >> -- > >> Thanks & Regards, > >> Anil Gupta > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > -- Thanks & Regards, Anil Gupta
