Just a couple of questions... First, since you don't have any natural secondary indices, you can create one from a couple of choices. Keeping it simple, you choose an inverted table as your index.
In doing so, you have one column containing all of the row ids for a given value. This means that it is a simple get(). My question is that since you don't have any formal SQL syntax, how are you doing this all server side? Sent from a remote device. Please excuse any typos... Mike Segel On Dec 18, 2012, at 2:28 AM, anil gupta <[email protected]> wrote: > Hi Anoop, > > Please find my reply inline. > > Thanks, > Anil Gupta > > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[email protected]> wrote: > >> Hi Anil >> During the scan, there is no need to fetch any index data >> to client side. So there is no need to create any scanner on the index >> table at the client side. This happens at the server side. > > >> >> For the Scan on the main table with condition on timestamp and customer >> id, a scanner to be created with Filters. Yes like normal when there is no >> secondary index. So this scan from the client will go through all the >> regions in the main table. > > > Anil: Do you mean that if the table is spread across 50 region servers in > 60 node cluster then we need to send a scan request to all the 50 RS. > Right? Doesn't it sounds expensive? IMHO you were not doing this in your > solution. Your solution looked cleaner than this since you exactly knew > which Node you need to go to for querying while using secondary index due > to co-location(due to static begin part for secondary table rowkey) of > region of primary table and secondary index table. My problem is little > more complicated due to the constraints that: I cannot have a "static begin > part" in the rowkey of my secondary table. > > When it scans one particular region say (x,y] on the main table, using the >> CP we can get the index table region object corresponding to this main >> table region from the RS. There is no issue in creating the static part of >> the rowkey. You know 'x' is the region start key. Then at the server side >> will create a scanner on the index region directly and here we can specify >> the startkey. 'x' + <timestamp value> + <customer id>.. Using the results >> from the index scan we will make reseek on the main region to the exact >> rows where the data what we are interested in is available. So there wont >> be a full region data scan happening. > >> When in the cases where only timestamp is there but no customer id, it >> will be simple again. Create a scanner on the main table with only one >> filter. At the CP side the scanner on the index region will get created >> with startkey as 'x' + <timestamp value>.. When you create the scan >> object and set startRow on that it need not be the full rowkey. It can be >> part of the rowkey also. Yes like prefix. >> >> Hope u got it now :) > Anil: I hope now we are on same page. Thanks a lot for your valuable time > to discuss this stuff. > >> >> -Anoop- >> ________________________________________ >> From: anil gupta [[email protected]] >> Sent: Friday, December 14, 2012 11:31 PM >> To: [email protected] >> Subject: Re: HBase - Secondary Index >> >> On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]> >> wrote: >> >>> Hi Anil, >>> >>>> 1. In your presentation you mentioned that region of Primary Table and >>> Region of Secondary Table are always located on the same region server. >> How >>> do you achieve it? By using the Primary table rowkey as prefix of Rowkey >>> of Secondary Table? Will your implementation work if the rowkey of >> primary >>> table cannot be used as prefix in rowkey of Secondary table( i have this >>> limitation in my use case)? >>> First all there will be same number of regions in both primary and index >>> tables. All the start/stop keys of the regions also will be same. >>> Suppose there are 2 regions on main table say for keys 0-10 and 10-20. >>> Then we will create 2 regions in index table also with same key ranges. >>> At the master balancing level it is easy to collocate these regions >> seeing >>> the start and end keys. >>> When the selection of the rowkey that will be used in the index table is >>> the key here. >>> What we will do is all the rowkeys in the index table will be prefixed >>> with the start key of the region/ >>> When an entry is added to the main table with rowkey as 5 it will go to >>> the 1st region (0-10) >>> Now there will be index region with range as 0-10. We will select this >>> region to store this index data. >>> The row getting added into the index region for this entry will have a >>> rowkey 0_x_5 >>> I am just using '_' as a seperator here just to show this. Actually we >>> wont be having any seperator. >>> So the rowkeys (in index region) will have a static begin part always. >>> Will scan time also we know this part and so the startrow and endrow >>> creation for the scan will be possible.. Note that we will store the >> actual >>> table row key as the last part of the index rowkey itself not as a value. >>> This is better option in our case of handling the scan index usage also >> at >>> sever side. There is no index data fetch to client side.. >> >> Anil: My primary table rowkey is customerId+event_id, and my secondary >> table rowkey is timestamp+ customerid. In your implementation it seems like >> for using secondary index the application needs to know about the >> "start_key" of the region(static begin part) it wants to query. Right? Do >> you separately manage the logic of determining the region >> "start_key"(static begin part) for a scan? >> Also, Its possible that while using secondary index the customerId is not >> provided. So, i wont be having customer id for all the queries. Hence i >> cannot use customer_id as a prefix in rowkey of my Secondary Table. >> >>> >>> I feel your use case perfectly fit with our model >> Anil: Somehow i am unable to fit your implementation into my use case due >> to the constraint of static begin part of rowkey in Secondary table. There >> seems to be a disconnect. Can you tell me how does my use case fits into >> your implementation? >> >>> >>>> 2. Are you using an Endpoint or Observer for building the secondary >> index >>> table? >>> Observer >>> >>>> 3. "Custom balancer do collocation". Is it a custom load balancer of >> HBase >>> Master or something else? >>> It is a balancer implementation which will be plugged into Master >>> >>>> 4. Your region split looks interesting. I dont have much info about it. >>> Can >>> you point to some docs on IndexHalfStoreFileReader? >>> Sorry I am not able to publish any design doc or code as the company has >>> not decided to open src the solution yet. >>> Any particular query you come acorss pls feel free to aske me :) >>> You can see the HalfStoreFileReader class 1st.. >>> >>> -Anoop- >>> ________________________________________ >>> From: anil gupta [[email protected]] >>> Sent: Friday, December 14, 2012 2:11 PM >>> To: [email protected] >>> Subject: Re: HBase - Secondary Index >>> >>> Hi Anoop, >>> >>> Nice presentation and seems like a smart implementation. Since the >>> presentation only covered bullet points so i have couple of questions on >>> your implementation. :) >>> >>> Here is a recap to my implementation and our previous discussion on >>> Secondary index: >>> >>> Here is the link to previous email thread: >>> http://search-hadoop.com/m/1zWPMaaRtr . >>> >>> The secondary index is stored in table "B" as rowkey B --> family:<rowkey >>> A> . "<rowkey A>" is the column qualifier. Every row in B will only on >>> have one column "k" and the value of that column is the rowkey of A. >>> >>> Suppose i am storing customer events in table A. I have two requirement >> for >>> data query: >>> 1. Query customer events on basis of customer_Id and event_ID. >>> 2. Query customer events on basis of event_timestamp and customer_ID. >>> >>> 70% of querying is done by query#1, so i will create >>> <customer_Id><event_ID> as row key of Table A. >>> Now, in order to support fast results for query#2, i need to create a >>> secondary index on A. I store that secondary index in B, rowkey of B is >>> <event_timestamp><customer_ID>.Every row stores the corresponding rowkey >> of >>> A. >>> >>> HBase Querying approach: >>> 1. Scan the secondary table by using prefix filter and startRow to get >> the >>> list of Rowkeys of Primary table. >>> 2. Do a batch get on primary table by using HTable.get(List<Get>) method >>> using the list of Rowkeys obtained in step1. >>> >>> The only issue is that in my solution i have at least two RPC calls. Once >>> each in step1 and step2 above. I want to reduce the number of RPC to 1 if >>> possible. >>> >>> >>> ******Questions on your implementation:********* >>> >>> 1. In your presentation you mentioned that region of Primary Table and >>> Region of Secondary Table are always located on the same region server. >> How >>> do you achieve it? By using the Primary table rowkey as prefix of Rowkey >>> of Secondary Table? Will your implementation work if the rowkey of >> primary >>> table cannot be used as prefix in rowkey of Secondary table( i have this >>> limitation in my use case)? >>> 2. Are you using an Endpoint or Observer for building the secondary index >>> table? >>> 3. "Custom balancer do collocation". Is it a custom load balancer of >> HBase >>> Master or something else? >>> 4. Your region split looks interesting. I dont have much info about it. >> Can >>> you point to some docs on IndexHalfStoreFileReader? >>> >>> Thanks, >>> Anil Gupta >>> >>> >>> >>> On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]> >>> wrote: >>> >>>> Hi All >>>> >>>> Last week I got a chance to present the secondary indexing >>>> solution what we have done in Huawei at the China Hadoop Conference. >> You >>>> can see the presentation from >>>> http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf >>>> >>>> >>>> >>>> I would like to hear what others think on this. :) >>>> >>>> >>>> >>>> -Anoop- >>> >>> >>> >>> -- >>> Thanks & Regards, >>> Anil Gupta >> >> >> >> -- >> Thanks & Regards, >> Anil Gupta > > > > -- > Thanks & Regards, > Anil Gupta
