Re: HBase - Secondary Index

Michel Segel Tue, 18 Dec 2012 01:03:14 -0800

Just a couple of questions...

First, since you don't have any natural secondary indices, you can create one 
from a couple of choices. Keeping it simple, you choose an inverted table as 
your index.


In doing so, you have one column containing all of the row ids for a given 
value.
This means that it is a simple get(). 

My question is that since you don't have any formal SQL syntax, how are you 
doing this all server side?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 18, 2012, at 2:28 AM, anil gupta <[email protected]> wrote:

> Hi Anoop,
> 
> Please find my reply inline.
> 
> Thanks,
> Anil Gupta
> 
> On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[email protected]> wrote:
> 
>> Hi Anil
>>                During the scan, there is no need to fetch any index data
>> to client side. So there is no need to create any scanner on the index
>> table at the client side. This happens at the server side.
> 
> 
>> 
>> For the Scan on the main table with condition on timestamp and customer
>> id, a scanner to be created with Filters. Yes like normal when there is no
>> secondary index. So this scan from the client will go through all the
>> regions in the main table.
> 
> 
> Anil: Do you mean that if the table is spread across 50 region servers in
> 60 node cluster then we need to send a scan request to all the 50 RS.
> Right? Doesn't it sounds expensive? IMHO you were not doing this in your
> solution. Your solution looked cleaner than this since you exactly knew
> which Node you need to go to for querying while using secondary index due
> to co-location(due to static begin part for secondary table rowkey) of
> region of primary table and secondary index table. My problem is little
> more complicated due to the constraints that: I cannot have a "static begin
> part" in the rowkey of my secondary table.
> 
> When it scans one particular region say (x,y] on the main table, using the
>> CP we can get the index table region object corresponding to this main
>> table region from the RS.  There is no issue in creating the static part of
>> the rowkey. You know 'x' is the region start key. Then at the server side
>> will create a scanner on the index region directly and here we can specify
>> the startkey. 'x' + <timestamp value> + <customer id>..  Using the results
>> from the index scan we will make reseek on the main region to the exact
>> rows where the data what we are interested in is available. So there wont
>> be a full region data scan happening.
> 
>> When in the cases where only timestamp is there but no customer id, it
>> will be simple again. Create a scanner on the main table with only one
>> filter. At the CP side the scanner on the index region will get created
>> with startkey as 'x' + <timestamp value>..    When you create the scan
>> object and set startRow on that it need not be the full rowkey. It can be
>> part of the rowkey also. Yes like prefix.
>> 
>> Hope u got it now :)
> Anil: I hope now we are on same page. Thanks a lot for your valuable time
> to discuss this stuff.
> 
>> 
>> -Anoop-
>> ________________________________________
>> From: anil gupta [[email protected]]
>> Sent: Friday, December 14, 2012 11:31 PM
>> To: [email protected]
>> Subject: Re: HBase - Secondary Index
>> 
>> On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]>
>> wrote:
>> 
>>> Hi Anil,
>>> 
>>>> 1. In your presentation you mentioned that region of Primary Table and
>>> Region of Secondary Table are always located on the same region server.
>> How
>>> do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
>>> of Secondary Table? Will your implementation work if the rowkey of
>> primary
>>> table cannot be used as prefix in rowkey of Secondary table( i have this
>>> limitation in my use case)?
>>> First all there will be same number of regions in both primary and index
>>> tables. All the start/stop keys of the regions also will be same.
>>> Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
>>> Then we will create 2 regions in index table also with same key ranges.
>>> At the master balancing level it is easy to collocate these regions
>> seeing
>>> the start and end keys.
>>> When the selection of the rowkey that will be used in the index table is
>>> the key here.
>>> What we will do is all the rowkeys in the index table will be prefixed
>>> with the start key of the region/
>>> When an entry is added to the main table with rowkey as 5 it will go to
>>> the 1st region (0-10)
>>> Now there will be index region with range as 0-10.  We will select this
>>> region to store this index data.
>>> The row getting added into the index region for this entry will have a
>>> rowkey 0_x_5
>>> I am just using '_' as a seperator here just to show this. Actually we
>>> wont be having any seperator.
>>> So the rowkeys (in index region) will have a static begin part always.
>>> Will scan time also we know this part and so the startrow and endrow
>>> creation for the scan will be possible.. Note that we will store the
>> actual
>>> table row key as the last part of the index rowkey itself not as a value.
>>> This is better option in our case of handling the scan index usage also
>> at
>>> sever side.  There is no index data fetch to client side..
>> 
>> Anil: My primary table rowkey is customerId+event_id, and my secondary
>> table rowkey is timestamp+ customerid. In your implementation it seems like
>> for using secondary index the application needs to know about the
>> "start_key" of the region(static begin part) it wants to query. Right? Do
>> you separately manage the logic of determining the region
>> "start_key"(static begin part) for a scan?
>> Also, Its possible that while using secondary index the customerId is not
>> provided. So, i wont be having customer id for all the queries. Hence i
>> cannot use customer_id as a prefix in rowkey of my Secondary Table.
>> 
>>> 
>>> I feel your use case perfectly fit with our model
>> Anil: Somehow i am unable to fit your implementation into my use case due
>> to the constraint of static begin part of rowkey in Secondary table. There
>> seems to be a disconnect. Can you tell me how does my use case fits into
>> your implementation?
>> 
>>> 
>>>> 2. Are you using an Endpoint or Observer for building the secondary
>> index
>>> table?
>>> Observer
>>> 
>>>> 3. "Custom balancer do collocation". Is it a custom load balancer of
>> HBase
>>> Master or something else?
>>> It is a balancer implementation which will be plugged into Master
>>> 
>>>> 4. Your region split looks interesting. I dont have much info about it.
>>> Can
>>> you point to some docs on IndexHalfStoreFileReader?
>>> Sorry I am not able to publish any design doc or code as the company has
>>> not decided to open src the solution yet.
>>> Any particular query you come acorss pls feel free to aske me :)
>>> You can see the HalfStoreFileReader class 1st..
>>> 
>>> -Anoop-
>>> ________________________________________
>>> From: anil gupta [[email protected]]
>>> Sent: Friday, December 14, 2012 2:11 PM
>>> To: [email protected]
>>> Subject: Re: HBase - Secondary Index
>>> 
>>> Hi Anoop,
>>> 
>>> Nice presentation and seems like a smart implementation. Since the
>>> presentation only covered bullet points so i have couple of questions on
>>> your implementation. :)
>>> 
>>> Here is a recap to my implementation and our previous discussion on
>>> Secondary index:
>>> 
>>> Here is the link to previous email thread:
>>> http://search-hadoop.com/m/1zWPMaaRtr .
>>> 
>>> The secondary index is stored in table "B" as rowkey B --> family:<rowkey
>>> A>  . "<rowkey A>" is the column qualifier. Every row in B will only on
>>> have one column "k" and the value of that column is the rowkey of A.
>>> 
>>> Suppose i am storing customer events in table A. I have two requirement
>> for
>>> data query:
>>> 1. Query customer events on basis of customer_Id and event_ID.
>>> 2. Query customer events on basis of event_timestamp and customer_ID.
>>> 
>>> 70% of querying is done by query#1, so i will create
>>> <customer_Id><event_ID> as row key of Table A.
>>> Now, in order to support fast results for query#2, i need to create a
>>> secondary index on A. I store that secondary index in B, rowkey of B is
>>> <event_timestamp><customer_ID>.Every row stores the corresponding rowkey
>> of
>>> A.
>>> 
>>> HBase Querying approach:
>>> 1. Scan the secondary table by using prefix filter and startRow to get
>> the
>>> list of Rowkeys of Primary table.
>>> 2. Do a batch get on primary table by using HTable.get(List<Get>) method
>>> using the list of Rowkeys obtained in step1.
>>> 
>>> The only issue is that in my solution i have at least two RPC calls. Once
>>> each in step1 and step2 above. I want to reduce the number of RPC to 1 if
>>> possible.
>>> 
>>> 
>>> ******Questions on your implementation:*********
>>> 
>>> 1. In your presentation you mentioned that region of Primary Table and
>>> Region of Secondary Table are always located on the same region server.
>> How
>>> do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
>>> of Secondary Table? Will your implementation work if the rowkey of
>> primary
>>> table cannot be used as prefix in rowkey of Secondary table( i have this
>>> limitation in my use case)?
>>> 2. Are you using an Endpoint or Observer for building the secondary index
>>> table?
>>> 3. "Custom balancer do collocation". Is it a custom load balancer of
>> HBase
>>> Master or something else?
>>> 4. Your region split looks interesting. I dont have much info about it.
>> Can
>>> you point to some docs on IndexHalfStoreFileReader?
>>> 
>>> Thanks,
>>> Anil Gupta
>>> 
>>> 
>>> 
>>> On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]>
>>> wrote:
>>> 
>>>> Hi All
>>>> 
>>>>            Last week I got a chance to present the secondary indexing
>>>> solution what we have done in Huawei at the China Hadoop Conference.
>> You
>>>> can see the presentation from
>>>> http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf
>>>> 
>>>> 
>>>> 
>>>> I would like to hear what others think on this. :)
>>>> 
>>>> 
>>>> 
>>>> -Anoop-
>>> 
>>> 
>>> 
>>> --
>>> Thanks & Regards,
>>> Anil Gupta
>> 
>> 
>> 
>> --
>> Thanks & Regards,
>> Anil Gupta
> 
> 
> 
> -- 
> Thanks & Regards,
> Anil Gupta

Re: HBase - Secondary Index

Reply via email to