Re: HBase - Secondary Index

anil gupta Fri, 14 Dec 2012 10:02:23 -0800

On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]> wrote:


> Hi Anil,
>
> >1. In your presentation you mentioned that region of Primary Table and
> Region of Secondary Table are always located on the same region server. How
> do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
> of Secondary Table? Will your implementation work if the rowkey of primary
> table cannot be used as prefix in rowkey of Secondary table( i have this
> limitation in my use case)?
> First all there will be same number of regions in both primary and index
> tables. All the start/stop keys of the regions also will be same.
> Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
>  Then we will create 2 regions in index table also with same key ranges.
> At the master balancing level it is easy to collocate these regions seeing
> the start and end keys.
> When the selection of the rowkey that will be used in the index table is
> the key here.
> What we will do is all the rowkeys in the index table will be prefixed
> with the start key of the region/
> When an entry is added to the main table with rowkey as 5 it will go to
> the 1st region (0-10)
> Now there will be index region with range as 0-10.  We will select this
> region to store this index data.
> The row getting added into the index region for this entry will have a
> rowkey 0_x_5
> I am just using '_' as a seperator here just to show this. Actually we
> wont be having any seperator.
> So the rowkeys (in index region) will have a static begin part always.
>  Will scan time also we know this part and so the startrow and endrow
> creation for the scan will be possible.. Note that we will store the actual
> table row key as the last part of the index rowkey itself not as a value.
> This is better option in our case of handling the scan index usage also at
> sever side.  There is no index data fetch to client side..
>

Anil: My primary table rowkey is customerId+event_id, and my secondary
table rowkey is timestamp+ customerid. In your implementation it seems like
for using secondary index the application needs to know about the
"start_key" of the region(static begin part) it wants to query. Right? Do
you separately manage the logic of determining the region
"start_key"(static begin part) for a scan?
Also, Its possible that while using secondary index the customerId is not
provided. So, i wont be having customer id for all the queries. Hence i
cannot use customer_id as a prefix in rowkey of my Secondary Table.

>
> I feel your use case perfectly fit with our model
>
Anil: Somehow i am unable to fit your implementation into my use case due
to the constraint of static begin part of rowkey in Secondary table. There
seems to be a disconnect. Can you tell me how does my use case fits into
your implementation?

>
> >2. Are you using an Endpoint or Observer for building the secondary index
> table?
> Observer
>
> >3. "Custom balancer do collocation". Is it a custom load balancer of HBase
> Master or something else?
> It is a balancer implementation which will be plugged into Master
>
> >4. Your region split looks interesting. I dont have much info about it.
> Can
> you point to some docs on IndexHalfStoreFileReader?
> Sorry I am not able to publish any design doc or code as the company has
> not decided to open src the solution yet.
> Any particular query you come acorss pls feel free to aske me :)
> You can see the HalfStoreFileReader class 1st..
>
> -Anoop-
> ________________________________________
> From: anil gupta [[email protected]]
> Sent: Friday, December 14, 2012 2:11 PM
> To: [email protected]
> Subject: Re: HBase - Secondary Index
>
> Hi Anoop,
>
> Nice presentation and seems like a smart implementation. Since the
> presentation only covered bullet points so i have couple of questions on
> your implementation. :)
>
> Here is a recap to my implementation and our previous discussion on
> Secondary index:
>
> Here is the link to previous email thread:
> http://search-hadoop.com/m/1zWPMaaRtr .
>
> The secondary index is stored in table "B" as rowkey B --> family:<rowkey
> A>  . "<rowkey A>" is the column qualifier. Every row in B will only on
> have one column "k" and the value of that column is the rowkey of A.
>
> Suppose i am storing customer events in table A. I have two requirement for
> data query:
> 1. Query customer events on basis of customer_Id and event_ID.
> 2. Query customer events on basis of event_timestamp and customer_ID.
>
> 70% of querying is done by query#1, so i will create
> <customer_Id><event_ID> as row key of Table A.
> Now, in order to support fast results for query#2, i need to create a
> secondary index on A. I store that secondary index in B, rowkey of B is
> <event_timestamp><customer_ID>.Every row stores the corresponding rowkey of
> A.
>
> HBase Querying approach:
> 1. Scan the secondary table by using prefix filter and startRow to get the
> list of Rowkeys of Primary table.
> 2. Do a batch get on primary table by using HTable.get(List<Get>) method
> using the list of Rowkeys obtained in step1.
>
> The only issue is that in my solution i have at least two RPC calls. Once
> each in step1 and step2 above. I want to reduce the number of RPC to 1 if
> possible.
>
>
> ******Questions on your implementation:*********
>
> 1. In your presentation you mentioned that region of Primary Table and
> Region of Secondary Table are always located on the same region server. How
> do you achieve it? By using the Primary table rowkey as prefix of  Rowkey
> of Secondary Table? Will your implementation work if the rowkey of primary
> table cannot be used as prefix in rowkey of Secondary table( i have this
> limitation in my use case)?
> 2. Are you using an Endpoint or Observer for building the secondary index
> table?
> 3. "Custom balancer do collocation". Is it a custom load balancer of HBase
> Master or something else?
> 4. Your region split looks interesting. I dont have much info about it. Can
> you point to some docs on IndexHalfStoreFileReader?
>
> Thanks,
> Anil Gupta
>
>
>
> On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]>
> wrote:
>
> > Hi All
> >
> >             Last week I got a chance to present the secondary indexing
> > solution what we have done in Huawei at the China Hadoop Conference.  You
> > can see the presentation from
> > http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf
> >
> >
> >
> > I would like to hear what others think on this. :)
> >
> >
> >
> > -Anoop-
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Thanks & Regards,
Anil Gupta

Re: HBase - Secondary Index

Reply via email to