Re: HBase - Secondary Index

anil gupta Wed, 19 Dec 2012 00:40:15 -0800

Hi Michael,

Please find my replies inline.


Thanks,
Anil

On Tue, Dec 18, 2012 at 1:02 AM, Michel Segel <[email protected]>wrote:

> Just a couple of questions...
>
> First, since you don't have any natural secondary indices, you can create
> one from a couple of choices. Keeping it simple, you choose an inverted
> table as your index.
>
Reasons for not creating a inverted table:
1. There can be millions of columns corresponding to a rowkey in my
secondary index. In future it can even grow more.
2. While using secondary index, we are also planning to have filtering on
the basis of other non-rowkey columns.
For example: 1 Row of Secondary table might look like this:
Rowkey: cf:PrimarytableRowKey=x, cf:customerFirstName=xyz,
cf:customerAddress=123, Union Sq, LA
My primary table has around 50 columns and in secondary table i duplicate
two columns to used along with secondary index for filtering.

>
> In doing so, you have one column containing all of the row ids for a given
> value.
> This means that it is a simple get().
>
> My question is that since you don't have any formal SQL syntax, how are
> you doing this all server side?
>
As Anoop said, I am not doing the index data scan at the server side. He
scan the index table data back to client and from client doing gets to get
the main table data.

>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Dec 18, 2012, at 2:28 AM, anil gupta <[email protected]> wrote:
>
> > Hi Anoop,
> >
> > Please find my reply inline.
> >
> > Thanks,
> > Anil Gupta
> >
> > On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[email protected]>
> wrote:
> >
> >> Hi Anil
> >>                During the scan, there is no need to fetch any index data
> >> to client side. So there is no need to create any scanner on the index
> >> table at the client side. This happens at the server side.
> >
> >
> >>
> >> For the Scan on the main table with condition on timestamp and customer
> >> id, a scanner to be created with Filters. Yes like normal when there is
> no
> >> secondary index. So this scan from the client will go through all the
> >> regions in the main table.
> >
> >
> > Anil: Do you mean that if the table is spread across 50 region servers in
> > 60 node cluster then we need to send a scan request to all the 50 RS.
> > Right? Doesn't it sounds expensive? IMHO you were not doing this in your
> > solution. Your solution looked cleaner than this since you exactly knew
> > which Node you need to go to for querying while using secondary index due
> > to co-location(due to static begin part for secondary table rowkey) of
> > region of primary table and secondary index table. My problem is little
> > more complicated due to the constraints that: I cannot have a "static
> begin
> > part" in the rowkey of my secondary table.
> >
> > When it scans one particular region say (x,y] on the main table, using
> the
> >> CP we can get the index table region object corresponding to this main
> >> table region from the RS.  There is no issue in creating the static
> part of
> >> the rowkey. You know 'x' is the region start key. Then at the server
> side
> >> will create a scanner on the index region directly and here we can
> specify
> >> the startkey. 'x' + <timestamp value> + <customer id>..  Using the
> results
> >> from the index scan we will make reseek on the main region to the exact
> >> rows where the data what we are interested in is available. So there
> wont
> >> be a full region data scan happening.
> >
> >> When in the cases where only timestamp is there but no customer id, it
> >> will be simple again. Create a scanner on the main table with only one
> >> filter. At the CP side the scanner on the index region will get created
> >> with startkey as 'x' + <timestamp value>..    When you create the scan
> >> object and set startRow on that it need not be the full rowkey. It can
> be
> >> part of the rowkey also. Yes like prefix.
> >>
> >> Hope u got it now :)
> > Anil: I hope now we are on same page. Thanks a lot for your valuable time
> > to discuss this stuff.
> >
> >>
> >> -Anoop-
> >> ________________________________________
> >> From: anil gupta [[email protected]]
> >> Sent: Friday, December 14, 2012 11:31 PM
> >> To: [email protected]
> >> Subject: Re: HBase - Secondary Index
> >>
> >> On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]>
> >> wrote:
> >>
> >>> Hi Anil,
> >>>
> >>>> 1. In your presentation you mentioned that region of Primary Table and
> >>> Region of Secondary Table are always located on the same region server.
> >> How
> >>> do you achieve it? By using the Primary table rowkey as prefix of
>  Rowkey
> >>> of Secondary Table? Will your implementation work if the rowkey of
> >> primary
> >>> table cannot be used as prefix in rowkey of Secondary table( i have
> this
> >>> limitation in my use case)?
> >>> First all there will be same number of regions in both primary and
> index
> >>> tables. All the start/stop keys of the regions also will be same.
> >>> Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
> >>> Then we will create 2 regions in index table also with same key ranges.
> >>> At the master balancing level it is easy to collocate these regions
> >> seeing
> >>> the start and end keys.
> >>> When the selection of the rowkey that will be used in the index table
> is
> >>> the key here.
> >>> What we will do is all the rowkeys in the index table will be prefixed
> >>> with the start key of the region/
> >>> When an entry is added to the main table with rowkey as 5 it will go to
> >>> the 1st region (0-10)
> >>> Now there will be index region with range as 0-10.  We will select this
> >>> region to store this index data.
> >>> The row getting added into the index region for this entry will have a
> >>> rowkey 0_x_5
> >>> I am just using '_' as a seperator here just to show this. Actually we
> >>> wont be having any seperator.
> >>> So the rowkeys (in index region) will have a static begin part always.
> >>> Will scan time also we know this part and so the startrow and endrow
> >>> creation for the scan will be possible.. Note that we will store the
> >> actual
> >>> table row key as the last part of the index rowkey itself not as a
> value.
> >>> This is better option in our case of handling the scan index usage also
> >> at
> >>> sever side.  There is no index data fetch to client side..
> >>
> >> Anil: My primary table rowkey is customerId+event_id, and my secondary
> >> table rowkey is timestamp+ customerid. In your implementation it seems
> like
> >> for using secondary index the application needs to know about the
> >> "start_key" of the region(static begin part) it wants to query. Right?
> Do
> >> you separately manage the logic of determining the region
> >> "start_key"(static begin part) for a scan?
> >> Also, Its possible that while using secondary index the customerId is
> not
> >> provided. So, i wont be having customer id for all the queries. Hence i
> >> cannot use customer_id as a prefix in rowkey of my Secondary Table.
> >>
> >>>
> >>> I feel your use case perfectly fit with our model
> >> Anil: Somehow i am unable to fit your implementation into my use case
> due
> >> to the constraint of static begin part of rowkey in Secondary table.
> There
> >> seems to be a disconnect. Can you tell me how does my use case fits into
> >> your implementation?
> >>
> >>>
> >>>> 2. Are you using an Endpoint or Observer for building the secondary
> >> index
> >>> table?
> >>> Observer
> >>>
> >>>> 3. "Custom balancer do collocation". Is it a custom load balancer of
> >> HBase
> >>> Master or something else?
> >>> It is a balancer implementation which will be plugged into Master
> >>>
> >>>> 4. Your region split looks interesting. I dont have much info about
> it.
> >>> Can
> >>> you point to some docs on IndexHalfStoreFileReader?
> >>> Sorry I am not able to publish any design doc or code as the company
> has
> >>> not decided to open src the solution yet.
> >>> Any particular query you come acorss pls feel free to aske me :)
> >>> You can see the HalfStoreFileReader class 1st..
> >>>
> >>> -Anoop-
> >>> ________________________________________
> >>> From: anil gupta [[email protected]]
> >>> Sent: Friday, December 14, 2012 2:11 PM
> >>> To: [email protected]
> >>> Subject: Re: HBase - Secondary Index
> >>>
> >>> Hi Anoop,
> >>>
> >>> Nice presentation and seems like a smart implementation. Since the
> >>> presentation only covered bullet points so i have couple of questions
> on
> >>> your implementation. :)
> >>>
> >>> Here is a recap to my implementation and our previous discussion on
> >>> Secondary index:
> >>>
> >>> Here is the link to previous email thread:
> >>> http://search-hadoop.com/m/1zWPMaaRtr .
> >>>
> >>> The secondary index is stored in table "B" as rowkey B -->
> family:<rowkey
> >>> A>  . "<rowkey A>" is the column qualifier. Every row in B will only on
> >>> have one column "k" and the value of that column is the rowkey of A.
> >>>
> >>> Suppose i am storing customer events in table A. I have two requirement
> >> for
> >>> data query:
> >>> 1. Query customer events on basis of customer_Id and event_ID.
> >>> 2. Query customer events on basis of event_timestamp and customer_ID.
> >>>
> >>> 70% of querying is done by query#1, so i will create
> >>> <customer_Id><event_ID> as row key of Table A.
> >>> Now, in order to support fast results for query#2, i need to create a
> >>> secondary index on A. I store that secondary index in B, rowkey of B is
> >>> <event_timestamp><customer_ID>.Every row stores the corresponding
> rowkey
> >> of
> >>> A.
> >>>
> >>> HBase Querying approach:
> >>> 1. Scan the secondary table by using prefix filter and startRow to get
> >> the
> >>> list of Rowkeys of Primary table.
> >>> 2. Do a batch get on primary table by using HTable.get(List<Get>)
> method
> >>> using the list of Rowkeys obtained in step1.
> >>>
> >>> The only issue is that in my solution i have at least two RPC calls.
> Once
> >>> each in step1 and step2 above. I want to reduce the number of RPC to 1
> if
> >>> possible.
> >>>
> >>>
> >>> ******Questions on your implementation:*********
> >>>
> >>> 1. In your presentation you mentioned that region of Primary Table and
> >>> Region of Secondary Table are always located on the same region server.
> >> How
> >>> do you achieve it? By using the Primary table rowkey as prefix of
>  Rowkey
> >>> of Secondary Table? Will your implementation work if the rowkey of
> >> primary
> >>> table cannot be used as prefix in rowkey of Secondary table( i have
> this
> >>> limitation in my use case)?
> >>> 2. Are you using an Endpoint or Observer for building the secondary
> index
> >>> table?
> >>> 3. "Custom balancer do collocation". Is it a custom load balancer of
> >> HBase
> >>> Master or something else?
> >>> 4. Your region split looks interesting. I dont have much info about it.
> >> Can
> >>> you point to some docs on IndexHalfStoreFileReader?
> >>>
> >>> Thanks,
> >>> Anil Gupta
> >>>
> >>>
> >>>
> >>> On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]>
> >>> wrote:
> >>>
> >>>> Hi All
> >>>>
> >>>>            Last week I got a chance to present the secondary indexing
> >>>> solution what we have done in Huawei at the China Hadoop Conference.
> >> You
> >>>> can see the presentation from
> >>>> http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf
> >>>>
> >>>>
> >>>>
> >>>> I would like to hear what others think on this. :)
> >>>>
> >>>>
> >>>>
> >>>> -Anoop-
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks & Regards,
> >>> Anil Gupta
> >>
> >>
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
>



-- 
Thanks & Regards,
Anil Gupta

Re: HBase - Secondary Index

Reply via email to