Re: HBase - Secondary Index

anil gupta Wed, 19 Dec 2012 00:25:34 -0800

Hi Anoop,

For my use case, scans will never have primary table rowkey range whenever
i query using secondary index. IMHO, if i am sending the request to all the
RS of table then i am afraid/concerned of too many unnecessary RPC's across
the cluster for every single query based on secondary index. Essentially
everytime it will look like a full table scan but under the hood the CP's
will do the magic using secondary table.Your solution works well when
rowkey range on primary table can be specified.
Unfortunately, i dont have that luxury for now to use "primary table rowkey
range". It seems like i will have to stick to my current solution. However,
it's always good to have a healthy discussion on different approaches. :)



PS: My current secondary index implementation is not yet in production. I
did some preliminary testing and it seems to work fine but i think i need
to do some more testing.

Thanks,
Anil Gupta


On Tue, Dec 18, 2012 at 1:27 AM, Anoop Sam John <[email protected]> wrote:

> Anil:
>     If the scan from client side does not specify any rowkey range but
> only the filter condition, yes it will go to all the primary table regions
> for the scan. There 1st it will scan the index table region and seek to
> exact rows in the main table region.  If that region is not having any data
> at all corresponding to the filter condition, the entire region will get
> skipped simply.
>
> In a normal scan also, if there is a rowkey range that we can specify,
> then only to specific regions the request will go. In the sec index case of
> ours also it is same..
>
> In a simple way what I can say is for the scan there is no change at all
> wrt the operation that is what is happening at the client side. From the
> meta data to know which all region and RSs to contact, and contacting that
> regions one by one and getting data from that region. Only difference is
> what is happening at the server side. With out index the whole data from
> all the Hfiles will get fetched at the server side and the filter will get
> applied for every row. Only those rows which passes the filter will get
> back to the client side.  With index, when the scanning happen at the
> server side, the index data will get scanned 1st from the index region.
> This region will be in the same RS so no extra RPCs. The data to be scanned
> from the index table will be limited.. We can create the start key and stop
> key for that.. Based on the result of the index scan, we will know the
> rowkeys where all the data what we are interested in resides. So reseek
> will happen to those rows and read only those rows. So the time spent at
> the server side for scanning a region will get reduced to a very high value.
>
> Yes but still there will be calls from the client side to the RS for each
> region...
>
> Now I think u might be clear.. In the ppt that I have shared, there also
> it is saying the same thing. It is showing what is happening at the server
> side.
>
> -Anoop-
>
> ________________________________________
> From: anil gupta [[email protected]]
> Sent: Tuesday, December 18, 2012 1:58 PM
> To: [email protected]
> Subject: Re: HBase - Secondary Index
>
> Hi Anoop,
>
> Please find my reply inline.
>
> Thanks,
> Anil Gupta
>
> On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[email protected]>
> wrote:
>
> > Hi Anil
> >                 During the scan, there is no need to fetch any index data
> > to client side. So there is no need to create any scanner on the index
> > table at the client side. This happens at the server side.
> >
>
>
> >
> > For the Scan on the main table with condition on timestamp and customer
> > id, a scanner to be created with Filters. Yes like normal when there is
> no
> > secondary index. So this scan from the client will go through all the
> > regions in the main table.
>
>
> Anil: Do you mean that if the table is spread across 50 region servers in
> 60 node cluster then we need to send a scan request to all the 50 RS.
> Right? Doesn't it sounds expensive? IMHO you were not doing this in your
> solution. Your solution looked cleaner than this since you exactly knew
> which Node you need to go to for querying while using secondary index due
> to co-location(due to static begin part for secondary table rowkey) of
> region of primary table and secondary index table. My problem is little
> more complicated due to the constraints that: I cannot have a "static begin
> part" in the rowkey of my secondary table.
>
> When it scans one particular region say (x,y] on the main table, using the
> > CP we can get the index table region object corresponding to this main
> > table region from the RS.  There is no issue in creating the static part
> of
> > the rowkey. You know 'x' is the region start key. Then at the server side
> > will create a scanner on the index region directly and here we can
> specify
> > the startkey. 'x' + <timestamp value> + <customer id>..  Using the
> results
> > from the index scan we will make reseek on the main region to the exact
> > rows where the data what we are interested in is available. So there wont
> > be a full region data scan happening.
> >
>
> > When in the cases where only timestamp is there but no customer id, it
> > will be simple again. Create a scanner on the main table with only one
> > filter. At the CP side the scanner on the index region will get created
> > with startkey as 'x' + <timestamp value>..    When you create the scan
> > object and set startRow on that it need not be the full rowkey. It can be
> > part of the rowkey also. Yes like prefix.
> >
> > Hope u got it now :)
> >
> Anil: I hope now we are on same page. Thanks a lot for your valuable time
> to discuss this stuff.
>
> >
> > -Anoop-
> > ________________________________________
> > From: anil gupta [[email protected]]
> > Sent: Friday, December 14, 2012 11:31 PM
> > To: [email protected]
> > Subject: Re: HBase - Secondary Index
> >
> > On Fri, Dec 14, 2012 at 12:54 AM, Anoop Sam John <[email protected]>
> > wrote:
> >
> > > Hi Anil,
> > >
> > > >1. In your presentation you mentioned that region of Primary Table and
> > > Region of Secondary Table are always located on the same region server.
> > How
> > > do you achieve it? By using the Primary table rowkey as prefix of
>  Rowkey
> > > of Secondary Table? Will your implementation work if the rowkey of
> > primary
> > > table cannot be used as prefix in rowkey of Secondary table( i have
> this
> > > limitation in my use case)?
> > > First all there will be same number of regions in both primary and
> index
> > > tables. All the start/stop keys of the regions also will be same.
> > > Suppose there are 2 regions on main table say for keys 0-10 and 10-20.
> > >  Then we will create 2 regions in index table also with same key
> ranges.
> > > At the master balancing level it is easy to collocate these regions
> > seeing
> > > the start and end keys.
> > > When the selection of the rowkey that will be used in the index table
> is
> > > the key here.
> > > What we will do is all the rowkeys in the index table will be prefixed
> > > with the start key of the region/
> > > When an entry is added to the main table with rowkey as 5 it will go to
> > > the 1st region (0-10)
> > > Now there will be index region with range as 0-10.  We will select this
> > > region to store this index data.
> > > The row getting added into the index region for this entry will have a
> > > rowkey 0_x_5
> > > I am just using '_' as a seperator here just to show this. Actually we
> > > wont be having any seperator.
> > > So the rowkeys (in index region) will have a static begin part always.
> > >  Will scan time also we know this part and so the startrow and endrow
> > > creation for the scan will be possible.. Note that we will store the
> > actual
> > > table row key as the last part of the index rowkey itself not as a
> value.
> > > This is better option in our case of handling the scan index usage also
> > at
> > > sever side.  There is no index data fetch to client side..
> > >
> >
> > Anil: My primary table rowkey is customerId+event_id, and my secondary
> > table rowkey is timestamp+ customerid. In your implementation it seems
> like
> > for using secondary index the application needs to know about the
> > "start_key" of the region(static begin part) it wants to query. Right? Do
> > you separately manage the logic of determining the region
> > "start_key"(static begin part) for a scan?
> > Also, Its possible that while using secondary index the customerId is not
> > provided. So, i wont be having customer id for all the queries. Hence i
> > cannot use customer_id as a prefix in rowkey of my Secondary Table.
> >
> > >
> > > I feel your use case perfectly fit with our model
> > >
> > Anil: Somehow i am unable to fit your implementation into my use case due
> > to the constraint of static begin part of rowkey in Secondary table.
> There
> > seems to be a disconnect. Can you tell me how does my use case fits into
> > your implementation?
> >
> > >
> > > >2. Are you using an Endpoint or Observer for building the secondary
> > index
> > > table?
> > > Observer
> > >
> > > >3. "Custom balancer do collocation". Is it a custom load balancer of
> > HBase
> > > Master or something else?
> > > It is a balancer implementation which will be plugged into Master
> > >
> > > >4. Your region split looks interesting. I dont have much info about
> it.
> > > Can
> > > you point to some docs on IndexHalfStoreFileReader?
> > > Sorry I am not able to publish any design doc or code as the company
> has
> > > not decided to open src the solution yet.
> > > Any particular query you come acorss pls feel free to aske me :)
> > > You can see the HalfStoreFileReader class 1st..
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: anil gupta [[email protected]]
> > > Sent: Friday, December 14, 2012 2:11 PM
> > > To: [email protected]
> > > Subject: Re: HBase - Secondary Index
> > >
> > > Hi Anoop,
> > >
> > > Nice presentation and seems like a smart implementation. Since the
> > > presentation only covered bullet points so i have couple of questions
> on
> > > your implementation. :)
> > >
> > > Here is a recap to my implementation and our previous discussion on
> > > Secondary index:
> > >
> > > Here is the link to previous email thread:
> > > http://search-hadoop.com/m/1zWPMaaRtr .
> > >
> > > The secondary index is stored in table "B" as rowkey B -->
> family:<rowkey
> > > A>  . "<rowkey A>" is the column qualifier. Every row in B will only on
> > > have one column "k" and the value of that column is the rowkey of A.
> > >
> > > Suppose i am storing customer events in table A. I have two requirement
> > for
> > > data query:
> > > 1. Query customer events on basis of customer_Id and event_ID.
> > > 2. Query customer events on basis of event_timestamp and customer_ID.
> > >
> > > 70% of querying is done by query#1, so i will create
> > > <customer_Id><event_ID> as row key of Table A.
> > > Now, in order to support fast results for query#2, i need to create a
> > > secondary index on A. I store that secondary index in B, rowkey of B is
> > > <event_timestamp><customer_ID>.Every row stores the corresponding
> rowkey
> > of
> > > A.
> > >
> > > HBase Querying approach:
> > > 1. Scan the secondary table by using prefix filter and startRow to get
> > the
> > > list of Rowkeys of Primary table.
> > > 2. Do a batch get on primary table by using HTable.get(List<Get>)
> method
> > > using the list of Rowkeys obtained in step1.
> > >
> > > The only issue is that in my solution i have at least two RPC calls.
> Once
> > > each in step1 and step2 above. I want to reduce the number of RPC to 1
> if
> > > possible.
> > >
> > >
> > > ******Questions on your implementation:*********
> > >
> > > 1. In your presentation you mentioned that region of Primary Table and
> > > Region of Secondary Table are always located on the same region server.
> > How
> > > do you achieve it? By using the Primary table rowkey as prefix of
>  Rowkey
> > > of Secondary Table? Will your implementation work if the rowkey of
> > primary
> > > table cannot be used as prefix in rowkey of Secondary table( i have
> this
> > > limitation in my use case)?
> > > 2. Are you using an Endpoint or Observer for building the secondary
> index
> > > table?
> > > 3. "Custom balancer do collocation". Is it a custom load balancer of
> > HBase
> > > Master or something else?
> > > 4. Your region split looks interesting. I dont have much info about it.
> > Can
> > > you point to some docs on IndexHalfStoreFileReader?
> > >
> > > Thanks,
> > > Anil Gupta
> > >
> > >
> > >
> > > On Tue, Dec 4, 2012 at 12:10 AM, Anoop Sam John <[email protected]>
> > > wrote:
> > >
> > > > Hi All
> > > >
> > > >             Last week I got a chance to present the secondary
> indexing
> > > > solution what we have done in Huawei at the China Hadoop Conference.
> >  You
> > > > can see the presentation from
> > > > http://hbtc2012.hadooper.cn/subject/track4Anoop%20Sam%20John2.pdf
> > > >
> > > >
> > > >
> > > > I would like to hear what others think on this. :)
> > > >
> > > >
> > > >
> > > > -Anoop-
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Anil Gupta
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Thanks & Regards,
Anil Gupta

Re: HBase - Secondary Index

Reply via email to