Re: HBase - Secondary Index

anil gupta Tue, 08 Jan 2013 17:29:13 -0800

+1 on Lars comment.

Either the client gets the rowkey from secondary table and then gets the
real data from Primary Table. ** OR ** Send the request to all the RS(or
region) hosting a region of primary table.


Anoop is using the latter mechanism. Both the mechanism have their pros and
cons. IMO, there is no outright winner.

~Anil Gupta

On Tue, Jan 8, 2013 at 4:30 PM, lars hofhansl <[email protected]> wrote:

> Different use cases.
>
>
> For global point queries you want exactly what you said below.
> For range scans across many rows you want Anoop's design. As usually it
> depends.
>
>
> The tradeoff is bringing a lot of unnecessary data to the client vs having
> to contact each region (or at least each region server).
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Michael Segel <[email protected]>
> To: [email protected]
> Sent: Tuesday, January 8, 2013 6:33 AM
> Subject: Re: HBase - Secondary Index
>
> So if you're using an inverted table / index why on earth are you doing it
> at the region level?
>
> I've tried to explain this to others over 6 months ago and its not really
> a good idea.
>
> You're over complicating this and you will end up creating performance
> bottlenecks when your secondary index is completely orthogonal to your row
> key.
>
> To give you an example...
>
> Suppose you're CCCIS and you have a large database of auto insurance
> claims that you've acquired over the years from your Pathways product.
>
> Your primary key would be a combination of the Insurance Company's ID and
> their internal claim ID for the individual claim.
> Your row would be all of the data associated to that claim.
>
> So now lets say you want to find the average cost to repair a front end
> collision of an S80 Volvo.
> The make and model of the car would be orthogonal to the initial key. This
> means that the result set containing insurance records for Front End
> collisions of S80 Volvos would be most likely evenly distributed across the
> cluster's regions.
>
> If you used a series of inverted tables, you would be able to use a series
> of get()s to get the result set from each index and then find their
> intersections. (Note that you could also put them in sort order so that the
> intersections would be fairly straight forward to find.
>
> Doing this at the region level isn't so simple.
>
> So I have to again ask why go through and over complicate things?
>
> Just saying...
>
> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[email protected]> wrote:
>
> > Hi,
> > It is inverted index based on column(s) value(s)
> > It will be region wise indexing. Can work when some one knows the rowkey
> range or NOT.
> >
> > -Anoop-
> > ________________________________________
> > From: Mohit Anchlia [[email protected]]
> > Sent: Monday, January 07, 2013 9:47 AM
> > To: [email protected]
> > Subject: Re: HBase - Secondary Index
> >
> > Hi Anoop,
> >
> > Am I correct in understanding that this indexing mechanism is only
> > applicable when you know the row key? It's not an inverted index truly
> > based on the column value.
> >
> > Mohit
> > On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[email protected]>
> wrote:
> >
> >> Hi Adrien
> >>                 We are making the consistency btw the main table and
> >> index table and the roll back mentioned below etc using the CP hooks.
> The
> >> current hooks were not enough for those though..  I am in the process of
> >> trying to contribute those new hooks, core changes etc now...  Once all
> are
> >> done I will be able to explain in details..
> >>
> >> -Anoop-
> >> ________________________________________
> >> From: Adrien Mogenet [[email protected]]
> >> Sent: Monday, January 07, 2013 2:00 AM
> >> To: [email protected]
> >> Subject: Re: HBase - Secondary Index
> >>
> >> Nice topic, perhaps one of the most important for 2013 :-)
> >> I still don't get how you're ensuring consistency between index table
> and
> >> main table, without an external component (such as
> bookkeeper/zookeeper).
> >> What's the exact write path in your situation when inserting data ?
> >> (WAL/RegionObserver, pre/post put/WALedit...)
> >>
> >> The underlying question is about how you're ensuring that WALEdit in
> Index
> >> and Main tables are perfectly sync'ed, and how you 're able to rollback
> in
> >> case of issue in both WAL ?
> >>
> >>
> >> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]>
> >> wrote:
> >>
> >>>> Yes as you say when the no of rows to be returned is becoming more and
> >>> more the latency will be becoming more.  seeks within an HFile block is
> >>> some what expensive op now. (Not much but still)  The new encoding
> >>> prefix
> >>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> >> also
> >>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying
> to
> >>> measure the scan performance with this new encoding . Trying to >back
> >> port
> >>> a simple patch for 94 version just for testing...   Yes when the no of
> >>> results to be returned is more and more any index will become less
> >>> performing as per my study  :)
> >>>
> >>> yes, you are right, I guess it's just a drawback of any index approach.
> >>> Thanks for the explanation.
> >>>
> >>> Shengjie
> >>>
> >>> On 28 December 2012 04:14, Anoop Sam John <[email protected]> wrote:
> >>>
> >>>>> Do you have link to that presentation?
> >>>>
> >>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> >>>>
> >>>> -Anoop-
> >>>>
> >>>> ________________________________________
> >>>> From: Mohit Anchlia [[email protected]]
> >>>> Sent: Friday, December 28, 2012 9:12 AM
> >>>> To: [email protected]
> >>>> Subject: Re: HBase - Secondary Index
> >>>>
> >>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Yes as you say when the no of rows to be returned is becoming more
> >> and
> >>>>> more the latency will be becoming more.  seeks within an HFile block
> >> is
> >>>>> some what expensive op now. (Not much but still)  The new encoding
> >>> prefix
> >>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
> >>>> also
> >>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
> >> trying
> >>> to
> >>>>> measure the scan performance with this new encoding . Trying to back
> >>>> port a
> >>>>> simple patch for 94 version just for testing...   Yes when the no of
> >>>>> results to be returned is more and more any index will become less
> >>>>> performing as per my study  :)
> >>>>>
> >>>>> Do you have link to that presentation?
> >>>>
> >>>>
> >>>>>> btw, quick question- in your presentation, the scale there is
> >> seconds
> >>> or
> >>>>> mill-seconds:)
> >>>>>
> >>>>> It is seconds.  Dont consider the exact values. What is the % of
> >>> increase
> >>>>> in latency is important :) Those were not high end machines.
> >>>>>
> >>>>> -Anoop-
> >>>>> ________________________________________
> >>>>> From: Shengjie Min [[email protected]]
> >>>>> Sent: Thursday, December 27, 2012 9:59 PM
> >>>>> To: [email protected]
> >>>>> Subject: Re: HBase - Secondary Index
> >>>>>
> >>>>>> Didnt follow u completely here. There wont be any get() happening..
> >>> As
> >>>>> the
> >>>>>> exact rowkey in a region we get from the index table, we can seek to
> >>> the
> >>>>>> exact position and return that row.
> >>>>>
> >>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> >> just
> >>>>> small number of rows returned, this works perfect. As you said you
> >> will
> >>>> get
> >>>>> the exact rowkey positions per region, and simply seek them. I was
> >>> trying
> >>>>> to work out the case that when the number of result rows increases
> >>>>> massively. Like in Anil's case, he wants to do a scan query against
> >> the
> >>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
> >>> timestamp2"
> >>>>> given no customerId provided. During that time period, he might have
> >> a
> >>>> big
> >>>>> chunk of rows from different customerIds. The index table returns a
> >> lot
> >>>> of
> >>>>> rowkey positions for different customerIds (I believe they are
> >>> scattered
> >>>> in
> >>>>> different regions), then you end up seeking all different positions
> >> in
> >>>>> different regions and return all the rows needed. According to your
> >>>>> presentation page14 - Performance Test Results (Scan), without index,
> >>>> it's
> >>>>> a linear increase as result rows # increases. on the other hand, with
> >>>>> index, time spent climbs up way quicker than the case without index.
> >>>>>
> >>>>> btw, quick question- in your presentation, the scale there is seconds
> >>> or
> >>>>> mill-seconds:)
> >>>>>
> >>>>> - Shengjie
> >>>>>
> >>>>>
> >>>>> On 27 December 2012 15:54, Anoop John <[email protected]> wrote:
> >>>>>
> >>>>>>> how the massive number of get() is going to
> >>>>>> perform againt the main table
> >>>>>>
> >>>>>> Didnt follow u completely here. There wont be any get() happening..
> >>> As
> >>>>> the
> >>>>>> exact rowkey in a region we get from the index table, we can seek
> >> to
> >>>> the
> >>>>>> exact position and return that row.
> >>>>>>
> >>>>>> -Anoop-
> >>>>>>
> >>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> >> [email protected]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> how the massive number of get() is going to
> >>>>>>> perform againt the main table
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> All the best,
> >>>>> Shengjie Min
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> All the best,
> >>> Shengjie Min
> >>>
> >>
> >>
> >>
> >> --
> >> Adrien Mogenet
> >> 06.59.16.64.22
> >> http://www.mogenet.me
> >>




-- 
Thanks & Regards,
Anil Gupta

Re: HBase - Secondary Index

Reply via email to