Re: HBase - Secondary Index

Mohit Anchlia Sun, 06 Jan 2013 12:36:56 -0800

Does anyone has any links or information to the new prefix encoding feature
in HBase that's being referred to in this mail?


On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet <[email protected]>wrote:

> Nice topic, perhaps one of the most important for 2013 :-)
> I still don't get how you're ensuring consistency between index table and
> main table, without an external component (such as bookkeeper/zookeeper).
> What's the exact write path in your situation when inserting data ?
> (WAL/RegionObserver, pre/post put/WALedit...)
>
> The underlying question is about how you're ensuring that WALEdit in Index
> and Main tables are perfectly sync'ed, and how you 're able to rollback in
> case of issue in both WAL ?
>
>
> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]>
> wrote:
>
> > >Yes as you say when the no of rows to be returned is becoming more and
> > more the latency will be becoming more.  seeks within an HFile block is
> > some what expensive op now. (Not much but still)  The new encoding
> >prefix
> > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> also
> > presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
> > measure the scan performance with this new encoding . Trying to >back
> port
> > a simple patch for 94 version just for testing...   Yes when the no of
> > results to be returned is more and more any index will become less
> > performing as per my study  :)
> >
> > yes, you are right, I guess it's just a drawback of any index approach.
> > Thanks for the explanation.
> >
> > Shengjie
> >
> > On 28 December 2012 04:14, Anoop Sam John <[email protected]> wrote:
> >
> > > > Do you have link to that presentation?
> > >
> > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
> > >
> > > -Anoop-
> > >
> > > ________________________________________
> > > From: Mohit Anchlia [[email protected]]
> > > Sent: Friday, December 28, 2012 9:12 AM
> > > To: [email protected]
> > > Subject: Re: HBase - Secondary Index
> > >
> > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[email protected]>
> > > wrote:
> > >
> > > > Yes as you say when the no of rows to be returned is becoming more
> and
> > > > more the latency will be becoming more.  seeks within an HFile block
> is
> > > > some what expensive op now. (Not much but still)  The new encoding
> > prefix
> > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted
> > > also
> > > > presented this in the Hadoop China]  Thanks to Matt... :)  I am
> trying
> > to
> > > > measure the scan performance with this new encoding . Trying to back
> > > port a
> > > > simple patch for 94 version just for testing...   Yes when the no of
> > > > results to be returned is more and more any index will become less
> > > > performing as per my study  :)
> > > >
> > > > Do you have link to that presentation?
> > >
> > >
> > > > >btw, quick question- in your presentation, the scale there is
> seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > It is seconds.  Dont consider the exact values. What is the % of
> > increase
> > > > in latency is important :) Those were not high end machines.
> > > >
> > > > -Anoop-
> > > > ________________________________________
> > > > From: Shengjie Min [[email protected]]
> > > > Sent: Thursday, December 27, 2012 9:59 PM
> > > > To: [email protected]
> > > > Subject: Re: HBase - Secondary Index
> > > >
> > > >  >Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > >exact rowkey in a region we get from the index table, we can seek to
> > the
> > > > >exact position and return that row.
> > > >
> > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
> just
> > > > small number of rows returned, this works perfect. As you said you
> will
> > > get
> > > > the exact rowkey positions per region, and simply seek them. I was
> > trying
> > > > to work out the case that when the number of result rows increases
> > > > massively. Like in Anil's case, he wants to do a scan query against
> the
> > > > 2ndary index(timestamp): "select all rows from timestamp1 to
> > timestamp2"
> > > > given no customerId provided. During that time period, he might have
> a
> > > big
> > > > chunk of rows from different customerIds. The index table returns a
> lot
> > > of
> > > > rowkey positions for different customerIds (I believe they are
> > scattered
> > > in
> > > > different regions), then you end up seeking all different positions
> in
> > > > different regions and return all the rows needed. According to your
> > > > presentation page14 - Performance Test Results (Scan), without index,
> > > it's
> > > > a linear increase as result rows # increases. on the other hand, with
> > > > index, time spent climbs up way quicker than the case without index.
> > > >
> > > > btw, quick question- in your presentation, the scale there is seconds
> > or
> > > > mill-seconds:)
> > > >
> > > > - Shengjie
> > > >
> > > >
> > > > On 27 December 2012 15:54, Anoop John <[email protected]> wrote:
> > > >
> > > > > >how the massive number of get() is going to
> > > > > perform againt the main table
> > > > >
> > > > > Didnt follow u completely here. There wont be any get() happening..
> > As
> > > > the
> > > > > exact rowkey in a region we get from the index table, we can seek
> to
> > > the
> > > > > exact position and return that row.
> > > > >
> > > > > -Anoop-
> > > > >
> > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > how the massive number of get() is going to
> > > > > > perform againt the main table
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > All the best,
> > > > Shengjie Min
> > > >
> > >
> >
> >
> >
> > --
> > All the best,
> > Shengjie Min
> >
>
>
>
> --
> Adrien Mogenet
> 06.59.16.64.22
> http://www.mogenet.me
>

Re: HBase - Secondary Index

Reply via email to