Does anyone has any links or information to the new prefix encoding feature in HBase that's being referred to in this mail?
On Sun, Jan 6, 2013 at 12:30 PM, Adrien Mogenet <[email protected]>wrote: > Nice topic, perhaps one of the most important for 2013 :-) > I still don't get how you're ensuring consistency between index table and > main table, without an external component (such as bookkeeper/zookeeper). > What's the exact write path in your situation when inserting data ? > (WAL/RegionObserver, pre/post put/WALedit...) > > The underlying question is about how you're ensuring that WALEdit in Index > and Main tables are perfectly sync'ed, and how you 're able to rollback in > case of issue in both WAL ? > > > On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]> > wrote: > > > >Yes as you say when the no of rows to be returned is becoming more and > > more the latency will be becoming more. seeks within an HFile block is > > some what expensive op now. (Not much but still) The new encoding > >prefix > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > also > > presented this in the Hadoop China] Thanks to Matt... :) I am trying to > > measure the scan performance with this new encoding . Trying to >back > port > > a simple patch for 94 version just for testing... Yes when the no of > > results to be returned is more and more any index will become less > > performing as per my study :) > > > > yes, you are right, I guess it's just a drawback of any index approach. > > Thanks for the explanation. > > > > Shengjie > > > > On 28 December 2012 04:14, Anoop Sam John <[email protected]> wrote: > > > > > > Do you have link to that presentation? > > > > > > http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf > > > > > > -Anoop- > > > > > > ________________________________________ > > > From: Mohit Anchlia [[email protected]] > > > Sent: Friday, December 28, 2012 9:12 AM > > > To: [email protected] > > > Subject: Re: HBase - Secondary Index > > > > > > On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[email protected]> > > > wrote: > > > > > > > Yes as you say when the no of rows to be returned is becoming more > and > > > > more the latency will be becoming more. seeks within an HFile block > is > > > > some what expensive op now. (Not much but still) The new encoding > > prefix > > > > trie will be a huge bonus here. There the seeks will be flying.. [Ted > > > also > > > > presented this in the Hadoop China] Thanks to Matt... :) I am > trying > > to > > > > measure the scan performance with this new encoding . Trying to back > > > port a > > > > simple patch for 94 version just for testing... Yes when the no of > > > > results to be returned is more and more any index will become less > > > > performing as per my study :) > > > > > > > > Do you have link to that presentation? > > > > > > > > > > >btw, quick question- in your presentation, the scale there is > seconds > > or > > > > mill-seconds:) > > > > > > > > It is seconds. Dont consider the exact values. What is the % of > > increase > > > > in latency is important :) Those were not high end machines. > > > > > > > > -Anoop- > > > > ________________________________________ > > > > From: Shengjie Min [[email protected]] > > > > Sent: Thursday, December 27, 2012 9:59 PM > > > > To: [email protected] > > > > Subject: Re: HBase - Secondary Index > > > > > > > > >Didnt follow u completely here. There wont be any get() happening.. > > As > > > > the > > > > >exact rowkey in a region we get from the index table, we can seek to > > the > > > > >exact position and return that row. > > > > > > > > Sorry, When I misused "get()" here, I meant seeking. Yes, if it's > just > > > > small number of rows returned, this works perfect. As you said you > will > > > get > > > > the exact rowkey positions per region, and simply seek them. I was > > trying > > > > to work out the case that when the number of result rows increases > > > > massively. Like in Anil's case, he wants to do a scan query against > the > > > > 2ndary index(timestamp): "select all rows from timestamp1 to > > timestamp2" > > > > given no customerId provided. During that time period, he might have > a > > > big > > > > chunk of rows from different customerIds. The index table returns a > lot > > > of > > > > rowkey positions for different customerIds (I believe they are > > scattered > > > in > > > > different regions), then you end up seeking all different positions > in > > > > different regions and return all the rows needed. According to your > > > > presentation page14 - Performance Test Results (Scan), without index, > > > it's > > > > a linear increase as result rows # increases. on the other hand, with > > > > index, time spent climbs up way quicker than the case without index. > > > > > > > > btw, quick question- in your presentation, the scale there is seconds > > or > > > > mill-seconds:) > > > > > > > > - Shengjie > > > > > > > > > > > > On 27 December 2012 15:54, Anoop John <[email protected]> wrote: > > > > > > > > > >how the massive number of get() is going to > > > > > perform againt the main table > > > > > > > > > > Didnt follow u completely here. There wont be any get() happening.. > > As > > > > the > > > > > exact rowkey in a region we get from the index table, we can seek > to > > > the > > > > > exact position and return that row. > > > > > > > > > > -Anoop- > > > > > > > > > > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min < > [email protected]> > > > > > wrote: > > > > > > > > > > > how the massive number of get() is going to > > > > > > perform againt the main table > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > All the best, > > > > Shengjie Min > > > > > > > > > > > > > > > -- > > All the best, > > Shengjie Min > > > > > > -- > Adrien Mogenet > 06.59.16.64.22 > http://www.mogenet.me >
