Yes as you say when the no of rows to be returned is becoming more and more the 
latency will be becoming more.  seeks within an HFile block is some what 
expensive op now. (Not much but still)  The new encoding prefix trie will be a 
huge bonus here. There the seeks will be flying.. [Ted also presented this in 
the Hadoop China]  Thanks to Matt... :)  I am trying to measure the scan 
performance with this new encoding . Trying to back port a simple patch for 94 
version just for testing...   Yes when the no of results to be returned is more 
and more any index will become less performing as per my study  :)

>btw, quick question- in your presentation, the scale there is seconds or
mill-seconds:)

It is seconds.  Dont consider the exact values. What is the % of increase in 
latency is important :) Those were not high end machines.

-Anoop-
________________________________________
From: Shengjie Min [[email protected]]
Sent: Thursday, December 27, 2012 9:59 PM
To: [email protected]
Subject: Re: HBase - Secondary Index

>Didnt follow u completely here. There wont be any get() happening.. As the
>exact rowkey in a region we get from the index table, we can seek to the
>exact position and return that row.

Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just
small number of rows returned, this works perfect. As you said you will get
the exact rowkey positions per region, and simply seek them. I was trying
to work out the case that when the number of result rows increases
massively. Like in Anil's case, he wants to do a scan query against the
2ndary index(timestamp): "select all rows from timestamp1 to timestamp2"
given no customerId provided. During that time period, he might have a big
chunk of rows from different customerIds. The index table returns a lot of
rowkey positions for different customerIds (I believe they are scattered in
different regions), then you end up seeking all different positions in
different regions and return all the rows needed. According to your
presentation page14 - Performance Test Results (Scan), without index, it's
a linear increase as result rows # increases. on the other hand, with
index, time spent climbs up way quicker than the case without index.

btw, quick question- in your presentation, the scale there is seconds or
mill-seconds:)

- Shengjie


On 27 December 2012 15:54, Anoop John <[email protected]> wrote:

> >how the massive number of get() is going to
> perform againt the main table
>
> Didnt follow u completely here. There wont be any get() happening.. As the
> exact rowkey in a region we get from the index table, we can seek to the
> exact position and return that row.
>
> -Anoop-
>
> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[email protected]>
> wrote:
>
> > how the massive number of get() is going to
> > perform againt the main table
> >
>



--
All the best,
Shengjie Min

Reply via email to