Re: HBase - Secondary Index

Michel Segel Wed, 09 Jan 2013 02:27:57 -0800

Sorry, this makes no sense...

You are doing a range scan, I get that...


Consider that in an inverted table as your index, each column would be your 
rowkey which will be in a sort order.

Extend get() to take in a range pair as parameters and limit the result set 
returned to those columns which fall within your range... 

Problem solved. Right?

The RPC and network traffic is kept to a minimum and you are still solving the 
underlying use case with cleaner code.

Just saying...


Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 8, 2013, at 6:30 PM, lars hofhansl <[email protected]> wrote:

> Different use cases.
> 
> 
> For global point queries you want exactly what you said below.
> For range scans across many rows you want Anoop's design. As usually it 
> depends.
> 
> 
> The tradeoff is bringing a lot of unnecessary data to the client vs having to 
> contact each region (or at least each region server).
> 
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Michael Segel <[email protected]>
> To: [email protected] 
> Sent: Tuesday, January 8, 2013 6:33 AM
> Subject: Re: HBase - Secondary Index
> 
> So if you're using an inverted table / index why on earth are you doing it at 
> the region level? 
> 
> I've tried to explain this to others over 6 months ago and its not really a 
> good idea. 
> 
> You're over complicating this and you will end up creating performance 
> bottlenecks when your secondary index is completely orthogonal to your row 
> key. 
> 
> To give you an example... 
> 
> Suppose you're CCCIS and you have a large database of auto insurance claims 
> that you've acquired over the years from your Pathways product. 
> 
> Your primary key would be a combination of the Insurance Company's ID and 
> their internal claim ID for the individual claim. 
> Your row would be all of the data associated to that claim.
> 
> So now lets say you want to find the average cost to repair a front end 
> collision of an S80 Volvo. 
> The make and model of the car would be orthogonal to the initial key. This 
> means that the result set containing insurance records for Front End 
> collisions of S80 Volvos would be most likely evenly distributed across the 
> cluster's regions. 
> 
> If you used a series of inverted tables, you would be able to use a series of 
> get()s to get the result set from each index and then find their 
> intersections. (Note that you could also put them in sort order so that the 
> intersections would be fairly straight forward to find. 
> 
> Doing this at the region level isn't so simple. 
> 
> So I have to again ask why go through and over complicate things? 
> 
> Just saying... 
> 
> On Jan 7, 2013, at 7:49 AM, Anoop Sam John <[email protected]> wrote:
> 
>> Hi,
>> It is inverted index based on column(s) value(s)
>> It will be region wise indexing. Can work when some one knows the rowkey 
>> range or NOT.
>> 
>> -Anoop-
>> ________________________________________
>> From: Mohit Anchlia [[email protected]]
>> Sent: Monday, January 07, 2013 9:47 AM
>> To: [email protected]
>> Subject: Re: HBase - Secondary Index
>> 
>> Hi Anoop,
>> 
>> Am I correct in understanding that this indexing mechanism is only
>> applicable when you know the row key? It's not an inverted index truly
>> based on the column value.
>> 
>> Mohit
>> On Sun, Jan 6, 2013 at 7:48 PM, Anoop Sam John <[email protected]> wrote:
>> 
>>> Hi Adrien
>>>                  We are making the consistency btw the main table and
>>> index table and the roll back mentioned below etc using the CP hooks. The
>>> current hooks were not enough for those though..  I am in the process of
>>> trying to contribute those new hooks, core changes etc now...  Once all are
>>> done I will be able to explain in details..
>>> 
>>> -Anoop-
>>> ________________________________________
>>> From: Adrien Mogenet [[email protected]]
>>> Sent: Monday, January 07, 2013 2:00 AM
>>> To: [email protected]
>>> Subject: Re: HBase - Secondary Index
>>> 
>>> Nice topic, perhaps one of the most important for 2013 :-)
>>> I still don't get how you're ensuring consistency between index table and
>>> main table, without an external component (such as bookkeeper/zookeeper).
>>> What's the exact write path in your situation when inserting data ?
>>> (WAL/RegionObserver, pre/post put/WALedit...)
>>> 
>>> The underlying question is about how you're ensuring that WALEdit in Index
>>> and Main tables are perfectly sync'ed, and how you 're able to rollback in
>>> case of issue in both WAL ?
>>> 
>>> 
>>> On Fri, Dec 28, 2012 at 11:55 AM, Shengjie Min <[email protected]>
>>> wrote:
>>> 
>>>>> Yes as you say when the no of rows to be returned is becoming more and
>>>> more the latency will be becoming more.  seeks within an HFile block is
>>>> some what expensive op now. (Not much but still)  The new encoding
>>>> prefix
>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>> also
>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
>>>> measure the scan performance with this new encoding . Trying to >back
>>> port
>>>> a simple patch for 94 version just for testing...   Yes when the no of
>>>> results to be returned is more and more any index will become less
>>>> performing as per my study  :)
>>>> 
>>>> yes, you are right, I guess it's just a drawback of any index approach.
>>>> Thanks for the explanation.
>>>> 
>>>> Shengjie
>>>> 
>>>> On 28 December 2012 04:14, Anoop Sam John <[email protected]> wrote:
>>>> 
>>>>>> Do you have link to that presentation?
>>>>> 
>>>>> http://hbtc2012.hadooper.cn/subject/track4TedYu4.pdf
>>>>> 
>>>>> -Anoop-
>>>>> 
>>>>> ________________________________________
>>>>> From: Mohit Anchlia [[email protected]]
>>>>> Sent: Friday, December 28, 2012 9:12 AM
>>>>> To: [email protected]
>>>>> Subject: Re: HBase - Secondary Index
>>>>> 
>>>>> On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Yes as you say when the no of rows to be returned is becoming more
>>> and
>>>>>> more the latency will be becoming more.  seeks within an HFile block
>>> is
>>>>>> some what expensive op now. (Not much but still)  The new encoding
>>>> prefix
>>>>>> trie will be a huge bonus here. There the seeks will be flying.. [Ted
>>>>> also
>>>>>> presented this in the Hadoop China]  Thanks to Matt... :)  I am
>>> trying
>>>> to
>>>>>> measure the scan performance with this new encoding . Trying to back
>>>>> port a
>>>>>> simple patch for 94 version just for testing...   Yes when the no of
>>>>>> results to be returned is more and more any index will become less
>>>>>> performing as per my study  :)
>>>>>> 
>>>>>> Do you have link to that presentation?
>>>>> 
>>>>> 
>>>>>>> btw, quick question- in your presentation, the scale there is
>>> seconds
>>>> or
>>>>>> mill-seconds:)
>>>>>> 
>>>>>> It is seconds.  Dont consider the exact values. What is the % of
>>>> increase
>>>>>> in latency is important :) Those were not high end machines.
>>>>>> 
>>>>>> -Anoop-
>>>>>> ________________________________________
>>>>>> From: Shengjie Min [[email protected]]
>>>>>> Sent: Thursday, December 27, 2012 9:59 PM
>>>>>> To: [email protected]
>>>>>> Subject: Re: HBase - Secondary Index
>>>>>> 
>>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>>> As
>>>>>> the
>>>>>>> exact rowkey in a region we get from the index table, we can seek to
>>>> the
>>>>>>> exact position and return that row.
>>>>>> 
>>>>>> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's
>>> just
>>>>>> small number of rows returned, this works perfect. As you said you
>>> will
>>>>> get
>>>>>> the exact rowkey positions per region, and simply seek them. I was
>>>> trying
>>>>>> to work out the case that when the number of result rows increases
>>>>>> massively. Like in Anil's case, he wants to do a scan query against
>>> the
>>>>>> 2ndary index(timestamp): "select all rows from timestamp1 to
>>>> timestamp2"
>>>>>> given no customerId provided. During that time period, he might have
>>> a
>>>>> big
>>>>>> chunk of rows from different customerIds. The index table returns a
>>> lot
>>>>> of
>>>>>> rowkey positions for different customerIds (I believe they are
>>>> scattered
>>>>> in
>>>>>> different regions), then you end up seeking all different positions
>>> in
>>>>>> different regions and return all the rows needed. According to your
>>>>>> presentation page14 - Performance Test Results (Scan), without index,
>>>>> it's
>>>>>> a linear increase as result rows # increases. on the other hand, with
>>>>>> index, time spent climbs up way quicker than the case without index.
>>>>>> 
>>>>>> btw, quick question- in your presentation, the scale there is seconds
>>>> or
>>>>>> mill-seconds:)
>>>>>> 
>>>>>> - Shengjie
>>>>>> 
>>>>>> 
>>>>>> On 27 December 2012 15:54, Anoop John <[email protected]> wrote:
>>>>>> 
>>>>>>>> how the massive number of get() is going to
>>>>>>> perform againt the main table
>>>>>>> 
>>>>>>> Didnt follow u completely here. There wont be any get() happening..
>>>> As
>>>>>> the
>>>>>>> exact rowkey in a region we get from the index table, we can seek
>>> to
>>>>> the
>>>>>>> exact position and return that row.
>>>>>>> 
>>>>>>> -Anoop-
>>>>>>> 
>>>>>>> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <
>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> how the massive number of get() is going to
>>>>>>>> perform againt the main table
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> All the best,
>>>>>> Shengjie Min
>>>> 
>>>> 
>>>> 
>>>> --
>>>> All the best,
>>>> Shengjie Min
>>> 
>>> 
>>> 
>>> --
>>> Adrien Mogenet
>>> 06.59.16.64.22
>>> http://www.mogenet.me

Re: HBase - Secondary Index

Reply via email to