Re: Of hbase key distribution and query scalability, again.

Michael Segel Sat, 26 May 2012 09:26:00 -0700

Hi,

Jumping in on this late...


>>>> To cut a long story, is the region size the only current HBase
>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>>>> some more advanced techniques to do that ?

So maybe I'm missing something but I don't see the problem.

In terms of writing data to be evenly/randomly distributed, you would hash the 
key (md5 or SHA-1 as examples). 
This works well if you're doing get()s and not a lot of scan()s. 

But on reads, how do you get 'hot spotting' ? 

Should those rows be cached in memory? 

So what am I missing? Besides another cup of coffee?  

-Mike

On May 25, 2012, at 1:23 PM, Ian Varley wrote:

> Yeah, I think you're right Dmitriy; there's nothing like that in HBase today 
> as far as I know. If it'd be useful for you, maybe it would be for others, 
> too; work up a rough patch and see what people think on the dev list.
> 
> Ian
> 
> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:
> 
>> Thanks, Ian.
>> 
>> I am talking about situation when even when we have uniform keys, the
>> query distribution over them is still non-uniform and impossible to
>> predict without sampling query skewness, but skewness is surprisingly
>> great. (as in least active/most active user may differ in activity 100
>> times and there is no way one could now which users are going to be
>> active and which are going to be not active). Assuming there are few
>> very active users, but many low active users, if two active users get
>> into the same region, it creates a hotspot which could have been
>> avoided if region balancer took notions of number of hits the regions
>> are getting recently.
>> 
>> Like i pointed out before, such skewness balancer could be fairly
>> easily implemented externally to hbase (as in TotalOrderPartitioner),
>> with exception that it would be interfering with the Hbase's balancer
>> itself so it must be integrated with the balancer in that case.
>> 
>> Also another distinct problem is time parameters of such balance
>> controller. The load may be changing fast enough or slow enough so
>> that sampling must be time-weighted itself.
>> 
>> All these tehchnicalities make it difficult to implement it outside
>> hbase or use key manipulation (as dynamic nature makes it difficult to
>> deal with key re-assigning to match newly discovered load
>> distribution).
>> 
>> Ok I guess there's nothing in HBase like that right now otherwise i
>> would've seen it in the book i suppose...
>> 
>> Thanks.
>> -d
>> 
>> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[email protected]> wrote:
>>> Dmitriy,
>>> 
>>> If I understand you right, what you're asking about might be called "Read 
>>> Hotspotting". For an obvious example, if I distribute my data nicely over 
>>> the cluster but then say:
>>> 
>>> for (int x = 0; x < 10000000000; x++) {
>>>  htable.get(new Get(Bytes.toBytes("row1")));
>>> }
>>> 
>>> Then naturally I'm only putting read load on the region server that hosts 
>>> "row1". That's contrived, of course, you'd never really do that. But I can 
>>> imagine plenty of situations where there's an imbalance in query load w/r/t 
>>> the leading part of the row key of a table. It's not fundamentally 
>>> different from "write hotspotting", except that it's probably less common 
>>> (it happens frequently in writes because ascending data in a time series or 
>>> number sequence is a common thing to insert into a database).
>>> 
>>> I guess the simple answer is, if you know of non-even distribution of read 
>>> patterns, it might be something to consider in a custom partitioning of the 
>>> data into regions. I don't know of any other technique (short of some 
>>> external caching mechanism) that'd alleviate this; at base, you still have 
>>> to ask exactly one RS for any given piece of data.
>>> 
>>> Ian
>>> 
>>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I'd like to collect opinions from HBase experts on the query
>>>> uniformity and whether there's any advance technique currently exists
>>>> in HBase to cope with the problems of query uniformity beyond just
>>>> maintaining the key uniform distribution.
>>>> 
>>>> I know we start with the statement that in order to scale queries, we
>>>> need them uniformly distributed over key space. The next advice people
>>>> get is to use uniformly distributed key. Then, the thinking goes, the
>>>> query load will also be uniformly distributed among regions.
>>>> 
>>>> For what seems to be an embarassingly long time i was missing the
>>>> point however that using uniformly distributed keys does not equate
>>>> uniform distribution of the queries since it doesn't account for
>>>> skewness of queries over the key space itself. This skewness can be
>>>> bad enough under some circumstances to create query hot spots in the
>>>> cluster which could have been avoided should region splits were
>>>> balanced based on query loads rather than on a data size per se. (sort
>>>> of dynamic query distribution sampling in order to equalize the load
>>>> similar to how TotalOrderPartitioner does random data sampling to
>>>> build distribution of the key skewness in the incoming data).
>>>> 
>>>> To cut a long story, is the region size the only current HBase
>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>>>> some more advanced techniques to do that ?
>>>> 
>>>> Thank you very much.
>>>> -Dmitriy
>>> 
> 
>

Re: Of hbase key distribution and query scalability, again.

Reply via email to