Re: Of hbase key distribution and query scalability, again.

Michael Segel Sat, 26 May 2012 10:06:57 -0700

Ian

Understood.


Dmitry, 
Could you show a use case where you see this happening? 

If you have records that are being read that frequently, they would be cached 
in memory. 

I think you could use some concept of a systems table and then using 
coprocessors you could update the table with the read patterns. 

There'd be a performance hit. (I don't know how much of one but it will 
exist...) But its possible...

On May 26, 2012, at 11:45 AM, Ian Varley wrote:

> Mike,
> 
> I gather that Dmitriy is asking whether there are any smarts in the region 
> balancer based on heavy *read* traffic (i.e. if it turns out that your read 
> load is heavily skewed towards a small subset of regions). Which there 
> aren't, but could be if someone wanted to write the infrastructure for it 
> (which would likely be complex, as you'd have to persist information about 
> read traffic somewhere other than the logs). Then read-hot regions would be 
> candidates for splitting, not just based on their size but also based on 
> their read traffic.
> 
> Caching is relevant to help read performance, for sure, but there could still 
> be scenarios where your read traffic is all stuck in one region, and even 
> after all other optimizations, it still leaves one region hot and the rest 
> cold.
> 
> To be totally clear, Dmitriy: I think this is a pretty advanced feature 
> that's not high on the overall priority list, because in such a rare 
> situation you could always manually split that region.
> 
> Ian
> 
> On May 26, 2012, at 11:25 AM, Michael Segel wrote:
> 
>> Hi,
>> 
>> Jumping in on this late...
>> 
>>>>>> To cut a long story, is the region size the only current HBase
>>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>>>>>> some more advanced techniques to do that ?
>> 
>> So maybe I'm missing something but I don't see the problem.
>> 
>> In terms of writing data to be evenly/randomly distributed, you would hash 
>> the key (md5 or SHA-1 as examples). 
>> This works well if you're doing get()s and not a lot of scan()s. 
>> 
>> But on reads, how do you get 'hot spotting' ? 
>> 
>> Should those rows be cached in memory? 
>> 
>> So what am I missing? Besides another cup of coffee?  
>> 
>> -Mike
>> 
>> On May 25, 2012, at 1:23 PM, Ian Varley wrote:
>> 
>>> Yeah, I think you're right Dmitriy; there's nothing like that in HBase 
>>> today as far as I know. If it'd be useful for you, maybe it would be for 
>>> others, too; work up a rough patch and see what people think on the dev 
>>> list.
>>> 
>>> Ian
>>> 
>>> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote:
>>> 
>>>> Thanks, Ian.
>>>> 
>>>> I am talking about situation when even when we have uniform keys, the
>>>> query distribution over them is still non-uniform and impossible to
>>>> predict without sampling query skewness, but skewness is surprisingly
>>>> great. (as in least active/most active user may differ in activity 100
>>>> times and there is no way one could now which users are going to be
>>>> active and which are going to be not active). Assuming there are few
>>>> very active users, but many low active users, if two active users get
>>>> into the same region, it creates a hotspot which could have been
>>>> avoided if region balancer took notions of number of hits the regions
>>>> are getting recently.
>>>> 
>>>> Like i pointed out before, such skewness balancer could be fairly
>>>> easily implemented externally to hbase (as in TotalOrderPartitioner),
>>>> with exception that it would be interfering with the Hbase's balancer
>>>> itself so it must be integrated with the balancer in that case.
>>>> 
>>>> Also another distinct problem is time parameters of such balance
>>>> controller. The load may be changing fast enough or slow enough so
>>>> that sampling must be time-weighted itself.
>>>> 
>>>> All these tehchnicalities make it difficult to implement it outside
>>>> hbase or use key manipulation (as dynamic nature makes it difficult to
>>>> deal with key re-assigning to match newly discovered load
>>>> distribution).
>>>> 
>>>> Ok I guess there's nothing in HBase like that right now otherwise i
>>>> would've seen it in the book i suppose...
>>>> 
>>>> Thanks.
>>>> -d
>>>> 
>>>> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[email protected]> 
>>>> wrote:
>>>>> Dmitriy,
>>>>> 
>>>>> If I understand you right, what you're asking about might be called "Read 
>>>>> Hotspotting". For an obvious example, if I distribute my data nicely over 
>>>>> the cluster but then say:
>>>>> 
>>>>> for (int x = 0; x < 10000000000; x++) {
>>>>> htable.get(new Get(Bytes.toBytes("row1")));
>>>>> }
>>>>> 
>>>>> Then naturally I'm only putting read load on the region server that hosts 
>>>>> "row1". That's contrived, of course, you'd never really do that. But I 
>>>>> can imagine plenty of situations where there's an imbalance in query load 
>>>>> w/r/t the leading part of the row key of a table. It's not fundamentally 
>>>>> different from "write hotspotting", except that it's probably less common 
>>>>> (it happens frequently in writes because ascending data in a time series 
>>>>> or number sequence is a common thing to insert into a database).
>>>>> 
>>>>> I guess the simple answer is, if you know of non-even distribution of 
>>>>> read patterns, it might be something to consider in a custom partitioning 
>>>>> of the data into regions. I don't know of any other technique (short of 
>>>>> some external caching mechanism) that'd alleviate this; at base, you 
>>>>> still have to ask exactly one RS for any given piece of data.
>>>>> 
>>>>> Ian
>>>>> 
>>>>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I'd like to collect opinions from HBase experts on the query
>>>>>> uniformity and whether there's any advance technique currently exists
>>>>>> in HBase to cope with the problems of query uniformity beyond just
>>>>>> maintaining the key uniform distribution.
>>>>>> 
>>>>>> I know we start with the statement that in order to scale queries, we
>>>>>> need them uniformly distributed over key space. The next advice people
>>>>>> get is to use uniformly distributed key. Then, the thinking goes, the
>>>>>> query load will also be uniformly distributed among regions.
>>>>>> 
>>>>>> For what seems to be an embarassingly long time i was missing the
>>>>>> point however that using uniformly distributed keys does not equate
>>>>>> uniform distribution of the queries since it doesn't account for
>>>>>> skewness of queries over the key space itself. This skewness can be
>>>>>> bad enough under some circumstances to create query hot spots in the
>>>>>> cluster which could have been avoided should region splits were
>>>>>> balanced based on query loads rather than on a data size per se. (sort
>>>>>> of dynamic query distribution sampling in order to equalize the load
>>>>>> similar to how TotalOrderPartitioner does random data sampling to
>>>>>> build distribution of the key skewness in the incoming data).
>>>>>> 
>>>>>> To cut a long story, is the region size the only current HBase
>>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are
>>>>>> some more advanced techniques to do that ?
>>>>>> 
>>>>>> Thank you very much.
>>>>>> -Dmitriy
>>>>> 
>>> 
>>> 
>> 
> 
>

Re: Of hbase key distribution and query scalability, again.

Reply via email to