Re: Of hbase key distribution and query scalability, again.

Ian Varley Fri, 25 May 2012 10:43:10 -0700

Dmitriy,

If I understand you right, what you're asking about might be called "Read 
Hotspotting". For an obvious example, if I distribute my data nicely over the 
cluster but then say:

for (int x = 0; x < 10000000000; x++) {
   htable.get(new Get(Bytes.toBytes("row1")));
}

Then naturally I'm only putting read load on the region server that hosts 
"row1". That's contrived, of course, you'd never really do that. But I can 
imagine plenty of situations where there's an imbalance in query load w/r/t the 
leading part of the row key of a table. It's not fundamentally different from 
"write hotspotting", except that it's probably less common (it happens 
frequently in writes because ascending data in a time series or number sequence 
is a common thing to insert into a database).

I guess the simple answer is, if you know of non-even distribution of read 
patterns, it might be something to consider in a custom partitioning of the 
data into regions. I don't know of any other technique (short of some external 
caching mechanism) that'd alleviate this; at base, you still have to ask 
exactly one RS for any given piece of data.

Ian

On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:

> Hello,
> 
> I'd like to collect opinions from HBase experts on the query
> uniformity and whether there's any advance technique currently exists
> in HBase to cope with the problems of query uniformity beyond just
> maintaining the key uniform distribution.
> 
> I know we start with the statement that in order to scale queries, we
> need them uniformly distributed over key space. The next advice people
> get is to use uniformly distributed key. Then, the thinking goes, the
> query load will also be uniformly distributed among regions.
> 
> For what seems to be an embarassingly long time i was missing the
> point however that using uniformly distributed keys does not equate
> uniform distribution of the queries since it doesn't account for
> skewness of queries over the key space itself. This skewness can be
> bad enough under some circumstances to create query hot spots in the
> cluster which could have been avoided should region splits were
> balanced based on query loads rather than on a data size per se. (sort
> of dynamic query distribution sampling in order to equalize the load
> similar to how TotalOrderPartitioner does random data sampling to
> build distribution of the key skewness in the incoming data).
> 
> To cut a long story, is the region size the only current HBase
> technique to balance load, esp. w.r.t query load? Or perhaps there are
> some more advanced techniques to do that ?
> 
> Thank you very much.
> -Dmitriy

Re: Of hbase key distribution and query scalability, again.

Reply via email to