Dmitriy,
If I understand you right, what you're asking about might be called "Read
Hotspotting". For an obvious example, if I distribute my data nicely over the
cluster but then say:
for (int x = 0; x < 10000000000; x++) {
htable.get(new Get(Bytes.toBytes("row1")));
}
Then naturally I'm only putting read load on the region server that hosts
"row1". That's contrived, of course, you'd never really do that. But I can
imagine plenty of situations where there's an imbalance in query load w/r/t the
leading part of the row key of a table. It's not fundamentally different from
"write hotspotting", except that it's probably less common (it happens
frequently in writes because ascending data in a time series or number sequence
is a common thing to insert into a database).
I guess the simple answer is, if you know of non-even distribution of read
patterns, it might be something to consider in a custom partitioning of the
data into regions. I don't know of any other technique (short of some external
caching mechanism) that'd alleviate this; at base, you still have to ask
exactly one RS for any given piece of data.
Ian
On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote:
> Hello,
>
> I'd like to collect opinions from HBase experts on the query
> uniformity and whether there's any advance technique currently exists
> in HBase to cope with the problems of query uniformity beyond just
> maintaining the key uniform distribution.
>
> I know we start with the statement that in order to scale queries, we
> need them uniformly distributed over key space. The next advice people
> get is to use uniformly distributed key. Then, the thinking goes, the
> query load will also be uniformly distributed among regions.
>
> For what seems to be an embarassingly long time i was missing the
> point however that using uniformly distributed keys does not equate
> uniform distribution of the queries since it doesn't account for
> skewness of queries over the key space itself. This skewness can be
> bad enough under some circumstances to create query hot spots in the
> cluster which could have been avoided should region splits were
> balanced based on query loads rather than on a data size per se. (sort
> of dynamic query distribution sampling in order to equalize the load
> similar to how TotalOrderPartitioner does random data sampling to
> build distribution of the key skewness in the incoming data).
>
> To cut a long story, is the region size the only current HBase
> technique to balance load, esp. w.r.t query load? Or perhaps there are
> some more advanced techniques to do that ?
>
> Thank you very much.
> -Dmitriy