Yeah, I think you're right Dmitriy; there's nothing like that in HBase today as far as I know. If it'd be useful for you, maybe it would be for others, too; work up a rough patch and see what people think on the dev list.
Ian On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote: > Thanks, Ian. > > I am talking about situation when even when we have uniform keys, the > query distribution over them is still non-uniform and impossible to > predict without sampling query skewness, but skewness is surprisingly > great. (as in least active/most active user may differ in activity 100 > times and there is no way one could now which users are going to be > active and which are going to be not active). Assuming there are few > very active users, but many low active users, if two active users get > into the same region, it creates a hotspot which could have been > avoided if region balancer took notions of number of hits the regions > are getting recently. > > Like i pointed out before, such skewness balancer could be fairly > easily implemented externally to hbase (as in TotalOrderPartitioner), > with exception that it would be interfering with the Hbase's balancer > itself so it must be integrated with the balancer in that case. > > Also another distinct problem is time parameters of such balance > controller. The load may be changing fast enough or slow enough so > that sampling must be time-weighted itself. > > All these tehchnicalities make it difficult to implement it outside > hbase or use key manipulation (as dynamic nature makes it difficult to > deal with key re-assigning to match newly discovered load > distribution). > > Ok I guess there's nothing in HBase like that right now otherwise i > would've seen it in the book i suppose... > > Thanks. > -d > > On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[email protected]> wrote: >> Dmitriy, >> >> If I understand you right, what you're asking about might be called "Read >> Hotspotting". For an obvious example, if I distribute my data nicely over >> the cluster but then say: >> >> for (int x = 0; x < 10000000000; x++) { >> htable.get(new Get(Bytes.toBytes("row1"))); >> } >> >> Then naturally I'm only putting read load on the region server that hosts >> "row1". That's contrived, of course, you'd never really do that. But I can >> imagine plenty of situations where there's an imbalance in query load w/r/t >> the leading part of the row key of a table. It's not fundamentally different >> from "write hotspotting", except that it's probably less common (it happens >> frequently in writes because ascending data in a time series or number >> sequence is a common thing to insert into a database). >> >> I guess the simple answer is, if you know of non-even distribution of read >> patterns, it might be something to consider in a custom partitioning of the >> data into regions. I don't know of any other technique (short of some >> external caching mechanism) that'd alleviate this; at base, you still have >> to ask exactly one RS for any given piece of data. >> >> Ian >> >> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote: >> >>> Hello, >>> >>> I'd like to collect opinions from HBase experts on the query >>> uniformity and whether there's any advance technique currently exists >>> in HBase to cope with the problems of query uniformity beyond just >>> maintaining the key uniform distribution. >>> >>> I know we start with the statement that in order to scale queries, we >>> need them uniformly distributed over key space. The next advice people >>> get is to use uniformly distributed key. Then, the thinking goes, the >>> query load will also be uniformly distributed among regions. >>> >>> For what seems to be an embarassingly long time i was missing the >>> point however that using uniformly distributed keys does not equate >>> uniform distribution of the queries since it doesn't account for >>> skewness of queries over the key space itself. This skewness can be >>> bad enough under some circumstances to create query hot spots in the >>> cluster which could have been avoided should region splits were >>> balanced based on query loads rather than on a data size per se. (sort >>> of dynamic query distribution sampling in order to equalize the load >>> similar to how TotalOrderPartitioner does random data sampling to >>> build distribution of the key skewness in the incoming data). >>> >>> To cut a long story, is the region size the only current HBase >>> technique to balance load, esp. w.r.t query load? Or perhaps there are >>> some more advanced techniques to do that ? >>> >>> Thank you very much. >>> -Dmitriy >>
