Here is my SS: 259 71 2451 On May 26, 2012, at 9:25 AM, Michael Segel <[email protected]> wrote:
> Hi, > > Jumping in on this late... > >>>>> To cut a long story, is the region size the only current HBase >>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are >>>>> some more advanced techniques to do that ? > > So maybe I'm missing something but I don't see the problem. > > In terms of writing data to be evenly/randomly distributed, you would hash > the key (md5 or SHA-1 as examples). > This works well if you're doing get()s and not a lot of scan()s. > > But on reads, how do you get 'hot spotting' ? > > Should those rows be cached in memory? > > So what am I missing? Besides another cup of coffee? > > -Mike > > On May 25, 2012, at 1:23 PM, Ian Varley wrote: > >> Yeah, I think you're right Dmitriy; there's nothing like that in HBase today >> as far as I know. If it'd be useful for you, maybe it would be for others, >> too; work up a rough patch and see what people think on the dev list. >> >> Ian >> >> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote: >> >>> Thanks, Ian. >>> >>> I am talking about situation when even when we have uniform keys, the >>> query distribution over them is still non-uniform and impossible to >>> predict without sampling query skewness, but skewness is surprisingly >>> great. (as in least active/most active user may differ in activity 100 >>> times and there is no way one could now which users are going to be >>> active and which are going to be not active). Assuming there are few >>> very active users, but many low active users, if two active users get >>> into the same region, it creates a hotspot which could have been >>> avoided if region balancer took notions of number of hits the regions >>> are getting recently. >>> >>> Like i pointed out before, such skewness balancer could be fairly >>> easily implemented externally to hbase (as in TotalOrderPartitioner), >>> with exception that it would be interfering with the Hbase's balancer >>> itself so it must be integrated with the balancer in that case. >>> >>> Also another distinct problem is time parameters of such balance >>> controller. The load may be changing fast enough or slow enough so >>> that sampling must be time-weighted itself. >>> >>> All these tehchnicalities make it difficult to implement it outside >>> hbase or use key manipulation (as dynamic nature makes it difficult to >>> deal with key re-assigning to match newly discovered load >>> distribution). >>> >>> Ok I guess there's nothing in HBase like that right now otherwise i >>> would've seen it in the book i suppose... >>> >>> Thanks. >>> -d >>> >>> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[email protected]> wrote: >>>> Dmitriy, >>>> >>>> If I understand you right, what you're asking about might be called "Read >>>> Hotspotting". For an obvious example, if I distribute my data nicely over >>>> the cluster but then say: >>>> >>>> for (int x = 0; x < 10000000000; x++) { >>>> htable.get(new Get(Bytes.toBytes("row1"))); >>>> } >>>> >>>> Then naturally I'm only putting read load on the region server that hosts >>>> "row1". That's contrived, of course, you'd never really do that. But I can >>>> imagine plenty of situations where there's an imbalance in query load >>>> w/r/t the leading part of the row key of a table. It's not fundamentally >>>> different from "write hotspotting", except that it's probably less common >>>> (it happens frequently in writes because ascending data in a time series >>>> or number sequence is a common thing to insert into a database). >>>> >>>> I guess the simple answer is, if you know of non-even distribution of read >>>> patterns, it might be something to consider in a custom partitioning of >>>> the data into regions. I don't know of any other technique (short of some >>>> external caching mechanism) that'd alleviate this; at base, you still have >>>> to ask exactly one RS for any given piece of data. >>>> >>>> Ian >>>> >>>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote: >>>> >>>>> Hello, >>>>> >>>>> I'd like to collect opinions from HBase experts on the query >>>>> uniformity and whether there's any advance technique currently exists >>>>> in HBase to cope with the problems of query uniformity beyond just >>>>> maintaining the key uniform distribution. >>>>> >>>>> I know we start with the statement that in order to scale queries, we >>>>> need them uniformly distributed over key space. The next advice people >>>>> get is to use uniformly distributed key. Then, the thinking goes, the >>>>> query load will also be uniformly distributed among regions. >>>>> >>>>> For what seems to be an embarassingly long time i was missing the >>>>> point however that using uniformly distributed keys does not equate >>>>> uniform distribution of the queries since it doesn't account for >>>>> skewness of queries over the key space itself. This skewness can be >>>>> bad enough under some circumstances to create query hot spots in the >>>>> cluster which could have been avoided should region splits were >>>>> balanced based on query loads rather than on a data size per se. (sort >>>>> of dynamic query distribution sampling in order to equalize the load >>>>> similar to how TotalOrderPartitioner does random data sampling to >>>>> build distribution of the key skewness in the incoming data). >>>>> >>>>> To cut a long story, is the region size the only current HBase >>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are >>>>> some more advanced techniques to do that ? >>>>> >>>>> Thank you very much. >>>>> -Dmitriy >>>> >> >> >
