Ian Understood.
Dmitry, Could you show a use case where you see this happening? If you have records that are being read that frequently, they would be cached in memory. I think you could use some concept of a systems table and then using coprocessors you could update the table with the read patterns. There'd be a performance hit. (I don't know how much of one but it will exist...) But its possible... On May 26, 2012, at 11:45 AM, Ian Varley wrote: > Mike, > > I gather that Dmitriy is asking whether there are any smarts in the region > balancer based on heavy *read* traffic (i.e. if it turns out that your read > load is heavily skewed towards a small subset of regions). Which there > aren't, but could be if someone wanted to write the infrastructure for it > (which would likely be complex, as you'd have to persist information about > read traffic somewhere other than the logs). Then read-hot regions would be > candidates for splitting, not just based on their size but also based on > their read traffic. > > Caching is relevant to help read performance, for sure, but there could still > be scenarios where your read traffic is all stuck in one region, and even > after all other optimizations, it still leaves one region hot and the rest > cold. > > To be totally clear, Dmitriy: I think this is a pretty advanced feature > that's not high on the overall priority list, because in such a rare > situation you could always manually split that region. > > Ian > > On May 26, 2012, at 11:25 AM, Michael Segel wrote: > >> Hi, >> >> Jumping in on this late... >> >>>>>> To cut a long story, is the region size the only current HBase >>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are >>>>>> some more advanced techniques to do that ? >> >> So maybe I'm missing something but I don't see the problem. >> >> In terms of writing data to be evenly/randomly distributed, you would hash >> the key (md5 or SHA-1 as examples). >> This works well if you're doing get()s and not a lot of scan()s. >> >> But on reads, how do you get 'hot spotting' ? >> >> Should those rows be cached in memory? >> >> So what am I missing? Besides another cup of coffee? >> >> -Mike >> >> On May 25, 2012, at 1:23 PM, Ian Varley wrote: >> >>> Yeah, I think you're right Dmitriy; there's nothing like that in HBase >>> today as far as I know. If it'd be useful for you, maybe it would be for >>> others, too; work up a rough patch and see what people think on the dev >>> list. >>> >>> Ian >>> >>> On May 25, 2012, at 1:02 PM, Dmitriy Lyubimov wrote: >>> >>>> Thanks, Ian. >>>> >>>> I am talking about situation when even when we have uniform keys, the >>>> query distribution over them is still non-uniform and impossible to >>>> predict without sampling query skewness, but skewness is surprisingly >>>> great. (as in least active/most active user may differ in activity 100 >>>> times and there is no way one could now which users are going to be >>>> active and which are going to be not active). Assuming there are few >>>> very active users, but many low active users, if two active users get >>>> into the same region, it creates a hotspot which could have been >>>> avoided if region balancer took notions of number of hits the regions >>>> are getting recently. >>>> >>>> Like i pointed out before, such skewness balancer could be fairly >>>> easily implemented externally to hbase (as in TotalOrderPartitioner), >>>> with exception that it would be interfering with the Hbase's balancer >>>> itself so it must be integrated with the balancer in that case. >>>> >>>> Also another distinct problem is time parameters of such balance >>>> controller. The load may be changing fast enough or slow enough so >>>> that sampling must be time-weighted itself. >>>> >>>> All these tehchnicalities make it difficult to implement it outside >>>> hbase or use key manipulation (as dynamic nature makes it difficult to >>>> deal with key re-assigning to match newly discovered load >>>> distribution). >>>> >>>> Ok I guess there's nothing in HBase like that right now otherwise i >>>> would've seen it in the book i suppose... >>>> >>>> Thanks. >>>> -d >>>> >>>> On Fri, May 25, 2012 at 10:42 AM, Ian Varley <[email protected]> >>>> wrote: >>>>> Dmitriy, >>>>> >>>>> If I understand you right, what you're asking about might be called "Read >>>>> Hotspotting". For an obvious example, if I distribute my data nicely over >>>>> the cluster but then say: >>>>> >>>>> for (int x = 0; x < 10000000000; x++) { >>>>> htable.get(new Get(Bytes.toBytes("row1"))); >>>>> } >>>>> >>>>> Then naturally I'm only putting read load on the region server that hosts >>>>> "row1". That's contrived, of course, you'd never really do that. But I >>>>> can imagine plenty of situations where there's an imbalance in query load >>>>> w/r/t the leading part of the row key of a table. It's not fundamentally >>>>> different from "write hotspotting", except that it's probably less common >>>>> (it happens frequently in writes because ascending data in a time series >>>>> or number sequence is a common thing to insert into a database). >>>>> >>>>> I guess the simple answer is, if you know of non-even distribution of >>>>> read patterns, it might be something to consider in a custom partitioning >>>>> of the data into regions. I don't know of any other technique (short of >>>>> some external caching mechanism) that'd alleviate this; at base, you >>>>> still have to ask exactly one RS for any given piece of data. >>>>> >>>>> Ian >>>>> >>>>> On May 25, 2012, at 12:31 PM, Dmitriy Lyubimov wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I'd like to collect opinions from HBase experts on the query >>>>>> uniformity and whether there's any advance technique currently exists >>>>>> in HBase to cope with the problems of query uniformity beyond just >>>>>> maintaining the key uniform distribution. >>>>>> >>>>>> I know we start with the statement that in order to scale queries, we >>>>>> need them uniformly distributed over key space. The next advice people >>>>>> get is to use uniformly distributed key. Then, the thinking goes, the >>>>>> query load will also be uniformly distributed among regions. >>>>>> >>>>>> For what seems to be an embarassingly long time i was missing the >>>>>> point however that using uniformly distributed keys does not equate >>>>>> uniform distribution of the queries since it doesn't account for >>>>>> skewness of queries over the key space itself. This skewness can be >>>>>> bad enough under some circumstances to create query hot spots in the >>>>>> cluster which could have been avoided should region splits were >>>>>> balanced based on query loads rather than on a data size per se. (sort >>>>>> of dynamic query distribution sampling in order to equalize the load >>>>>> similar to how TotalOrderPartitioner does random data sampling to >>>>>> build distribution of the key skewness in the incoming data). >>>>>> >>>>>> To cut a long story, is the region size the only current HBase >>>>>> technique to balance load, esp. w.r.t query load? Or perhaps there are >>>>>> some more advanced techniques to do that ? >>>>>> >>>>>> Thank you very much. >>>>>> -Dmitriy >>>>> >>> >>> >> > >
