Please note that if you have fast growing datastore, you may end up with very large region files - if you limit the number of regions. If that happens (and you can tell by simply examining your HDFS), your compactions (which you can't avoid) will end up rewriting a lot of data. In our case (we have 600TB hbase cluster for storing photos), we keep our region size at about 10-15GB and no more. Anything else will overdrive your disk IO during compactions.
On Fri, Jul 6, 2012 at 2:22 PM, Jean-Daniel Cryans <[email protected]> wrote: > Hey Nick, > > I'd say it has nothing to do with spindles/CPUs. The reason is that > locking happens only at the row level thus having 1 or 100 regions > doesn't change anything, then you only have 1 HLog to write to > (currently at least) so spindles don't really count either. It might > influence the number of threads that can compact tho. > > I gave some explanations at the HBase BoF we did after the Hadoop > Summit[1] and, from the writes POV, you want to have as much MemStore > potential as you have dedicated heap for it and the HLog need to match > that too. One region can make sense for that except for the issues I > demonstrated and also it's harder to balance. Instead, having a few > regions will give you more agility. For the sake of giving a number > I'd say 10-20 but then you need to tune your memstore size > accordingly. This really is over-optimization tho and it's not even > something we put in practice here since we have so many different use > cases. > > The cluster size in my opinion doesn't influence the number of regions > you should have per machine, as long as you keep it low it will be > fine. > > Hope this helps, > > J-D > > 1. http://www.slideshare.net/jdcryans/performance-bof12 > > On Fri, Jul 6, 2012 at 1:53 PM, Nick Dimiduk <[email protected]> wrote: >> Heya, >> >> I'm looking for more detailed advice about how many regions a table should >> run. Disabling automatic splits (often hand-in-hand with disabling >> automatic compactions) is often described as advanced practice, at least >> when guaranteeing latency SLAs. Which begs the question: how many regions >> should I have? Surely this depends on both the shape of your data and >> expected workload. I've seen "10-20 Regions per RS" thrown around as a >> stock answer. My question is: why? Presumably that's 10-20 regions per RS >> for all tables rather than per-table. That advice is centered around a >> regular region size, but surely distribution of ops/load matters more. But >> still, where does 10-20 come from? Is it a calculation vs the number of >> cores on the RS, like advice given around parallelizing builds? If so, how >> many cores are we assuming the RS has? Is it a calculation vs the amount of >> RAM available? Is 20 regions based on a trade-off between static >> allocations and per-region memory overhead? Does 10-20 become 5-15 in a >> memory-restricted environment and bump to 20-40 when more RAM is available? >> Does it have to do with the number of spindles available on the machine? >> Threads like this one [0] give some hint about how the big players work. >> However, that advice looks heavily influenced by concerns when there are >> 1000's of regions to manage. How does advice for larger clusters (>50 >> nodes) differ from smaller clusters (<20 nodes)? >> >> Thanks, >> -n >> >> [0]: http://thread.gmane.org/gmane.comp.java.hadoop.hbase.user/22451
