Region size is one of those tricky things, there are a few factors to consider:
- regions are the basic element of availability and distribution. - HBase scales by having regions across many servers. Thus if you have 2 regions for 16GB data, on a 20 node machine you are a net loss there. - High region count has been known to make things slow, this is getting better, but it is probably better to have 700 regions than 3000 for the same amount of data. - Low region count prevents parallel scalability as per point #2. This really cant be stressed enough, since a common problem is loading 200MB data into HBase then wondering why your awesome 10 node cluster is mostly idle. - There is not much memory footprint difference between 1 region and 10 in terms of indexes, etc, held by the regionserver. Generally speaking I stick to the default, go smaller for hot tables, or manually split them, and go with a 1GB region size on our largest 900 GB table. -ryan On Wed, Sep 22, 2010 at 12:01 AM, Jack Levin <[email protected]> wrote: > Yes, I am thinking to put 10 to 15 million files on each regionserver > (well, not literally, but be controlled by regionserver). So thats > close to 4 TB worth of regions, which is about 4GB per region should > we target 1000 regions per server. Note, not all files are 'hot', and > I only expect to keep about 1% super hot, and 5% relatively hot, the > rest are cold. So in terms of keeping hbase blocks in RAM, that > should be adequate, for the rest we can afford a trip into hdfs. > > If servers are running 8 GB of ram, and are shared for regionservers > and datanodes, how much heap should I allocate to each? 6GB for RS > and 1GB for DN? > > Also, on the question whether 8 core x 16G Ram helps a Master server > to bring up the cluster faster, the answer is definitely - yes. It > took only 90 seconds to load 5000 regions across 13 servers, where > same task for Dual Core 8G Ram, took nearly 10 minutes. > > -Jack > > > > On Tue, Sep 21, 2010 at 11:38 PM, Stack <[email protected]> wrote: >> On Tue, Sep 21, 2010 at 11:11 PM, Jack Levin <[email protected]> wrote: >>> Its definitely binary, and I can even load it in my browser but >>> setting appropriate headers. So I guess for PUT and GET via Accept: >>> application/octet-stream there is no base64 encoding at all. >>> >> >> OK. Good. If it were base64'd, you'd see it. >> >>> Btw, out of curiosity I have region max file size set to 1GB now, but >>> what if I set it to say 10G or 50G? Is their significant overhead in >>> address seeking via HDFS? >>> >> >> You could do that. We don't have much experience running regions of >> that size. You should for sure pre-split your table on creation if >> you go this route (See HBaseAdmin API [1]). This method is not >> available in shell so you'd have to script it or write a little java >> to do it). >> >> St.Ack >> >> 1. >> http://hbase.apache.org/docs/r0.89.20100726/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor, >> byte[][]) >> >
