So our cell sizes will be 350kb on average with 5-10 terabytes per server, I 
just want to keep the count of Regions under 1000, per server

-Jack


On Sep 22, 2010, at 2:44 AM, Ryan Rawson <[email protected]> wrote:

> Region size is one of those tricky things, there are a few factors to 
> consider:
> 
> - regions are the basic element of availability and distribution.
> - HBase scales by having regions across many servers.  Thus if you
> have 2 regions for 16GB data, on a 20 node machine you are a net loss
> there.
> - High region count has been known to make things slow, this is
> getting better, but it is probably better to have 700 regions than
> 3000 for the same amount of data.
> - Low region count prevents parallel scalability as per point #2.
> This really cant be stressed enough, since a common problem is loading
> 200MB data into HBase then wondering why your awesome 10 node cluster
> is mostly idle.
> - There is not much memory footprint difference between 1 region and
> 10 in terms of indexes, etc, held by the regionserver.
> 
> Generally speaking I stick to the default, go smaller for hot tables,
> or manually split them, and go with a 1GB region size on our largest
> 900 GB table.
> 
> -ryan
> 
> On Wed, Sep 22, 2010 at 12:01 AM, Jack Levin <[email protected]> wrote:
>> Yes, I am thinking to put 10 to 15 million files on each regionserver
>> (well, not literally, but be controlled by regionserver).   So thats
>> close to 4 TB worth of regions, which is about 4GB per region should
>> we target 1000 regions per server.  Note, not all files are 'hot', and
>> I only expect to keep about 1% super hot, and 5% relatively hot, the
>> rest are cold.  So in terms of keeping hbase blocks in RAM, that
>> should be adequate, for the rest we can afford a trip into hdfs.
>> 
>> If servers are running 8 GB of ram, and are shared for regionservers
>> and datanodes, how much heap should I allocate to each?  6GB for RS
>> and 1GB  for DN?
>> 
>> Also, on the question whether 8 core x 16G Ram helps a Master server
>> to bring up the cluster faster, the answer is definitely - yes.   It
>> took only 90 seconds to load 5000 regions across 13 servers, where
>> same task for Dual Core 8G Ram, took nearly 10 minutes.
>> 
>> -Jack
>> 
>> 
>> 
>> On Tue, Sep 21, 2010 at 11:38 PM, Stack <[email protected]> wrote:
>>> On Tue, Sep 21, 2010 at 11:11 PM, Jack Levin <[email protected]> wrote:
>>>> Its definitely binary, and I can even load it in my browser but
>>>> setting appropriate headers.  So I guess for PUT and GET via Accept:
>>>> application/octet-stream there is no base64 encoding at all.
>>>> 
>>> 
>>> OK.  Good.  If it were base64'd, you'd see it.
>>> 
>>>> Btw, out of curiosity I have region max file size set to 1GB now, but
>>>> what if I set it to say 10G or 50G?  Is their significant overhead in
>>>> address seeking via HDFS?
>>>> 
>>> 
>>> You could do that.  We don't have much experience running regions of
>>> that size.  You should for sure pre-split your table on creation if
>>> you go this route (See HBaseAdmin API [1]).  This method is not
>>> available in shell so you'd have to script it or write a little java
>>> to do it).
>>> 
>>> St.Ack
>>> 
>>> 1. 
>>> http://hbase.apache.org/docs/r0.89.20100726/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor,
>>> byte[][])
>>> 
>> 

Reply via email to