Fair enough. St.Ack
On Wed, Sep 22, 2010 at 11:09 AM, Jack Levin <[email protected]> wrote: > Lzo of image data, which is already Jpeg? Probably not a great idea, yes? > > -Jack > > On Wed, Sep 22, 2010 at 11:06 AM, Stack <[email protected]> wrote: >> Are you lzo'ing Jack? If not, you probably should. >> St.Ack >> >> On Wed, Sep 22, 2010 at 3:17 AM, Jack Levin <[email protected]> wrote: >>> So our cell sizes will be 350kb on average with 5-10 terabytes per server, >>> I just want to keep the count of Regions under 1000, per server >>> >>> -Jack >>> >>> >>> On Sep 22, 2010, at 2:44 AM, Ryan Rawson <[email protected]> wrote: >>> >>>> Region size is one of those tricky things, there are a few factors to >>>> consider: >>>> >>>> - regions are the basic element of availability and distribution. >>>> - HBase scales by having regions across many servers. Thus if you >>>> have 2 regions for 16GB data, on a 20 node machine you are a net loss >>>> there. >>>> - High region count has been known to make things slow, this is >>>> getting better, but it is probably better to have 700 regions than >>>> 3000 for the same amount of data. >>>> - Low region count prevents parallel scalability as per point #2. >>>> This really cant be stressed enough, since a common problem is loading >>>> 200MB data into HBase then wondering why your awesome 10 node cluster >>>> is mostly idle. >>>> - There is not much memory footprint difference between 1 region and >>>> 10 in terms of indexes, etc, held by the regionserver. >>>> >>>> Generally speaking I stick to the default, go smaller for hot tables, >>>> or manually split them, and go with a 1GB region size on our largest >>>> 900 GB table. >>>> >>>> -ryan >>>> >>>> On Wed, Sep 22, 2010 at 12:01 AM, Jack Levin <[email protected]> wrote: >>>>> Yes, I am thinking to put 10 to 15 million files on each regionserver >>>>> (well, not literally, but be controlled by regionserver). So thats >>>>> close to 4 TB worth of regions, which is about 4GB per region should >>>>> we target 1000 regions per server. Note, not all files are 'hot', and >>>>> I only expect to keep about 1% super hot, and 5% relatively hot, the >>>>> rest are cold. So in terms of keeping hbase blocks in RAM, that >>>>> should be adequate, for the rest we can afford a trip into hdfs. >>>>> >>>>> If servers are running 8 GB of ram, and are shared for regionservers >>>>> and datanodes, how much heap should I allocate to each? 6GB for RS >>>>> and 1GB for DN? >>>>> >>>>> Also, on the question whether 8 core x 16G Ram helps a Master server >>>>> to bring up the cluster faster, the answer is definitely - yes. It >>>>> took only 90 seconds to load 5000 regions across 13 servers, where >>>>> same task for Dual Core 8G Ram, took nearly 10 minutes. >>>>> >>>>> -Jack >>>>> >>>>> >>>>> >>>>> On Tue, Sep 21, 2010 at 11:38 PM, Stack <[email protected]> wrote: >>>>>> On Tue, Sep 21, 2010 at 11:11 PM, Jack Levin <[email protected]> wrote: >>>>>>> Its definitely binary, and I can even load it in my browser but >>>>>>> setting appropriate headers. So I guess for PUT and GET via Accept: >>>>>>> application/octet-stream there is no base64 encoding at all. >>>>>>> >>>>>> >>>>>> OK. Good. If it were base64'd, you'd see it. >>>>>> >>>>>>> Btw, out of curiosity I have region max file size set to 1GB now, but >>>>>>> what if I set it to say 10G or 50G? Is their significant overhead in >>>>>>> address seeking via HDFS? >>>>>>> >>>>>> >>>>>> You could do that. We don't have much experience running regions of >>>>>> that size. You should for sure pre-split your table on creation if >>>>>> you go this route (See HBaseAdmin API [1]). This method is not >>>>>> available in shell so you'd have to script it or write a little java >>>>>> to do it). >>>>>> >>>>>> St.Ack >>>>>> >>>>>> 1. >>>>>> http://hbase.apache.org/docs/r0.89.20100726/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor, >>>>>> byte[][]) >>>>>> >>>>> >>> >> >
