It'll lower it. Remember that each regionserver, or region has its own block cache of a given size. If you increase the regionsize, then you lower the cachesize/region size ratio.
On Tue, Oct 25, 2011 at 1:53 AM, Sam Seigal <[email protected]> wrote: > On Mon, Oct 24, 2011 at 9:22 PM, Karthik Ranganathan > <[email protected]> wrote: >> >> >> << ...mod the hash with the number of machines I have... >> >> This means that the data will change with the number of machines - so all >> your data will map to different regions if you add a new machine to your >> cluster. >> >> >> << What I do not understand is the advantages/disasvantages of having >> regions that are too big vs regions that are too thin. >> >> The disadvantage is that some regions (and consequently nodes) will have a >> lot of data which will adversely affect things like storage (if dfs is >> local to that node), block cache hit ratio, etc. > > Can you please explain a bit more on how a bigger region size will > affect the block cache hit ratio ? > >> >> In general - per our experience using Hbase, its much more desirable to >> disperse data up-front. If you are building indexes using MR, then you >> probably don¹t need range scan ability on your keys. >> >> Thanks >> Karthik >> >> >> >> On 10/24/11 4:48 PM, "Sam Seigal" <[email protected]> wrote: >> >>>According to my understanding, the way that HBase works is that on a >>>brand new system, all keys will start going to a single region i.e. a >>>single region server. Once that region >>>reaches a max region size, it will split and then move to another >>>region server, and so on and so forth. >>> >>>Initially hooking up HBase to a prod system, I am concerned about this >>>behaviour, since a clean HBase cluster is going to experience a surge >>>of traffic all going into one region server initially. >>>This is the motivation behind pre-defining the regions, so the initial >>>surge of traffic is distributed evenly. >>> >>>My strategy is to take the incoming data, calculate the hash and then >>>mod the hash with the number of machines I have. I will split the >>>regions according to the prefix # . >>>This should , I think provide for better data distribution when the >>>cluster first starts up with one region / region server. >>> >>>These regions should then grow fairly uniformly. Once they reach a >>>size like ~ 5G, I can do a rolling split. >>> >>>Also, I want to make sure my regions do not grow too much in size that >>>when I end up adding more machines, it does not take a very long time >>>to perform a rolling split. >>> >>>What I do not understand is the advantages/disasvantages of having >>>regions that are too big vs regions that are too thin. What does this >>>impact ? Compaction time ? Split time ? What is the >>>concern when it comes to how the architecture works. I think if I >>>understand this better, I can manage my regions more efficiently. >>> >>> >>> >>>On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg >>><[email protected]> wrote: >>>> Isn't a better strategy to create the HBase keys as >>>> >>>> Key = hash(MySQL_key) + MySQL_key >>>> >>>> That way you'll know your key distribution and can add new machines >>>> seamlessly. I'm assuming that your rows don't overlap between any 2 >>>> machines. If so, you could append the MACHINE_ID to the key (not >>>> prepend). I don't think you want the machine # as the first dimension >>>>on >>>> your rows, because you want the data from new machines to be evenly >>>>spread >>>> out across the existing regions. >>>> >>>> >>>> On 10/24/11 9:07 AM, "Stack" <[email protected]> wrote: >>>> >>>>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[email protected]> wrote: >>>>>> According to the HBase book , pre splitting tables and doing manual >>>>>> splits is a better long term strategy than letting HBase handle it. >>>>>> >>>>> >>>>>Its good for getting a table off the ground, yes. >>>>> >>>>> >>>>>> Since I do not know what the keys from the prod system are going to >>>>>> look like , I am adding a machine number prefix to the the row keys >>>>>> and pre splitting the tables based on the prefix (prefix 0 goes to >>>>>> machine A, prefix 1 goes to machine b etc). >>>>>> >>>>> >>>>>You don't need to do inorder scan of the data? Whats the rest of your >>>>>row key look like? >>>>> >>>>> >>>>>> Once I decide to add more machines, I can always do a rolling split >>>>>> and add more prefixes. >>>>>> >>>>> >>>>>Yes. >>>>> >>>>>> Is this a good strategy for pre splitting the tables ? >>>>>> >>>>> >>>>>So, you'll start out with one region per server? >>>>> >>>>>What do you think the rate of splitting will be like? Are you using >>>>>default region size or have you bumped this up? >>>>> >>>>>St.Ack >>>> >>>> >> >> >
