<< ...mod the hash with the number of machines I have... >> This means that the data will change with the number of machines - so all your data will map to different regions if you add a new machine to your cluster.
<< What I do not understand is the advantages/disasvantages of having regions that are too big vs regions that are too thin. >> The disadvantage is that some regions (and consequently nodes) will have a lot of data which will adversely affect things like storage (if dfs is local to that node), block cache hit ratio, etc. In general - per our experience using Hbase, its much more desirable to disperse data up-front. If you are building indexes using MR, then you probably don¹t need range scan ability on your keys. Thanks Karthik On 10/24/11 4:48 PM, "Sam Seigal" <[email protected]> wrote: >According to my understanding, the way that HBase works is that on a >brand new system, all keys will start going to a single region i.e. a >single region server. Once that region >reaches a max region size, it will split and then move to another >region server, and so on and so forth. > >Initially hooking up HBase to a prod system, I am concerned about this >behaviour, since a clean HBase cluster is going to experience a surge >of traffic all going into one region server initially. >This is the motivation behind pre-defining the regions, so the initial >surge of traffic is distributed evenly. > >My strategy is to take the incoming data, calculate the hash and then >mod the hash with the number of machines I have. I will split the >regions according to the prefix # . >This should , I think provide for better data distribution when the >cluster first starts up with one region / region server. > >These regions should then grow fairly uniformly. Once they reach a >size like ~ 5G, I can do a rolling split. > >Also, I want to make sure my regions do not grow too much in size that >when I end up adding more machines, it does not take a very long time >to perform a rolling split. > >What I do not understand is the advantages/disasvantages of having >regions that are too big vs regions that are too thin. What does this >impact ? Compaction time ? Split time ? What is the >concern when it comes to how the architecture works. I think if I >understand this better, I can manage my regions more efficiently. > > > >On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg ><[email protected]> wrote: >> Isn't a better strategy to create the HBase keys as >> >> Key = hash(MySQL_key) + MySQL_key >> >> That way you'll know your key distribution and can add new machines >> seamlessly. I'm assuming that your rows don't overlap between any 2 >> machines. If so, you could append the MACHINE_ID to the key (not >> prepend). I don't think you want the machine # as the first dimension >>on >> your rows, because you want the data from new machines to be evenly >>spread >> out across the existing regions. >> >> >> On 10/24/11 9:07 AM, "Stack" <[email protected]> wrote: >> >>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[email protected]> wrote: >>>> According to the HBase book , pre splitting tables and doing manual >>>> splits is a better long term strategy than letting HBase handle it. >>>> >>> >>>Its good for getting a table off the ground, yes. >>> >>> >>>> Since I do not know what the keys from the prod system are going to >>>> look like , I am adding a machine number prefix to the the row keys >>>> and pre splitting the tables based on the prefix (prefix 0 goes to >>>> machine A, prefix 1 goes to machine b etc). >>>> >>> >>>You don't need to do inorder scan of the data? Whats the rest of your >>>row key look like? >>> >>> >>>> Once I decide to add more machines, I can always do a rolling split >>>> and add more prefixes. >>>> >>> >>>Yes. >>> >>>> Is this a good strategy for pre splitting the tables ? >>>> >>> >>>So, you'll start out with one region per server? >>> >>>What do you think the rate of splitting will be like? Are you using >>>default region size or have you bumped this up? >>> >>>St.Ack >> >>
