According to my understanding, the way that HBase works is that on a brand new system, all keys will start going to a single region i.e. a single region server. Once that region reaches a max region size, it will split and then move to another region server, and so on and so forth.
Initially hooking up HBase to a prod system, I am concerned about this behaviour, since a clean HBase cluster is going to experience a surge of traffic all going into one region server initially. This is the motivation behind pre-defining the regions, so the initial surge of traffic is distributed evenly. My strategy is to take the incoming data, calculate the hash and then mod the hash with the number of machines I have. I will split the regions according to the prefix # . This should , I think provide for better data distribution when the cluster first starts up with one region / region server. These regions should then grow fairly uniformly. Once they reach a size like ~ 5G, I can do a rolling split. Also, I want to make sure my regions do not grow too much in size that when I end up adding more machines, it does not take a very long time to perform a rolling split. What I do not understand is the advantages/disasvantages of having regions that are too big vs regions that are too thin. What does this impact ? Compaction time ? Split time ? What is the concern when it comes to how the architecture works. I think if I understand this better, I can manage my regions more efficiently. On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg <[email protected]> wrote: > Isn't a better strategy to create the HBase keys as > > Key = hash(MySQL_key) + MySQL_key > > That way you'll know your key distribution and can add new machines > seamlessly. I'm assuming that your rows don't overlap between any 2 > machines. If so, you could append the MACHINE_ID to the key (not > prepend). I don't think you want the machine # as the first dimension on > your rows, because you want the data from new machines to be evenly spread > out across the existing regions. > > > On 10/24/11 9:07 AM, "Stack" <[email protected]> wrote: > >>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[email protected]> wrote: >>> According to the HBase book , pre splitting tables and doing manual >>> splits is a better long term strategy than letting HBase handle it. >>> >> >>Its good for getting a table off the ground, yes. >> >> >>> Since I do not know what the keys from the prod system are going to >>> look like , I am adding a machine number prefix to the the row keys >>> and pre splitting the tables based on the prefix (prefix 0 goes to >>> machine A, prefix 1 goes to machine b etc). >>> >> >>You don't need to do inorder scan of the data? Whats the rest of your >>row key look like? >> >> >>> Once I decide to add more machines, I can always do a rolling split >>> and add more prefixes. >>> >> >>Yes. >> >>> Is this a good strategy for pre splitting the tables ? >>> >> >>So, you'll start out with one region per server? >> >>What do you think the rate of splitting will be like? Are you using >>default region size or have you bumped this up? >> >>St.Ack > >
