Re: pre splitting tables

Sam Seigal Mon, 24 Oct 2011 16:49:23 -0700

According to my understanding, the way that HBase works is that on a
brand new system, all keys will start going to a single region i.e. a
single region server. Once that region
reaches a max region size, it will split and then move to another
region server, and so on and so forth.

Initially hooking up HBase to a prod system, I am concerned about this
behaviour, since a clean HBase cluster is going to experience a surge
of traffic all going into one region server initially.
This is the motivation behind pre-defining the regions, so the initial
surge of traffic is distributed evenly.

My strategy is to take the incoming data, calculate the hash and then
mod the hash with the number of machines I have. I will split the
regions according to the prefix # .
This should , I think provide for better data distribution when the
cluster first starts up with one region / region server.

These regions should then grow fairly uniformly. Once they reach a
size like ~ 5G, I can do a rolling split.

Also, I want to make sure my regions do not grow too much in size that
when I end up adding more machines, it does not take a very long time
to perform a rolling split.

What I do not understand is the advantages/disasvantages of having
regions that are too big vs regions that are too thin. What does this
impact ? Compaction time ? Split time ? What is the
concern when it comes to how the architecture works. I think if I
understand this better, I can manage my regions more efficiently.

On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg
<[email protected]> wrote:
> Isn't a better strategy to create the HBase keys as
>
> Key = hash(MySQL_key) + MySQL_key
>
> That way you'll know your key distribution and can add new machines
> seamlessly.  I'm assuming that your rows don't overlap between any 2
> machines.  If so, you could append the MACHINE_ID to the key (not
> prepend).  I don't think you want the machine # as the first dimension on
> your rows, because you want the data from new machines to be evenly spread
> out across the existing regions.
>
>
> On 10/24/11 9:07 AM, "Stack" <[email protected]> wrote:
>
>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[email protected]> wrote:
>>> According to the HBase book , pre splitting tables and doing manual
>>> splits is a better long term strategy than letting HBase handle it.
>>>
>>
>>Its good for getting a table off the ground, yes.
>>
>>
>>> Since I do not know what the keys from the prod system are going to
>>> look like , I am adding a machine number prefix to the the row keys
>>> and pre splitting the tables  based on the prefix (prefix 0 goes to
>>> machine A, prefix 1 goes to machine b etc).
>>>
>>
>>You don't need to do inorder scan of the data?  Whats the rest of your
>>row key look like?
>>
>>
>>> Once I decide to add more machines, I can always do a rolling split
>>> and add more prefixes.
>>>
>>
>>Yes.
>>
>>> Is this a good strategy for pre splitting the tables ?
>>>
>>
>>So, you'll start out with one region per server?
>>
>>What do you think the rate of splitting will be like?  Are you using
>>default region size or have you bumped this up?
>>
>>St.Ack
>
>

Re: pre splitting tables

Reply via email to