Re: pre splitting tables

Li Pi Tue, 25 Oct 2011 16:14:56 -0700

It'll lower it. Remember that each regionserver, or region has its own
block cache of a given size. If you increase the regionsize, then you
lower the cachesize/region size ratio.


On Tue, Oct 25, 2011 at 1:53 AM, Sam Seigal <[email protected]> wrote:
> On Mon, Oct 24, 2011 at 9:22 PM, Karthik Ranganathan
> <[email protected]> wrote:
>>
>>
>> << ...mod the hash with the number of machines I have... >>
>> This means that the data will change with the number of machines - so all
>> your data will map to different regions if you add a new machine to your
>> cluster.
>>
>>
>> << What I do not understand is the advantages/disasvantages of having
>> regions that are too big vs regions that are too thin. >>
>> The disadvantage is that some regions (and consequently nodes) will have a
>> lot of data which will adversely affect things like storage (if dfs is
>> local to that node), block cache hit ratio, etc.
>
> Can you please explain a bit more on how a bigger region size will
> affect the block cache hit ratio ?
>
>>
>> In general - per our experience using Hbase, its much more desirable to
>> disperse data up-front. If you are building indexes using MR, then you
>> probably don¹t need range scan ability on your keys.
>>
>> Thanks
>> Karthik
>>
>>
>>
>> On 10/24/11 4:48 PM, "Sam Seigal" <[email protected]> wrote:
>>
>>>According to my understanding, the way that HBase works is that on a
>>>brand new system, all keys will start going to a single region i.e. a
>>>single region server. Once that region
>>>reaches a max region size, it will split and then move to another
>>>region server, and so on and so forth.
>>>
>>>Initially hooking up HBase to a prod system, I am concerned about this
>>>behaviour, since a clean HBase cluster is going to experience a surge
>>>of traffic all going into one region server initially.
>>>This is the motivation behind pre-defining the regions, so the initial
>>>surge of traffic is distributed evenly.
>>>
>>>My strategy is to take the incoming data, calculate the hash and then
>>>mod the hash with the number of machines I have. I will split the
>>>regions according to the prefix # .
>>>This should , I think provide for better data distribution when the
>>>cluster first starts up with one region / region server.
>>>
>>>These regions should then grow fairly uniformly. Once they reach a
>>>size like ~ 5G, I can do a rolling split.
>>>
>>>Also, I want to make sure my regions do not grow too much in size that
>>>when I end up adding more machines, it does not take a very long time
>>>to perform a rolling split.
>>>
>>>What I do not understand is the advantages/disasvantages of having
>>>regions that are too big vs regions that are too thin. What does this
>>>impact ? Compaction time ? Split time ? What is the
>>>concern when it comes to how the architecture works. I think if I
>>>understand this better, I can manage my regions more efficiently.
>>>
>>>
>>>
>>>On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg
>>><[email protected]> wrote:
>>>> Isn't a better strategy to create the HBase keys as
>>>>
>>>> Key = hash(MySQL_key) + MySQL_key
>>>>
>>>> That way you'll know your key distribution and can add new machines
>>>> seamlessly.  I'm assuming that your rows don't overlap between any 2
>>>> machines.  If so, you could append the MACHINE_ID to the key (not
>>>> prepend).  I don't think you want the machine # as the first dimension
>>>>on
>>>> your rows, because you want the data from new machines to be evenly
>>>>spread
>>>> out across the existing regions.
>>>>
>>>>
>>>> On 10/24/11 9:07 AM, "Stack" <[email protected]> wrote:
>>>>
>>>>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[email protected]> wrote:
>>>>>> According to the HBase book , pre splitting tables and doing manual
>>>>>> splits is a better long term strategy than letting HBase handle it.
>>>>>>
>>>>>
>>>>>Its good for getting a table off the ground, yes.
>>>>>
>>>>>
>>>>>> Since I do not know what the keys from the prod system are going to
>>>>>> look like , I am adding a machine number prefix to the the row keys
>>>>>> and pre splitting the tables  based on the prefix (prefix 0 goes to
>>>>>> machine A, prefix 1 goes to machine b etc).
>>>>>>
>>>>>
>>>>>You don't need to do inorder scan of the data?  Whats the rest of your
>>>>>row key look like?
>>>>>
>>>>>
>>>>>> Once I decide to add more machines, I can always do a rolling split
>>>>>> and add more prefixes.
>>>>>>
>>>>>
>>>>>Yes.
>>>>>
>>>>>> Is this a good strategy for pre splitting the tables ?
>>>>>>
>>>>>
>>>>>So, you'll start out with one region per server?
>>>>>
>>>>>What do you think the rate of splitting will be like?  Are you using
>>>>>default region size or have you bumped this up?
>>>>>
>>>>>St.Ack
>>>>
>>>>
>>
>>
>

Re: pre splitting tables

Reply via email to