Yes, what you described  mod(hash(rowkey),n) where n is the number of regions 
will remove the hotspotting issue. 

However, if your key is sequential you will only have regions half full post 
region split. 

Look at it this way… 

If I have a key that is a sequential count 1,2,3,4,5 … I am always adding a new 
row to the last region and its always being added to the right. (reading left 
from right.) Always at the end of the line… 

So if I have 10,000 rows and I split the region… region 1 has 0 to 4,999 and 
region 2 has 5000 to 10000.

Now my next row is 10001, the following is 10002 … so they will be added at the 
tail end of region 2 until it splits.  (And so on, and so on…) 

If you take a modulus of the hash, you create n buckets. Again for each bucket… 
I will still be adding a new larger number so it will be added to the right 
hand side or tail of the list.

Once a region is split… that’s it.  

Bucketing will solve the hot spotting issue by creating n lists of rows, but 
you’re still always adding to the end of the list. 

Does that make sense? 


> On May 5, 2015, at 10:04 AM, jeremy p <[email protected]> wrote:
> 
> Thank you for your response!
> 
> So I guess 'salt' is a bit of a misnomer.  What I used to do is this :
> 
> 1) Say that my key value is something like '1234foobar'
> 2) I obtain the hash of '1234foobar'.  Let's say that's '54824923'
> 3) I mod the hash by my number of regions.  Let's say I have 2000 regions.
> 54824923 % 2000 = 923
> 4) I prepend that value to my original key value, so my new key is
> '923_1234foobar'
> 
> Is this the same thing you were talking about?
> 
> A couple questions :
> 
> * Why would my regions only be 1/2 full?
> * Why would I only use this for sequential keys?  I would think this would
> give better performance in any situation where I don't need range scans.
> For example, let's say my key value is a person's last name.  That will
> naturally cluster around certain letters, giving me an uneven distribution.
> 
> --Jeremy
> 
> 
> 
> On Sun, May 3, 2015 at 11:46 AM, Michael Segel <[email protected]>
> wrote:
> 
>> Yes, don’t use a salt. Salt implies that your seed is orthogonal (read
>> random) to the base table row key.
>> You’re better off using a truncated hash (md5 is fastest) so that at least
>> you can use a single get().
>> 
>> Common?
>> 
>> Only if your row key is mostly sequential.
>> 
>> Note that even with bucketing, you will still end up with regions only 1/2
>> full with the only exception being the last region.
>> 
>>> On May 1, 2015, at 11:09 AM, jeremy p <[email protected]>
>> wrote:
>>> 
>>> Hello all,
>>> 
>>> I've been out of the HBase world for a while, and I'm just now jumping
>> back
>>> in.
>>> 
>>> As of HBase .94, it was still common to take a hash of your RowKey and
>> use
>>> that to "salt" the beginning of your RowKey to obtain an even
>> distribution
>>> among your region servers.  Is this still a common practice, or is there
>> a
>>> better way to do this in HBase 1.0?
>>> 
>>> --Jeremy
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com





Reply via email to