You may be missing the point. The primary reason for the salt prefix
pattern is to avoid hotspotting when inserting time series data AND at
the same time provide a way to perform range scans.
http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/

>NOTE:  Many people worry about hot spotting when they really don’t have to do 
>so. Hot spotting that occurs on a the initial load of a table is .OK. Its when 
>you have a sequential row key that you run in to problems with hot spotting 
>and regions being only half filled.

The data being inserted will be a constant stream of time ordered data
so yes, hotspotting will be an issue

>  Adding a random value to give you a bit of randomness now means that you 
> can’t do a range scan..

That's not accurate. To perform a range scan you would just need to
open up N scanners where N is the size of the buckets/random prefixes
used.

> Don’t take the modulo, just truncate to the first byte.  Taking the modulo is 
> again a dumb idea, but not as dumb as using a salt.

Well the only reason why I would think using a salt would be
beneficial is to limit the number of scanners when performing a range
scan. See above comment. And yes, performing a range scan will be our
primary read pattern.

On Sun, May 18, 2014 at 2:36 AM, Michael Segel
<[email protected]> wrote:
> I think I should dust off my schema design talk… clearly the talks given by 
> some of the vendors don’t really explain things …
> (Hmmm. Strata London?)
>
> See my reply below…. Note I used SHA-1. MD-5 should also give you roughly the 
> same results.
>
> On May 18, 2014, at 4:28 AM, Software Dev <[email protected]> wrote:
>
>> I recently came across the pattern of adding a salting prefix to the
>> row keys to prevent hotspotting. Still trying to wrap my head around
>> it and I have a few questions.
>>
>
> If you add a salt, you’re prepending a random number to a row in order to 
> avoid hot spotting.  It amazes me that Sematext never went back and either 
> removed the blog or fixed it and now the bad idea is getting propagated.  
> Adding a random value to give you a bit of randomness now means that you 
> can’t do a range scan, or fetch the specific row with a single get()  so 
> you’re going to end up boiling the ocean to get your data. You’re better off 
> using hive/spark/shark than hbase.
>
> As James tries to point out, you take the hash of the row so that you can 
> easily retrieve the value. But rather than prepend a 160 bit hash, you can 
> easily achieve the same thing by just truncating the hash to the first byte 
> in order to get enough randomness to avoid hot spotting. Of course, the one 
> question you should ask is why don’t you just take the hash as the row key 
> and then have a 160 bit row key (40 bytes in length)? Then store the actual 
> key as a column in the table.
>
> And then there’s a bigger question… why are you worried about hot spotting? 
> Are you adding rows where the row key is sequential?  Or are you worried 
> about when you first start loading rows, that you are hot spotting, but the 
> underlying row key is random enough that once the first set of rows are 
> added, HBase splitting regions will be enough?
>
>> - Is there ever a reason to salt to more buckets than there are region
>> servers? The only reason why I think that may be beneficial is to
>> anticipate future growth???
>>
> Doesn’t matter.
> Think about how HBase splits regions.
> Don’t take the modulo, just truncate to the first byte.  Taking the modulo is 
> again a dumb idea, but not as dumb as using a salt.
>
> Keep in mind that the first byte of the hash is going to be 0-f in a 
> character representation. (4 bits of the 160bit key)  So you have 16 values 
> to start with.
> That should be enough.
>
>> - Is it beneficial to always hash against a known number of buckets
>> (ie never change the size) that way for any individual row key you can
>> always determine the prefix?
>>
> Your question doesn’t make sense.
>
>> - Are there any good use cases of this pattern out in the wild?
>>
> Yup.
> Deduping data sets.
>
>> Thanks
>>
> NOTE:  Many people worry about hot spotting when they really don’t have to do 
> so. Hot spotting that occurs on a the initial load of a table is OK. Its when 
> you have a sequential row key that you run in to problems with hot spotting 
> and regions being only half filled.
>

Reply via email to