I think I should dust off my schema design talk… clearly the talks given by 
some of the vendors don’t really explain things … 
(Hmmm. Strata London?) 

See my reply below…. Note I used SHA-1. MD-5 should also give you roughly the 
same results.

On May 18, 2014, at 4:28 AM, Software Dev <[email protected]> wrote:

> I recently came across the pattern of adding a salting prefix to the
> row keys to prevent hotspotting. Still trying to wrap my head around
> it and I have a few questions.
> 

If you add a salt, you’re prepending a random number to a row in order to avoid 
hot spotting.  It amazes me that Sematext never went back and either removed 
the blog or fixed it and now the bad idea is getting propagated.  Adding a 
random value to give you a bit of randomness now means that you can’t do a 
range scan, or fetch the specific row with a single get()  so you’re going to 
end up boiling the ocean to get your data. You’re better off using 
hive/spark/shark than hbase.

As James tries to point out, you take the hash of the row so that you can 
easily retrieve the value. But rather than prepend a 160 bit hash, you can 
easily achieve the same thing by just truncating the hash to the first byte in 
order to get enough randomness to avoid hot spotting. Of course, the one 
question you should ask is why don’t you just take the hash as the row key and 
then have a 160 bit row key (40 bytes in length)? Then store the actual key as 
a column in the table.

And then there’s a bigger question… why are you worried about hot spotting? Are 
you adding rows where the row key is sequential?  Or are you worried about when 
you first start loading rows, that you are hot spotting, but the underlying row 
key is random enough that once the first set of rows are added, HBase splitting 
regions will be enough? 

> - Is there ever a reason to salt to more buckets than there are region
> servers? The only reason why I think that may be beneficial is to
> anticipate future growth???
> 
Doesn’t matter. 
Think about how HBase splits regions. 
Don’t take the modulo, just truncate to the first byte.  Taking the modulo is 
again a dumb idea, but not as dumb as using a salt.

Keep in mind that the first byte of the hash is going to be 0-f in a character 
representation. (4 bits of the 160bit key)  So you have 16 values to start 
with. 
That should be enough.

> - Is it beneficial to always hash against a known number of buckets
> (ie never change the size) that way for any individual row key you can
> always determine the prefix?
> 
Your question doesn’t make sense. 

> - Are there any good use cases of this pattern out in the wild?
> 
Yup.
Deduping data sets.

> Thanks
> 
NOTE:  Many people worry about hot spotting when they really don’t have to do 
so. Hot spotting that occurs on a the initial load of a table is OK. Its when 
you have a sequential row key that you run in to problems with hot spotting and 
regions being only half filled. 

Reply via email to