No, you’re missing the point.
Its not a good idea or design.

Is your data mutable or static? 

To your point. Everytime you want to do a simple get() you have to open up n 
get() statements. On your range scans you will have to do n range scans, then 
join and sort the result sets. The fact that each result set is in sort order 
will help a little, but still not that clean. 



On May 18, 2014, at 4:58 PM, Software Dev <[email protected]> wrote:

> You may be missing the point. The primary reason for the salt prefix
> pattern is to avoid hotspotting when inserting time series data AND at
> the same time provide a way to perform range scans.
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> 
>> NOTE:  Many people worry about hot spotting when they really don’t have to 
>> do so. Hot spotting that occurs on a the initial load of a table is .OK. Its 
>> when you have a sequential row key that you run in to problems with hot 
>> spotting and regions being only half filled.
> 
> The data being inserted will be a constant stream of time ordered data
> so yes, hotspotting will be an issue
> 
>> Adding a random value to give you a bit of randomness now means that you 
>> can’t do a range scan..
> 
> That's not accurate. To perform a range scan you would just need to
> open up N scanners where N is the size of the buckets/random prefixes
> used.
> 
>> Don’t take the modulo, just truncate to the first byte.  Taking the modulo 
>> is again a dumb idea, but not as dumb as using a salt.
> 
> Well the only reason why I would think using a salt would be
> beneficial is to limit the number of scanners when performing a range
> scan. See above comment. And yes, performing a range scan will be our
> primary read pattern.
> 
> On Sun, May 18, 2014 at 2:36 AM, Michael Segel
> <[email protected]> wrote:
>> I think I should dust off my schema design talk… clearly the talks given by 
>> some of the vendors don’t really explain things …
>> (Hmmm. Strata London?)
>> 
>> See my reply below…. Note I used SHA-1. MD-5 should also give you roughly 
>> the same results.
>> 
>> On May 18, 2014, at 4:28 AM, Software Dev <[email protected]> wrote:
>> 
>>> I recently came across the pattern of adding a salting prefix to the
>>> row keys to prevent hotspotting. Still trying to wrap my head around
>>> it and I have a few questions.
>>> 
>> 
>> If you add a salt, you’re prepending a random number to a row in order to 
>> avoid hot spotting.  It amazes me that Sematext never went back and either 
>> removed the blog or fixed it and now the bad idea is getting propagated.  
>> Adding a random value to give you a bit of randomness now means that you 
>> can’t do a range scan, or fetch the specific row with a single get()  so 
>> you’re going to end up boiling the ocean to get your data. You’re better off 
>> using hive/spark/shark than hbase.
>> 
>> As James tries to point out, you take the hash of the row so that you can 
>> easily retrieve the value. But rather than prepend a 160 bit hash, you can 
>> easily achieve the same thing by just truncating the hash to the first byte 
>> in order to get enough randomness to avoid hot spotting. Of course, the one 
>> question you should ask is why don’t you just take the hash as the row key 
>> and then have a 160 bit row key (40 bytes in length)? Then store the actual 
>> key as a column in the table.
>> 
>> And then there’s a bigger question… why are you worried about hot spotting? 
>> Are you adding rows where the row key is sequential?  Or are you worried 
>> about when you first start loading rows, that you are hot spotting, but the 
>> underlying row key is random enough that once the first set of rows are 
>> added, HBase splitting regions will be enough?
>> 
>>> - Is there ever a reason to salt to more buckets than there are region
>>> servers? The only reason why I think that may be beneficial is to
>>> anticipate future growth???
>>> 
>> Doesn’t matter.
>> Think about how HBase splits regions.
>> Don’t take the modulo, just truncate to the first byte.  Taking the modulo 
>> is again a dumb idea, but not as dumb as using a salt.
>> 
>> Keep in mind that the first byte of the hash is going to be 0-f in a 
>> character representation. (4 bits of the 160bit key)  So you have 16 values 
>> to start with.
>> That should be enough.
>> 
>>> - Is it beneficial to always hash against a known number of buckets
>>> (ie never change the size) that way for any individual row key you can
>>> always determine the prefix?
>>> 
>> Your question doesn’t make sense.
>> 
>>> - Are there any good use cases of this pattern out in the wild?
>>> 
>> Yup.
>> Deduping data sets.
>> 
>>> Thanks
>>> 
>> NOTE:  Many people worry about hot spotting when they really don’t have to 
>> do so. Hot spotting that occurs on a the initial load of a table is OK. Its 
>> when you have a sequential row key that you run in to problems with hot 
>> spotting and regions being only half filled.
>> 
> 

Reply via email to