In our measurements, scanning is improved by performing against n
range scans rather than 1 (since you are effectively striping the
reads). This is even better when you don't necessary care about the
order of every row, but want every row in a given range (then you can
just get whatever row is available from a buffer in the client).

-Mike

On Sun, May 18, 2014 at 1:07 PM, Michael Segel
<[email protected]> wrote:
> No, you’re missing the point.
> Its not a good idea or design.
>
> Is your data mutable or static?
>
> To your point. Everytime you want to do a simple get() you have to open up n 
> get() statements. On your range scans you will have to do n range scans, then 
> join and sort the result sets. The fact that each result set is in sort order 
> will help a little, but still not that clean.
>
>
>
> On May 18, 2014, at 4:58 PM, Software Dev <[email protected]> wrote:
>
>> You may be missing the point. The primary reason for the salt prefix
>> pattern is to avoid hotspotting when inserting time series data AND at
>> the same time provide a way to perform range scans.
>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>
>>> NOTE:  Many people worry about hot spotting when they really don’t have to 
>>> do so. Hot spotting that occurs on a the initial load of a table is .OK. 
>>> Its when you have a sequential row key that you run in to problems with hot 
>>> spotting and regions being only half filled.
>>
>> The data being inserted will be a constant stream of time ordered data
>> so yes, hotspotting will be an issue
>>
>>> Adding a random value to give you a bit of randomness now means that you 
>>> can’t do a range scan..
>>
>> That's not accurate. To perform a range scan you would just need to
>> open up N scanners where N is the size of the buckets/random prefixes
>> used.
>>
>>> Don’t take the modulo, just truncate to the first byte.  Taking the modulo 
>>> is again a dumb idea, but not as dumb as using a salt.
>>
>> Well the only reason why I would think using a salt would be
>> beneficial is to limit the number of scanners when performing a range
>> scan. See above comment. And yes, performing a range scan will be our
>> primary read pattern.
>>
>> On Sun, May 18, 2014 at 2:36 AM, Michael Segel
>> <[email protected]> wrote:
>>> I think I should dust off my schema design talk… clearly the talks given by 
>>> some of the vendors don’t really explain things …
>>> (Hmmm. Strata London?)
>>>
>>> See my reply below…. Note I used SHA-1. MD-5 should also give you roughly 
>>> the same results.
>>>
>>> On May 18, 2014, at 4:28 AM, Software Dev <[email protected]> wrote:
>>>
>>>> I recently came across the pattern of adding a salting prefix to the
>>>> row keys to prevent hotspotting. Still trying to wrap my head around
>>>> it and I have a few questions.
>>>>
>>>
>>> If you add a salt, you’re prepending a random number to a row in order to 
>>> avoid hot spotting.  It amazes me that Sematext never went back and either 
>>> removed the blog or fixed it and now the bad idea is getting propagated.  
>>> Adding a random value to give you a bit of randomness now means that you 
>>> can’t do a range scan, or fetch the specific row with a single get()  so 
>>> you’re going to end up boiling the ocean to get your data. You’re better 
>>> off using hive/spark/shark than hbase.
>>>
>>> As James tries to point out, you take the hash of the row so that you can 
>>> easily retrieve the value. But rather than prepend a 160 bit hash, you can 
>>> easily achieve the same thing by just truncating the hash to the first byte 
>>> in order to get enough randomness to avoid hot spotting. Of course, the one 
>>> question you should ask is why don’t you just take the hash as the row key 
>>> and then have a 160 bit row key (40 bytes in length)? Then store the actual 
>>> key as a column in the table.
>>>
>>> And then there’s a bigger question… why are you worried about hot spotting? 
>>> Are you adding rows where the row key is sequential?  Or are you worried 
>>> about when you first start loading rows, that you are hot spotting, but the 
>>> underlying row key is random enough that once the first set of rows are 
>>> added, HBase splitting regions will be enough?
>>>
>>>> - Is there ever a reason to salt to more buckets than there are region
>>>> servers? The only reason why I think that may be beneficial is to
>>>> anticipate future growth???
>>>>
>>> Doesn’t matter.
>>> Think about how HBase splits regions.
>>> Don’t take the modulo, just truncate to the first byte.  Taking the modulo 
>>> is again a dumb idea, but not as dumb as using a salt.
>>>
>>> Keep in mind that the first byte of the hash is going to be 0-f in a 
>>> character representation. (4 bits of the 160bit key)  So you have 16 values 
>>> to start with.
>>> That should be enough.
>>>
>>>> - Is it beneficial to always hash against a known number of buckets
>>>> (ie never change the size) that way for any individual row key you can
>>>> always determine the prefix?
>>>>
>>> Your question doesn’t make sense.
>>>
>>>> - Are there any good use cases of this pattern out in the wild?
>>>>
>>> Yup.
>>> Deduping data sets.
>>>
>>>> Thanks
>>>>
>>> NOTE:  Many people worry about hot spotting when they really don’t have to 
>>> do so. Hot spotting that occurs on a the initial load of a table is OK. Its 
>>> when you have a sequential row key that you run in to problems with hot 
>>> spotting and regions being only half filled.
>>>
>>
>

Reply via email to