Re: Questions on FuzzyRowFilter

Michael Segel Sun, 18 May 2014 12:11:32 -0700

@James…
You’re not listening. There is a special meaning when you say salt.


On May 18, 2014, at 7:16 PM, James Taylor <[email protected]> wrote:

> @Mike,
> 
> The biggest problem is you're not listening. Please actually read my
> response (and you'll understand the what we're calling "salting" is not a
> random seed).
> 
> Phoenix already has secondary indexes in two flavors: one optimized for
> write-once data and one more general for fully mutable data. Soon we'll
> have a third for local indexing.
> 
> James
> 
> 
> On Sun, May 18, 2014 at 10:27 AM, Michael Segel
> <[email protected]>wrote:
> 
>> @James,
>> 
>> I know and that’s the biggest problem.
>> Salts by definition are random seeds.
>> 
>> Now I have two new phrases.
>> 
>> 1) We want to remain on a sodium free diet.
>> 2) Learn to kick the bucket.
>> 
>> When you have data that is coming in on a time series, is the data mutable
>> or not?
>> 
>> A better approach would be to redesign a second type of storage to handle
>> serial data and how the regions are split and managed.
>> Or just not use HBase to store the underlying data in the first place and
>> just store the index… ;-)
>> (Yes, I thought about this too.)
>> 
>> -Mike
>> 
>> On May 16, 2014, at 7:50 PM, James Taylor <[email protected]> wrote:
>> 
>>> Hi Mike,
>>> I agree with you - the way you've outlined is exactly the way Phoenix has
>>> implemented it. It's a bit of a problem with terminology, though. We call
>>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash the
>>> key, mod the hash with the SALT_BUCKET value you provide, and prepend the
>>> row key with this single byte value. Maybe you can coin a good term for
>>> this technique?
>>> 
>>> FWIW, you don't lose the ability to do a range scan when you salt (or
>>> hash-the-key and mod by the number of "buckets"), but you do need to run
>> a
>>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then
>>> the client does a merge sort among these scans. It performs well.
>>> 
>>> Thanks,
>>> James
>>> 
>>> 
>>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
>> [email protected]>wrote:
>>> 
>>>> 3+ Years on and a bad idea is being propagated again.
>>>> 
>>>> Now repeat after me… DO NO USE A SALT.
>>>> 
>>>> Having a low sodium diet, especially for HBase is really good for your
>>>> health and sanity.
>>>> 
>>>> The salt is going to be orthogonal to the row key (Key).
>>>> There is no relationship to the specific Key.
>>>> 
>>>> Using a salt means you now use the ability to randomly spread the
>>>> distribution of data to avoid HOT SPOTTING.
>>>> However you lose the ability to seek for a specific row.
>>>> 
>>>> YOU HASH THE KEY.
>>>> 
>>>> The hash whether you use SHA-1 or MD-5 is going to yield the same result
>>>> each and every time you provide the key.
>>>> 
>>>> But wait, the generated hash is 160 bits long. We don’t need that!
>>>> Absolutely true if you just want to randomize the key to avoid hot
>>>> spotting. There’s this concept called truncating the hash to the desired
>>>> length.
>>>> So to Adrien’s point, you can truncate it to a single byte which would
>> be
>>>> sufficient….
>>>> Now when you want to seek for a specific row, you can find it.
>>>> 
>>>> The downside to any solution is that you lose the ability to do a range
>>>> scan.
>>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A
>>>> SINGLE ROW VIA A get() CALL.
>>>> 
>>>> <rant>
>>>> This simple fact has been pointed out several years ago, yet for some
>>>> reason, the use of a salt persists.
>>>> I’ve actually made that part of the HBase course I wrote and use it in
>> my
>>>> presentation(s) on HBase.
>>>> 
>>>> It amazes me that the committers and regulars who post here still don’t
>>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well
>> not
>>>> use HBase and stick with Hive.
>>>> I remember Ed C’s rant about how preferential treatment on Hive patches
>>>> was given to vendors’ committers… that preferential treatment seems to
>> also
>>>> be extended speakers at conferences. It wouldn’t be a problem if those
>> said
>>>> speakers actually knew the topic… ;-)
>>>> 
>>>> Propagation of bad ideas means that you’re leaving a lot of performance
>> on
>>>> the table and it can kill or cripple projects.
>>>> 
>>>> </rant>
>>>> 
>>>> Sorry for the rant…
>>>> 
>>>> -Mike
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On May 3, 2014, at 4:39 PM, Software Dev <[email protected]>
>>>> wrote:
>>>> 
>>>>> Ok so there is no way around the FuzzyRowFilter checking every single
>>>>> row in the table correct? If so, what is a valid use case for that
>>>>> filter?
>>>>> 
>>>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
>>>>> client for accessing these tables is a Rails (not JRuby) application
>>>>> so we are stuck with either the Thrift or Rails client. Can either of
>>>>> these perform multiple gets/scans?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
>> [email protected]>
>>>> wrote:
>>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can
>>>> be
>>>>>> split enough among all the possible regions, but you won't be able to
>>>>>> easily benefit from distributed scans to gather what you want.
>>>>>> 
>>>>>> Let say you want to split (time+login) with a salted key and you
>> expect
>>>> to
>>>>>> be able to retrieve events from 20140429 pretty fast. Then I would
>> split
>>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
>>>> `$random
>>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the
>> 10
>>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
>> everything
>>>>>> until I've got all the expected results.
>>>>>> 
>>>>>> So in term of performances this looks "a little bit" faster than your
>>>> 2^32
>>>>>> randomization.
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
>>>> [email protected]>wrote:
>>>>>> 
>>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our
>>>>>>> time series data (20140501, 20140502...).  We can prefix all of the
>>>>>>> keys with 4 random bytes and then just skip these during scanning. Is
>>>>>>> that correct? These *seems* like it will work but Im questioning the
>>>>>>> performance of this even if it does work.
>>>>>>> 
>>>>>>> Also, is this available via the rest client, shell and/or thrift
>>>> client?
>>>>>>> 
>>>>>>> Also, is there a FuzzyColumn equivalent of this feature?
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Adrien Mogenet
>>>>>> http://www.borntosegfault.com
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Questions on FuzzyRowFilter

Reply via email to