Re: row filter - binary comparator at certain range

Michael Segel Mon, 21 Oct 2013 17:59:15 -0700

James, 

Its evenly distributed, however... because its a time stamp, its a 'tail end 
charlie' addition.  
So when you split a region, the top half is never added to, so you end up with 
all regions half filled except for the last region in each 'modded' value.


I wouldn't say its a bad thing if you plan for it. 

On Oct 21, 2013, at 5:07 PM, James Taylor <[email protected]> wrote:

> We don't truncate the hash, we mod it. Why would you expect that data
> wouldn't be evenly distributed? We've not seen this to be the case.
> 
> 
> 
> On Mon, Oct 21, 2013 at 1:48 PM, Michael Segel 
> <[email protected]>wrote:
> 
>> What do you call hashing the row key?
>> Or hashing the row key and then appending the row key to the hash?
>> Or hashing the row key, truncating the hash value to some subset and then
>> appending the row key to the value?
>> 
>> The problem is that there is specific meaning to the term salt. Re-using
>> it here will cause confusion because you're implying something you don't
>> mean to imply.
>> 
>> you could say prepend a truncated hash of the key, however… is prepend a
>> real word? ;-) (I am sorry, I am not a grammar nazi, nor an English major. )
>> 
>> So even outside of Phoenix, the concept is the same.
>> Even with a truncated hash, you will find that over time, all but the tail
>> N regions will only be half full.
>> This could be both good and bad.
>> 
>> (Where N is your number 8 or 16 allowable hash values.)
>> 
>> You've solved potentially one problem… but still have other issues that
>> you need to address.
>> I guess the simple answer is to double the region sizes and not care that
>> most of your regions will be 1/2 the max size…  but the size you really
>> want and 8-16 regions will be up to twice as big.
>> 
>> 
>> 
>> On Oct 21, 2013, at 3:26 PM, James Taylor <[email protected]> wrote:
>> 
>>> What do you think it should be called, because
>>> "prepending-row-key-with-single-hashed-byte" doesn't have a very good
>> ring
>>> to it. :-)
>>> 
>>> Agree that getting the row key design right is crucial.
>>> 
>>> The range of "prepending-row-key-with-single-hashed-byte" is declarative
>>> when you create your table in Phoenix, so you typically declare an upper
>>> bound based on your cluster size (not 255, but maybe 8 or 16). We've run
>>> the numbers and it's typically faster, but as with most things, not
>> always.
>>> 
>>> HTH,
>>> James
>>> 
>>> 
>>> On Mon, Oct 21, 2013 at 1:05 PM, Michael Segel <
>> [email protected]>wrote:
>>> 
>>>> Then its not a SALT. And please don't use the term 'salt' because it has
>>>> specific meaning outside to what you want it to mean.  Just like saying
>>>> HBase has ACID because you write the entire row as an atomic element.
>> But
>>>> I digress….
>>>> 
>>>> Ok so to your point…
>>>> 
>>>> 1 byte == 255 possible values.
>>>> 
>>>> So which will be faster.
>>>> 
>>>> creating a list of the 1 byte truncated hash of each possible timestamp
>> in
>>>> your range, or doing 255 separate range scans with the start and stop
>> range
>>>> key set?
>>>> 
>>>> That will give you the results you want, however… I'd go back and have
>>>> them possibly rethink the row key if they can … assuming this is the
>> base
>>>> access pattern.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Oct 21, 2013, at 11:37 AM, James Taylor <[email protected]>
>> wrote:
>>>> 
>>>>> Phoenix restricts salting to a single byte.
>>>>> Salting perhaps is misnamed, as the salt byte is a stable hash based on
>>>> the
>>>>> row key.
>>>>> Phoenix's skip scan supports sub-key ranges.
>>>>> We've found salting in general to be faster (though there are cases
>> where
>>>>> it's not), as it ensures better parallelization.
>>>>> 
>>>>> Regards,
>>>>> James
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Oct 21, 2013 at 9:14 AM, Vladimir Rodionov
>>>>> <[email protected]>wrote:
>>>>> 
>>>>>> FuzzyRowFilter does not work on sub-key ranges.
>>>>>> Salting is bad for any scan operation, unfortunately. When salt prefix
>>>>>> cardinality is small (1-2 bytes),
>>>>>> one can try something similar to FuzzyRowFilter but with additional
>>>>>> sub-key range support.
>>>>>> If salt prefix cardinality is high (> 2 bytes) - do a full scan with
>>>> your
>>>>>> own Filter (for timestamp ranges).
>>>>>> 
>>>>>> Best regards,
>>>>>> Vladimir Rodionov
>>>>>> Principal Platform Engineer
>>>>>> Carrier IQ, www.carrieriq.com
>>>>>> e-mail: [email protected]
>>>>>> 
>>>>>> ________________________________________
>>>>>> From: Premal Shah [[email protected]]
>>>>>> Sent: Sunday, October 20, 2013 10:42 PM
>>>>>> To: user
>>>>>> Subject: Re: row filter - binary comparator at certain range
>>>>>> 
>>>>>> Have you looked at FuzzyRowFilter? Seems to me that it might satisfy
>>>> your
>>>>>> use-case.
>>>>>> 
>>>>>> 
>>>> 
>> http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/
>>>>>> 
>>>>>> 
>>>>>> On Sun, Oct 20, 2013 at 9:31 PM, Tony Duan <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> Alex Vasilenko <aa.vasilenko@...> writes:
>>>>>>> 
>>>>>>>> 
>>>>>>>> Lars,
>>>>>>>> 
>>>>>>>> But how it will behave, when I have salt at the beginning of the key
>>>> to
>>>>>>>> properly shard table across regions? Imagine row key of format
>>>>>>>> salt:timestamp and rows goes like this:
>>>>>>>> ...
>>>>>>>> 1:15
>>>>>>>> 1:16
>>>>>>>> 1:17
>>>>>>>> 1:23
>>>>>>>> 2:3
>>>>>>>> 2:5
>>>>>>>> 2:12
>>>>>>>> 2:15
>>>>>>>> 2:19
>>>>>>>> 2:25
>>>>>>>> ...
>>>>>>>> 
>>>>>>>> And I want to find all rows, that has second part (timestamp) in
>> range
>>>>>>>> 15-25. What startKey and endKey should be used?
>>>>>>>> 
>>>>>>>> Alexandr Vasilenko
>>>>>>>> Web Developer
>>>>>>>> Skype:menterr
>>>>>>>> mob: +38097-611-45-99
>>>>>>>> 
>>>>>>>> 2012/2/9 lars hofhansl <lhofhansl@...>
>>>>>>> Hi,
>>>>>>> Alexandr Vasilenko
>>>>>>> Have you ever resolved this issue?i am also facing this iusse.
>>>>>>> i also want implement this functionality.
>>>>>>> Imagine row key of format
>>>>>>> salt:timestamp and rows goes like this:
>>>>>>> ...
>>>>>>> 1:15
>>>>>>> 1:16
>>>>>>> 1:17
>>>>>>> 1:23
>>>>>>> 2:3
>>>>>>> 2:5
>>>>>>> 2:12
>>>>>>> 2:15
>>>>>>> 2:19
>>>>>>> 2:25
>>>>>>> ...
>>>>>>> 
>>>>>>> And I want to find all rows, that has second part (timestamp) in
>> range
>>>>>>> 15-25.
>>>>>>> 
>>>>>>> Could you please tell me how you resolve this ?
>>>>>>> thanks  in advance.
>>>>>>> 
>>>>>>> 
>>>>>>> Tony duan
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards,
>>>>>> Premal Shah.
>>>>>> 
>>>>>> Confidentiality Notice:  The information contained in this message,
>>>>>> including any attachments hereto, may be confidential and is intended
>>>> to be
>>>>>> read only by the individual or entity to whom this message is
>>>> addressed. If
>>>>>> the reader of this message is not the intended recipient or an agent
>> or
>>>>>> designee of the intended recipient, please note that any review, use,
>>>>>> disclosure or distribution of this message or its attachments, in any
>>>> form,
>>>>>> is strictly prohibited.  If you have received this message in error,
>>>> please
>>>>>> immediately notify the sender and/or [email protected] and
>>>>>> delete or destroy any copy of this message and its attachments.
>>>>>> 
>>>> 
>>>> 
>> 
>> 

The opinions expressed here are mine, while they may reflect a cognitive 
thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: row filter - binary comparator at certain range

Reply via email to