@James… You’re not listening. There is a special meaning when you say salt.
On May 18, 2014, at 7:16 PM, James Taylor <[email protected]> wrote: > @Mike, > > The biggest problem is you're not listening. Please actually read my > response (and you'll understand the what we're calling "salting" is not a > random seed). > > Phoenix already has secondary indexes in two flavors: one optimized for > write-once data and one more general for fully mutable data. Soon we'll > have a third for local indexing. > > James > > > On Sun, May 18, 2014 at 10:27 AM, Michael Segel > <[email protected]>wrote: > >> @James, >> >> I know and that’s the biggest problem. >> Salts by definition are random seeds. >> >> Now I have two new phrases. >> >> 1) We want to remain on a sodium free diet. >> 2) Learn to kick the bucket. >> >> When you have data that is coming in on a time series, is the data mutable >> or not? >> >> A better approach would be to redesign a second type of storage to handle >> serial data and how the regions are split and managed. >> Or just not use HBase to store the underlying data in the first place and >> just store the index… ;-) >> (Yes, I thought about this too.) >> >> -Mike >> >> On May 16, 2014, at 7:50 PM, James Taylor <[email protected]> wrote: >> >>> Hi Mike, >>> I agree with you - the way you've outlined is exactly the way Phoenix has >>> implemented it. It's a bit of a problem with terminology, though. We call >>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash the >>> key, mod the hash with the SALT_BUCKET value you provide, and prepend the >>> row key with this single byte value. Maybe you can coin a good term for >>> this technique? >>> >>> FWIW, you don't lose the ability to do a range scan when you salt (or >>> hash-the-key and mod by the number of "buckets"), but you do need to run >> a >>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then >>> the client does a merge sort among these scans. It performs well. >>> >>> Thanks, >>> James >>> >>> >>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel < >> [email protected]>wrote: >>> >>>> 3+ Years on and a bad idea is being propagated again. >>>> >>>> Now repeat after me… DO NO USE A SALT. >>>> >>>> Having a low sodium diet, especially for HBase is really good for your >>>> health and sanity. >>>> >>>> The salt is going to be orthogonal to the row key (Key). >>>> There is no relationship to the specific Key. >>>> >>>> Using a salt means you now use the ability to randomly spread the >>>> distribution of data to avoid HOT SPOTTING. >>>> However you lose the ability to seek for a specific row. >>>> >>>> YOU HASH THE KEY. >>>> >>>> The hash whether you use SHA-1 or MD-5 is going to yield the same result >>>> each and every time you provide the key. >>>> >>>> But wait, the generated hash is 160 bits long. We don’t need that! >>>> Absolutely true if you just want to randomize the key to avoid hot >>>> spotting. There’s this concept called truncating the hash to the desired >>>> length. >>>> So to Adrien’s point, you can truncate it to a single byte which would >> be >>>> sufficient…. >>>> Now when you want to seek for a specific row, you can find it. >>>> >>>> The downside to any solution is that you lose the ability to do a range >>>> scan. >>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A >>>> SINGLE ROW VIA A get() CALL. >>>> >>>> <rant> >>>> This simple fact has been pointed out several years ago, yet for some >>>> reason, the use of a salt persists. >>>> I’ve actually made that part of the HBase course I wrote and use it in >> my >>>> presentation(s) on HBase. >>>> >>>> It amazes me that the committers and regulars who post here still don’t >>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well >> not >>>> use HBase and stick with Hive. >>>> I remember Ed C’s rant about how preferential treatment on Hive patches >>>> was given to vendors’ committers… that preferential treatment seems to >> also >>>> be extended speakers at conferences. It wouldn’t be a problem if those >> said >>>> speakers actually knew the topic… ;-) >>>> >>>> Propagation of bad ideas means that you’re leaving a lot of performance >> on >>>> the table and it can kill or cripple projects. >>>> >>>> </rant> >>>> >>>> Sorry for the rant… >>>> >>>> -Mike >>>> >>>> >>>> >>>> >>>> On May 3, 2014, at 4:39 PM, Software Dev <[email protected]> >>>> wrote: >>>> >>>>> Ok so there is no way around the FuzzyRowFilter checking every single >>>>> row in the table correct? If so, what is a valid use case for that >>>>> filter? >>>>> >>>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our >>>>> client for accessing these tables is a Rails (not JRuby) application >>>>> so we are stuck with either the Thrift or Rails client. Can either of >>>>> these perform multiple gets/scans? >>>>> >>>>> >>>>> >>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet < >> [email protected]> >>>> wrote: >>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can >>>> be >>>>>> split enough among all the possible regions, but you won't be able to >>>>>> easily benefit from distributed scans to gather what you want. >>>>>> >>>>>> Let say you want to split (time+login) with a salted key and you >> expect >>>> to >>>>>> be able to retrieve events from 20140429 pretty fast. Then I would >> split >>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: >>>> `$random >>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the >> 10 >>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort >> everything >>>>>> until I've got all the expected results. >>>>>> >>>>>> So in term of performances this looks "a little bit" faster than your >>>> 2^32 >>>>>> randomization. >>>>>> >>>>>> >>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev < >>>> [email protected]>wrote: >>>>>> >>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our >>>>>>> time series data (20140501, 20140502...). We can prefix all of the >>>>>>> keys with 4 random bytes and then just skip these during scanning. Is >>>>>>> that correct? These *seems* like it will work but Im questioning the >>>>>>> performance of this even if it does work. >>>>>>> >>>>>>> Also, is this available via the rest client, shell and/or thrift >>>> client? >>>>>>> >>>>>>> Also, is there a FuzzyColumn equivalent of this feature? >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Adrien Mogenet >>>>>> http://www.borntosegfault.com >>>>> >>>> >>>> >> >>
