The top two hits when you Google for HBase salt are - Sematext blog describing "salting" as I described it in my email - Phoenix blog again describing "salting" in this same way I really don't understand what you're arguing about - the mechanism that you're advocating for is exactly the way both these solutions have implemented it. I believe we're all in agreement. It seems that you just aren't happy with the fact that we've called this technique "salting".
On Sun, May 18, 2014 at 11:32 AM, Michael Segel <[email protected]>wrote: > @James… > You’re not listening. There is a special meaning when you say salt. > > On May 18, 2014, at 7:16 PM, James Taylor <[email protected]> wrote: > > > @Mike, > > > > The biggest problem is you're not listening. Please actually read my > > response (and you'll understand the what we're calling "salting" is not a > > random seed). > > > > Phoenix already has secondary indexes in two flavors: one optimized for > > write-once data and one more general for fully mutable data. Soon we'll > > have a third for local indexing. > > > > James > > > > > > On Sun, May 18, 2014 at 10:27 AM, Michael Segel > > <[email protected]>wrote: > > > >> @James, > >> > >> I know and that’s the biggest problem. > >> Salts by definition are random seeds. > >> > >> Now I have two new phrases. > >> > >> 1) We want to remain on a sodium free diet. > >> 2) Learn to kick the bucket. > >> > >> When you have data that is coming in on a time series, is the data > mutable > >> or not? > >> > >> A better approach would be to redesign a second type of storage to > handle > >> serial data and how the regions are split and managed. > >> Or just not use HBase to store the underlying data in the first place > and > >> just store the index… ;-) > >> (Yes, I thought about this too.) > >> > >> -Mike > >> > >> On May 16, 2014, at 7:50 PM, James Taylor <[email protected]> > wrote: > >> > >>> Hi Mike, > >>> I agree with you - the way you've outlined is exactly the way Phoenix > has > >>> implemented it. It's a bit of a problem with terminology, though. We > call > >>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash > the > >>> key, mod the hash with the SALT_BUCKET value you provide, and prepend > the > >>> row key with this single byte value. Maybe you can coin a good term for > >>> this technique? > >>> > >>> FWIW, you don't lose the ability to do a range scan when you salt (or > >>> hash-the-key and mod by the number of "buckets"), but you do need to > run > >> a > >>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1). > Then > >>> the client does a merge sort among these scans. It performs well. > >>> > >>> Thanks, > >>> James > >>> > >>> > >>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel < > >> [email protected]>wrote: > >>> > >>>> 3+ Years on and a bad idea is being propagated again. > >>>> > >>>> Now repeat after me… DO NO USE A SALT. > >>>> > >>>> Having a low sodium diet, especially for HBase is really good for your > >>>> health and sanity. > >>>> > >>>> The salt is going to be orthogonal to the row key (Key). > >>>> There is no relationship to the specific Key. > >>>> > >>>> Using a salt means you now use the ability to randomly spread the > >>>> distribution of data to avoid HOT SPOTTING. > >>>> However you lose the ability to seek for a specific row. > >>>> > >>>> YOU HASH THE KEY. > >>>> > >>>> The hash whether you use SHA-1 or MD-5 is going to yield the same > result > >>>> each and every time you provide the key. > >>>> > >>>> But wait, the generated hash is 160 bits long. We don’t need that! > >>>> Absolutely true if you just want to randomize the key to avoid hot > >>>> spotting. There’s this concept called truncating the hash to the > desired > >>>> length. > >>>> So to Adrien’s point, you can truncate it to a single byte which would > >> be > >>>> sufficient…. > >>>> Now when you want to seek for a specific row, you can find it. > >>>> > >>>> The downside to any solution is that you lose the ability to do a > range > >>>> scan. > >>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO > FETCH A > >>>> SINGLE ROW VIA A get() CALL. > >>>> > >>>> <rant> > >>>> This simple fact has been pointed out several years ago, yet for some > >>>> reason, the use of a salt persists. > >>>> I’ve actually made that part of the HBase course I wrote and use it in > >> my > >>>> presentation(s) on HBase. > >>>> > >>>> It amazes me that the committers and regulars who post here still > don’t > >>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well > >> not > >>>> use HBase and stick with Hive. > >>>> I remember Ed C’s rant about how preferential treatment on Hive > patches > >>>> was given to vendors’ committers… that preferential treatment seems to > >> also > >>>> be extended speakers at conferences. It wouldn’t be a problem if those > >> said > >>>> speakers actually knew the topic… ;-) > >>>> > >>>> Propagation of bad ideas means that you’re leaving a lot of > performance > >> on > >>>> the table and it can kill or cripple projects. > >>>> > >>>> </rant> > >>>> > >>>> Sorry for the rant… > >>>> > >>>> -Mike > >>>> > >>>> > >>>> > >>>> > >>>> On May 3, 2014, at 4:39 PM, Software Dev <[email protected]> > >>>> wrote: > >>>> > >>>>> Ok so there is no way around the FuzzyRowFilter checking every single > >>>>> row in the table correct? If so, what is a valid use case for that > >>>>> filter? > >>>>> > >>>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our > >>>>> client for accessing these tables is a Rails (not JRuby) application > >>>>> so we are stuck with either the Thrift or Rails client. Can either of > >>>>> these perform multiple gets/scans? > >>>>> > >>>>> > >>>>> > >>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet < > >> [email protected]> > >>>> wrote: > >>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data > can > >>>> be > >>>>>> split enough among all the possible regions, but you won't be able > to > >>>>>> easily benefit from distributed scans to gather what you want. > >>>>>> > >>>>>> Let say you want to split (time+login) with a salted key and you > >> expect > >>>> to > >>>>>> be able to retrieve events from 20140429 pretty fast. Then I would > >> split > >>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: > >>>> `$random > >>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the > >> 10 > >>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort > >> everything > >>>>>> until I've got all the expected results. > >>>>>> > >>>>>> So in term of performances this looks "a little bit" faster than > your > >>>> 2^32 > >>>>>> randomization. > >>>>>> > >>>>>> > >>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev < > >>>> [email protected]>wrote: > >>>>>> > >>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of > our > >>>>>>> time series data (20140501, 20140502...). We can prefix all of the > >>>>>>> keys with 4 random bytes and then just skip these during scanning. > Is > >>>>>>> that correct? These *seems* like it will work but Im questioning > the > >>>>>>> performance of this even if it does work. > >>>>>>> > >>>>>>> Also, is this available via the rest client, shell and/or thrift > >>>> client? > >>>>>>> > >>>>>>> Also, is there a FuzzyColumn equivalent of this feature? > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Adrien Mogenet > >>>>>> http://www.borntosegfault.com > >>>>> > >>>> > >>>> > >> > >> > >
