@Mike, The biggest problem is you're not listening. Please actually read my response (and you'll understand the what we're calling "salting" is not a random seed).
Phoenix already has secondary indexes in two flavors: one optimized for write-once data and one more general for fully mutable data. Soon we'll have a third for local indexing. James On Sun, May 18, 2014 at 10:27 AM, Michael Segel <[email protected]>wrote: > @James, > > I know and that’s the biggest problem. > Salts by definition are random seeds. > > Now I have two new phrases. > > 1) We want to remain on a sodium free diet. > 2) Learn to kick the bucket. > > When you have data that is coming in on a time series, is the data mutable > or not? > > A better approach would be to redesign a second type of storage to handle > serial data and how the regions are split and managed. > Or just not use HBase to store the underlying data in the first place and > just store the index… ;-) > (Yes, I thought about this too.) > > -Mike > > On May 16, 2014, at 7:50 PM, James Taylor <[email protected]> wrote: > > > Hi Mike, > > I agree with you - the way you've outlined is exactly the way Phoenix has > > implemented it. It's a bit of a problem with terminology, though. We call > > it salting: http://phoenix.incubator.apache.org/salted.html. We hash the > > key, mod the hash with the SALT_BUCKET value you provide, and prepend the > > row key with this single byte value. Maybe you can coin a good term for > > this technique? > > > > FWIW, you don't lose the ability to do a range scan when you salt (or > > hash-the-key and mod by the number of "buckets"), but you do need to run > a > > scan for each possible value of your salt byte (0 - SALT_BUCKET-1). Then > > the client does a merge sort among these scans. It performs well. > > > > Thanks, > > James > > > > > > On Fri, May 9, 2014 at 11:57 PM, Michael Segel < > [email protected]>wrote: > > > >> 3+ Years on and a bad idea is being propagated again. > >> > >> Now repeat after me… DO NO USE A SALT. > >> > >> Having a low sodium diet, especially for HBase is really good for your > >> health and sanity. > >> > >> The salt is going to be orthogonal to the row key (Key). > >> There is no relationship to the specific Key. > >> > >> Using a salt means you now use the ability to randomly spread the > >> distribution of data to avoid HOT SPOTTING. > >> However you lose the ability to seek for a specific row. > >> > >> YOU HASH THE KEY. > >> > >> The hash whether you use SHA-1 or MD-5 is going to yield the same result > >> each and every time you provide the key. > >> > >> But wait, the generated hash is 160 bits long. We don’t need that! > >> Absolutely true if you just want to randomize the key to avoid hot > >> spotting. There’s this concept called truncating the hash to the desired > >> length. > >> So to Adrien’s point, you can truncate it to a single byte which would > be > >> sufficient…. > >> Now when you want to seek for a specific row, you can find it. > >> > >> The downside to any solution is that you lose the ability to do a range > >> scan. > >> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO FETCH A > >> SINGLE ROW VIA A get() CALL. > >> > >> <rant> > >> This simple fact has been pointed out several years ago, yet for some > >> reason, the use of a salt persists. > >> I’ve actually made that part of the HBase course I wrote and use it in > my > >> presentation(s) on HBase. > >> > >> It amazes me that the committers and regulars who post here still don’t > >> grok the fact that if you’re going to ‘SALT’ a row, you might as well > not > >> use HBase and stick with Hive. > >> I remember Ed C’s rant about how preferential treatment on Hive patches > >> was given to vendors’ committers… that preferential treatment seems to > also > >> be extended speakers at conferences. It wouldn’t be a problem if those > said > >> speakers actually knew the topic… ;-) > >> > >> Propagation of bad ideas means that you’re leaving a lot of performance > on > >> the table and it can kill or cripple projects. > >> > >> </rant> > >> > >> Sorry for the rant… > >> > >> -Mike > >> > >> > >> > >> > >> On May 3, 2014, at 4:39 PM, Software Dev <[email protected]> > >> wrote: > >> > >>> Ok so there is no way around the FuzzyRowFilter checking every single > >>> row in the table correct? If so, what is a valid use case for that > >>> filter? > >>> > >>> Ok so salt to a low enough prefix that makes scanning reasonable. Our > >>> client for accessing these tables is a Rails (not JRuby) application > >>> so we are stuck with either the Thrift or Rails client. Can either of > >>> these perform multiple gets/scans? > >>> > >>> > >>> > >>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet < > [email protected]> > >> wrote: > >>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data can > >> be > >>>> split enough among all the possible regions, but you won't be able to > >>>> easily benefit from distributed scans to gather what you want. > >>>> > >>>> Let say you want to split (time+login) with a salted key and you > expect > >> to > >>>> be able to retrieve events from 20140429 pretty fast. Then I would > split > >>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: > >> `$random > >>>> % 10'). To retrieve ordered data, I would parallelize Scans over the > 10 > >>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort > everything > >>>> until I've got all the expected results. > >>>> > >>>> So in term of performances this looks "a little bit" faster than your > >> 2^32 > >>>> randomization. > >>>> > >>>> > >>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev < > >> [email protected]>wrote: > >>>> > >>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of our > >>>>> time series data (20140501, 20140502...). We can prefix all of the > >>>>> keys with 4 random bytes and then just skip these during scanning. Is > >>>>> that correct? These *seems* like it will work but Im questioning the > >>>>> performance of this even if it does work. > >>>>> > >>>>> Also, is this available via the rest client, shell and/or thrift > >> client? > >>>>> > >>>>> Also, is there a FuzzyColumn equivalent of this feature? > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Adrien Mogenet > >>>> http://www.borntosegfault.com > >>> > >> > >> > >
