@Software Dev - if you use Phoenix, queries would leverage our Skip Scan (which supports a superset of the FuzzyRowFilter perf improvements). Take a look here: http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html
Assuming a row key made up of a low cardinality first value (like a byte representing an enum), followed by a high cardinality second value (like a date/time value) you'd get a large benefit from the skip scan when you're only looking a small sliver of your time range. Another option would be to create a secondary index over your date: http://phoenix.incubator.apache.org/secondary_indexing.html Thanks, James On Sun, May 18, 2014 at 1:56 PM, James Taylor <[email protected]>wrote: > The top two hits when you Google for HBase salt are > - Sematext blog describing "salting" as I described it in my email > - Phoenix blog again describing "salting" in this same way > I really don't understand what you're arguing about - the mechanism that > you're advocating for is exactly the way both these solutions have > implemented it. I believe we're all in agreement. It seems that you just > aren't happy with the fact that we've called this technique "salting". > > > On Sun, May 18, 2014 at 11:32 AM, Michael Segel <[email protected] > > wrote: > >> @James… >> You’re not listening. There is a special meaning when you say salt. >> >> On May 18, 2014, at 7:16 PM, James Taylor <[email protected]> wrote: >> >> > @Mike, >> > >> > The biggest problem is you're not listening. Please actually read my >> > response (and you'll understand the what we're calling "salting" is not >> a >> > random seed). >> > >> > Phoenix already has secondary indexes in two flavors: one optimized for >> > write-once data and one more general for fully mutable data. Soon we'll >> > have a third for local indexing. >> > >> > James >> > >> > >> > On Sun, May 18, 2014 at 10:27 AM, Michael Segel >> > <[email protected]>wrote: >> > >> >> @James, >> >> >> >> I know and that’s the biggest problem. >> >> Salts by definition are random seeds. >> >> >> >> Now I have two new phrases. >> >> >> >> 1) We want to remain on a sodium free diet. >> >> 2) Learn to kick the bucket. >> >> >> >> When you have data that is coming in on a time series, is the data >> mutable >> >> or not? >> >> >> >> A better approach would be to redesign a second type of storage to >> handle >> >> serial data and how the regions are split and managed. >> >> Or just not use HBase to store the underlying data in the first place >> and >> >> just store the index… ;-) >> >> (Yes, I thought about this too.) >> >> >> >> -Mike >> >> >> >> On May 16, 2014, at 7:50 PM, James Taylor <[email protected]> >> wrote: >> >> >> >>> Hi Mike, >> >>> I agree with you - the way you've outlined is exactly the way Phoenix >> has >> >>> implemented it. It's a bit of a problem with terminology, though. We >> call >> >>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash >> the >> >>> key, mod the hash with the SALT_BUCKET value you provide, and prepend >> the >> >>> row key with this single byte value. Maybe you can coin a good term >> for >> >>> this technique? >> >>> >> >>> FWIW, you don't lose the ability to do a range scan when you salt (or >> >>> hash-the-key and mod by the number of "buckets"), but you do need to >> run >> >> a >> >>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1). >> Then >> >>> the client does a merge sort among these scans. It performs well. >> >>> >> >>> Thanks, >> >>> James >> >>> >> >>> >> >>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel < >> >> [email protected]>wrote: >> >>> >> >>>> 3+ Years on and a bad idea is being propagated again. >> >>>> >> >>>> Now repeat after me… DO NO USE A SALT. >> >>>> >> >>>> Having a low sodium diet, especially for HBase is really good for >> your >> >>>> health and sanity. >> >>>> >> >>>> The salt is going to be orthogonal to the row key (Key). >> >>>> There is no relationship to the specific Key. >> >>>> >> >>>> Using a salt means you now use the ability to randomly spread the >> >>>> distribution of data to avoid HOT SPOTTING. >> >>>> However you lose the ability to seek for a specific row. >> >>>> >> >>>> YOU HASH THE KEY. >> >>>> >> >>>> The hash whether you use SHA-1 or MD-5 is going to yield the same >> result >> >>>> each and every time you provide the key. >> >>>> >> >>>> But wait, the generated hash is 160 bits long. We don’t need that! >> >>>> Absolutely true if you just want to randomize the key to avoid hot >> >>>> spotting. There’s this concept called truncating the hash to the >> desired >> >>>> length. >> >>>> So to Adrien’s point, you can truncate it to a single byte which >> would >> >> be >> >>>> sufficient…. >> >>>> Now when you want to seek for a specific row, you can find it. >> >>>> >> >>>> The downside to any solution is that you lose the ability to do a >> range >> >>>> scan. >> >>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO >> FETCH A >> >>>> SINGLE ROW VIA A get() CALL. >> >>>> >> >>>> <rant> >> >>>> This simple fact has been pointed out several years ago, yet for some >> >>>> reason, the use of a salt persists. >> >>>> I’ve actually made that part of the HBase course I wrote and use it >> in >> >> my >> >>>> presentation(s) on HBase. >> >>>> >> >>>> It amazes me that the committers and regulars who post here still >> don’t >> >>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well >> >> not >> >>>> use HBase and stick with Hive. >> >>>> I remember Ed C’s rant about how preferential treatment on Hive >> patches >> >>>> was given to vendors’ committers… that preferential treatment seems >> to >> >> also >> >>>> be extended speakers at conferences. It wouldn’t be a problem if >> those >> >> said >> >>>> speakers actually knew the topic… ;-) >> >>>> >> >>>> Propagation of bad ideas means that you’re leaving a lot of >> performance >> >> on >> >>>> the table and it can kill or cripple projects. >> >>>> >> >>>> </rant> >> >>>> >> >>>> Sorry for the rant… >> >>>> >> >>>> -Mike >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> On May 3, 2014, at 4:39 PM, Software Dev <[email protected]> >> >>>> wrote: >> >>>> >> >>>>> Ok so there is no way around the FuzzyRowFilter checking every >> single >> >>>>> row in the table correct? If so, what is a valid use case for that >> >>>>> filter? >> >>>>> >> >>>>> Ok so salt to a low enough prefix that makes scanning reasonable. >> Our >> >>>>> client for accessing these tables is a Rails (not JRuby) application >> >>>>> so we are stuck with either the Thrift or Rails client. Can either >> of >> >>>>> these perform multiple gets/scans? >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet < >> >> [email protected]> >> >>>> wrote: >> >>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data >> can >> >>>> be >> >>>>>> split enough among all the possible regions, but you won't be able >> to >> >>>>>> easily benefit from distributed scans to gather what you want. >> >>>>>> >> >>>>>> Let say you want to split (time+login) with a salted key and you >> >> expect >> >>>> to >> >>>>>> be able to retrieve events from 20140429 pretty fast. Then I would >> >> split >> >>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie: >> >>>> `$random >> >>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over >> the >> >> 10 >> >>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort >> >> everything >> >>>>>> until I've got all the expected results. >> >>>>>> >> >>>>>> So in term of performances this looks "a little bit" faster than >> your >> >>>> 2^32 >> >>>>>> randomization. >> >>>>>> >> >>>>>> >> >>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev < >> >>>> [email protected]>wrote: >> >>>>>> >> >>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of >> our >> >>>>>>> time series data (20140501, 20140502...). We can prefix all of >> the >> >>>>>>> keys with 4 random bytes and then just skip these during >> scanning. Is >> >>>>>>> that correct? These *seems* like it will work but Im questioning >> the >> >>>>>>> performance of this even if it does work. >> >>>>>>> >> >>>>>>> Also, is this available via the rest client, shell and/or thrift >> >>>> client? >> >>>>>>> >> >>>>>>> Also, is there a FuzzyColumn equivalent of this feature? >> >>>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> Adrien Mogenet >> >>>>>> http://www.borntosegfault.com >> >>>>> >> >>>> >> >>>> >> >> >> >> >> >> >
