Re: Questions on FuzzyRowFilter

James Taylor Sun, 18 May 2014 15:13:25 -0700

The top two hits when you Google  for HBase salt are
- Sematext blog describing "salting" as I described it in my email
- Phoenix blog again describing "salting" in this same way
I really don't understand what you're arguing about - the mechanism that
you're advocating for is exactly the way both these solutions have
implemented it. I believe we're all in agreement. It seems that you just
aren't happy with the fact that we've called this technique "salting".



On Sun, May 18, 2014 at 11:32 AM, Michael Segel
<[email protected]>wrote:

> @James…
> You’re not listening. There is a special meaning when you say salt.
>
> On May 18, 2014, at 7:16 PM, James Taylor <[email protected]> wrote:
>
> > @Mike,
> >
> > The biggest problem is you're not listening. Please actually read my
> > response (and you'll understand the what we're calling "salting" is not a
> > random seed).
> >
> > Phoenix already has secondary indexes in two flavors: one optimized for
> > write-once data and one more general for fully mutable data. Soon we'll
> > have a third for local indexing.
> >
> > James
> >
> >
> > On Sun, May 18, 2014 at 10:27 AM, Michael Segel
> > <[email protected]>wrote:
> >
> >> @James,
> >>
> >> I know and that’s the biggest problem.
> >> Salts by definition are random seeds.
> >>
> >> Now I have two new phrases.
> >>
> >> 1) We want to remain on a sodium free diet.
> >> 2) Learn to kick the bucket.
> >>
> >> When you have data that is coming in on a time series, is the data
> mutable
> >> or not?
> >>
> >> A better approach would be to redesign a second type of storage to
> handle
> >> serial data and how the regions are split and managed.
> >> Or just not use HBase to store the underlying data in the first place
> and
> >> just store the index… ;-)
> >> (Yes, I thought about this too.)
> >>
> >> -Mike
> >>
> >> On May 16, 2014, at 7:50 PM, James Taylor <[email protected]>
> wrote:
> >>
> >>> Hi Mike,
> >>> I agree with you - the way you've outlined is exactly the way Phoenix
> has
> >>> implemented it. It's a bit of a problem with terminology, though. We
> call
> >>> it salting: http://phoenix.incubator.apache.org/salted.html. We hash
> the
> >>> key, mod the hash with the SALT_BUCKET value you provide, and prepend
> the
> >>> row key with this single byte value. Maybe you can coin a good term for
> >>> this technique?
> >>>
> >>> FWIW, you don't lose the ability to do a range scan when you salt (or
> >>> hash-the-key and mod by the number of "buckets"), but you do need to
> run
> >> a
> >>> scan for each possible value of your salt byte (0 - SALT_BUCKET-1).
> Then
> >>> the client does a merge sort among these scans. It performs well.
> >>>
> >>> Thanks,
> >>> James
> >>>
> >>>
> >>> On Fri, May 9, 2014 at 11:57 PM, Michael Segel <
> >> [email protected]>wrote:
> >>>
> >>>> 3+ Years on and a bad idea is being propagated again.
> >>>>
> >>>> Now repeat after me… DO NO USE A SALT.
> >>>>
> >>>> Having a low sodium diet, especially for HBase is really good for your
> >>>> health and sanity.
> >>>>
> >>>> The salt is going to be orthogonal to the row key (Key).
> >>>> There is no relationship to the specific Key.
> >>>>
> >>>> Using a salt means you now use the ability to randomly spread the
> >>>> distribution of data to avoid HOT SPOTTING.
> >>>> However you lose the ability to seek for a specific row.
> >>>>
> >>>> YOU HASH THE KEY.
> >>>>
> >>>> The hash whether you use SHA-1 or MD-5 is going to yield the same
> result
> >>>> each and every time you provide the key.
> >>>>
> >>>> But wait, the generated hash is 160 bits long. We don’t need that!
> >>>> Absolutely true if you just want to randomize the key to avoid hot
> >>>> spotting. There’s this concept called truncating the hash to the
> desired
> >>>> length.
> >>>> So to Adrien’s point, you can truncate it to a single byte which would
> >> be
> >>>> sufficient….
> >>>> Now when you want to seek for a specific row, you can find it.
> >>>>
> >>>> The downside to any solution is that you lose the ability to do a
> range
> >>>> scan.
> >>>> BUT BY USING A HASH AND NOT A SALT, YOU DONT LOSE THE ABILITY TO
> FETCH A
> >>>> SINGLE ROW VIA A get() CALL.
> >>>>
> >>>> <rant>
> >>>> This simple fact has been pointed out several years ago, yet for some
> >>>> reason, the use of a salt persists.
> >>>> I’ve actually made that part of the HBase course I wrote and use it in
> >> my
> >>>> presentation(s) on HBase.
> >>>>
> >>>> It amazes me that the committers and regulars who post here still
> don’t
> >>>> grok the fact that if you’re going to ‘SALT’ a row, you might as well
> >> not
> >>>> use HBase and stick with Hive.
> >>>> I remember Ed C’s rant about how preferential treatment on Hive
> patches
> >>>> was given to vendors’ committers… that preferential treatment seems to
> >> also
> >>>> be extended speakers at conferences. It wouldn’t be a problem if those
> >> said
> >>>> speakers actually knew the topic… ;-)
> >>>>
> >>>> Propagation of bad ideas means that you’re leaving a lot of
> performance
> >> on
> >>>> the table and it can kill or cripple projects.
> >>>>
> >>>> </rant>
> >>>>
> >>>> Sorry for the rant…
> >>>>
> >>>> -Mike
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On May 3, 2014, at 4:39 PM, Software Dev <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Ok so there is no way around the FuzzyRowFilter checking every single
> >>>>> row in the table correct? If so, what is a valid use case for that
> >>>>> filter?
> >>>>>
> >>>>> Ok so salt to a low enough prefix that makes scanning reasonable. Our
> >>>>> client for accessing these tables is a Rails (not JRuby) application
> >>>>> so we are stuck with either the Thrift or Rails client. Can either of
> >>>>> these perform multiple gets/scans?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sat, May 3, 2014 at 1:10 AM, Adrien Mogenet <
> >> [email protected]>
> >>>> wrote:
> >>>>>> Using 4 random bytes you'll get 2^32 possibilities; thus your data
> can
> >>>> be
> >>>>>> split enough among all the possible regions, but you won't be able
> to
> >>>>>> easily benefit from distributed scans to gather what you want.
> >>>>>>
> >>>>>> Let say you want to split (time+login) with a salted key and you
> >> expect
> >>>> to
> >>>>>> be able to retrieve events from 20140429 pretty fast. Then I would
> >> split
> >>>>>> input data among 10 "spans", spread over 10 regions and 10 RS (ie:
> >>>> `$random
> >>>>>> % 10'). To retrieve ordered data, I would parallelize Scans over the
> >> 10
> >>>>>> span groups (<00>-20140429, <01>-20140429...) and merge-sort
> >> everything
> >>>>>> until I've got all the expected results.
> >>>>>>
> >>>>>> So in term of performances this looks "a little bit" faster than
> your
> >>>> 2^32
> >>>>>> randomization.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, May 2, 2014 at 10:09 PM, Software Dev <
> >>>> [email protected]>wrote:
> >>>>>>
> >>>>>>> I'm planning to work with FuzzyRowFilter to avoid hot spotting of
> our
> >>>>>>> time series data (20140501, 20140502...).  We can prefix all of the
> >>>>>>> keys with 4 random bytes and then just skip these during scanning.
> Is
> >>>>>>> that correct? These *seems* like it will work but Im questioning
> the
> >>>>>>> performance of this even if it does work.
> >>>>>>>
> >>>>>>> Also, is this available via the rest client, shell and/or thrift
> >>>> client?
> >>>>>>>
> >>>>>>> Also, is there a FuzzyColumn equivalent of this feature?
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Adrien Mogenet
> >>>>>> http://www.borntosegfault.com
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Questions on FuzzyRowFilter

Reply via email to