I see what you are saying Michael but I think following is a blanket assumption: bq Think of it this way... the operation was a success but the patient died. eq
This is not always the case. Yes, if your use-case/system is such that it will have lots of users trying to access then perhaps N users kicking off N concurrent/distributed reads is not efficient but what if you have a batch use case where these distributed scans might actually help. Point being, rather than shooting down the idea as a whole, we can perhaps qualify it with areas where it might be useful and area others where it can have adverse affect. Regards, Shahab On Wed, May 1, 2013 at 10:14 AM, Michael Segel <[email protected]>wrote: > Unfortunately as this idea keeps popping up, you are going to have this > discussion. > > 1) As you admit... salting is bad when your primary access vector is > get()s. > 2) Range scans. Instead of 1 range scan, you now have N where N is the > number of salt values. In this case 10. > You wouldn't think this as bad, however when you have a system which has a > lot of users, lots of queries which now have to scan N times the number of > records for each scan? Excessive overhead. Just because the scans happen in > parallel, you are still tying up a finite amount of resources. > > So you have to go back and ask the initial question... why? > Can you change your key? > What is the problem you're trying to solve? > > The point is that just because you can do it, doesn't make it a good idea. > > Think of it this way... the operation was a success but the patient died. > > > On May 1, 2013, at 12:12 AM, lars hofhansl <[email protected]> wrote: > > > I do not want to be rude or anything... But how often we need to have > this discussion? > > > > When you salt your rowkeys with say 10 salt values then for each read > you need to fork of 10 read requests, and each of them touches only 1/10th > of the tables (which nicely with HBase's prefix scans). > > > > Obviously, if you only need point gets you wouldn't salting, that would > be stupid. If you mostly do range scans, than salting is quite nice. > > > > Saying that salting is bad, because it does not work for point gets is > like saying that bulldozers are bad, because you cannot use on them race > tracks. :) > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Michael Segel <[email protected]> > > To: [email protected] > > Sent: Tuesday, April 30, 2013 10:06 AM > > Subject: Re: Read access pattern > > > > > > Sure. > > > > By definition, the salt number is a random seed that is not associated > with the underlying record. > > A simple example is a round robin counter (mod the counter by 10 > yielding [0..9] ) > > > > So you get a record, prepend your salt and you write it out to HBase. > The salt will push the data out to a different region. > > > > But what happens when you want to read the data? > > > > So on a full table scan... no biggie, its the same. > > > > But suppose I want to do a partial table scan. Now I have to do multiple > partial scans because I dont know the salt. > > Or if I want to do a simple get() I now have to do N number of get()s > where N is the number of salt values allowed. In my example that's 10. > > > > And that's the problem. > > > > You are better off doing a hash of the record, use the first couple of > bytes off the hash and then writing the record out. > > You want the record, take the key, hash it, using the same process and > you have 1 get(). > > > > You're still screwed up on doing a range scan, but you can't have > everything. > > > > THIS IS WHY I AND MANY CARDIOLOGISTS SAY NO TO SALT. The only difference > is that they are talking about excess sodium chloride in your diet. I'm > talking about using a salt aka 'random seed'. > > > > Does that make sense? > > > > > > On Apr 30, 2013, at 11:17 AM, Shahab Yunus <[email protected]> > wrote: > > > >> Well those are *some* words :) Anyway, can you explain a bit in detail > that > >> why you feel so strongly about this design/approach? The salting here is > >> not the only option mentioned and static hashing can be used as well. > Plus > >> even in case of salting, wouldn't the distributed scan take care of it? > The > >> downside that I see, is the bucket_number that we have to maintain both > at > >> time or reading/writing and update it in case of cluster restructuring. > >> > >> Thanks, > >> Shahab > >> > >> > >> On Tue, Apr 30, 2013 at 11:57 AM, Michael Segel > >> <[email protected]>wrote: > >> > >>> Geez that's a bad article. > >>> Never salt. > >>> > >>> And yes there's a difference between using a salt and using the first > 2-4 > >>> bytes from your MD5 hash. > >>> > >>> (Hint: Salts are random. Your hash isn't. ) > >>> > >>> Sorry to be-itch but its a bad idea and it shouldn't be propagated. > >>> > >>> On Apr 29, 2013, at 10:17 AM, Shahab Yunus <[email protected]> > wrote: > >>> > >>>> I think you cannot use the scanner simply to to a range scan here as > your > >>>> keys are not monotonically increasing. You need to apply logic to > >>>> decode/reverse your mechanism that you have used to hash your keys at > the > >>>> time of writing. You might want to check out the SemaText library > which > >>>> does distributed scans and seem to handle the scenarios that you want > to > >>>> implement. > >>>> > >>> > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > >>>> > >>>> > >>>> On Mon, Apr 29, 2013 at 11:03 AM, <[email protected]> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I have a rowkey defined by : > >>>>> getMD5AsHex(Bytes.toBytes(myObjectId)) + > String.format("%19d\n", > >>>>> (Long.MAX_VALUE - changeDate.getTime())); > >>>>> > >>>>> How could I get the previous and next row for a given rowkey ? > >>>>> For instance, I have the following ordered keys : > >>>>> > >>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370673172227807 > >>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468022807 > >>>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468862807 > >>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674984237807 > >>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674987271807 > >>>>> > >>>>> If I choose the rowkey : > >>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468862807, what would be > the > >>>>> correct scan to get the previous and next key ? > >>>>> Result would be : > >>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674468022807 > >>>>> 00003db1b6c1e7e7d2ece41ff2184f76*9223370674984237807 > >>>>> > >>>>> Thank you ! > >>>>> R. > >>>>> > >>>>> Une messagerie gratuite, garantie à vie et des services en plus, ça > vous > >>>>> tente ? > >>>>> Je crée ma boîte mail www.laposte.net > >>>>> > >>> > >
