I think there's some hair-splitting going on here. The term "salting," by strict definition [0] from the cryptographic context, means the introduction of randomness to produce a one-way encoding of a value. The technique for rowkey design described here does not include the introduction of said randomness, nor is it a one-way operation. Instead, it divides the input more or less evenly across a fixed number of buckets.
If you've introduced one-way randomness, you have no way to reproduce the rowkey after it's been inserted. Thus, accessing a specific row would be O(n) where n is the number of rows in the table -- you'd have to scan the table looking for the desired row. Use of a bucketing technique, on the other hand, adds a small amount of computation to calculate the true rowkey, making access to a single row a O(1) + C vs the number of rows in the table. -n [0]: http://en.wikipedia.org/wiki/Salt_(cryptography) On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel <[email protected]>wrote: > Lars, > > Ok... he's talking about buckets. > > So when you have N buckets, what is the least number of get()s do you need > to fetch the single row? > (Hint: The answer is N) > > How many scans? (N again) > > Do you disagree? > > > On Dec 19, 2012, at 8:06 PM, lars hofhansl <[email protected]> wrote: > > > Mike, please think about what you write before you write it. > > You will most definitely not need a full table scan (much less a *FULL* > *TABLE* *SCAN* ;-) ). > > > > Read Alex's blog post again, it's a good post (IMHO). He is talking > about buckets. > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Michael Segel <[email protected]> > > To: [email protected] > > Sent: Wednesday, December 19, 2012 5:23 PM > > Subject: Re: Is it necessary to set MD5 on rowkey? > > > > Ok, > > > > Lets try this one more time... > > > > If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to > retrieve the row. > > If you do something like a salt that uses only a preset of N > combinations, you will have to do N get()s in order to fetch the row. > > > > This is bad. VERY BAD. > > > > If you hash the row, you will get a consistent value each time you hash > the key. If you use SHA-1, the odds of a collision are mathematically > possible, however highly improbable. So people have recommended that they > append the key to the hash to form the new key. Here, you might as well as > truncate the hash to just the most significant byte or two and the append > the key. This will give you enough of an even distribution that you can > avoid hot spotting. > > > > So if I use the hash, I can effectively still get the row of data back > with a single get(). Otherwise its a full table scan. > > > > Do you see the difference? > > > > > > On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari < > [email protected]> wrote: > > > >> Hi Mike, > >> > >> If in your business case, the only thing you need when you retreive > >> your data is to do full scan over MR jobs, then you can salt with > >> what-ever you want. Hash, random values, etc. > >> > >> If you know you have x regions, then you can simply do a round-robin > >> salting, or a random salting over those x regions. > >> > >> Then when you run your MR job, you discard the first bytes, and do > >> what you want with your data. > >> > >> So I also think that salting can still be usefull. All depend on what > >> you do with your data. > >> > >> Must my opinion. > >> > >> JM > >> > >> 2012/12/19, Michael Segel <[email protected]>: > >>> Ok... > >>> > >>> So you use a random byte or two at the front of the row. > >>> How do you then use get() to find the row? > >>> How do you do a partial scan()? > >>> > >>> Do you start to see the problem? > >>> The only way to get to the row is to do a full table scan. That kills > HBase > >>> and you would be better off going with a partitioned Hive table. > >>> > >>> Using a hash of the key or a portion of the hash is not a salt. > >>> That's not what I have a problem with. Each time you want to fetch the > key, > >>> you just hash it, truncate the hash and then prepend it to the key. > You will > >>> then be able to use get(). > >>> > >>> Using a salt would imply using some form of a modulo math to get a > round > >>> robin prefix. Or a random number generator. > >>> > >>> That's the issue. > >>> > >>> Does that make sense? > >>> > >>> > >>> > >>> On Dec 19, 2012, at 3:26 PM, David Arthur <[email protected]> wrote: > >>> > >>>> Let's say you want to decompose a url into domain and path to include > in > >>>> your row key. > >>>> > >>>> You could of course just use the url as the key, but you will see > >>>> hotspotting since most will start with "http". To mitigate this, you > could > >>>> add a random byte or two at the beginning (random salt) to improve > >>>> distribution of keys, but you break single record Gets (and Scans > >>>> arguably). Another approach is to use a hash-based salt: hash the > whole > >>>> key and use a few of those bytes as a salt. This fixes Gets but Scans > are > >>>> still not effective. > >>>> > >>>> One approach I've taken is to hash only a part of the key. Consider > the > >>>> following key structure > >>>> > >>>> <2 bytes of hash(domain)><domain><path> > >>>> > >>>> With this you get 16 bits for a hash-based salt. The salt is > deterministic > >>>> so Gets work fine, and for a single domain the salt is the same so > you can > >>>> easily do Scans across a domain. If you had some further structure to > your > >>>> key that you wished to scan across, you could do something like: > >>>> > >>>> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path> > >>>> > >>>> It really boils down to identifying your access patterns and > read/write > >>>> requirements and constructing a row key accordingly. > >>>> > >>>> HTH, > >>>> David > >>>> > >>>> On 12/18/12 6:29 PM, Michael Segel wrote: > >>>>> Alex, > >>>>> And that's the point. Salt as you explain it conceptually implies > that > >>>>> the number you are adding to the key to ensure a better distribution > >>>>> means that you will have inefficiencies in terms of scans and gets. > >>>>> > >>>>> Using a hash as either the full key, or taking the hash, truncating > it > >>>>> and appending the key may screw up scans, but your get() is intact. > >>>>> > >>>>> There are other options like inverting the numeric key ... > >>>>> > >>>>> And of course doing nothing. > >>>>> > >>>>> Using a salt as part of the design pattern is bad. > >>>>> > >>>>> With respect to the OP, I was discussing the use of hash and some > >>>>> alternatives to how to implement the hash of a key. > >>>>> Again, doing nothing may also make sense too, if you understand the > risks > >>>>> and you know how your data is going to be used. > >>>>> > >>>>> > >>>>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected] > > > >>>>> wrote: > >>>>> > >>>>>> Mike, > >>>>>> > >>>>>> Please read *full post* before judge. In particular, "Hash-based > >>>>>> distribution" section. You can find the same in HBaseWD small README > >>>>>> file > >>>>>> [1] (not sure if you read it at all before commenting on the lib). > >>>>>> Round > >>>>>> robin is mainly for explaining the concept/idea (though not only for > >>>>>> that). > >>>>>> > >>>>>> Thank you, > >>>>>> Alex Baranau > >>>>>> ------ > >>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > ElasticSearch > >>>>>> - > >>>>>> Solr > >>>>>> > >>>>>> [1] https://github.com/sematext/HBaseWD > >>>>>> > >>>>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel > >>>>>> <[email protected]>wrote: > >>>>>> > >>>>>>> Quick answer... > >>>>>>> > >>>>>>> Look at the salt. > >>>>>>> Its just a number from a round robin counter. > >>>>>>> There is no tie between the salt and row. > >>>>>>> > >>>>>>> So when you want to fetch a single row, how do you do it? > >>>>>>> ... > >>>>>>> ;-) > >>>>>>> > >>>>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau < > [email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> @Mike: > >>>>>>>> > >>>>>>>> I'm the author of that post :). > >>>>>>>> > >>>>>>>> Quick reply to your last comment: > >>>>>>>> > >>>>>>>> 1) Could you please describe why "the use of a 'Salt' is a very, > very > >>>>>>>> bad > >>>>>>>> idea" in more specific way than "Fetching data takes more effort". > >>>>>>>> Would > >>>>>>> be > >>>>>>>> helpful for anyone who is looking into using this approach. > >>>>>>>> > >>>>>>>> 2) The approach described in the post also says you can prefix > with > >>>>>>>> the > >>>>>>>> hash, you probably missed that. > >>>>>>>> > >>>>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata > >>>>>>>> guy. > >>>>>>>> Please re-read the question: the intention is to distribute the > load > >>>>>>> while > >>>>>>>> still being able to do "partial key scans". The blog post linked > >>>>>>>> above > >>>>>>>> explains one possible solution for that, while your answer > doesn't. > >>>>>>>> > >>>>>>>> @bigdata: > >>>>>>>> > >>>>>>>> Basically when it comes to solving two issues: distributing writes > >>>>>>>> and > >>>>>>>> having ability to read data sequentially, you have to balance > between > >>>>>>> being > >>>>>>>> good at both of them. Very good presentation by Lars: > >>>>>>>> > >>>>>>> > http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012 > >>>>>>> , > >>>>>>>> slide 22. You will see how this is correlated. In short: > >>>>>>>> * having md5/other hash prefix of the key does better w.r.t. > >>>>>>>> distributing > >>>>>>>> writes, while compromises ability to do range scans efficiently > >>>>>>>> * having very limited number of 'salt' prefixes still allows to do > >>>>>>>> range > >>>>>>>> scans (less efficiently than normal range scans, of course, but > still > >>>>>>> good > >>>>>>>> enough in many cases) while providing worse distribution of writes > >>>>>>>> > >>>>>>>> In the latter case by choosing number of possible 'salt' prefixes > >>>>>>>> (which > >>>>>>>> could be derived from hashed values, etc.) you can balance between > >>>>>>>> distributing writes efficiency and ability to run fast range > scans. > >>>>>>>> > >>>>>>>> Hope this helps > >>>>>>>> > >>>>>>>> Alex Baranau > >>>>>>>> ------ > >>>>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > >>>>>>>> ElasticSearch > >>>>>>> - > >>>>>>>> Solr > >>>>>>>> > >>>>>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel < > >>>>>>> [email protected]>wrote: > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> First, the use of a 'Salt' is a very, very bad idea and I would > >>>>>>>>> really > >>>>>>>>> hope that the author of that blog take it down. > >>>>>>>>> While it may solve an initial problem in terms of region hot > >>>>>>>>> spotting, > >>>>>>> it > >>>>>>>>> creates another problem when it comes to fetching data. Fetching > >>>>>>>>> data > >>>>>>> takes > >>>>>>>>> more effort. > >>>>>>>>> > >>>>>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a > more > >>>>>>> random > >>>>>>>>> key that is unique to the record. Some would argue that using > MD5 > >>>>>>>>> or > >>>>>>> SHA-1 > >>>>>>>>> that mathematically you could have a collision, however you could > >>>>>>>>> then > >>>>>>>>> append the key to the hash to guarantee uniqueness. You could > also > >>>>>>>>> do > >>>>>>>>> things like take the hash and then truncate it to the first byte > and > >>>>>>> then > >>>>>>>>> append the record key. This should give you enough randomness to > >>>>>>>>> avoid > >>>>>>> hot > >>>>>>>>> spotting after the initial region completion and you could > pre-split > >>>>>>>>> out > >>>>>>>>> any number of regions. (First byte 0-255 for values, so you can > >>>>>>>>> program > >>>>>>> the > >>>>>>>>> split... > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Having said that... yes, you lose the ability to perform a > >>>>>>>>> sequential > >>>>>>> scan > >>>>>>>>> of the data. At least to a point. It depends on your schema. > >>>>>>>>> > >>>>>>>>> Note that you need to think about how you are primarily going to > >>>>>>>>> access > >>>>>>>>> the data. You can then determine the best way to store the data > to > >>>>>>>>> gain > >>>>>>>>> the best performance. For some applications... the region hot > >>>>>>>>> spotting > >>>>>>>>> isn't an important issue. > >>>>>>>>> > >>>>>>>>> Note YMMV > >>>>>>>>> > >>>>>>>>> HTH > >>>>>>>>> > >>>>>>>>> -Mike > >>>>>>>>> > >>>>>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected] > > > >>>>>>> wrote: > >>>>>>>>>> Hello, > >>>>>>>>>> > >>>>>>>>>> There is middle term betwen sequecial keys (hot spoting risk) > and > >>>>>>>>>> md5 > >>>>>>>>>> (heavy scan): > >>>>>>>>>> * you can use composed keys with a field that can segregate data > >>>>>>>>>> (hostname, productname, metric name) like OpenTSDB > >>>>>>>>>> * or use Salt with a limited number of values (example > >>>>>>>>>> substr(md5(rowid),0,1) = 16 values) > >>>>>>>>>> so that a scan is a combination of 16 filters on on each salt > >>>>>>>>>> values > >>>>>>>>>> you can base your code on HBaseWD by sematext > >>>>>>>>>> > >>>>>>>>>> > >>>>>>> > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > >>>>>>>>>> https://github.com/sematext/HBaseWD > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 2012/12/18 bigdata <[email protected]> > >>>>>>>>>> > >>>>>>>>>>> Many articles tell me that MD5 rowkey or part of it is good > method > >>>>>>>>>>> to > >>>>>>>>>>> balance the records stored in different parts. But If I want to > >>>>>>>>>>> search > >>>>>>>>> some > >>>>>>>>>>> sequential rowkey records, such as date as rowkey or > partially. I > >>>>>>>>>>> can > >>>>>>>>> not > >>>>>>>>>>> use rowkey filter to scan a range of date value one time on the > >>>>>>>>>>> date > >>>>>>> by > >>>>>>>>>>> MD5. How to balance this issue? > >>>>>>>>>>> Thanks. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Damien HARDY > >>>>>>>>> > >>>>>>> > >>>> > >>>> > >>> > >>> > >
