Re: Is it necessary to set MD5 on rowkey?

Nick Dimiduk Thu, 20 Dec 2012 10:16:02 -0800

I think there's some hair-splitting going on here. The term "salting," by
strict definition [0] from the cryptographic context, means the
introduction of randomness to produce a one-way encoding of a value. The
technique for rowkey design described here does not include the
introduction of said randomness, nor is it a one-way operation. Instead, it
divides the input more or less evenly across a fixed number of buckets.


If you've introduced one-way randomness, you have no way to reproduce the
rowkey after it's been inserted. Thus, accessing a specific row would be
O(n) where n is the number of rows in the table -- you'd have to scan the
table looking for the desired row. Use of a bucketing technique, on the
other hand, adds a small amount of computation to calculate the true
rowkey, making access to a single row a O(1) + C vs the number of rows in
the table.

-n

[0]: http://en.wikipedia.org/wiki/Salt_(cryptography)

On Thu, Dec 20, 2012 at 5:20 AM, Michael Segel <[email protected]>wrote:

> Lars,
>
> Ok... he's talking about buckets.
>
> So when you have N buckets, what is the least number of get()s do you need
> to fetch the single row?
> (Hint: The answer is N)
>
> How many scans? (N again)
>
> Do you disagree?
>
>
> On Dec 19, 2012, at 8:06 PM, lars hofhansl <[email protected]> wrote:
>
> > Mike, please think about what you write before you write it.
> > You will most definitely not need a full table scan (much less a *FULL*
> *TABLE* *SCAN* ;-) ).
> >
> > Read Alex's blog post again, it's a good post (IMHO). He is talking
> about buckets.
> >
> >
> > -- Lars
> >
> >
> >
> > ________________________________
> > From: Michael Segel <[email protected]>
> > To: [email protected]
> > Sent: Wednesday, December 19, 2012 5:23 PM
> > Subject: Re: Is it necessary to set MD5 on rowkey?
> >
> > Ok,
> >
> > Lets try this one more time...
> >
> > If you salt, you will have to do a *FULL* *TABLE* *SCAN* in order to
> retrieve the row.
> > If you do something like a salt that uses only  a preset of N
> combinations, you will have to do N get()s in order to fetch the row.
> >
> > This is bad. VERY BAD.
> >
> > If you hash the row, you will get a consistent value each time you hash
> the key.  If you use SHA-1, the odds of a collision are mathematically
> possible, however highly improbable. So people have recommended that they
> append the key to the hash to form the new key. Here, you might as well as
> truncate the hash to just the most significant byte or two and the append
> the key. This will give you enough of an even distribution that you can
> avoid hot spotting.
> >
> > So if I use the hash, I can effectively still get the row of data back
> with a single get(). Otherwise its a full table scan.
> >
> > Do you see the difference?
> >
> >
> > On Dec 19, 2012, at 7:11 PM, Jean-Marc Spaggiari <
> [email protected]> wrote:
> >
> >> Hi Mike,
> >>
> >> If in your business case, the only thing you need when you retreive
> >> your data is to do full scan over MR jobs, then you can salt with
> >> what-ever you want. Hash, random values, etc.
> >>
> >> If you know you have x regions, then you can simply do a round-robin
> >> salting, or a random salting over those x regions.
> >>
> >> Then when you run your MR job, you discard the first bytes, and do
> >> what you want with your data.
> >>
> >> So I also think that salting can still be usefull. All depend on what
> >> you do with your data.
> >>
> >> Must my opinion.
> >>
> >> JM
> >>
> >> 2012/12/19, Michael Segel <[email protected]>:
> >>> Ok...
> >>>
> >>> So you use a random byte or two at the front of the row.
> >>> How do you then use get() to find the row?
> >>> How do you do a partial scan()?
> >>>
> >>> Do you start to see the problem?
> >>> The only way to get to the row is to do a full table scan. That kills
> HBase
> >>> and you would be better off going with a partitioned Hive table.
> >>>
> >>> Using a hash of the key or a portion of the hash is not a salt.
> >>> That's not what I have a problem with. Each time you want to fetch the
> key,
> >>> you just hash it, truncate the hash and then prepend it to the key.
> You will
> >>> then be able to use get().
> >>>
> >>> Using a salt would imply using some form of a modulo math to get a
> round
> >>> robin prefix.  Or a random number generator.
> >>>
> >>> That's the issue.
> >>>
> >>> Does that make sense?
> >>>
> >>>
> >>>
> >>> On Dec 19, 2012, at 3:26 PM, David Arthur <[email protected]> wrote:
> >>>
> >>>> Let's say you want to decompose a url into domain and path to include
> in
> >>>> your row key.
> >>>>
> >>>> You could of course just use the url as the key, but you will see
> >>>> hotspotting since most will start with "http". To mitigate this, you
> could
> >>>> add a random byte or two at the beginning (random salt) to improve
> >>>> distribution of keys, but you break single record Gets (and Scans
> >>>> arguably). Another approach is to use a hash-based salt: hash the
> whole
> >>>> key and use a few of those bytes as a salt. This fixes Gets but Scans
> are
> >>>> still not effective.
> >>>>
> >>>> One approach I've taken is to hash only a part of the key. Consider
> the
> >>>> following key structure
> >>>>
> >>>> <2 bytes of hash(domain)><domain><path>
> >>>>
> >>>> With this you get 16 bits for a hash-based salt. The salt is
> deterministic
> >>>> so Gets work fine, and for a single domain the salt is the same so
> you can
> >>>> easily do Scans across a domain. If you had some further structure to
> your
> >>>> key that you wished to scan across, you could do something like:
> >>>>
> >>>> <2 bytes of hash(domain)><domain><2 bytes of hash(path)><path>
> >>>>
> >>>> It really boils down to identifying your access patterns and
> read/write
> >>>> requirements and constructing a row key accordingly.
> >>>>
> >>>> HTH,
> >>>> David
> >>>>
> >>>> On 12/18/12 6:29 PM, Michael Segel wrote:
> >>>>> Alex,
> >>>>> And that's the point. Salt as you explain it conceptually implies
> that
> >>>>> the number you are adding to the key to ensure a better distribution
> >>>>> means that you will have inefficiencies in terms of scans and gets.
> >>>>>
> >>>>> Using a hash as either the full key, or taking the hash, truncating
> it
> >>>>> and appending the key may screw up scans, but your get() is intact.
> >>>>>
> >>>>> There are other options like inverting the numeric key ...
> >>>>>
> >>>>> And of course doing nothing.
> >>>>>
> >>>>> Using a salt as part of the design pattern is bad.
> >>>>>
> >>>>> With respect to the OP, I was discussing the use of hash and some
> >>>>> alternatives to how to implement the hash of a key.
> >>>>> Again, doing nothing may also make sense too, if you understand the
> risks
> >>>>> and you know how your data is going to be used.
> >>>>>
> >>>>>
> >>>>> On Dec 18, 2012, at 11:36 AM, Alex Baranau <[email protected]
> >
> >>>>> wrote:
> >>>>>
> >>>>>> Mike,
> >>>>>>
> >>>>>> Please read *full post* before judge. In particular, "Hash-based
> >>>>>> distribution" section. You can find the same in HBaseWD small README
> >>>>>> file
> >>>>>> [1] (not sure if you read it at all before commenting on the lib).
> >>>>>> Round
> >>>>>> robin is mainly for explaining the concept/idea (though not only for
> >>>>>> that).
> >>>>>>
> >>>>>> Thank you,
> >>>>>> Alex Baranau
> >>>>>> ------
> >>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> >>>>>> -
> >>>>>> Solr
> >>>>>>
> >>>>>> [1] https://github.com/sematext/HBaseWD
> >>>>>>
> >>>>>> On Tue, Dec 18, 2012 at 12:24 PM, Michael Segel
> >>>>>> <[email protected]>wrote:
> >>>>>>
> >>>>>>> Quick answer...
> >>>>>>>
> >>>>>>> Look at the salt.
> >>>>>>> Its just a number from a round robin counter.
> >>>>>>> There is no tie between the salt and row.
> >>>>>>>
> >>>>>>> So when you want to fetch a single row, how do you do it?
> >>>>>>> ...
> >>>>>>> ;-)
> >>>>>>>
> >>>>>>> On Dec 18, 2012, at 11:12 AM, Alex Baranau <
> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> @Mike:
> >>>>>>>>
> >>>>>>>> I'm the author of that post :).
> >>>>>>>>
> >>>>>>>> Quick reply to your last comment:
> >>>>>>>>
> >>>>>>>> 1) Could you please describe why "the use of a 'Salt' is a very,
> very
> >>>>>>>> bad
> >>>>>>>> idea" in more specific way than "Fetching data takes more effort".
> >>>>>>>> Would
> >>>>>>> be
> >>>>>>>> helpful for anyone who is looking into using this approach.
> >>>>>>>>
> >>>>>>>> 2) The approach described in the post also says you can prefix
> with
> >>>>>>>> the
> >>>>>>>> hash, you probably missed that.
> >>>>>>>>
> >>>>>>>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata
> >>>>>>>> guy.
> >>>>>>>> Please re-read the question: the intention is to distribute the
> load
> >>>>>>> while
> >>>>>>>> still being able to do "partial key scans". The blog post linked
> >>>>>>>> above
> >>>>>>>> explains one possible solution for that, while your answer
> doesn't.
> >>>>>>>>
> >>>>>>>> @bigdata:
> >>>>>>>>
> >>>>>>>> Basically when it comes to solving two issues: distributing writes
> >>>>>>>> and
> >>>>>>>> having ability to read data sequentially, you have to balance
> between
> >>>>>>> being
> >>>>>>>> good at both of them. Very good presentation by Lars:
> >>>>>>>>
> >>>>>>>
> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
> >>>>>>> ,
> >>>>>>>> slide 22. You will see how this is correlated. In short:
> >>>>>>>> * having md5/other hash prefix of the key does better w.r.t.
> >>>>>>>> distributing
> >>>>>>>> writes, while compromises ability to do range scans efficiently
> >>>>>>>> * having very limited number of 'salt' prefixes still allows to do
> >>>>>>>> range
> >>>>>>>> scans (less efficiently than normal range scans, of course, but
> still
> >>>>>>> good
> >>>>>>>> enough in many cases) while providing worse distribution of writes
> >>>>>>>>
> >>>>>>>> In the latter case by choosing number of possible 'salt' prefixes
> >>>>>>>> (which
> >>>>>>>> could be derived from hashed values, etc.) you can balance between
> >>>>>>>> distributing writes efficiency and ability to run fast range
> scans.
> >>>>>>>>
> >>>>>>>> Hope this helps
> >>>>>>>>
> >>>>>>>> Alex Baranau
> >>>>>>>> ------
> >>>>>>>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> >>>>>>>> ElasticSearch
> >>>>>>> -
> >>>>>>>> Solr
> >>>>>>>>
> >>>>>>>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel <
> >>>>>>> [email protected]>wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> First, the use of a 'Salt' is a very, very bad idea and I would
> >>>>>>>>> really
> >>>>>>>>> hope that the author of that blog take it down.
> >>>>>>>>> While it may solve an initial problem in terms of region hot
> >>>>>>>>> spotting,
> >>>>>>> it
> >>>>>>>>> creates another problem when it comes to fetching data. Fetching
> >>>>>>>>> data
> >>>>>>> takes
> >>>>>>>>> more effort.
> >>>>>>>>>
> >>>>>>>>> With respect to using a hash (MD5 or SHA-1) you are creating a
> more
> >>>>>>> random
> >>>>>>>>> key that is unique to the record.  Some would argue that using
> MD5
> >>>>>>>>> or
> >>>>>>> SHA-1
> >>>>>>>>> that mathematically you could have a collision, however you could
> >>>>>>>>> then
> >>>>>>>>> append the key to the hash to guarantee uniqueness. You could
> also
> >>>>>>>>> do
> >>>>>>>>> things like take the hash and then truncate it to the first byte
> and
> >>>>>>> then
> >>>>>>>>> append the record key. This should give you enough randomness to
> >>>>>>>>> avoid
> >>>>>>> hot
> >>>>>>>>> spotting after the initial region completion and you could
> pre-split
> >>>>>>>>> out
> >>>>>>>>> any number of regions. (First byte 0-255 for values, so you can
> >>>>>>>>> program
> >>>>>>> the
> >>>>>>>>> split...
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Having said that... yes, you lose the ability to perform a
> >>>>>>>>> sequential
> >>>>>>> scan
> >>>>>>>>> of the data.  At least to a point.  It depends on your schema.
> >>>>>>>>>
> >>>>>>>>> Note that you need to think about how you are primarily going to
> >>>>>>>>> access
> >>>>>>>>> the data.  You can then determine the best way to store the data
> to
> >>>>>>>>> gain
> >>>>>>>>> the best performance. For some applications... the region hot
> >>>>>>>>> spotting
> >>>>>>>>> isn't an important issue.
> >>>>>>>>>
> >>>>>>>>> Note YMMV
> >>>>>>>>>
> >>>>>>>>> HTH
> >>>>>>>>>
> >>>>>>>>> -Mike
> >>>>>>>>>
> >>>>>>>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]
> >
> >>>>>>> wrote:
> >>>>>>>>>> Hello,
> >>>>>>>>>>
> >>>>>>>>>> There is middle term betwen sequecial keys (hot spoting risk)
> and
> >>>>>>>>>> md5
> >>>>>>>>>> (heavy scan):
> >>>>>>>>>> * you can use composed keys with a field that can segregate data
> >>>>>>>>>> (hostname, productname, metric name) like OpenTSDB
> >>>>>>>>>> * or use Salt with a limited number of values (example
> >>>>>>>>>> substr(md5(rowid),0,1) = 16 values)
> >>>>>>>>>> so that a scan is a combination of 16 filters on on each salt
> >>>>>>>>>> values
> >>>>>>>>>> you can base your code on HBaseWD by sematext
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> >>>>>>>>>>     https://github.com/sematext/HBaseWD
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2012/12/18 bigdata <[email protected]>
> >>>>>>>>>>
> >>>>>>>>>>> Many articles tell me that MD5 rowkey or part of it is good
> method
> >>>>>>>>>>> to
> >>>>>>>>>>> balance the records stored in different parts. But If I want to
> >>>>>>>>>>> search
> >>>>>>>>> some
> >>>>>>>>>>> sequential rowkey records, such as date as rowkey or
> partially. I
> >>>>>>>>>>> can
> >>>>>>>>> not
> >>>>>>>>>>> use rowkey filter to scan a range of date value one time on the
> >>>>>>>>>>> date
> >>>>>>> by
> >>>>>>>>>>> MD5. How to balance this issue?
> >>>>>>>>>>> Thanks.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Damien HARDY
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>>>
> >>>
> >>>
>
>

Re: Is it necessary to set MD5 on rowkey?

Reply via email to