RE: Is it necessary to set MD5 on rowkey?

bigdata Tue, 17 Dec 2013 00:37:03 -0800

Hello,
@Alex Baranau
Thanks for your salt solution. In my understanding, the salt solution is divide 
the data into several partial(if 2 letters,00~FF, then 255 parts will be 
devided). My question is when I want to scan data, do I need scan 256 times for 
the following situation:rowkey:  salt prefix (00~FF) + date + xxx
And If I want do mapreduce on this table, if the 
initTableMapperJob(List<Scan>,...) is OK?
If example of scan the salted table is appreciated!
Thanks.










> Date: Tue, 18 Dec 2012 12:12:37 -0500
> Subject: Re: Is it necessary to set MD5 on rowkey?
> From: [email protected]
> To: [email protected]
> 
> Hello,
> 
> @Mike:
> 
> I'm the author of that post :).
> 
> Quick reply to your last comment:
> 
> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
> idea" in more specific way than "Fetching data takes more effort". Would be
> helpful for anyone who is looking into using this approach.
> 
> 2) The approach described in the post also says you can prefix with the
> hash, you probably missed that.
> 
> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
> Please re-read the question: the intention is to distribute the load while
> still being able to do "partial key scans". The blog post linked above
> explains one possible solution for that, while your answer doesn't.
> 
> @bigdata:
> 
> Basically when it comes to solving two issues: distributing writes and
> having ability to read data sequentially, you have to balance between being
> good at both of them. Very good presentation by Lars:
> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012,
> slide 22. You will see how this is correlated. In short:
> * having md5/other hash prefix of the key does better w.r.t. distributing
> writes, while compromises ability to do range scans efficiently
> * having very limited number of 'salt' prefixes still allows to do range
> scans (less efficiently than normal range scans, of course, but still good
> enough in many cases) while providing worse distribution of writes
> 
> In the latter case by choosing number of possible 'salt' prefixes (which
> could be derived from hashed values, etc.) you can balance between
> distributing writes efficiency and ability to run fast range scans.
> 
> Hope this helps
> 
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> 
> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel 
> <[email protected]>wrote:
> 
> >
> > Hi,
> >
> > First, the use of a 'Salt' is a very, very bad idea and I would really
> > hope that the author of that blog take it down.
> > While it may solve an initial problem in terms of region hot spotting, it
> > creates another problem when it comes to fetching data. Fetching data takes
> > more effort.
> >
> > With respect to using a hash (MD5 or SHA-1) you are creating a more random
> > key that is unique to the record.  Some would argue that using MD5 or SHA-1
> > that mathematically you could have a collision, however you could then
> > append the key to the hash to guarantee uniqueness. You could also do
> > things like take the hash and then truncate it to the first byte and then
> > append the record key. This should give you enough randomness to avoid hot
> > spotting after the initial region completion and you could pre-split out
> > any number of regions. (First byte 0-255 for values, so you can program the
> > split...
> >
> >
> > Having said that... yes, you lose the ability to perform a sequential scan
> > of the data.  At least to a point.  It depends on your schema.
> >
> > Note that you need to think about how you are primarily going to access
> > the data.  You can then determine the best way to store the data to gain
> > the best performance. For some applications... the region hot spotting
> > isn't an important issue.
> >
> > Note YMMV
> >
> > HTH
> >
> > -Mike
> >
> > On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > There is middle term betwen sequecial keys (hot spoting risk) and md5
> > > (heavy scan):
> > >  * you can use composed keys with a field that can segregate data
> > > (hostname, productname, metric name) like OpenTSDB
> > >  * or use Salt with a limited number of values (example
> > > substr(md5(rowid),0,1) = 16 values)
> > >    so that a scan is a combination of 16 filters on on each salt values
> > >    you can base your code on HBaseWD by sematext
> > >
> > >
> > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > >       https://github.com/sematext/HBaseWD
> > >
> > > Cheers,
> > >
> > >
> > > 2012/12/18 bigdata <[email protected]>
> > >
> > >> Many articles tell me that MD5 rowkey or part of it is good method to
> > >> balance the records stored in different parts. But If I want to search
> > some
> > >> sequential rowkey records, such as date as rowkey or partially. I can
> > not
> > >> use rowkey filter to scan a range of date value one time on the date by
> > >> MD5. How to balance this issue?
> > >> Thanks.
> > >>
> > >>
> > >
> > >
> > >
> > >
> > > --
> > > Damien HARDY
> >
> >

RE: Is it necessary to set MD5 on rowkey?

Reply via email to