Hello, @Alex Baranau Thanks for your salt solution. In my understanding, the salt solution is divide the data into several partial(if 2 letters,00~FF, then 255 parts will be devided). My question is when I want to scan data, do I need scan 256 times for the following situation:rowkey: salt prefix (00~FF) + date + xxx And If I want do mapreduce on this table, if the initTableMapperJob(List<Scan>,...) is OK? If example of scan the salted table is appreciated! Thanks.
> Date: Tue, 18 Dec 2012 12:12:37 -0500 > Subject: Re: Is it necessary to set MD5 on rowkey? > From: [email protected] > To: [email protected] > > Hello, > > @Mike: > > I'm the author of that post :). > > Quick reply to your last comment: > > 1) Could you please describe why "the use of a 'Salt' is a very, very bad > idea" in more specific way than "Fetching data takes more effort". Would be > helpful for anyone who is looking into using this approach. > > 2) The approach described in the post also says you can prefix with the > hash, you probably missed that. > > 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy. > Please re-read the question: the intention is to distribute the load while > still being able to do "partial key scans". The blog post linked above > explains one possible solution for that, while your answer doesn't. > > @bigdata: > > Basically when it comes to solving two issues: distributing writes and > having ability to read data sequentially, you have to balance between being > good at both of them. Very good presentation by Lars: > http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012, > slide 22. You will see how this is correlated. In short: > * having md5/other hash prefix of the key does better w.r.t. distributing > writes, while compromises ability to do range scans efficiently > * having very limited number of 'salt' prefixes still allows to do range > scans (less efficiently than normal range scans, of course, but still good > enough in many cases) while providing worse distribution of writes > > In the latter case by choosing number of possible 'salt' prefixes (which > could be derived from hashed values, etc.) you can balance between > distributing writes efficiency and ability to run fast range scans. > > Hope this helps > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel > <[email protected]>wrote: > > > > > Hi, > > > > First, the use of a 'Salt' is a very, very bad idea and I would really > > hope that the author of that blog take it down. > > While it may solve an initial problem in terms of region hot spotting, it > > creates another problem when it comes to fetching data. Fetching data takes > > more effort. > > > > With respect to using a hash (MD5 or SHA-1) you are creating a more random > > key that is unique to the record. Some would argue that using MD5 or SHA-1 > > that mathematically you could have a collision, however you could then > > append the key to the hash to guarantee uniqueness. You could also do > > things like take the hash and then truncate it to the first byte and then > > append the record key. This should give you enough randomness to avoid hot > > spotting after the initial region completion and you could pre-split out > > any number of regions. (First byte 0-255 for values, so you can program the > > split... > > > > > > Having said that... yes, you lose the ability to perform a sequential scan > > of the data. At least to a point. It depends on your schema. > > > > Note that you need to think about how you are primarily going to access > > the data. You can then determine the best way to store the data to gain > > the best performance. For some applications... the region hot spotting > > isn't an important issue. > > > > Note YMMV > > > > HTH > > > > -Mike > > > > On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> wrote: > > > > > Hello, > > > > > > There is middle term betwen sequecial keys (hot spoting risk) and md5 > > > (heavy scan): > > > * you can use composed keys with a field that can segregate data > > > (hostname, productname, metric name) like OpenTSDB > > > * or use Salt with a limited number of values (example > > > substr(md5(rowid),0,1) = 16 values) > > > so that a scan is a combination of 16 filters on on each salt values > > > you can base your code on HBaseWD by sematext > > > > > > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > https://github.com/sematext/HBaseWD > > > > > > Cheers, > > > > > > > > > 2012/12/18 bigdata <[email protected]> > > > > > >> Many articles tell me that MD5 rowkey or part of it is good method to > > >> balance the records stored in different parts. But If I want to search > > some > > >> sequential rowkey records, such as date as rowkey or partially. I can > > not > > >> use rowkey filter to scan a range of date value one time on the date by > > >> MD5. How to balance this issue? > > >> Thanks. > > >> > > >> > > > > > > > > > > > > > > > -- > > > Damien HARDY > > > >
