Hello, yes you need 256 scans range or a full (almost) scan with combination of filters for each 256 ranges (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.Operator.html#MUST_PASS_ONE)
For mapreduce, the getsplit() method should be modified from TableInputFormatBase to handle salt values. This is what is done in https://github.com/sematext/HBaseWD/blob/master/src/main/java/com/sematext/hbase/wd/WdTableInputFormat.java (to return on HBaseWD example) So a mapper (several if a salt value cover many regions) is dedicated for each salt value like simple TableInoutFormart would do without salt. Best regards. -- Damien Le 17/12/2013 09:36, bigdata a écrit : > Hello, > @Alex Baranau > Thanks for your salt solution. In my understanding, the salt solution is > divide the data into several partial(if 2 letters,00~FF, then 255 parts will > be devided). My question is when I want to scan data, do I need scan 256 > times for the following situation:rowkey: salt prefix (00~FF) + date + xxx > And If I want do mapreduce on this table, if the > initTableMapperJob(List<Scan>,...) is OK? > If example of scan the salted table is appreciated! > Thanks. > > > > > > > > > >> Date: Tue, 18 Dec 2012 12:12:37 -0500 >> Subject: Re: Is it necessary to set MD5 on rowkey? >> From: [email protected] >> To: [email protected] >> >> Hello, >> >> @Mike: >> >> I'm the author of that post :). >> >> Quick reply to your last comment: >> >> 1) Could you please describe why "the use of a 'Salt' is a very, very bad >> idea" in more specific way than "Fetching data takes more effort". Would be >> helpful for anyone who is looking into using this approach. >> >> 2) The approach described in the post also says you can prefix with the >> hash, you probably missed that. >> >> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy. >> Please re-read the question: the intention is to distribute the load while >> still being able to do "partial key scans". The blog post linked above >> explains one possible solution for that, while your answer doesn't. >> >> @bigdata: >> >> Basically when it comes to solving two issues: distributing writes and >> having ability to read data sequentially, you have to balance between being >> good at both of them. Very good presentation by Lars: >> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012, >> slide 22. You will see how this is correlated. In short: >> * having md5/other hash prefix of the key does better w.r.t. distributing >> writes, while compromises ability to do range scans efficiently >> * having very limited number of 'salt' prefixes still allows to do range >> scans (less efficiently than normal range scans, of course, but still good >> enough in many cases) while providing worse distribution of writes >> >> In the latter case by choosing number of possible 'salt' prefixes (which >> could be derived from hashed values, etc.) you can balance between >> distributing writes efficiency and ability to run fast range scans. >> >> Hope this helps >> >> Alex Baranau >> ------ >> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - >> Solr >> >> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel >> <[email protected]>wrote: >> >>> >>> Hi, >>> >>> First, the use of a 'Salt' is a very, very bad idea and I would really >>> hope that the author of that blog take it down. >>> While it may solve an initial problem in terms of region hot spotting, it >>> creates another problem when it comes to fetching data. Fetching data takes >>> more effort. >>> >>> With respect to using a hash (MD5 or SHA-1) you are creating a more random >>> key that is unique to the record. Some would argue that using MD5 or SHA-1 >>> that mathematically you could have a collision, however you could then >>> append the key to the hash to guarantee uniqueness. You could also do >>> things like take the hash and then truncate it to the first byte and then >>> append the record key. This should give you enough randomness to avoid hot >>> spotting after the initial region completion and you could pre-split out >>> any number of regions. (First byte 0-255 for values, so you can program the >>> split... >>> >>> >>> Having said that... yes, you lose the ability to perform a sequential scan >>> of the data. At least to a point. It depends on your schema. >>> >>> Note that you need to think about how you are primarily going to access >>> the data. You can then determine the best way to store the data to gain >>> the best performance. For some applications... the region hot spotting >>> isn't an important issue. >>> >>> Note YMMV >>> >>> HTH >>> >>> -Mike >>> >>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> There is middle term betwen sequecial keys (hot spoting risk) and md5 >>>> (heavy scan): >>>> * you can use composed keys with a field that can segregate data >>>> (hostname, productname, metric name) like OpenTSDB >>>> * or use Salt with a limited number of values (example >>>> substr(md5(rowid),0,1) = 16 values) >>>> so that a scan is a combination of 16 filters on on each salt values >>>> you can base your code on HBaseWD by sematext >>>> >>>> >>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>> https://github.com/sematext/HBaseWD >>>> >>>> Cheers, >>>> >>>> >>>> 2012/12/18 bigdata <[email protected]> >>>> >>>>> Many articles tell me that MD5 rowkey or part of it is good method to >>>>> balance the records stored in different parts. But If I want to search >>> some >>>>> sequential rowkey records, such as date as rowkey or partially. I can >>> not >>>>> use rowkey filter to scan a range of date value one time on the date by >>>>> MD5. How to balance this issue? >>>>> Thanks.
signature.asc
Description: OpenPGP digital signature
