Re: Is it necessary to set MD5 on rowkey?

Damien Hardy Tue, 17 Dec 2013 01:22:15 -0800

Hello,

yes you need 256 scans range or a full (almost) scan with combination of
filters for each 256 ranges
(https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.Operator.html#MUST_PASS_ONE)


For mapreduce, the getsplit() method should be modified from
TableInputFormatBase to handle salt values.
This is what is done in
https://github.com/sematext/HBaseWD/blob/master/src/main/java/com/sematext/hbase/wd/WdTableInputFormat.java
(to return on HBaseWD example)

So a mapper (several if a salt value cover many regions) is dedicated
for each salt value like simple TableInoutFormart would do without salt.

Best regards.

-- 
Damien


Le 17/12/2013 09:36, bigdata a écrit :
> Hello,
> @Alex Baranau
> Thanks for your salt solution. In my understanding, the salt solution is 
> divide the data into several partial(if 2 letters,00~FF, then 255 parts will 
> be devided). My question is when I want to scan data, do I need scan 256 
> times for the following situation:rowkey:  salt prefix (00~FF) + date + xxx
> And If I want do mapreduce on this table, if the 
> initTableMapperJob(List<Scan>,...) is OK?
> If example of scan the salted table is appreciated!
> Thanks.
> 
> 
> 
> 
> 
> 
> 
> 
> 
>> Date: Tue, 18 Dec 2012 12:12:37 -0500
>> Subject: Re: Is it necessary to set MD5 on rowkey?
>> From: [email protected]
>> To: [email protected]
>>
>> Hello,
>>
>> @Mike:
>>
>> I'm the author of that post :).
>>
>> Quick reply to your last comment:
>>
>> 1) Could you please describe why "the use of a 'Salt' is a very, very bad
>> idea" in more specific way than "Fetching data takes more effort". Would be
>> helpful for anyone who is looking into using this approach.
>>
>> 2) The approach described in the post also says you can prefix with the
>> hash, you probably missed that.
>>
>> 3) I believe your answer, "use MD5 or SHA-1" doesn't help bigdata guy.
>> Please re-read the question: the intention is to distribute the load while
>> still being able to do "partial key scans". The blog post linked above
>> explains one possible solution for that, while your answer doesn't.
>>
>> @bigdata:
>>
>> Basically when it comes to solving two issues: distributing writes and
>> having ability to read data sequentially, you have to balance between being
>> good at both of them. Very good presentation by Lars:
>> http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012,
>> slide 22. You will see how this is correlated. In short:
>> * having md5/other hash prefix of the key does better w.r.t. distributing
>> writes, while compromises ability to do range scans efficiently
>> * having very limited number of 'salt' prefixes still allows to do range
>> scans (less efficiently than normal range scans, of course, but still good
>> enough in many cases) while providing worse distribution of writes
>>
>> In the latter case by choosing number of possible 'salt' prefixes (which
>> could be derived from hashed values, etc.) you can balance between
>> distributing writes efficiency and ability to run fast range scans.
>>
>> Hope this helps
>>
>> Alex Baranau
>> ------
>> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
>> Solr
>>
>> On Tue, Dec 18, 2012 at 8:52 AM, Michael Segel 
>> <[email protected]>wrote:
>>
>>>
>>> Hi,
>>>
>>> First, the use of a 'Salt' is a very, very bad idea and I would really
>>> hope that the author of that blog take it down.
>>> While it may solve an initial problem in terms of region hot spotting, it
>>> creates another problem when it comes to fetching data. Fetching data takes
>>> more effort.
>>>
>>> With respect to using a hash (MD5 or SHA-1) you are creating a more random
>>> key that is unique to the record.  Some would argue that using MD5 or SHA-1
>>> that mathematically you could have a collision, however you could then
>>> append the key to the hash to guarantee uniqueness. You could also do
>>> things like take the hash and then truncate it to the first byte and then
>>> append the record key. This should give you enough randomness to avoid hot
>>> spotting after the initial region completion and you could pre-split out
>>> any number of regions. (First byte 0-255 for values, so you can program the
>>> split...
>>>
>>>
>>> Having said that... yes, you lose the ability to perform a sequential scan
>>> of the data.  At least to a point.  It depends on your schema.
>>>
>>> Note that you need to think about how you are primarily going to access
>>> the data.  You can then determine the best way to store the data to gain
>>> the best performance. For some applications... the region hot spotting
>>> isn't an important issue.
>>>
>>> Note YMMV
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> On Dec 18, 2012, at 3:33 AM, Damien Hardy <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> There is middle term betwen sequecial keys (hot spoting risk) and md5
>>>> (heavy scan):
>>>>  * you can use composed keys with a field that can segregate data
>>>> (hostname, productname, metric name) like OpenTSDB
>>>>  * or use Salt with a limited number of values (example
>>>> substr(md5(rowid),0,1) = 16 values)
>>>>    so that a scan is a combination of 16 filters on on each salt values
>>>>    you can base your code on HBaseWD by sematext
>>>>
>>>>
>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>       https://github.com/sematext/HBaseWD
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> 2012/12/18 bigdata <[email protected]>
>>>>
>>>>> Many articles tell me that MD5 rowkey or part of it is good method to
>>>>> balance the records stored in different parts. But If I want to search
>>> some
>>>>> sequential rowkey records, such as date as rowkey or partially. I can
>>> not
>>>>> use rowkey filter to scan a range of date value one time on the date by
>>>>> MD5. How to balance this issue?
>>>>> Thanks.

signature.asc
Description: OpenPGP digital signature

Re: Is it necessary to set MD5 on rowkey?

Reply via email to