Thank you for the explanation, but I'm a little confused. The key will be monotonically increasing, but the hash of that key will not be.
So, even though your original keys may look like : 1_foobar, 2_foobar, 3_foobar After the hashing, they'd look more like : 349000_1_foobar, 999999_2_foobar, 000001_3_foobar With five regions, the original key ranges for your regions would look something like : 000000-199999, 200000-399999, 400000-599999, 600000-799999, 800000-99999 So let's say you add another row. It causes a split. Now your regions look like : 000000-199999, 200000-399999, 400000-599999, 600000-799999, 800000-899999, 900000-999999 Since the value that you are prepending to your keys is essentially random, I don't see why your regions would only fill halfway. A new, hashed key would be just as likely to fall within 800000-899999 as it would be to fall within 900000-999999. Are we working from different assumptions? On Tue, May 5, 2015 at 4:46 PM, Michael Segel <michael_se...@hotmail.com> wrote: > Yes, what you described mod(hash(rowkey),n) where n is the number of > regions will remove the hotspotting issue. > > However, if your key is sequential you will only have regions half full > post region split. > > Look at it this way… > > If I have a key that is a sequential count 1,2,3,4,5 … I am always adding > a new row to the last region and its always being added to the right. > (reading left from right.) Always at the end of the line… > > So if I have 10,000 rows and I split the region… region 1 has 0 to 4,999 > and region 2 has 5000 to 10000. > > Now my next row is 10001, the following is 10002 … so they will be added > at the tail end of region 2 until it splits. (And so on, and so on…) > > If you take a modulus of the hash, you create n buckets. Again for each > bucket… I will still be adding a new larger number so it will be added to > the right hand side or tail of the list. > > Once a region is split… that’s it. > > Bucketing will solve the hot spotting issue by creating n lists of rows, > but you’re still always adding to the end of the list. > > Does that make sense? > > > > On May 5, 2015, at 10:04 AM, jeremy p <athomewithagroove...@gmail.com> > wrote: > > > > Thank you for your response! > > > > So I guess 'salt' is a bit of a misnomer. What I used to do is this : > > > > 1) Say that my key value is something like '1234foobar' > > 2) I obtain the hash of '1234foobar'. Let's say that's '54824923' > > 3) I mod the hash by my number of regions. Let's say I have 2000 > regions. > > 54824923 % 2000 = 923 > > 4) I prepend that value to my original key value, so my new key is > > '923_1234foobar' > > > > Is this the same thing you were talking about? > > > > A couple questions : > > > > * Why would my regions only be 1/2 full? > > * Why would I only use this for sequential keys? I would think this > would > > give better performance in any situation where I don't need range scans. > > For example, let's say my key value is a person's last name. That will > > naturally cluster around certain letters, giving me an uneven > distribution. > > > > --Jeremy > > > > > > > > On Sun, May 3, 2015 at 11:46 AM, Michael Segel < > michael_se...@hotmail.com> > > wrote: > > > >> Yes, don’t use a salt. Salt implies that your seed is orthogonal (read > >> random) to the base table row key. > >> You’re better off using a truncated hash (md5 is fastest) so that at > least > >> you can use a single get(). > >> > >> Common? > >> > >> Only if your row key is mostly sequential. > >> > >> Note that even with bucketing, you will still end up with regions only > 1/2 > >> full with the only exception being the last region. > >> > >>> On May 1, 2015, at 11:09 AM, jeremy p <athomewithagroove...@gmail.com> > >> wrote: > >>> > >>> Hello all, > >>> > >>> I've been out of the HBase world for a while, and I'm just now jumping > >> back > >>> in. > >>> > >>> As of HBase .94, it was still common to take a hash of your RowKey and > >> use > >>> that to "salt" the beginning of your RowKey to obtain an even > >> distribution > >>> among your region servers. Is this still a common practice, or is > there > >> a > >>> better way to do this in HBase 1.0? > >>> > >>> --Jeremy > >> > >> The opinions expressed here are mine, while they may reflect a cognitive > >> thought, that is purely accidental. > >> Use at your own risk. > >> Michael Segel > >> michael_segel (AT) hotmail.com > >> > >> > >> > >> > >> > >> > > The opinions expressed here are mine, while they may reflect a cognitive > thought, that is purely accidental. > Use at your own risk. > Michael Segel > michael_segel (AT) hotmail.com > > > > > >