Thank you for the explanation, but I'm a little confused.  The key will be
monotonically increasing, but the hash of that key will not be.

So, even though your original keys may look like : 1_foobar, 2_foobar,
3_foobar
After the hashing, they'd look more like : 349000_1_foobar,
999999_2_foobar, 000001_3_foobar

With five regions, the original key ranges for your regions would look
something like : 000000-199999, 200000-399999, 400000-599999,
600000-799999, 800000-99999

So let's say you add another row.  It causes a split.  Now your regions
look like :  000000-199999, 200000-399999, 400000-599999, 600000-799999,
800000-899999, 900000-999999

Since the value that you are prepending to your keys is essentially random,
I don't see why your regions would only fill halfway.  A new, hashed key
would be just as likely to fall within 800000-899999 as it would be to fall
within 900000-999999.

Are we working from different assumptions?

On Tue, May 5, 2015 at 4:46 PM, Michael Segel <michael_se...@hotmail.com>
wrote:

> Yes, what you described  mod(hash(rowkey),n) where n is the number of
> regions will remove the hotspotting issue.
>
> However, if your key is sequential you will only have regions half full
> post region split.
>
> Look at it this way…
>
> If I have a key that is a sequential count 1,2,3,4,5 … I am always adding
> a new row to the last region and its always being added to the right.
> (reading left from right.) Always at the end of the line…
>
> So if I have 10,000 rows and I split the region… region 1 has 0 to 4,999
> and region 2 has 5000 to 10000.
>
> Now my next row is 10001, the following is 10002 … so they will be added
> at the tail end of region 2 until it splits.  (And so on, and so on…)
>
> If you take a modulus of the hash, you create n buckets. Again for each
> bucket… I will still be adding a new larger number so it will be added to
> the right hand side or tail of the list.
>
> Once a region is split… that’s it.
>
> Bucketing will solve the hot spotting issue by creating n lists of rows,
> but you’re still always adding to the end of the list.
>
> Does that make sense?
>
>
> > On May 5, 2015, at 10:04 AM, jeremy p <athomewithagroove...@gmail.com>
> wrote:
> >
> > Thank you for your response!
> >
> > So I guess 'salt' is a bit of a misnomer.  What I used to do is this :
> >
> > 1) Say that my key value is something like '1234foobar'
> > 2) I obtain the hash of '1234foobar'.  Let's say that's '54824923'
> > 3) I mod the hash by my number of regions.  Let's say I have 2000
> regions.
> > 54824923 % 2000 = 923
> > 4) I prepend that value to my original key value, so my new key is
> > '923_1234foobar'
> >
> > Is this the same thing you were talking about?
> >
> > A couple questions :
> >
> > * Why would my regions only be 1/2 full?
> > * Why would I only use this for sequential keys?  I would think this
> would
> > give better performance in any situation where I don't need range scans.
> > For example, let's say my key value is a person's last name.  That will
> > naturally cluster around certain letters, giving me an uneven
> distribution.
> >
> > --Jeremy
> >
> >
> >
> > On Sun, May 3, 2015 at 11:46 AM, Michael Segel <
> michael_se...@hotmail.com>
> > wrote:
> >
> >> Yes, don’t use a salt. Salt implies that your seed is orthogonal (read
> >> random) to the base table row key.
> >> You’re better off using a truncated hash (md5 is fastest) so that at
> least
> >> you can use a single get().
> >>
> >> Common?
> >>
> >> Only if your row key is mostly sequential.
> >>
> >> Note that even with bucketing, you will still end up with regions only
> 1/2
> >> full with the only exception being the last region.
> >>
> >>> On May 1, 2015, at 11:09 AM, jeremy p <athomewithagroove...@gmail.com>
> >> wrote:
> >>>
> >>> Hello all,
> >>>
> >>> I've been out of the HBase world for a while, and I'm just now jumping
> >> back
> >>> in.
> >>>
> >>> As of HBase .94, it was still common to take a hash of your RowKey and
> >> use
> >>> that to "salt" the beginning of your RowKey to obtain an even
> >> distribution
> >>> among your region servers.  Is this still a common practice, or is
> there
> >> a
> >>> better way to do this in HBase 1.0?
> >>>
> >>> --Jeremy
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Reply via email to