Re: hbase hashing algorithm and schema design

tsuna Tue, 07 Jun 2011 02:08:25 -0700

On Fri, Jun 3, 2011 at 11:33 AM, Sam Seigal <[email protected]> wrote:
> Thanks for your reply. As I mentioned in the previous email, I prefix the
> key with an "event id" (<eventid> + <timestamp>). However, this event id is
> not going to be evenly distributed or random. According to some research I
> did into the data I receive in my system over the last 6 months, 40% of the
> data comes from one event id, then 25% from the other and so on. Since the
> data distribution is skewed, will the regions servers holding the regions
> for the "hot" event keys become overloaded for writes ? If this happens, is
> splitting the regions going to solve the problem ?


Whether or not the skewed distribution will be a problem depends on
how much write load you put on your cluster and how much write
capacity your cluster gives you.

You should test and see whether with your cluster and your workload,
you can handle your incoming traffic.

If your cluster is unable to handle your write load, splitting regions
further won't help, since the "last" 12 regions will always be the
ones accepting writes (since you have 12 prefixes).  What you can do,
however, is to manually move regions around such that the "hot"
regions are evenly spread across all your physical servers.  You can
do this from the HBase shell.  In the future, when HBase does proper
load balancing of regions, this won't be necessary anymore.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: hbase hashing algorithm and schema design

Reply via email to