I was studying the OpenTSDB example, where they also prefix the row keys with 
event id. 

I further modified my row keys to have this ->

<eventid> <uuid>  <yyyy-mm-dd>

The uuid is fairly unique and random.
Is appending a uuid to the event id help the distribution ? 

Let us say if I have 4 region servers to start off with and I start the 
workload, how does HBase decide how many regions is it going to create, and 
what 
key is going to go
into what region ?

I could have gone with something like 

<uuid><eventid><yyyy-mm-dd> , but would not like to, since my queries are 
always 
going to be against a particular event id type, and i would like them to be 
spatially located. 



----- Original Message ----
From: tsuna <[email protected]>
To: [email protected]
Cc: [email protected]
Sent: Tue, June 7, 2011 2:07:16 AM
Subject: Re: hbase hashing algorithm and schema design

On Fri, Jun 3, 2011 at 11:33 AM, Sam Seigal <[email protected]> wrote:
> Thanks for your reply. As I mentioned in the previous email, I prefix the
> key with an "event id" (<eventid> + <timestamp>). However, this event id is
> not going to be evenly distributed or random. According to some research I
> did into the data I receive in my system over the last 6 months, 40% of the
> data comes from one event id, then 25% from the other and so on. Since the
> data distribution is skewed, will the regions servers holding the regions
> for the "hot" event keys become overloaded for writes ? If this happens, is
> splitting the regions going to solve the problem ?

Whether or not the skewed distribution will be a problem depends on
how much write load you put on your cluster and how much write
capacity your cluster gives you.

You should test and see whether with your cluster and your workload,
you can handle your incoming traffic.

If your cluster is unable to handle your write load, splitting regions
further won't help, since the "last" 12 regions will always be the
ones accepting writes (since you have 12 prefixes).  What you can do,
however, is to manually move regions around such that the "hot"
regions are evenly spread across all your physical servers.  You can
do this from the HBase shell.  In the future, when HBase does proper
load balancing of regions, this won't be necessary anymore.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Reply via email to