I was studying the OpenTSDB example, where they also prefix the row keys with event id.
I further modified my row keys to have this -> <eventid> <uuid> <yyyy-mm-dd> The uuid is fairly unique and random. Is appending a uuid to the event id help the distribution ? Let us say if I have 4 region servers to start off with and I start the workload, how does HBase decide how many regions is it going to create, and what key is going to go into what region ? I could have gone with something like <uuid><eventid><yyyy-mm-dd> , but would not like to, since my queries are always going to be against a particular event id type, and i would like them to be spatially located. ----- Original Message ---- From: tsuna <[email protected]> To: [email protected] Cc: [email protected] Sent: Tue, June 7, 2011 2:07:16 AM Subject: Re: hbase hashing algorithm and schema design On Fri, Jun 3, 2011 at 11:33 AM, Sam Seigal <[email protected]> wrote: > Thanks for your reply. As I mentioned in the previous email, I prefix the > key with an "event id" (<eventid> + <timestamp>). However, this event id is > not going to be evenly distributed or random. According to some research I > did into the data I receive in my system over the last 6 months, 40% of the > data comes from one event id, then 25% from the other and so on. Since the > data distribution is skewed, will the regions servers holding the regions > for the "hot" event keys become overloaded for writes ? If this happens, is > splitting the regions going to solve the problem ? Whether or not the skewed distribution will be a problem depends on how much write load you put on your cluster and how much write capacity your cluster gives you. You should test and see whether with your cluster and your workload, you can handle your incoming traffic. If your cluster is unable to handle your write load, splitting regions further won't help, since the "last" 12 regions will always be the ones accepting writes (since you have 12 prefixes). What you can do, however, is to manually move regions around such that the "hot" regions are evenly spread across all your physical servers. You can do this from the HBase shell. In the future, when HBase does proper load balancing of regions, this won't be necessary anymore. -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com
