Re: hbase hashing algorithm and schema design

Sam Seigal Fri, 03 Jun 2011 11:34:14 -0700

Hi Joey,

Thanks for your reply. As I mentioned in the previous email, I prefix the
key with an "event id" (<eventid> + <timestamp>). However, this event id is
not going to be evenly distributed or random. According to some research I
did into the data I receive in my system over the last 6 months, 40% of the
data comes from one event id, then 25% from the other and so on. Since the
data distribution is skewed, will the regions servers holding the regions
for the "hot" event keys become overloaded for writes ? If this happens, is
splitting the regions going to solve the problem ?


Thank you,

Sam


On Fri, Jun 3, 2011 at 5:27 AM, Joey Echeverria <[email protected]> wrote:

> Rows are split into regions of continuous row keys. Each region is assigned
> a physical server (region server) to host queries and updates to rows in
> that region. Currently, the assignment process is random and only balances
> the number of regions assigned to each server.
>
> The problem with largely sequential key inserts is they will go to the
> region hosting the end of the key space. That makes this region server a
> potential bottleneck. If you want to improve write performance, you can
> prefix each key with a hash of the key. The downside is sequential scans now
> have to be performed with multiple scanners and re-ordered client side.
>
> -Joey
>
> On Jun 3, 2011, at 3:35, Sam Seigal <[email protected]> wrote:
>
> > Hi,
> >
> > I am not able to find information regarding the algorithm that decides
> which
> > region a particular row belongs to in an HBase cluster. Does the
> algorithm
> > take into account the number of physical nodes ? Where can I find more
> > details about it ?
> >
> > I went through the HBase book and the OpenTSDB schema examples on schema
> > definitions and problems with monotonically increasing row keys, and had
> a
> > follow up question.
> >
> > I want to be able to query on ranges of time in HBase. Following the
> > OpenTSDB example, I have the following row key format:
> >
> > <eventid> - <yyyy-mm-dd>
> >
> > My eventId can be one of 12 distinct values (let us say from A-L) , and I
> > have a 4 node cluster running HBase right now. However, these event id
> > values are not evenly distributed.  I believe that this implies some of
> the
> > regions in the cluster  are going to grow faster in size than others, and
> > eventually will either automatically split or have to be manually split.
> > Should this be a concern at this point ? How is HBase deciding which
> > partition a particular key will go to ? I feel that knowing more details
> > about the algorithm can help me design the schema better.
> >
> > Your help is appreciated.
> >
> > Thank you.
> >
> > Sam
>

Re: hbase hashing algorithm and schema design

Reply via email to