follow up question on row key schema design

Sam Seigal Thu, 02 Jun 2011 17:34:15 -0700

Hi,

I am not able to find information regarding the algorithm that decides which
region a particular row belongs to in an HBase cluster. Does the algorithm
take into account the number of physical nodes ? Where can I find more
details about it ?


I went through the HBase book and the OpenTSDB schema examples on schema
definitions and problems with monotonically increasing row keys, and had a
follow up question.

I want to be able to query on ranges of time in HBase. Following the
OpenTSDB example, I have the following row key format:

<eventid> - <yyyy-mm-dd>

My eventId can be one of 12 distinct values (let us say from A-L) , and I
have a 4 node cluster running HBase right now.

After doing some research in our OLTP database, I found that the majority
(about 45% of the data) from the last 6 months written in the OLTP database
has the event id equal to value "A".

I believe that this implies some of the regions in the cluster (i.e. regions
responsible for holding the row keys starting with "A") are going to grow
faster in size than others, and eventually will either automatically
split or have to be manually split. Should this be a concern at this point ?
In any case, I do not expect these event id's to be equally distributed in
volume anyway.  Will read performance suffer in this case ?
Also, when regions are split, how is HBase deciding which partition a
particular key will go to ? I feel that knowing more details about the
algorithm can help me design the schema better.

Your help is appreciated.

Thank you.

Sam

follow up question on row key schema design

Reply via email to