Hi, I am not able to find information regarding the algorithm that decides which region a particular row belongs to in an HBase cluster. Does the algorithm take into account the number of physical nodes ? Where can I find more details about it ?
I went through the HBase book and the OpenTSDB schema examples on schema definitions and problems with monotonically increasing row keys, and had a follow up question. I want to be able to query on ranges of time in HBase. Following the OpenTSDB example, I have the following row key format: <eventid> - <yyyy-mm-dd> My eventId can be one of 12 distinct values (let us say from A-L) , and I have a 4 node cluster running HBase right now. After doing some research in our OLTP database, I found that the majority (about 45% of the data) from the last 6 months written in the OLTP database has the event id equal to value "A". I believe that this implies some of the regions in the cluster (i.e. regions responsible for holding the row keys starting with "A") are going to grow faster in size than others, and eventually will either automatically split or have to be manually split. Should this be a concern at this point ? In any case, I do not expect these event id's to be equally distributed in volume anyway. Will read performance suffer in this case ? Also, when regions are split, how is HBase deciding which partition a particular key will go to ? I feel that knowing more details about the algorithm can help me design the schema better. Your help is appreciated. Thank you. Sam
