Hi Joey, Thanks for your reply. As I mentioned in the previous email, I prefix the key with an "event id" (<eventid> + <timestamp>). However, this event id is not going to be evenly distributed or random. According to some research I did into the data I receive in my system over the last 6 months, 40% of the data comes from one event id, then 25% from the other and so on. Since the data distribution is skewed, will the regions servers holding the regions for the "hot" event keys become overloaded for writes ? If this happens, is splitting the regions going to solve the problem ?
Thank you, Sam On Fri, Jun 3, 2011 at 5:27 AM, Joey Echeverria <[email protected]> wrote: > Rows are split into regions of continuous row keys. Each region is assigned > a physical server (region server) to host queries and updates to rows in > that region. Currently, the assignment process is random and only balances > the number of regions assigned to each server. > > The problem with largely sequential key inserts is they will go to the > region hosting the end of the key space. That makes this region server a > potential bottleneck. If you want to improve write performance, you can > prefix each key with a hash of the key. The downside is sequential scans now > have to be performed with multiple scanners and re-ordered client side. > > -Joey > > On Jun 3, 2011, at 3:35, Sam Seigal <[email protected]> wrote: > > > Hi, > > > > I am not able to find information regarding the algorithm that decides > which > > region a particular row belongs to in an HBase cluster. Does the > algorithm > > take into account the number of physical nodes ? Where can I find more > > details about it ? > > > > I went through the HBase book and the OpenTSDB schema examples on schema > > definitions and problems with monotonically increasing row keys, and had > a > > follow up question. > > > > I want to be able to query on ranges of time in HBase. Following the > > OpenTSDB example, I have the following row key format: > > > > <eventid> - <yyyy-mm-dd> > > > > My eventId can be one of 12 distinct values (let us say from A-L) , and I > > have a 4 node cluster running HBase right now. However, these event id > > values are not evenly distributed. I believe that this implies some of > the > > regions in the cluster are going to grow faster in size than others, and > > eventually will either automatically split or have to be manually split. > > Should this be a concern at this point ? How is HBase deciding which > > partition a particular key will go to ? I feel that knowing more details > > about the algorithm can help me design the schema better. > > > > Your help is appreciated. > > > > Thank you. > > > > Sam >
