On Apr 14, 2011, at 12:18 PM, David Schnepper wrote: > Could it be that your row key is not distributing the data well enough? > That is, if your key is primarily based on the current date, it will only put > the > data into a small number of regions.
This, I have come to realize, is an essential difference between the Cassandra approach and the HBase approach. With HBase, your keys can be randomly distributed over the entire keyspace, but if all your data fits in a single region, then all your requests are going to a single regionserver. The only ways I know around this are to make the split threshold low or to pre-split the table. If you make the split threshold low, you get distribution for smaller tables, but if the tables get big, you have the overhead of more regions to deal with. If you pre-split the table, you're in good shape provided you know the key distribution in advance (although I am concerned about possible bugs involving empty regions, based on one recent experience). It seems that, until you have enough data relative to your cluster size, you must choose between locality and distribution. (When you have enough data, you get a better balance between the two.) The HBase rebalancer, as I understand it, adjusts region assignments, but doesn't adjust split points (hence, the number of regions). Maybe that would be a useful feature for some cases. joe
