Re: Region Splitting for moderate amount of daily data - Improve MapReduce Performance

Joe Pallas Fri, 15 Apr 2011 12:49:23 -0700

On Apr 14, 2011, at 12:18 PM, David Schnepper wrote:

> Could it be that your row key is not distributing the data well enough?
> That is, if your key is primarily based on the current date, it will only put 
> the
> data into a small number of regions.


This, I have come to realize, is an essential difference between the Cassandra 
approach and the HBase approach.  With HBase, your keys can be randomly 
distributed over the entire keyspace, but if all your data fits in a single 
region, then all your requests are going to a single regionserver.  

The only ways I know around this are to make the split threshold low or to 
pre-split the table.  If you make the split threshold low, you get distribution 
for smaller tables, but if the tables get big, you have the overhead of more 
regions to deal with.  If you pre-split the table, you're in good shape 
provided you know the key distribution in advance (although I am concerned 
about possible bugs involving empty regions, based on one recent experience).

It seems that, until you have enough data relative to your cluster size, you 
must choose between locality and distribution.  (When you have enough data, you 
get a better balance between the two.)

The HBase rebalancer, as I understand it, adjusts region assignments, but 
doesn't adjust split points (hence, the number of regions).  Maybe that would 
be a useful feature for some cases.

joe

Re: Region Splitting for moderate amount of daily data - Improve MapReduce Performance

Reply via email to