Thanks, that's the way I visualized it happening. Then the assumption is this 
process would continue until every server in the cluster has on region of data 
(more or less). My underlying question is that I need to store my data with the 
key starting with the date (YYYY-MM-DD). I know this means I will have hot 
spots during inserts but make retrieval more efficient by using a scan with 
start and end rows. I was thinking of adding a prefix number of 00 to 09, for 
the ten servers. In theory, each server should only end up with one of the 
prefixes. Then during retrieval I could the use ten Threads, each would use a 
Start and End row with their prefix and the query should be distributed evenly 
out among the server. I'm not sure if using ten Thread to insert the data would 
buy me anything or not. Anyway, I'm going to try this out at home on my own 
cluster to see how it performs.

Thanks

-Pete

-----Original Message-----
From: Buttler, David [mailto:[email protected]] 
Sent: Friday, April 22, 2011 12:10 PM
To: [email protected]
Subject: RE: Row Key Question

Regions split when they are larger than the configuration parameter region 
size.  Your data is small enough to fit on a single region.

Keys are sorted in a region.  When a region splits the new regions are about 
half the size of the original region, and contain half the key space each.

Dave

-----Original Message-----
From: Peter Haidinyak [mailto:[email protected]] 
Sent: Friday, April 22, 2011 10:41 AM
To: [email protected]
Subject: Row Key Question

I have a question on how HBase decides to save rows based on Row Keys. Say I 
have a million rows to insert into a new table in a ten node cluster. Each 
row's key is some random 32 byte value and there are two columns per row, each 
column contains some random 32 byte value. 
My question is how does HBase know when to 'split' the table between the ten 
nodes? Or how does HBase 'split' the random keys between the ten nodes? 

Thanks

-Pete

Reply via email to