Hi,

I'm doing a distributed scan of an hbase table using map-reduce by taking all 
the regions belonging to a regionserver, and then assigning those regions to a 
mapper (so there's 1 mapper per regionserver, and each mapper only talks to one 
regionserver).  However, doing it this way I'm getting some data skew.  For 
example, I have 2 tables U and T.  Each regionserver may have 30 regions, but 
one regionserver might have 10 regions from table U while another regionserver 
might have 25 regions from table U.  Is there a way to balance regions per 
table per regionserver (so that each regionserver has 15 regions from table U 
for example)?  Or should I just not worry about trying to have each individual 
mapper only talk to one regionserver?

Also, how do regions get assigned to regionservers?  Is it based on data 
locality?  Region start/end keys?  Randomly?

Thanks,
Albert

Reply via email to