We are facing big region size but small  region number of a table. 10 region 
servers, each has only one region with size over 10G, map slot of each task 
tracker is 12. We are planning to ‘split’ start/stop key range of large table 
regions for more map tasks, so that we can better make usage of mapreduce 
resource (currently only one of 12 map slot is used). I have some ideas below 
to split, please give me comments or advice.
We are considering of implementing a TableInputFormat that optimized the method:
@Override
public List<InputSplit> getSplits(JobContext context) throws IOException
Following is a idea:

1)      Split start/stop key range based on threshold or avg. of region size
Set a threshold t1; collect each region’s size, if region size is larger than 
region size, then ‘split’ the range [startkey, stopkey) of the region, to N = 
{region size} / t1 sub-ranges: [startkey, stopkey1), [stopkey1, 
stopkey2),….,[stopkeyN-1, stopkey);
As for  t1, we could set as we like, or leave it as the average size of all 
region size. We will set it to be a small value when each region size is very 
large, so that ‘split’ will happen;

2)      Get split key by sampling hfile block keys
As for  the stopkey1, …stopkeyN-1, hbase doesn’t supply apis to get them and 
only Pair<byte[][],byte[][]> getStartEndKeys()is given to get start/stop key of 
the region. 1) We could do calculate to roughly get them, or 2) we can directly 
get each store file’s block key through Hfile.Reader and merge sort them. Then 
we can do sampling.
Does this method make sense?

Thanks,
Wei

Reply via email to