Hi Wei,

Have you looked at MAX_FILESIZE? If your table is 1TB size, and you
have 10 RS and want 12 regions per server, you can setup this to
1TB/(10x12) and you will get at least all those regions (and even a
bit more).

JM

2013/3/25 Lu, Wei <[email protected]>:
> We are facing big region size but small  region number of a table. 10 region 
> servers, each has only one region with size over 10G, map slot of each task 
> tracker is 12. We are planning to ‘split’ start/stop key range of large table 
> regions for more map tasks, so that we can better make usage of mapreduce 
> resource (currently only one of 12 map slot is used). I have some ideas below 
> to split, please give me comments or advice.
> We are considering of implementing a TableInputFormat that optimized the 
> method:
> @Override
> public List<InputSplit> getSplits(JobContext context) throws IOException
> Following is a idea:
>
> 1)      Split start/stop key range based on threshold or avg. of region size
> Set a threshold t1; collect each region’s size, if region size is larger than 
> region size, then ‘split’ the range [startkey, stopkey) of the region, to N = 
> {region size} / t1 sub-ranges: [startkey, stopkey1), [stopkey1, 
> stopkey2),….,[stopkeyN-1, stopkey);
> As for  t1, we could set as we like, or leave it as the average size of all 
> region size. We will set it to be a small value when each region size is very 
> large, so that ‘split’ will happen;
>
> 2)      Get split key by sampling hfile block keys
> As for  the stopkey1, …stopkeyN-1, hbase doesn’t supply apis to get them and 
> only Pair<byte[][],byte[][]> getStartEndKeys()is given to get start/stop key 
> of the region. 1) We could do calculate to roughly get them, or 2) we can 
> directly get each store file’s block key through Hfile.Reader and merge sort 
> them. Then we can do sampling.
> Does this method make sense?
>
> Thanks,
> Wei
>

Reply via email to