Hi Wei, Have you looked at MAX_FILESIZE? If your table is 1TB size, and you have 10 RS and want 12 regions per server, you can setup this to 1TB/(10x12) and you will get at least all those regions (and even a bit more).
JM 2013/3/25 Lu, Wei <[email protected]>: > We are facing big region size but small region number of a table. 10 region > servers, each has only one region with size over 10G, map slot of each task > tracker is 12. We are planning to ‘split’ start/stop key range of large table > regions for more map tasks, so that we can better make usage of mapreduce > resource (currently only one of 12 map slot is used). I have some ideas below > to split, please give me comments or advice. > We are considering of implementing a TableInputFormat that optimized the > method: > @Override > public List<InputSplit> getSplits(JobContext context) throws IOException > Following is a idea: > > 1) Split start/stop key range based on threshold or avg. of region size > Set a threshold t1; collect each region’s size, if region size is larger than > region size, then ‘split’ the range [startkey, stopkey) of the region, to N = > {region size} / t1 sub-ranges: [startkey, stopkey1), [stopkey1, > stopkey2),….,[stopkeyN-1, stopkey); > As for t1, we could set as we like, or leave it as the average size of all > region size. We will set it to be a small value when each region size is very > large, so that ‘split’ will happen; > > 2) Get split key by sampling hfile block keys > As for the stopkey1, …stopkeyN-1, hbase doesn’t supply apis to get them and > only Pair<byte[][],byte[][]> getStartEndKeys()is given to get start/stop key > of the region. 1) We could do calculate to roughly get them, or 2) we can > directly get each store file’s block key through Hfile.Reader and merge sort > them. Then we can do sampling. > Does this method make sense? > > Thanks, > Wei >
