Looks like this is what you were looking for: HBASE-4063 Improve TableInputFormat to allow application to configure the number of mappers
Cheers On Mon, Mar 25, 2013 at 7:33 PM, Lu, Wei <[email protected]> wrote: > Hi, Michael, > > Yes, I read some stuff in blogs and I did pre-split + large max region > file size to avoid on line split. Also set region size large to reduce > region server heap size, so I don't what to manually split. > > Let me make it clear. The problem I faced is to spawn more than one map > task for each large region when running MR on top of hbase. Which means to > run several map tasks each scans a row key range on each region. > > Thanks, > Wei > > > > -----Original Message----- > From: Michael Segel [mailto:[email protected]] > Sent: Monday, March 25, 2013 11:52 PM > To: [email protected] > Subject: Re: ‘split’ start/stop key range of large table regions for more > map tasks > > I think the problem is that Wei has been reading some stuff in blogs and > that's why he has such a large region size to start with. > > So if he manually splits the logs, drops the region size to something more > appropriate... > > Or if he unloads the table, drops the table, recreates the table with a > smaller more reasonable region size... reloads... he'd be better off. > > > On Mar 25, 2013, at 6:20 AM, Jean-Marc Spaggiari <[email protected]> > wrote: > > > Hi Wei, > > > > Have you looked at MAX_FILESIZE? If your table is 1TB size, and you > > have 10 RS and want 12 regions per server, you can setup this to > > 1TB/(10x12) and you will get at least all those regions (and even a > > bit more). > > > > JM > > > > 2013/3/25 Lu, Wei <[email protected]>: > >> We are facing big region size but small region number of a table. 10 > region servers, each has only one region with size over 10G, map slot of > each task tracker is 12. We are planning to ‘split’ start/stop key range of > large table regions for more map tasks, so that we can better make usage of > mapreduce resource (currently only one of 12 map slot is used). I have some > ideas below to split, please give me comments or advice. > >> We are considering of implementing a TableInputFormat that optimized > the method: > >> @Override > >> public List<InputSplit> getSplits(JobContext context) throws IOException > >> Following is a idea: > >> > >> 1) Split start/stop key range based on threshold or avg. of region > size > >> Set a threshold t1; collect each region’s size, if region size is > larger than region size, then ‘split’ the range [startkey, stopkey) of the > region, to N = {region size} / t1 sub-ranges: [startkey, stopkey1), > [stopkey1, stopkey2),….,[stopkeyN-1, stopkey); > >> As for t1, we could set as we like, or leave it as the average size of > all region size. We will set it to be a small value when each region size > is very large, so that ‘split’ will happen; > >> > >> 2) Get split key by sampling hfile block keys > >> As for the stopkey1, …stopkeyN-1, hbase doesn’t supply apis to get > them and only Pair<byte[][],byte[][]> getStartEndKeys()is given to get > start/stop key of the region. 1) We could do calculate to roughly get them, > or 2) we can directly get each store file’s block key through Hfile.Reader > and merge sort them. Then we can do sampling. > >> Does this method make sense? > >> > >> Thanks, > >> Wei > >> > > > >
