Re: ‘split’ start/stop key range of large table regions for more map tasks

Ted Yu Mon, 25 Mar 2013 19:38:27 -0700

Looks like this is what you were looking for:
HBASE-4063 Improve TableInputFormat to allow application to configure the
number of mappers


Cheers

On Mon, Mar 25, 2013 at 7:33 PM, Lu, Wei <[email protected]> wrote:

> Hi, Michael,
>
> Yes, I read some stuff in blogs and I did pre-split + large max region
> file size to avoid on line split. Also set region size large to reduce
> region server heap size, so I don't what to manually split.
>
> Let me make it clear. The problem I faced is to spawn more than one map
> task for each large region when running MR on top of hbase. Which means to
> run several map tasks each scans a row key range on each region.
>
> Thanks,
> Wei
>
>
>
> -----Original Message-----
> From: Michael Segel [mailto:[email protected]]
> Sent: Monday, March 25, 2013 11:52 PM
> To: [email protected]
> Subject: Re: ‘split’ start/stop key range of large table regions for more
> map tasks
>
> I think the problem is that Wei has been reading some stuff in blogs and
> that's why he has such a large region size to start with.
>
> So if he manually splits the logs, drops the region size to something more
> appropriate...
>
> Or if he unloads the table, drops the table, recreates the table with a
> smaller more reasonable region size... reloads...  he'd be better off.
>
>
> On Mar 25, 2013, at 6:20 AM, Jean-Marc Spaggiari <[email protected]>
> wrote:
>
> > Hi Wei,
> >
> > Have you looked at MAX_FILESIZE? If your table is 1TB size, and you
> > have 10 RS and want 12 regions per server, you can setup this to
> > 1TB/(10x12) and you will get at least all those regions (and even a
> > bit more).
> >
> > JM
> >
> > 2013/3/25 Lu, Wei <[email protected]>:
> >> We are facing big region size but small  region number of a table. 10
> region servers, each has only one region with size over 10G, map slot of
> each task tracker is 12. We are planning to ‘split’ start/stop key range of
> large table regions for more map tasks, so that we can better make usage of
> mapreduce resource (currently only one of 12 map slot is used). I have some
> ideas below to split, please give me comments or advice.
> >> We are considering of implementing a TableInputFormat that optimized
> the method:
> >> @Override
> >> public List<InputSplit> getSplits(JobContext context) throws IOException
> >> Following is a idea:
> >>
> >> 1)      Split start/stop key range based on threshold or avg. of region
> size
> >> Set a threshold t1; collect each region’s size, if region size is
> larger than region size, then ‘split’ the range [startkey, stopkey) of the
> region, to N = {region size} / t1 sub-ranges: [startkey, stopkey1),
> [stopkey1, stopkey2),….,[stopkeyN-1, stopkey);
> >> As for  t1, we could set as we like, or leave it as the average size of
> all region size. We will set it to be a small value when each region size
> is very large, so that ‘split’ will happen;
> >>
> >> 2)      Get split key by sampling hfile block keys
> >> As for  the stopkey1, …stopkeyN-1, hbase doesn’t supply apis to get
> them and only Pair<byte[][],byte[][]> getStartEndKeys()is given to get
> start/stop key of the region. 1) We could do calculate to roughly get them,
> or 2) we can directly get each store file’s block key through Hfile.Reader
> and merge sort them. Then we can do sampling.
> >> Does this method make sense?
> >>
> >> Thanks,
> >> Wei
> >>
> >
>
>

Re: ‘split’ start/stop key range of large table regions for more map tasks

Reply via email to