Re: TableInputFormat and number of mappers == number of regions

Avery Ching Mon, 11 Apr 2011 11:27:26 -0700

I found the code still exists in this code base for the old mapred interfaces


src/main/java/org/apache/hadoop/hbase/mapred/TableInputFormatBase.java

I'll adapt it for my needs.  Thanks!

Avery

On Apr 9, 2011, at 9:55 AM, Jean-Daniel Cryans wrote:

> It's weird, I thought we already did something like that and it seems
> that the old TableInputFormatBase does it but not the new one. From
> it's javadoc:
> 
>   * Splits are created in number equal to the smallest between numSplits and
>   * the number of {@link HRegion}s in the table. If the number of splits is
>   * smaller than the number of {@link HRegion}s then splits are spanned across
>   * multiple {@link HRegion}s and are grouped the most evenly possible. In the
>   * case splits are uneven the bigger splits are placed first in the
>   * {@link InputSplit} array.
> 
> J-D
> 
> On Sat, Apr 9, 2011 at 9:48 AM, Stack <[email protected]> wrote:
>> Yes, you could make a different Splitter.  Would be nice in the
>> splitter if you could keep the locality where we have the Map task
>> running on the TaskTracker that is adjacent to the hosting
>> RegionServer.  That shouldn't be hard.  Study the current splitter and
>> see how it juggles locations.
>> 
>> Can you put us in contact w/ the person running the cluster (offline
>> if you prefer)?  150k sounds like regions need to be bigger.
>> 
>> Thanks,
>> St.Ack
>> 
>> On Sat, Apr 9, 2011 at 9:33 AM, Avery Ching <[email protected]> wrote:
>>> The number of regions is pretty insane, but not under my control 
>>> unfortunately.  The workaround I suggested is to write another InputFormat 
>>> and InputSplit such that each InputSplit is responsible for a configurable 
>>> number of regions.  For example, if i have 100k regions and I configure 
>>> each InputSplit to handle 1k regions, then I'll only have 100 map tasks.  
>>> Just was wondering if anyone else faced these issues.
>>> 
>>> Thanks for your quick response on a Saturday morning =),
>>> 
>>> Avery
>>> 
>>> On Apr 9, 2011, at 9:26 AM, Jean-Daniel Cryans wrote:
>>> 
>>>> You cannot have more mappers than you have regions, but you can have
>>>> less. Try going that way.
>>>> 
>>>> Also 149,624 regions is insane, is that really the case? I don't think
>>>> i've ever seen such a large deploy and it's probably bound to hit some
>>>> issues...
>>>> 
>>>> J-D
>>>> 
>>>> On Sat, Apr 9, 2011 at 9:15 AM, Avery Ching <[email protected]> wrote:
>>>>> Hi,
>>>>> 
>>>>> First off, I'd like to say thanks to the developers for HBase, it's been 
>>>>> fun to work with.
>>>>> 
>>>>> I've been using TableInputFormat to run a Map-Reduce job and ran into an 
>>>>> issue.
>>>>> 
>>>>> Exception in thread "main" org.apache.hadoop.ipc.RemoteException: 
>>>>> java.io.IOException: java.io.IOException: The number of tasks for this 
>>>>> job 149624 exceeds the configured limit 100000
>>>>> 
>>>>> The table i'm accessing has 149624 regions, however my Hadoop instance 
>>>>> won't allow me to start a job with that many map tasks.  After briefly 
>>>>> looking at the TableInputFormatBase code, it appears that since 
>>>>> TableSplit only knows about a single region, my job will be forced into 
>>>>> having mappers == # of regions.  Since the Hadoop instance I'm using is 
>>>>> shared, I'm concerned that even if configured limit was raised, having 
>>>>> Jobs with so many mappers would eventually cause havoc to the job tracker.
>>>>> 
>>>>> Given that I have no control over the number of regions in the table 
>>>>> (maintained by someone else), is the only solution to implement another 
>>>>> input format (i.e. MultiRegionTableFormat) that allows InputSplits to 
>>>>> have more than one region?  I don't mind doing it, but didn't want to 
>>>>> write it if another solution already exists.
>>>>> 
>>>>> Apologies if this issue has been raised before, but a quick search didn't 
>>>>> turn anything up for me.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Avery
>>>>> 
>>> 
>>> 
>>

Re: TableInputFormat and number of mappers == number of regions

Reply via email to