The number of mappers usually is same as the number of the files you fed to it.
To reduce the number you can use CombineFileInputFormat.
I recently wrote an article about it. You can take a look if this fits your 
needs.

http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/

Felix

On Sep 29, 2013, at 6:45 PM, yunming zhang <[email protected]> wrote:

> I am actually trying to reduce the number of mappers because my application 
> takes up a lot of memory (in the order of 1-2 GB ram per mapper).  I want to 
> be able to use a few mappers but still maintain good CPU utilization through 
> multithreading within a single mapper. Multithreaded Mapper does't work 
> because it duplicates in memory data structures.
> 
> Thanks
> 
> Yunming
> 
> 
> On Sun, Sep 29, 2013 at 6:59 PM, Sonal Goyal <[email protected]> wrote:
> Wouldn't you rather just change your split size so that you can have more 
> mappers work on your input? What else are you doing in the mappers?
> Sent from my iPad
> 
> On Sep 30, 2013, at 2:22 AM, yunming zhang <[email protected]> wrote:
> 
>> Hi, 
>> 
>> I was playing with Hadoop code trying to have a single Mapper support 
>> reading a input split using multiple threads. I am getting All datanodes are 
>> bad IOException, and I am not sure what is the issue. 
>> 
>> The reason for this work is that I suspect my computation was slow because 
>> it takes too long to create the Text() objects from inputsplit using a 
>> single thread. I tried to modify the LineRecordReader (since I am mostly 
>> using TextInputFormat) to provide additional methods to retrieve lines from 
>> the input split  getCurrentKey2(), getCurrentValue2(), nextKeyValue2(). I 
>> created a second FSDataInputStream, and second LineReader object for 
>> getCurrentKey2(), getCurrentValue2() to read from. Essentially I am trying 
>> to open the input split twice with different start points (one in the very 
>> beginning, the other in the middle of the split) to read from input split in 
>> parallel using two threads.  
>> 
>> In the org.apache.hadoop.mapreduce.mapper.run() method, I modified it to 
>> read simultaneously using getCurrentKey() and getCurrentKey2() using Thread 
>> 1 and Thread 2 (both threads running at the same tim
>>       Thread 1:
>>        while(context.nextKeyValue()){
>>                   map(context.getCurrentKey(), context.getCurrentValue(), 
>> context);
>>         }
>> 
>>       Thread 2:
>>         while(context.nextKeyValue2()){
>>                 map(context.getCurrentKey2(), context.getCurrentValue2(), 
>> context);
>>                 //System.out.println("two iter");
>>         }
>> 
>> However, this causes me to see the All Datanodes are bad exception. I think 
>> I made sure that I closed the second file. I have attached a copy of my 
>> LineRecordReader file to show what I changed trying to enable two 
>> simultaneous read to the input split. 
>> 
>> I have modified other files(org.apache.hadoop.mapreduce.RecordReader.java, 
>> mapred.MapTask.java ....)  just to enable Mapper.run to call 
>> LinRecordReader.getCurrentKey2() and other access methods for the second 
>> file. 
>> 
>> 
>> I would really appreciate it if anyone could give me a bit advice or just 
>> point me to a direction as to where the problem might be, 
>> 
>> Thanks
>> 
>> Yunming 
>> 
>> <LineRecordReader.java>
> 

Reply via email to