The number of mappers usually is same as the number of the files you fed to it. To reduce the number you can use CombineFileInputFormat. I recently wrote an article about it. You can take a look if this fits your needs.
http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/ Felix On Sep 29, 2013, at 6:45 PM, yunming zhang <[email protected]> wrote: > I am actually trying to reduce the number of mappers because my application > takes up a lot of memory (in the order of 1-2 GB ram per mapper). I want to > be able to use a few mappers but still maintain good CPU utilization through > multithreading within a single mapper. Multithreaded Mapper does't work > because it duplicates in memory data structures. > > Thanks > > Yunming > > > On Sun, Sep 29, 2013 at 6:59 PM, Sonal Goyal <[email protected]> wrote: > Wouldn't you rather just change your split size so that you can have more > mappers work on your input? What else are you doing in the mappers? > Sent from my iPad > > On Sep 30, 2013, at 2:22 AM, yunming zhang <[email protected]> wrote: > >> Hi, >> >> I was playing with Hadoop code trying to have a single Mapper support >> reading a input split using multiple threads. I am getting All datanodes are >> bad IOException, and I am not sure what is the issue. >> >> The reason for this work is that I suspect my computation was slow because >> it takes too long to create the Text() objects from inputsplit using a >> single thread. I tried to modify the LineRecordReader (since I am mostly >> using TextInputFormat) to provide additional methods to retrieve lines from >> the input split getCurrentKey2(), getCurrentValue2(), nextKeyValue2(). I >> created a second FSDataInputStream, and second LineReader object for >> getCurrentKey2(), getCurrentValue2() to read from. Essentially I am trying >> to open the input split twice with different start points (one in the very >> beginning, the other in the middle of the split) to read from input split in >> parallel using two threads. >> >> In the org.apache.hadoop.mapreduce.mapper.run() method, I modified it to >> read simultaneously using getCurrentKey() and getCurrentKey2() using Thread >> 1 and Thread 2 (both threads running at the same tim >> Thread 1: >> while(context.nextKeyValue()){ >> map(context.getCurrentKey(), context.getCurrentValue(), >> context); >> } >> >> Thread 2: >> while(context.nextKeyValue2()){ >> map(context.getCurrentKey2(), context.getCurrentValue2(), >> context); >> //System.out.println("two iter"); >> } >> >> However, this causes me to see the All Datanodes are bad exception. I think >> I made sure that I closed the second file. I have attached a copy of my >> LineRecordReader file to show what I changed trying to enable two >> simultaneous read to the input split. >> >> I have modified other files(org.apache.hadoop.mapreduce.RecordReader.java, >> mapred.MapTask.java ....) just to enable Mapper.run to call >> LinRecordReader.getCurrentKey2() and other access methods for the second >> file. >> >> >> I would really appreciate it if anyone could give me a bit advice or just >> point me to a direction as to where the problem might be, >> >> Thanks >> >> Yunming >> >> <LineRecordReader.java> >
