How to combine input files for a MapReduce job

Agarwal, Nikhil Mon, 13 May 2013 00:20:59 -0700

Hi,

I  have a 3-node cluster, with JobTracker running on one machine and 
TaskTrackers on other two. Instead of using HDFS, I have written my own 
FileSystem implementation. As an experiment, I kept 1000 text files (all of 
same size) on both the slave nodes and ran a simple Wordcount MR job. It took 
around 50 mins to complete the task. Afterwards, I concatenated all the 1000 
files into a single file and then ran a Wordcount MR job, it took 35 secs. From 
the JobTracker UI I could make out that the problem is because of the number of 
mappers that JobTracker is creating. For 1000 files it creates 1000 maps and 
for 1 file it creates 1 map (irrespective of file size).


Thus, is there a way to reduce the number of mappers i.e. can I control the 
number of mappers through some configuration parameter so that Hadoop would 
club all the files until it reaches some specified size (say, 64 MB) and then 
make 1 map per 64 MB block?

Also, I wanted to know how to see which file is being submitted to which 
TaskTracker or if that is not possible then how do I check if some data 
transfer is happening in between my slave nodes during a MR job?

Sorry for so many questions and Thank you for your time.

Regards,
Nikhil

How to combine input files for a MapReduce job

Reply via email to