Re: How to combine input files for a MapReduce job

Harsh J Mon, 13 May 2013 00:59:11 -0700

Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4.


On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil
<[email protected]> wrote:
> Hi,
>
> @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release?
>
> -----Original Message-----
> From: Harsh J [mailto:[email protected]]
> Sent: Monday, May 13, 2013 1:03 PM
> To: <[email protected]>
> Subject: Re: How to combine input files for a MapReduce job
>
> For "control number of mappers" question: You can use 
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html
> which is designed to solve similar cases. However, you cannot beat the speed 
> you get out of a single large file (or a few large files), as you'll still 
> have file open/close overheads which will bog you down.
>
> For "which file is being submitted to which" question: Having
> https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the 
> version/distribution of Apache Hadoop you use would help.
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <[email protected]> 
> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my
>> own FileSystem implementation. As an experiment, I kept 1000 text
>> files (all of same size) on both the slave nodes and ran a simple
>> Wordcount MR job. It took around 50 mins to complete the task.
>> Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took
>> 35 secs. From the JobTracker UI I could make out that the problem is
>> because of the number of mappers that JobTracker is creating. For 1000
>> files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I
>> control the number of mappers through some configuration parameter so
>> that Hadoop would club all the files until it reaches some specified
>> size (say, 64 MB) and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to
>> which TaskTracker or if that is not possible then how do I check if
>> some data transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Reply via email to