you can configure your third mapreduce job using MultipleFileInput and read those file into you job. if the file size is small then you can consider the DistributedCache which will give you an optimal performance if you are joining the datasets of file1 and file2. I will also recommend you to use some job scheduling api oozie to make sure that thrid job kicks off only when the file1 and file2 are available on the HDFS( the same can be done by some shell script or JobControl implementation).
:::::::::::::::::::::::::::::::::::::::: Raj K Singh http://in.linkedin.com/in/rajkrrsingh http://www.rajkrrsingh.blogspot.com Mobile Tel: +91 (0)9899821370 On Tue, Jan 6, 2015 at 2:25 AM, hitarth trivedi <[email protected]> wrote: > Hi, > > I have 6 node cluster, and the scenario is as follows :- > > I have one map reduce job which will write file1 in HDFS. > I have another map reduce job which will write file2 in HDFS. > In the third map reduce job I need to use file1 and file2 to do some > computation and output the value. > > What is the best way to store file1 and file2 in HDFS so that they could > be used in third map reduce job. > > Thanks, > Hitarth >
