Running map reduce programmatically is unusually slow

Chandra Mohan, Ananda Vel Murugan Mon, 04 Nov 2013 06:03:18 -0800

Hi,

I have written a small utility to run map reduce job programmatically. My aim 
is to run my map reduce job without using hadoop shell script. I am planning to 
call this utility from another application.


Following is the code which runs the map reduce job. I have bundled this java 
class into a jar (remotemr.jar ). I have the actual map reduce job bundled 
inside another jar (mapreduce.jar)

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;


public class RemoteMapreduce {

       public static void main(String[] args) throws IOException, 
InterruptedException, ClassNotFoundException {

               String inputPath = args[0];
              String outputPath = args[1];
              String specFilePath=args[2];
              Configuration config = new Configuration();
              config.addResource(new 
Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
              config.addResource(new 
Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
              JobConf jobConf = new JobConf(config);
              jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
              jobConf.setJar("/home/ananda/mapreduce.jar");
              jobConf.setMapperClass(Myjob.MapClass.class);
              SequenceFileInputFormat.setInputPaths(jobConf, new 
Path(inputPath));
              TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
              jobConf.setMapOutputKeyClass(Text.class);
              jobConf.setMapOutputValueClass(Text.class);
              jobConf.setInputFormat(SequenceFileInputFormat.class);
              jobConf.setOutputFormat(TextOutputFormat.class);
              jobConf.setOutputKeyClass(Text.class);
              jobConf.setOutputValueClass(Text.class);
              jobConf.set("specPath", specFilePath);
              jobConf.setUser("ananda");
              Job job1 = new Job(jobConf);
              JobClient jc = new JobClient(jobConf);
              jc.submitJob(jobConf);
              /* JobControl ctrl = new JobControl("dar");
              ctrl.addJob(job1);
              ctrl.run();*/

              System.out.println("Job launched!");

       }
}


I am running it as follows

java -cp  <all hadoop jars needed for the 
job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar  RemoteMapreduce 
<inputpath> <outputpath> <specpath>

It runs without any error. But it takes longer time than what it takes when I 
run it using hadoop shell script. One more thing is all the three input paths 
needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If 
I give partial paths as in hadoop shell script, I am getting input path not 
found errors. Am I doing anything wrong? Please help. Thanks

Regards,
Anand.C

Running map reduce programmatically is unusually slow

Reply via email to