Hi,
I have written a small utility to run map reduce job programmatically. My aim
is to run my map reduce job without using hadoop shell script. I am planning to
call this utility from another application.
Following is the code which runs the map reduce job. I have bundled this java
class into a jar (remotemr.jar ). I have the actual map reduce job bundled
inside another jar (mapreduce.jar)
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.SequenceFileInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
public class RemoteMapreduce {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
String inputPath = args[0];
String outputPath = args[1];
String specFilePath=args[2];
Configuration config = new Configuration();
config.addResource(new
Path("/opt/hadoop-1.0.2/bin/core-site.xml"));
config.addResource(new
Path("/opt/hadoop-1.0.2/bin/hdfs-site.xml"));
JobConf jobConf = new JobConf(config);
jobConf.set("hadoop.tmp.dir ", "/tmp/hadoop-ananda/");
jobConf.setJar("/home/ananda/mapreduce.jar");
jobConf.setMapperClass(Myjob.MapClass.class);
SequenceFileInputFormat.setInputPaths(jobConf, new
Path(inputPath));
TextOutputFormat.setOutputPath(jobConf, new Path(outputPath));
jobConf.setMapOutputKeyClass(Text.class);
jobConf.setMapOutputValueClass(Text.class);
jobConf.setInputFormat(SequenceFileInputFormat.class);
jobConf.setOutputFormat(TextOutputFormat.class);
jobConf.setOutputKeyClass(Text.class);
jobConf.setOutputValueClass(Text.class);
jobConf.set("specPath", specFilePath);
jobConf.setUser("ananda");
Job job1 = new Job(jobConf);
JobClient jc = new JobClient(jobConf);
jc.submitJob(jobConf);
/* JobControl ctrl = new JobControl("dar");
ctrl.addJob(job1);
ctrl.run();*/
System.out.println("Job launched!");
}
}
I am running it as follows
java -cp <all hadoop jars needed for the
job>:/home/ananda/mapreduce.jar:/home/Ananda/remotemr.jar RemoteMapreduce
<inputpath> <outputpath> <specpath>
It runs without any error. But it takes longer time than what it takes when I
run it using hadoop shell script. One more thing is all the three input paths
needs to be fully qualified HDFS paths i.e. hdfs://<hostname>:<port>/<path>. If
I give partial paths as in hadoop shell script, I am getting input path not
found errors. Am I doing anything wrong? Please help. Thanks
Regards,
Anand.C