1) A standard way of doing it would be to have all your files content inside HDFS. You could then process <key,value> where key would be the name of the file and value its contents. It would improve performance : data locality, less network traffic... But you may have constraints...
2) Maven is a simple way of doing it. Regards Bertrand On Mon, Aug 13, 2012 at 7:59 PM, Pierre Antoine DuBoDeNa <[email protected]>wrote: > Hello, > > We use hadoop to distribute a task over our machines. > > This task requires only the mapper class to be defined. We want to do some > text processing in thousands of documents. So we create key-value pairs, > where key is just an increasing number and value is the path of the file to > be processed. > > We face problem on including an external jar file/class while running a jar > file. > > $ mkdir Rdg_classes > $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d > Rdg_classes Rdg.java > $ jar -cvf Rdg.jar -C Rdg_classes/ . > We have tried the following options: > > *1. Set HADOOP_CLASSPATH with the location of external jar files or > external classes.* > It doesnt help. Instead, it starts de-recognizing the Reducer with below > error: > > java.lang.RuntimeException: java.lang.RuntimeException: > java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899) > at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1028) > at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1380) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:981) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: > hadoop.Rdg$Reduce > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891) > ... 10 more > Caused by: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce > at java.net.URLClassLoader$1.run(URLClassLoader.java:202) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > at java.lang.ClassLoader.loadClass(ClassLoader.java:306) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) > at java.lang.ClassLoader.loadClass(ClassLoader.java:247) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:247) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865) > ... 11 more > > *2. Use -libjars option as below:* > hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output > > Where Rdg_lib is the a folder containing all reqd classes/jars stored on > HDFS. > But it starts reading -libjars as an input as gives error as: > > 12/08/10 08:16:24 ERROR security.UserGroupInformation: > PriviledgedActionException as:hduser > cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: hdfs://nameofserver:54310/user/hduser/-libjars > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars > > Is there any other way to do it? or we do anything wrong? > > Best, > -- Bertrand Dechoux
