Re: help in distribution of a task with hadoop

Bertrand Dechoux Mon, 13 Aug 2012 11:22:36 -0700

1) A standard way of doing it would be to have all your files content
inside HDFS. You could then process <key,value> where key would be the name
of the file and value its contents. It would improve performance : data
locality, less network traffic... But you may have constraints...


2) Maven is a simple way of doing it.

Regards

Bertrand

On Mon, Aug 13, 2012 at 7:59 PM, Pierre Antoine DuBoDeNa
<[email protected]>wrote:

> Hello,
>
> We use hadoop to distribute a task over our machines.
>
> This task requires only the mapper class to be defined. We want to do some
> text processing in thousands of documents. So we create key-value pairs,
> where key is just an increasing number and value is the path of the file to
> be processed.
>
> We face problem on including an external jar file/class while running a jar
> file.
>
> $ mkdir Rdg_classes
>  $ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d
> Rdg_classes Rdg.java
> $ jar -cvf Rdg.jar -C Rdg_classes/ .
> We have tried the following options:
>
> *1. Set HADOOP_CLASSPATH with the location of external jar files or
> external classes.*
> It doesnt help. Instead, it starts de-recognizing the Reducer with below
> error:
>
> java.lang.RuntimeException: java.lang.RuntimeException:
> java.lang.ClassNotFoundException: hadoop.Rdg$Reduce
>     at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:899)
>     at org.apache.hadoop.mapred.JobConf.getCombinerClass(JobConf.java:1028)
>     at org.apache.hadoop.mapred.Task$CombinerRunner.create(Task.java:1380)
>     at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:981)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:428)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:396)
>     at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>     at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
> hadoop.Rdg$Reduce
>     at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
>     at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:891)
>     ... 10 more
> Caused by: java.lang.ClassNotFoundException: hadoop.Rdg$Reduce
>     at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>     at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>     at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>     at java.lang.Class.forName0(Native Method)
>     at java.lang.Class.forName(Class.java:247)
>     at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>     at
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>     ... 11 more
>
> *2. Use -libjars option as below:*
> hadoop jar Rdg.jar my.hadoop.Rdg -libjars Rdg_lib/* tester rdg_output
>
> Where Rdg_lib is the a folder containing all reqd classes/jars stored on
> HDFS.
> But it starts reading -libjars as an input as gives error as:
>
> 12/08/10 08:16:24 ERROR security.UserGroupInformation:
> PriviledgedActionException as:hduser
> cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist: hdfs://nameofserver:54310/user/hduser/-libjars
> Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException:
> Input path does not exist: hdfs://nameofserver:54310/user/hduser/-libjars
>
> Is there any other way to do it? or we do anything wrong?
>
> Best,
>



-- 
Bertrand Dechoux

Re: help in distribution of a task with hadoop

Reply via email to