Hello!

I am trying to run a simple crawl with Nutch 1.6 on CDH4.2.1 on Centos 6.2
cluster.

First I had problems with
# hadoop jar apache-nutch-1.6.job org.apache.nutch.fetcher.Fetcher
/nutch/1.6/crawl/segments/20130613095319
which was returning:
 java.lang.RuntimeException: problem advancing post rec#0
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1183)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:255)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:251)
at
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:40)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:506)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:447)
Caused by: java.io.IOException: can't find class:
org.apache.nutch.protocol.ProtocolStatus because
org.apache.nutch.protocol.ProtocolStatus
at
org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:206)
. . .
Also, I noticed inconsistency between the file system shown with hdfs dfs
-ls and the one shown in CDH4 Hue GUI. The former seems to simply create
the folders/files locally and is not aware of the ones I create through Hue
GUI.
Therefore, I suspected that the job is not properly running on the CDH4
cluster and used Hue GUI to create /user/admin/Nutch-1.6 folder and
urls/seed.txt and upload the Nutch 1.6 .job file (previously configured and
built with ant in Eclipse).
When I submit the job through Hue it logs ClassNotFoundException, although
I properly defined path to the .job file on the hdfs and the class name in
that file:
...
Failing Oozie Launcher, Main class [org.apache.nutch.crawl.Injector],
exception invoking main(), java.lang.ClassNotFoundException: Class
org.apache.nutch.crawl.Injector not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.nutch.crawl.Injector not found
...
How should I define the Hue job so that it recognizes Nutch's .job jar file
and/or make the CDH4 Hue consistent with the hadoop/hdfs shell commands?
This thread looks related:
http://www.mail-archive.com/[email protected]/msg07603.html


Thank you

Reply via email to