Hello! I am trying to run a simple crawl with Nutch 1.6 on CDH4.2.1 on Centos 6.2 cluster.
First I had problems with # hadoop jar apache-nutch-1.6.job org.apache.nutch.fetcher.Fetcher /nutch/1.6/crawl/segments/20130613095319 which was returning: java.lang.RuntimeException: problem advancing post rec#0 at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1183) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:255) at org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:251) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:40) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:506) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:447) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:447) Caused by: java.io.IOException: can't find class: org.apache.nutch.protocol.ProtocolStatus because org.apache.nutch.protocol.ProtocolStatus at org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:206) . . . Also, I noticed inconsistency between the file system shown with hdfs dfs -ls and the one shown in CDH4 Hue GUI. The former seems to simply create the folders/files locally and is not aware of the ones I create through Hue GUI. Therefore, I suspected that the job is not properly running on the CDH4 cluster and used Hue GUI to create /user/admin/Nutch-1.6 folder and urls/seed.txt and upload the Nutch 1.6 .job file (previously configured and built with ant in Eclipse). When I submit the job through Hue it logs ClassNotFoundException, although I properly defined path to the .job file on the hdfs and the class name in that file: ... Failing Oozie Launcher, Main class [org.apache.nutch.crawl.Injector], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Injector not found java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.nutch.crawl.Injector not found ... How should I define the Hue job so that it recognizes Nutch's .job jar file and/or make the CDH4 Hue consistent with the hadoop/hdfs shell commands? This thread looks related: http://www.mail-archive.com/[email protected]/msg07603.html Thank you

