On 2010-05-18 12:19, Stefano Cherchi wrote:
> I'm running Nutch 1.0 on a 4-nodes Hadoop cluster, so all nutch data must 
> reside on the hadoop distributed filesystem rather than on the local fs.
> 
> The odd thing is that some of the nutch crawling "steps" seem to need a local 
> copy of the data in the nutch main directory in order to work correctly.
> 
> As an example, if I run the invertlinks command without a local copy of data 
> then nutch throws an exception:
> 
> LinkDb: java.io.IOException: No input paths specified in job
> at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:141)
> at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170)
> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248)
> 
> If an exact copy of the dfs directories is present on the local fs instead, 
> all runs fine even if nutch actually works on dfs stored data.
>  
> This is forcing me to perform an annoying, heavy time-spending copy of data 
> (several GB) on all cluster nodes at every crawl cycle.
> 
> Someone else has run into this and found a workaround?

This should not happen - are you sure that your hadoop config files are
consistent across the cluster, especially the FS related properties?
When you start the job make sure that the classpath that your command
uses pulls in the right hadoop config, one that correctly defines the
filesystem.

Also, check for the presence of multiple copies of hadoop config files
on your classpath - be aware that some of them may be inside jars, e.g.
in your job jar.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to