I'm running Nutch 1.0 on a 4-nodes Hadoop cluster, so all nutch data must reside on the hadoop distributed filesystem rather than on the local fs.
The odd thing is that some of the nutch crawling "steps" seem to need a local copy of the data in the nutch main directory in order to work correctly. As an example, if I run the invertlinks command without a local copy of data then nutch throws an exception: LinkDb: java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:141) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:170) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:285) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:248) If an exact copy of the dfs directories is present on the local fs instead, all runs fine even if nutch actually works on dfs stored data. This is forcing me to perform an annoying, heavy time-spending copy of data (several GB) on all cluster nodes at every crawl cycle. Someone else has run into this and found a workaround? S ---------------------------------- "Anyone proposing to run Windows on servers should be prepared to explain what they know about servers that Google, Yahoo, and Amazon don't." Paul Graham "A mathematician is a device for turning coffee into theorems." Paul Erdos (who obviously never met a sysadmin)

