Hi all,

I'm upgrading from nutch 1 to 1.4 and am having problems running invertlinks.

Error:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
    at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

I notice that the parse_data directories are produced after a fetch (with fetcher.parse set to true), but after the merge the parse_data directory doesn't exist.

What behaviour has changed since 1.0 and does anyone have a solution for the above?

Thanks in advance,

Dean.

Reply via email to