Hi Dean, Depending on the size of the segments your fetching, in most cases I would advise you to separate out fetching and parsing into individual steps. This becomes self explanatory as your segments increase in size and the possibility of something going wrong with the fetching and parsing when done together. This looks to be a segments which when being fetched has experienced problems during parsing, therefore no parse_data was produced.
Can you please try a test fetch (with parsing boolean set to false) on a sample segment then an individual parse and report back to us with this one please. Thanks On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <[email protected]> wrote: > Hi all, > > I'm upgrading from nutch 1 to 1.4 and am having problems running > invertlinks. > > Error: > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > I notice that the parse_data directories are produced after a fetch (with > fetcher.parse set to true), but after the merge the parse_data directory > doesn't exist. > > What behaviour has changed since 1.0 and does anyone have a solution for the > above? > > Thanks in advance, > > Dean. -- Lewis

