Lewis, Many thanks for your reply.
I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error. The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories. Regards, Dean. On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote: > Hi Dean, > > Depending on the size of the segments your fetching, in most cases I > would advise you to separate out fetching and parsing into individual > steps. This becomes self explanatory as your segments increase in size > and the possibility of something going wrong with the fetching and > parsing when done together. This looks to be a segments which when > being fetched has experienced problems during parsing, therefore no > parse_data was produced. > > Can you please try a test fetch (with parsing boolean set to false) on > a sample segment then an individual parse and report back to us with > this one please. > > Thanks > > On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <[email protected]> wrote: >> Hi all, >> >> I'm upgrading from nutch 1 to 1.4 and am having problems running >> invertlinks. >> >> Error: >> >> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not >> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data >> at >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >> at >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >> at >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >> >> I notice that the parse_data directories are produced after a fetch (with >> fetcher.parse set to true), but after the merge the parse_data directory >> doesn't exist. >> >> What behaviour has changed since 1.0 and does anyone have a solution for the >> above? >> >> Thanks in advance, >> >> Dean. > > > > -- > Lewis

