I might want to ask about your Hadoop temp dir since you seem to have disk errors. Have you set it?
On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote: > I haven't followed the entire thread but this is about the parse_data > directory disappears after a merge? We have no issues with merges on small > crawls. > > Do you still store content despite the parsing fetcher? Can you reproduce > this on a clean Nutch 1.4 build with an example crawl? > > On Thursday 05 January 2012 18:28:52 Dean Pullen wrote: > > Hi all, > > > > I'm upgrading from nutch 1 to 1.4 and am having problems running > > invertlinks. > > > > Error: > > > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does > > not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data > > > > at > > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java: > > 19 0) at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI > > np utFormat.java:44) at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2 > > 01 ) at > > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > > > > at > > > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > > > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > > > I notice that the parse_data directories are produced after a fetch > > (with fetcher.parse set to true), but after the merge the parse_data > > directory doesn't exist. > > > > What behaviour has changed since 1.0 and does anyone have a solution for > > the above? > > > > Thanks in advance, > > > > Dean. -- Markus Jelsma - CTO - Openindex

