The disk errors were solved by upgrading hadoop to 0.20.203 - they no longer appear.

Dean.

On 10/01/2012 17:01, Markus Jelsma wrote:
I might want to ask about your Hadoop temp dir since you seem to have disk
errors. Have you set it?

On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote:
I haven't followed the entire thread but this is about the parse_data
directory disappears after a merge? We have no issues with merges on small
crawls.

Do you still store content despite the parsing fetcher? Can you reproduce
this on a clean Nutch 1.4  build with an example crawl?

On Thursday 05 January 2012 18:28:52 Dean Pullen wrote:
Hi all,

I'm upgrading from nutch 1 to 1.4 and am having problems running
invertlinks.

Error:

LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data

      at

org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:
19 0) at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI
np utFormat.java:44) at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2
01 ) at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)

      at

org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)

      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

I notice that the parse_data directories are produced after a fetch
(with fetcher.parse set to true), but after the merge the parse_data
directory doesn't exist.

What behaviour has changed since 1.0 and does anyone have a solution for
the above?

Thanks in advance,

Dean.

Reply via email to