Hi again,
Continuing my investigations into nutch, I attempted running two nutch
whole-web crawls against two different target URL sets simultaneously and
with different crawl directories. All seemed to be going very well until the
exception below appeared in one of the threads. It looks like something
under the hood is using some lock files that seem to be overlapping. Is it
possible to run two nutch instances side by side, or would it be a better
architecture to prefer to have a single instance of the script running and
have it pick up updates to the URLs it has to crawl (e.g. the user
specifying new top-level URLs to crawl).
Cheers
Chris
Exception in thread "main" java.io.FileNotFoundException: File
file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)