Concurrently running multiple nutch crawls

Chris Alexander Wed, 13 Jul 2011 07:38:37 -0700

Hi again,

Continuing my investigations into nutch, I attempted running two nutch
whole-web crawls against two different target URL sets simultaneously and
with different crawl directories. All seemed to be going very well until the
exception below appeared in one of the threads. It looks like something
under the hood is using some lock files that seem to be overlapping. Is it
possible to run two nutch instances side by side, or would it be a better
architecture to prefer to have a single instance of the script running and
have it pick up updates to the URLs it has to crawl (e.g. the user
specifying new top-level URLs to crawl).


Cheers

Chris


Exception in thread "main" java.io.FileNotFoundException: File
file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not exist.
        at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
        at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
        at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
        at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
        at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
        at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)

Concurrently running multiple nutch crawls

Reply via email to