Re: Concurrently running multiple nutch crawls

Markus Jelsma Wed, 13 Jul 2011 07:50:46 -0700

You're running locally? You cannot run multiple Nutch' locally with each 
sharing the same /tmp/ directory: change /tmp/ per crawl or run on Hadoop or 
run in sequence if you can live with it.


On Wednesday 13 July 2011 16:38:04 Chris Alexander wrote:
> Hi again,
> 
> Continuing my investigations into nutch, I attempted running two nutch
> whole-web crawls against two different target URL sets simultaneously and
> with different crawl directories. All seemed to be going very well until
> the exception below appeared in one of the threads. It looks like
> something under the hood is using some lock files that seem to be
> overlapping. Is it possible to run two nutch instances side by side, or
> would it be a better architecture to prefer to have a single instance of
> the script running and have it pick up updates to the URLs it has to crawl
> (e.g. the user specifying new top-level URLs to crawl).
> 
> Cheers
> 
> Chris
> 
> 
> Exception in thread "main" java.io.FileNotFoundException: File
> file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not exist.
>         at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.ja
> va:361) at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:2
> 45) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
>         at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
>         at
> org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:6
> 1) at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
>         at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
>         at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>         at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>         at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Concurrently running multiple nutch crawls

Reply via email to