Re: Concurrently running multiple nutch crawls

Julien Nioche Wed, 13 Jul 2011 09:20:27 -0700

Having a single instance is a good solution as it would make the fetching
more efficient (more domains => the more threads working in parallel)+
simplifies the management of the crawls. You can modify the scoring so that
URLs added as seeds are fetched in priority -> see OPIC scoring for the
default implementation.


Julien


Continuing my investigations into nutch, I attempted running two nutch
> whole-web crawls against two different target URL sets simultaneously and
> with different crawl directories. All seemed to be going very well until
> the
> exception below appeared in one of the threads. It looks like something
> under the hood is using some lock files that seem to be overlapping. Is it
> possible to run two nutch instances side by side, or would it be a better
> architecture to prefer to have a single instance of the script running and
> have it pick up updates to the URLs it has to crawl (e.g. the user
> specifying new top-level URLs to crawl).
>
> Cheers
>
> Chris
>
>
> Exception in thread "main" java.io.FileNotFoundException: File
> file:/tmp/hadoop-root/mapred/system/job_local_0001/job.xml does not exist.
>        at
>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
>        at
>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
>        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
>        at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
>        at
>
> org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
>        at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
>        at
> org.apache.hadoop.mapred.LocalJobRunner$Job.<init>(LocalJobRunner.java:92)
>        at
> org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
>        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156)
>        at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:177)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:163)
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Concurrently running multiple nutch crawls

Reply via email to