RE: recrawl a URL?

Max Dzyuba Thu, 30 Aug 2012 06:02:35 -0700

Thank you Lewis and Rémy for your replies. I'll have to modify my scripts to
use the individual commands and use the mentioned crawl script.

Thanks a lot for your help,
Max

-----Original Message-----
From: Lewis John Mcgibbney [mailto:[email protected]] 
Sent: den 30 augusti 2012 12:26
To: [email protected]
Subject: Re: recrawl a URL?

Hi Max,

On Tue, Aug 28, 2012 at 3:24 PM, Max Dzyuba <[email protected]>
wrote:
> Is it possible to use the same crawldb but store segment data in a 
> different directory for consecutive crawls using the "bin/nutch crawl" 
> command? I thought that there is no option to specify the path to 
> crawldb or linkdb, but only the path to a directory where to save all 
> crawl data into. I'm using Nutch 1.5. If it's possible, how would the
crawl command look like?

No this is not possible out of the box as it would make the generic cmdlin
solution too convoluted. As you mention, in the past we only specified one
directory for all crawl data and this is still the same.

Please note that the crawl command is now deprecated in trunk and will not
be supported via convenience commands from the nutch script in future
releases. Julian and others implemented a crawl script which gives you much
more control over your crawl cycles. I must finally add that it would be a
piece of cake to edit the script for your purposes e.g. set a variable to
todays date, create a directory with the variable then move your data there
via the script... or something similar. For reference the script can be seen
below

http://svn.apache.org/repos/asf/nutch/trunk/src/bin/crawl

RE: recrawl a URL?

Reply via email to