Re: recrawl a URL?

Lewis John Mcgibbney Thu, 30 Aug 2012 03:26:34 -0700

Hi Max,

On Tue, Aug 28, 2012 at 3:24 PM, Max Dzyuba <[email protected]> wrote:
> Is it possible to use the same crawldb but store segment data in a different
> directory for consecutive crawls using the "bin/nutch crawl" command? I
> thought that there is no option to specify the path to crawldb or linkdb,
> but only the path to a directory where to save all crawl data into. I'm
> using Nutch 1.5. If it's possible, how would the crawl command look like?


No this is not possible out of the box as it would make the generic
cmdlin solution too convoluted. As you mention, in the past we only
specified one directory for all crawl data and this is still the same.

Please note that the crawl command is now deprecated in trunk and will
not be supported via convenience commands from the nutch script in
future releases. Julian and others implemented a crawl script which
gives you much more control over your crawl cycles. I must finally add
that it would be a piece of cake to edit the script for your purposes
e.g. set a variable to todays date, create a directory with the
variable then move your data there via the script... or something
similar. For reference the script can be seen below

http://svn.apache.org/repos/asf/nutch/trunk/src/bin/crawl

Re: recrawl a URL?

Reply via email to