RE: re-crawling with nutch 1.8

Markus Jelsma Fri, 13 Jun 2014 05:10:29 -0700

Hi Ali,

Nutch does not really re-crawl, it crawls every URL every N interval, default 
of 30 days. Usually one would keep Nutch running indefinately (e.g. by cron), 
the URL's will then automatically be `recrawled` every 30 days by default.


Markus

 
 
-----Original message-----
> From:Ali Nazemian <[email protected]>
> Sent: Thursday 5th June 2014 21:25
> To: [email protected]
> Subject: re-crawling with nutch 1.8
> 
> Hi,
> I recently got familiar with nutch and I want to use nutch for whole web
> crawling. The problem is I did not find any useful tutorial on how to
> re-crawl using nutch. I know that there is some configuration parameter
> that should change for purpose of recrawling, I am aware of them. The thing
> that I dont know is how can I run a crawler for crawl as first step and
> recrawl as the next steps? As far as I found out the default crawl script
> that is provided with nutch could not be used for my purpose. Could
> somebody tell me how can I do that? What are the prerequisites? Do I need
> web application server such as tomcat for this purpose?
> FYI I am using nutch 1.8.
> 
> Regards.
> 
> -- 
> A.Nazemian
>

RE: re-crawling with nutch 1.8

Reply via email to