Re: Incremental crawling with nutch

S.L Wed, 04 Jun 2014 07:20:40 -0700

Ali,

If you have not found this out yet, I was referring to db.fetch.interval.max.


Sent from my HTC

----- Reply message -----
From: "Ali Nazemian" <[email protected]>
To: <[email protected]>
Subject: Incremental crawling with nutch
Date: Mon, Jun 2, 2014 4:52 AM

Hi,
Could you please explain more?
What parameter? How can I do that?!
Regards.


On Mon, Jun 2, 2014 at 3:42 AM, S.L <[email protected]> wrote:

> Hi Ali
>
> Please see the nutch-site.xml parameters one of them does that.
>
> Sent from my HTC
>
> ----- Reply message -----
> From: "Ali Nazemian" <[email protected]>
> To: <[email protected]>
> Subject: Incremental crawling with nutch
> Date: Sun, Jun 1, 2014 10:46 AM
>
> Hi everybody,
> I am going to use nutch for crawling some news web site. These websites
> will be updated regularly. Therefore I should recrawl them at least every 2
> hours. But the problem is I want to have incremental re-crawl, it means
> nutch should crawl only the urls that are new and not fetched before
> (except the main page of each site for extracting new urls). I want in each
> re-crawling process only the new URLs fetched and send to solr for
> indexing. Would somebody guide me through this scenario with nutch 1.8?
> Best regards.
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Incremental crawling with nutch

Reply via email to