Re: Incremental crawling with nutch

Ali Nazemian Wed, 04 Jun 2014 10:33:31 -0700

Thank you very much. But it is just a parameter for specifying the interval
between re-crawls. The problem is nutch re-crawl does not works with
default crawl script.



On Wed, Jun 4, 2014 at 6:49 PM, S.L <[email protected]> wrote:

> Ali,
>
> If you have not found this out yet, I was referring to
> db.fetch.interval.max.
>
> Sent from my HTC
>
> ----- Reply message -----
> From: "Ali Nazemian" <[email protected]>
> To: <[email protected]>
> Subject: Incremental crawling with nutch
> Date: Mon, Jun 2, 2014 4:52 AM
>
> Hi,
> Could you please explain more?
> What parameter? How can I do that?!
> Regards.
>
>
> On Mon, Jun 2, 2014 at 3:42 AM, S.L <[email protected]> wrote:
>
> > Hi Ali
> >
> > Please see the nutch-site.xml parameters one of them does that.
> >
> > Sent from my HTC
> >
> > ----- Reply message -----
> > From: "Ali Nazemian" <[email protected]>
> > To: <[email protected]>
> > Subject: Incremental crawling with nutch
> > Date: Sun, Jun 1, 2014 10:46 AM
> >
> > Hi everybody,
> > I am going to use nutch for crawling some news web site. These websites
> > will be updated regularly. Therefore I should recrawl them at least
> every 2
> > hours. But the problem is I want to have incremental re-crawl, it means
> > nutch should crawl only the urls that are new and not fetched before
> > (except the main page of each site for extracting new urls). I want in
> each
> > re-crawling process only the new URLs fetched and send to solr for
> > indexing. Would somebody guide me through this scenario with nutch 1.8?
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Incremental crawling with nutch

Reply via email to