Re: keeping index up to date

Markus Jelsma Tue, 07 Jun 2011 13:02:22 -0700

>  Hi,
> 
> I took a look to the  recrawl script and noticed that all the steps except
> urls injection are repeated at the consequent indexing and wondered why
> would we generate new segments? Is it possible to do fetch, update for all
> previous $s1..$sn , invertlink  and index steps.


No, the generater generates a segment with a list of URL for the fetcher to 
fetch. You can, if you like, then merge segments.

> 
> Thanks.
> Alex.
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Julien Nioche <[email protected]>
> To: user <[email protected]>
> Sent: Wed, Jun 1, 2011 12:59 am
> Subject: Re: keeping index up to date
> 
> 
> You should use the adaptative fetch schedule. See
> http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20>for
> details
> 
> On 1 June 2011 07:18, <[email protected]> wrote:
> > Hello,
> > 
> > I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
> > files which do not change over time.
> > I wondered if there is a way of configuring nutch not to fetch unchanged
> > documents again and again, but keep the old index for them.
> > 
> > 
> > Thanks.
> > Alex.

Re: keeping index up to date

Reply via email to