Re: keeping index up to date

Markus Jelsma Tue, 26 Jul 2011 12:56:25 -0700

We have the injector for that ;)


>  Hello,
> 
> One more question. Is there a way of adding new urls to crawldb created in
> previous crawls to include in subsequent recrawls?
> 
> Thanks.
> Alex.
> 
> 
> 
> -----Original Message-----
> From: lewis john mcgibbney <[email protected]>
> To: user <[email protected]>; markus.jelsma
> <[email protected]> Sent: Tue, Jun 7, 2011 1:16 pm
> Subject: Re: keeping index up to date
> 
> 
> Hi,
> 
> To add to Markus' comments, if you take a look at the script it is written
> in such a way that if run in safe mode it protects us against an error
> which may occur. If this is the case we an recover segments etc and take
> appropriate actions to resolve.
> 
> On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma 
<[email protected]>wrote:
> > >  Hi,
> > > 
> > > I took a look to the  recrawl script and noticed that all the steps
> > 
> > except
> > 
> > > urls injection are repeated at the consequent indexing and wondered why
> > > would we generate new segments? Is it possible to do fetch, update for
> > 
> > all
> > 
> > > previous $s1..$sn , invertlink  and index steps.
> > 
> > No, the generater generates a segment with a list of URL for the fetcher
> > to fetch. You can, if you like, then merge segments.
> > 
> > > Thanks.
> > > Alex.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Julien Nioche <[email protected]>
> > > To: user <[email protected]>
> > > Sent: Wed, Jun 1, 2011 12:59 am
> > > Subject: Re: keeping index up to date
> > > 
> > > 
> > > You should use the adaptative fetch schedule. See
> > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> > >
> > >for
> > >
> > > details
> > > 
> > > On 1 June 2011 07:18, <[email protected]> wrote:
> > > > Hello,
> > > > 
> > > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> > 
> > pdf
> > 
> > > > files which do not change over time.
> > > > I wondered if there is a way of configuring nutch not to fetch
> > 
> > unchanged
> > 
> > > > documents again and again, but keep the old index for them.
> > > > 
> > > > 
> > > > Thanks.
> > > > Alex.

Re: keeping index up to date

Reply via email to