Re: keeping index up to date

lewis john mcgibbney Tue, 07 Jun 2011 13:16:39 -0700

Hi,

To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.


On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma <[email protected]>wrote:

>
> >  Hi,
> >
> > I took a look to the  recrawl script and noticed that all the steps
> except
> > urls injection are repeated at the consequent indexing and wondered why
> > would we generate new segments? Is it possible to do fetch, update for
> all
> > previous $s1..$sn , invertlink  and index steps.
>
> No, the generater generates a segment with a list of URL for the fetcher to
> fetch. You can, if you like, then merge segments.
>
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Julien Nioche <[email protected]>
> > To: user <[email protected]>
> > Sent: Wed, Jun 1, 2011 12:59 am
> > Subject: Re: keeping index up to date
> >
> >
> > You should use the adaptative fetch schedule. See
> > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> >for
> > details
> >
> > On 1 June 2011 07:18, <[email protected]> wrote:
> > > Hello,
> > >
> > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> pdf
> > > files which do not change over time.
> > > I wondered if there is a way of configuring nutch not to fetch
> unchanged
> > > documents again and again, but keep the old index for them.
> > >
> > >
> > > Thanks.
> > > Alex.
>



-- 
*Lewis*

Re: keeping index up to date

Reply via email to