Re: recrawl a URL?

alxsss Fri, 24 Aug 2012 11:11:36 -0700

This will work only for urls that has If-Modified-Since headers. But most urls 
does not have this header.


Thanks.
Alex. 
 

 

 

-----Original Message-----
From: Max Dzyuba <[email protected]>
To: Markus Jelsma <[email protected]>; user <[email protected]>
Sent: Fri, Aug 24, 2012 9:02 am
Subject: RE: recrawl a URL?


Thanks again! I'll have to test it more then in my 1.5.1.




Best regards,
MaxMarkus Jelsma <[email protected]> wrote:Hmm, i had to look it up 
but 
it is supported in 1.5 and 1.5.1:

http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apache/nutch/indexer/IndexerMapReduce.java?view=markup


-----Original message-----
> From:Max Dzyuba <[email protected]>
> Sent: Fri 24-Aug-2012 17:35
> To: Markus Jelsma <[email protected]>; [email protected]
> Subject: RE: recrawl a URL?
> 
> Thank you for the reply. Does it mean that it is not supported in latest 
stable release of Nutch?
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: den 24 augusti 2012 17:21
> To: [email protected]; Max Dzyuba
> Subject: RE: recrawl a URL?
> 
> Hi,
> 
> Trunk has a feature for this: indexer.skip.notmodified
> 
> Cheers 
>  
> -----Original message-----
> > From:Max Dzyuba <[email protected]>
> > Sent: Fri 24-Aug-2012 17:19
> > To: [email protected]
> > Subject: recrawl a URL?
> > 
> > Hello everyone,
> > 
> >  
> > 
> > I run a crawl command every day, but I don't want Nutch to submit an 
> > update to Solr if a particular page hasn't changed. How do I achieve 
> > that? Right now the value of db.fetch.interval.default doesn't seem to 
> > help prevent the crawl since the updates are submitted to Solr as if 
> > the page has been changed. I know for sure that the page has not been 
> > changed. This happens for every new crawl command.
> > 
> >  
> > 
> >  
> > 
> > Thanks in advance,
> > 
> > Max
> > 
> > 
> 
>

Re: recrawl a URL?

Reply via email to