RE: recrawl a URL?

Max Dzyuba Mon, 27 Aug 2012 05:51:37 -0700

Do I understand correctly that if I use "bin/nutch crawl" command and store the 
crawl data into a new directory every time, then there is no way for Nutch to 
know whether the page has been changed since the last crawl?


Thanks,
Max


-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: den 24 augusti 2012 21:26
To: [email protected]; [email protected]
Subject: RE: recrawl a URL?

No, the CrawlDatum's status field will be set to db_notmodified if the 
signatures match regardless of the HTTP headers. The header only sets a 
fetch_notmodified but it is not relevant for the db_* status.

 
 
-----Original message-----
> From:[email protected] <[email protected]>
> Sent: Fri 24-Aug-2012 20:14
> To: [email protected]; [email protected]
> Subject: Re: recrawl a URL?
> 
> This will work only for urls that has If-Modified-Since headers. But most 
> urls does not have this header.
> 
> Thanks.
> Alex. 
>  
> 
>  
> 
>  
> 
> -----Original Message-----
> From: Max Dzyuba <[email protected]>
> To: Markus Jelsma <[email protected]>; user 
> <[email protected]>
> Sent: Fri, Aug 24, 2012 9:02 am
> Subject: RE: recrawl a URL?
> 
> 
> Thanks again! I'll have to test it more then in my 1.5.1.
> 
> 
> 
> 
> Best regards,
> MaxMarkus Jelsma <[email protected]> wrote:Hmm, i had to look 
> it up but it is supported in 1.5 and 1.5.1:
> 
> http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apa
> che/nutch/indexer/IndexerMapReduce.java?view=markup
> 
> 
> -----Original message-----
> > From:Max Dzyuba <[email protected]>
> > Sent: Fri 24-Aug-2012 17:35
> > To: Markus Jelsma <[email protected]>; 
> > [email protected]
> > Subject: RE: recrawl a URL?
> > 
> > Thank you for the reply. Does it mean that it is not supported in 
> > latest
> stable release of Nutch?
> > 
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:[email protected]]
> > Sent: den 24 augusti 2012 17:21
> > To: [email protected]; Max Dzyuba
> > Subject: RE: recrawl a URL?
> > 
> > Hi,
> > 
> > Trunk has a feature for this: indexer.skip.notmodified
> > 
> > Cheers
> >  
> > -----Original message-----
> > > From:Max Dzyuba <[email protected]>
> > > Sent: Fri 24-Aug-2012 17:19
> > > To: [email protected]
> > > Subject: recrawl a URL?
> > > 
> > > Hello everyone,
> > > 
> > >  
> > > 
> > > I run a crawl command every day, but I don't want Nutch to submit 
> > > an update to Solr if a particular page hasn't changed. How do I 
> > > achieve that? Right now the value of db.fetch.interval.default 
> > > doesn't seem to help prevent the crawl since the updates are 
> > > submitted to Solr as if the page has been changed. I know for sure 
> > > that the page has not been changed. This happens for every new crawl 
> > > command.
> > > 
> > >  
> > > 
> > >  
> > > 
> > > Thanks in advance,
> > > 
> > > Max
> > > 
> > > 
> > 
> > 
> 
> 
>  
>

RE: recrawl a URL?

Reply via email to