Do I understand correctly that if I use "bin/nutch crawl" command and store the crawl data into a new directory every time, then there is no way for Nutch to know whether the page has been changed since the last crawl?
Thanks, Max -----Original Message----- From: Markus Jelsma [mailto:[email protected]] Sent: den 24 augusti 2012 21:26 To: [email protected]; [email protected] Subject: RE: recrawl a URL? No, the CrawlDatum's status field will be set to db_notmodified if the signatures match regardless of the HTTP headers. The header only sets a fetch_notmodified but it is not relevant for the db_* status. -----Original message----- > From:[email protected] <[email protected]> > Sent: Fri 24-Aug-2012 20:14 > To: [email protected]; [email protected] > Subject: Re: recrawl a URL? > > This will work only for urls that has If-Modified-Since headers. But most > urls does not have this header. > > Thanks. > Alex. > > > > > > > -----Original Message----- > From: Max Dzyuba <[email protected]> > To: Markus Jelsma <[email protected]>; user > <[email protected]> > Sent: Fri, Aug 24, 2012 9:02 am > Subject: RE: recrawl a URL? > > > Thanks again! I'll have to test it more then in my 1.5.1. > > > > > Best regards, > MaxMarkus Jelsma <[email protected]> wrote:Hmm, i had to look > it up but it is supported in 1.5 and 1.5.1: > > http://svn.apache.org/viewvc/nutch/tags/release-1.5.1/src/java/org/apa > che/nutch/indexer/IndexerMapReduce.java?view=markup > > > -----Original message----- > > From:Max Dzyuba <[email protected]> > > Sent: Fri 24-Aug-2012 17:35 > > To: Markus Jelsma <[email protected]>; > > [email protected] > > Subject: RE: recrawl a URL? > > > > Thank you for the reply. Does it mean that it is not supported in > > latest > stable release of Nutch? > > > > > > -----Original Message----- > > From: Markus Jelsma [mailto:[email protected]] > > Sent: den 24 augusti 2012 17:21 > > To: [email protected]; Max Dzyuba > > Subject: RE: recrawl a URL? > > > > Hi, > > > > Trunk has a feature for this: indexer.skip.notmodified > > > > Cheers > > > > -----Original message----- > > > From:Max Dzyuba <[email protected]> > > > Sent: Fri 24-Aug-2012 17:19 > > > To: [email protected] > > > Subject: recrawl a URL? > > > > > > Hello everyone, > > > > > > > > > > > > I run a crawl command every day, but I don't want Nutch to submit > > > an update to Solr if a particular page hasn't changed. How do I > > > achieve that? Right now the value of db.fetch.interval.default > > > doesn't seem to help prevent the crawl since the updates are > > > submitted to Solr as if the page has been changed. I know for sure > > > that the page has not been changed. This happens for every new crawl > > > command. > > > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > Max > > > > > > > > > > > > > >

