I was wondering how do you know if the page was changed without actually fetching it
On Wednesday, May 23, 2012, wrote: > Hello, > > As far as I understood nutch recrawls urls when their fetch time has past > current time regardless if those urls were modified or not. > Is there any initiative on restricting recrawls to only those urls that > have modified time(MT) greater than the old MT? > In detail: if nutch have crawled a url with next fetch time in 30 days, > then in the second recrawl nutch must visit this url, retrieve its modified > time and compare it with modified time that we have in the crawldb and > recrawl it if the new MT is greater than the old one, otherwise skip it. > > Thanks. > Alex. > > >

