Hello,

As far as I understood nutch recrawls urls when their fetch time has past  
current time regardless if those urls were modified or not.
Is there any initiative on restricting recrawls to only those urls that have 
modified time(MT) greater than the old MT?
In detail: if nutch have crawled a  url with next fetch time in 30 days, then 
in the second recrawl nutch must visit this url, retrieve its modified time and 
compare it  with modified time that we have in the crawldb and recrawl it if 
the new MT is greater than the old one, otherwise skip it.

Thanks.
Alex.


Reply via email to