I was wondering how do you know  if the page was changed without actually
fetching it

On Wednesday, May 23, 2012, wrote:

> Hello,
>
> As far as I understood nutch recrawls urls when their fetch time has past
>  current time regardless if those urls were modified or not.
> Is there any initiative on restricting recrawls to only those urls that
> have modified time(MT) greater than the old MT?
> In detail: if nutch have crawled a  url with next fetch time in 30 days,
> then in the second recrawl nutch must visit this url, retrieve its modified
> time and compare it  with modified time that we have in the crawldb and
> recrawl it if the new MT is greater than the old one, otherwise skip it.
>
> Thanks.
> Alex.
>
>
>

Reply via email to