Hi, did you solve the problem yourself? I'm running in the same Issue...
Maybe someone else could help here? Regards Hannes On Wed, Oct 27, 2010 at 12:28 PM, Davide Cavalaglio < [email protected]> wrote: > Hi, > i have problem with the option If-Modified-Since with Nutch. > I want crawl on a web syte every day, so i have in nutch-site.html the > right setting of property db.fetch.interval.default. > But i want to limit Nutch to fetch only page that changed using the > If-Modified-Since header. > > I found some resources on web to do this task, but when i recrawl page > afeter fetch-interval, nutch download all pages. I use Nutch 1.0 whith > protocol http. I don't use Adaptive Scheduler. In HttpResponse.java i > added the code: > if (datum.getModifiedTime() > 0) { > String httpDate = > HttpDateFormat.toString(datum.getModifiedTime()); > Http.LOG.debug("modified time: " + httpDate); > reqStr.append("If-Modified-Since: " + httpDate); > reqStr.append("\r\n"); > } > else if (datum.getFetchTime() > 0) { > String httpDate = HttpDateFormat.toString(datum.getFetchTime()); > Http.LOG.debug("modified time: " + httpDate); > reqStr.append("If-Modified-Since: " + httpDate); > reqStr.append("\r\n"); > } > > reqStr.append("\r\n"); > > because there was a bug that prevent the use of If-Modified-Since. > Also i did other change to Fetcher.java so i have the correct value of > LastModified in the CrawlDb > I try to crawl other web site because i want understand if it is a > problem of my web server that not support if-modified-since. But in > every test, i have always response code 200 even if the lastModified > of web page is older than LastModified in CrawlDb. > > Can anyone tell me how to correctly use the If-Modified-Since? > Thanks, > Cavalaglio Davide >

