Nutch 2.x is very similar to 1.x, the lib-http and protocol-http(client) did 
not really change. It is not possible out of the box in Nutch 1.7, there are no 
switches for this behaviour. I don't think this is easy to to with 
protocol-httpclient, unless HttpClient is already capable of this but you'd 
check the ancient javadocs to be sure. It is possible to patch protocol-http 
for this. The CrawlDatum is passes so you know the date and can stop reading 
bytes after the headers.

I don't think this is a good idea to implement. You won't really notice a 
faster fetcher unless you're processing many millions. You can also _cannot 
trust_ http headers, you are guaranteed to run into sites with crazy http 
headers and crazy values for last-modified. Nothing makes sense on the internet.

Anyway, most dynamic sites don't return that header so you'll have to compare 
digests anyway. You can then move to efficient fetching by using an adaptive 
fetch scheduler.
 
 
-----Original message-----
> From:Otis Gospodnetic <[email protected]>
> Sent: Friday 22nd November 2013 19:36
> To: Nutch User List <[email protected]>
> Subject: Not reading page body if page not modified?
> 
> Hi,
> 
> Is Nutch 2.x capable of issuing a GET request, comparing the reported
> Last-Modified date with the last modified date from the previous fetch of a
> page and, if the page is deemed unmodified since the last fetch, avoid
> fetching the rest of the page?
> 
> .... and thus save bandwidth (and maybe speed up fetching)?
> 
> Thanks,
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
> 

Reply via email to