Hi, This question is about politeness policies.
If I have understood correctly, Nutch adheres to politeness policies by ensuring a few things in its crawl logic: - It supports robots.txt - It ensures domains are partitioned in a way that all URLs from the same domain are necessarily fetched from the same map task. - By default, it runs a single thread for all such hosts - Again by default, it uses a 5 second delay between queries to the same domain. The question is, if a website has not advertised any specific constraints about crawl politeness, how fast can we go ? I know this really depends on what the website will permit eventually. But I was thinking if there are any experiences that users can share in terms of numbers ? Also, are there ways (other than trial-and-error and getting shut off) to find out how much 'impoliteness' would be tolerated ? For instance, if we have a website with about 200,000 URLs, can we configure enough threads and short delays to, say, finish the crawl in about 10 hours (assuming the required b/w is available) without being perceived as impolite ? Thanks Hemanth

