Ken, > >> Hi, >> >> This question is about politeness policies. >> >> If I have understood correctly, Nutch adheres to politeness policies >> by ensuring a few things in its crawl logic: >> >> - It supports robots.txt >> - It ensures domains are partitioned in a way that all URLs from the >> same domain are necessarily fetched from the same map task. >> - By default, it runs a single thread for all such hosts >> - Again by default, it uses a 5 second delay between queries to the same >> domain. >> >> The question is, if a website has not advertised any specific >> constraints about crawl politeness, how fast can we go ? I know this >> really depends on what the website will permit eventually. But I was >> thinking if there are any experiences that users can share in terms of >> numbers ? Also, are there ways (other than trial-and-error and getting >> shut off) to find out how much 'impoliteness' would be tolerated ? >> >> For instance, if we have a website with about 200,000 URLs, can we >> configure enough threads and short delays to, say, finish the crawl in >> about 10 hours (assuming the required b/w is available) without being >> perceived as impolite ? > > As soon as you hit a site with more than one thread, you're no longer > polite. > > Also, in addition to the default crawl delay, there's also pages/day which > most bigger sites monitor. > > If you're not Google, then I'd suggest a max of 5K/day unless you've got > some special understanding with the target domain. >
We're not Google :-), hence these numbers are useful pieces of information. Thanks. But, we have stumbled on some commercial crawling services that claim to do 1URL/per host/sec. That works to a much larger number than what you've given. Do you think this is managed typically by getting some agreement with the target ? Or do you think they may be taking a risk. Thanks Hemanth

