Hi,

This question is about politeness policies.

If I have understood correctly, Nutch adheres to politeness policies
by ensuring a few things in its crawl logic:

- It supports robots.txt
- It ensures domains are partitioned in a way that all URLs from the
same domain are necessarily fetched from the same map task.
- By default, it runs a single thread for all such hosts
- Again by default, it uses a 5 second delay between queries to the same domain.

The question is, if a website has not advertised any specific
constraints about crawl politeness, how fast can we go ? I know this
really depends on what the website will permit eventually. But I was
thinking if there are any experiences that users can share in terms of
numbers ? Also, are there ways (other than trial-and-error and getting
shut off) to find out how much 'impoliteness' would be tolerated ?

For instance, if we have a website with about 200,000 URLs, can we
configure enough threads and short delays to, say, finish the crawl in
about 10 hours (assuming the required b/w is available) without being
perceived as impolite ?

Thanks
Hemanth

Reply via email to