Ken,

>
>> Hi,
>>
>> This question is about politeness policies.
>>
>> If I have understood correctly, Nutch adheres to politeness policies
>> by ensuring a few things in its crawl logic:
>>
>> - It supports robots.txt
>> - It ensures domains are partitioned in a way that all URLs from the
>> same domain are necessarily fetched from the same map task.
>> - By default, it runs a single thread for all such hosts
>> - Again by default, it uses a 5 second delay between queries to the same
>> domain.
>>
>> The question is, if a website has not advertised any specific
>> constraints about crawl politeness, how fast can we go ? I know this
>> really depends on what the website will permit eventually. But I was
>> thinking if there are any experiences that users can share in terms of
>> numbers ? Also, are there ways (other than trial-and-error and getting
>> shut off) to find out how much 'impoliteness' would be tolerated ?
>>
>> For instance, if we have a website with about 200,000 URLs, can we
>> configure enough threads and short delays to, say, finish the crawl in
>> about 10 hours (assuming the required b/w is available) without being
>> perceived as impolite ?
>
> As soon as you hit a site with more than one thread, you're no longer
> polite.
>
> Also, in addition to the default crawl delay, there's also pages/day which
> most bigger sites monitor.
>
> If you're not Google, then I'd suggest a max of 5K/day unless you've got
> some special understanding with the target domain.
>

We're not Google :-), hence these numbers are useful pieces of
information. Thanks. But, we have stumbled on some commercial crawling
services that claim to do 1URL/per host/sec. That works to a much
larger number than what you've given. Do you think this is managed
typically by getting some agreement with the target ? Or do you think
they may be taking a risk.

Thanks
Hemanth

Reply via email to