On May 25, 2010, at 6:56pm, Hemanth Yamijala wrote:
Ken,
Hi,
This question is about politeness policies.
If I have understood correctly, Nutch adheres to politeness policies
by ensuring a few things in its crawl logic:
- It supports robots.txt
- It ensures domains are partitioned in a way that all URLs from the
same domain are necessarily fetched from the same map task.
- By default, it runs a single thread for all such hosts
- Again by default, it uses a 5 second delay between queries to
the same
domain.
The question is, if a website has not advertised any specific
constraints about crawl politeness, how fast can we go ? I know this
really depends on what the website will permit eventually. But I was
thinking if there are any experiences that users can share in
terms of
numbers ? Also, are there ways (other than trial-and-error and
getting
shut off) to find out how much 'impoliteness' would be tolerated ?
For instance, if we have a website with about 200,000 URLs, can we
configure enough threads and short delays to, say, finish the
crawl in
about 10 hours (assuming the required b/w is available) without
being
perceived as impolite ?
As soon as you hit a site with more than one thread, you're no longer
polite.
Also, in addition to the default crawl delay, there's also pages/
day which
most bigger sites monitor.
If you're not Google, then I'd suggest a max of 5K/day unless
you've got
some special understanding with the target domain.
We're not Google :-), hence these numbers are useful pieces of
information. Thanks. But, we have stumbled on some commercial crawling
services that claim to do 1URL/per host/sec. That works to a much
larger number than what you've given. Do you think this is managed
typically by getting some agreement with the target ? Or do you think
they may be taking a risk.
A default crawl delay of 1 second is definitely pushing it. But note
that the total number of URLs/site/day is a separate value.
And if they're pulling more than 5K, then they definitely will get
blocked by sites that don't have the bandwidth/capacity of the top 1%.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g