On Jul 6, 2011, at 10:59 AM, Cam Bazz wrote:
> Hello,
>
> I am crawling multiple sites, in range of hundreds, with 256
> concurrent threads, and 4 conns per site at a time.
>
> It seems that if a site is having a bad day, all the threads slow
> down, and this site basically clogs all the threads. basically if one
> site is slow, it slowly accumulates all the threads to be slow.
>
> Is there a way around this? Or if a site is giving excessive 404's -
> or 500's how can we avoid crawling at that run?
>
There is a configuration in nutch-site.xml that you can set to cease processing
URLs for a given site if there are too many errors:
<property>
<name>fetcher.max.exceptions.per.queue</name>
<value>-1</value>
<description>The maximum number of protocol-level exceptions (e.g.
timeouts) per
host (or IP) queue. Once this value is reached, any remaining entries from
this
queue are purged, effectively stopping the fetching from this host/IP. The
default
value of -1 deactivates this limit.
</description>
</property>
Our crawler is configured to only fetch 500 URLs per site for each fetch task.
I've set our max.exceptions.per.queue to be 10% of this value - i.e. 50
exceptions per queue.
Blessings,
TwP