Re: optimizing crawl

Tim Pease Wed, 06 Jul 2011 10:03:14 -0700

On Jul 6, 2011, at 10:59 AM, Cam Bazz wrote:

> Hello,
> 
> I am crawling multiple sites, in range of hundreds, with 256
> concurrent threads, and 4 conns per site at a time.
> 
> It seems that if a site is having a bad day, all the threads slow
> down, and this site basically clogs all the threads. basically if one
> site is slow, it slowly accumulates all the threads to be slow.
> 
> Is there a way around this? Or if a site is giving excessive 404's -
> or 500's how can we avoid crawling at that run?
>


There is a configuration in nutch-site.xml that you can set to cease processing 
URLs for a given site if there are too many errors:

  <property>
    <name>fetcher.max.exceptions.per.queue</name>
    <value>-1</value>
    <description>The maximum number of protocol-level exceptions (e.g. 
timeouts) per
    host (or IP) queue. Once this value is reached, any remaining entries from 
this
    queue are purged, effectively stopping the fetching from this host/IP. The 
default
    value of -1 deactivates this limit.
    </description>
  </property>

Our crawler is configured to only fetch 500 URLs per site for each fetch task. 
I've set our max.exceptions.per.queue to be 10% of this value - i.e. 50 
exceptions per queue.

Blessings,
TwP

Re: optimizing crawl

Reply via email to