Hi Blackice, On top of what other folks hae input. Over the years, we have enforced that Nutch be a good bot by default. For example the following default configuration parameter
<property> <name>fetcher.server.delay</name> <value>5.0</value> <description>The number of seconds the fetcher will delay between successive requests to the same server. Note that this might get overridden by a Crawl-Delay from a robots.txt and is used ONLY if fetcher.threads.per.queue is set to 1. </ description> </property> By default we wait MINIMUM 5 seconds between successive requests to the same host. >From personal experience, if you are being blocked by IP or HOST then it is an issue outwith the Nutch framework. I work alongside a number of other community members to ensure that open source software is consumed within Nutch for exactly this reason https://github.com/crawler-commons/crawler-commons/tree/master/src/main/java/crawlercommons/robots Nutch uses Crawler Commons Java library for robots.txt parsing. If there is an issue and or bug then it is a bug further down stream. Thanks for reporting Lewis On Fri, May 27, 2016 at 1:51 AM, <[email protected]> wrote: > > From: Markus Jelsma <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Date: Wed, 25 May 2016 10:25:23 +0000 > Subject: RE: Robots.txt > Hi - that is a curious case indeed as Nutch adhere's to robots.txt. Can > they provide you with a reason for marking your Nutch as impolite? > Markus >

