Re: Robots.txt

Lewis John Mcgibbney Fri, 27 May 2016 02:06:31 -0700

Hi Blackice,

On top of what other folks hae input.
Over the years, we have enforced that Nutch be a good bot by default. For
example the following default configuration parameter

<property>
<name>fetcher.server.delay</name> <value>5.0</value> <description>The
number of seconds the fetcher will delay between successive requests to the
same server. Note that this might get overridden by a Crawl-Delay from a
robots.txt and is used ONLY if fetcher.threads.per.queue is set to 1. </
description>
</property>

By default we wait MINIMUM 5 seconds between successive requests to the
same host.

>From personal experience, if you are being blocked by IP or HOST then it is
an issue outwith the Nutch framework.

I work alongside a number of other community members to ensure that open
source software is consumed within Nutch for exactly this reason
https://github.com/crawler-commons/crawler-commons/tree/master/src/main/java/crawlercommons/robots

Nutch uses Crawler Commons Java library for robots.txt parsing. If there is
an issue and or bug then it is a bug further down stream.
Thanks for reporting
Lewis

On Fri, May 27, 2016 at 1:51 AM, <[email protected]> wrote:

>
> From: Markus Jelsma <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc:
> Date: Wed, 25 May 2016 10:25:23 +0000
> Subject: RE: Robots.txt
> Hi - that is a curious case indeed as Nutch adhere's to robots.txt. Can
> they provide you with a reason for marking your Nutch as impolite?
> Markus
>

Re: Robots.txt

Reply via email to