Hi Charith,

I believe you should obey crawl-delay if it exists in robots.txt in order
to be polite. You don't want to go faster than what's specified by a
webmaster.

So if a website (e.g. https://www.aoncadis.org/robots.txt) has 10 seconds
crawl-delay, your configurations of 1 second will be
considered impolite and you might get banned.


Best regards,
Mohammad Al-Mohsin

On Fri, Feb 20, 2015 at 1:03 PM, Charith Wickramarachchi <
[email protected]> wrote:

> Hi,
>
> I am new to Nutch and trying to figure out the best configuration for
> crawling. In my Nutch configuration. I have configured *f*
> *etcher.threads.per.**queue* to be 2 and  fetcher.server.min.delay to be 2
> seconds.
>
> So as per my understanding from documentation in this case Nutch won't do
> more than 2 requests per each 2 seconds.  So the requests per second will
> be averaged to 1 request per second.
>
> Is this correct? Even though in this case it ignore robots.txt. I assume 1
> req/sec is a polite request rate for many servers.
>
> It will be great if you could give me a clarification.
>
>
> Thanks,
> Charith
>
>
> --
> Charith Dhanushka Wickramaarachchi
>
> Tel  +1 213 447 4253
> Blog  http://charith.wickramaarachchi.org/
> <http://charithwiki.blogspot.com/>
> Twitter  @charithwiki <https://twitter.com/charithwiki>
>
> This communication may contain privileged or other confidential information
> and is intended exclusively for the addressee/s. If you are not the
> intended recipient/s, or believe that you may have
> received this communication in error, please reply to the sender indicating
> that fact and delete the copy you received and in addition, you should not
> print, copy, retransmit, disseminate, or otherwise use the information
> contained in this communication. Internet communications cannot be
> guaranteed to be timely, secure, error or virus-free. The sender does not
> accept liability for any errors or omissions
>

Reply via email to