Hi Charith, I believe you should obey crawl-delay if it exists in robots.txt in order to be polite. You don't want to go faster than what's specified by a webmaster.
So if a website (e.g. https://www.aoncadis.org/robots.txt) has 10 seconds crawl-delay, your configurations of 1 second will be considered impolite and you might get banned. Best regards, Mohammad Al-Mohsin On Fri, Feb 20, 2015 at 1:03 PM, Charith Wickramarachchi < [email protected]> wrote: > Hi, > > I am new to Nutch and trying to figure out the best configuration for > crawling. In my Nutch configuration. I have configured *f* > *etcher.threads.per.**queue* to be 2 and fetcher.server.min.delay to be 2 > seconds. > > So as per my understanding from documentation in this case Nutch won't do > more than 2 requests per each 2 seconds. So the requests per second will > be averaged to 1 request per second. > > Is this correct? Even though in this case it ignore robots.txt. I assume 1 > req/sec is a polite request rate for many servers. > > It will be great if you could give me a clarification. > > > Thanks, > Charith > > > -- > Charith Dhanushka Wickramaarachchi > > Tel +1 213 447 4253 > Blog http://charith.wickramaarachchi.org/ > <http://charithwiki.blogspot.com/> > Twitter @charithwiki <https://twitter.com/charithwiki> > > This communication may contain privileged or other confidential information > and is intended exclusively for the addressee/s. If you are not the > intended recipient/s, or believe that you may have > received this communication in error, please reply to the sender indicating > that fact and delete the copy you received and in addition, you should not > print, copy, retransmit, disseminate, or otherwise use the information > contained in this communication. Internet communications cannot be > guaranteed to be timely, secure, error or virus-free. The sender does not > accept liability for any errors or omissions >

