Side note: fetcher.max.crawl.delay might help, to ignore pages with an unusually high crawl delay.
https://wiki.apache.org/nutch/OptimizingCrawls 6) Crawl-delay can be used and is obeyed by nutch in robots.txt. Most sites don't use this setting but a few (some malicious do). I have seen crawl-delays as high as 2 days in seconds. The fetcher.max.crawl.delay variable will ignore pages with crawl delays > x. I usually set this to 10 seconds, default is 30. Even at 10 seconds if you have a lot of pages from a site from which you can only crawl 1 page every 10 seconds it is going to be slow. On the flip side, setting this to a low value will ignore and not fetch those pages. On Sat, Feb 21, 2015 at 12:22 AM, Charith Wickramarachchi < [email protected]> wrote: > Great! Thanks a lot! > > On Fri, Feb 20, 2015 at 1:41 PM, Mohammad Al-Mohsin <[email protected]> wrote: > > > Hi Charith, > > > > I believe you should obey crawl-delay if it exists in robots.txt in order > > to be polite. You don't want to go faster than what's specified by a > > webmaster. > > > > So if a website (e.g. https://www.aoncadis.org/robots.txt) has 10 > seconds > > crawl-delay, your configurations of 1 second will be > > considered impolite and you might get banned. > > > > > > Best regards, > > Mohammad Al-Mohsin > > > > On Fri, Feb 20, 2015 at 1:03 PM, Charith Wickramarachchi < > > [email protected]> wrote: > > > > > Hi, > > > > > > I am new to Nutch and trying to figure out the best configuration for > > > crawling. In my Nutch configuration. I have configured *f* > > > *etcher.threads.per.**queue* to be 2 and fetcher.server.min.delay to > be > > 2 > > > seconds. > > > > > > So as per my understanding from documentation in this case Nutch won't > do > > > more than 2 requests per each 2 seconds. So the requests per second > will > > > be averaged to 1 request per second. > > > > > > Is this correct? Even though in this case it ignore robots.txt. I > assume > > 1 > > > req/sec is a polite request rate for many servers. > > > > > > It will be great if you could give me a clarification. > > > > > > > > > Thanks, > > > Charith > > > > > > > > > -- > > > Charith Dhanushka Wickramaarachchi > > > > > > Tel +1 213 447 4253 > > > Blog http://charith.wickramaarachchi.org/ > > > <http://charithwiki.blogspot.com/> > > > Twitter @charithwiki <https://twitter.com/charithwiki> > > > > > > This communication may contain privileged or other confidential > > information > > > and is intended exclusively for the addressee/s. If you are not the > > > intended recipient/s, or believe that you may have > > > received this communication in error, please reply to the sender > > indicating > > > that fact and delete the copy you received and in addition, you should > > not > > > print, copy, retransmit, disseminate, or otherwise use the information > > > contained in this communication. Internet communications cannot be > > > guaranteed to be timely, secure, error or virus-free. The sender does > not > > > accept liability for any errors or omissions > > > > > > > > > -- > Charith Dhanushka Wickramaarachchi > > Tel +1 213 447 4253 > Blog http://charith.wickramaarachchi.org/ > <http://charithwiki.blogspot.com/> > Twitter @charithwiki <https://twitter.com/charithwiki> > > This communication may contain privileged or other confidential information > and is intended exclusively for the addressee/s. If you are not the > intended recipient/s, or believe that you may have > received this communication in error, please reply to the sender indicating > that fact and delete the copy you received and in addition, you should not > print, copy, retransmit, disseminate, or otherwise use the information > contained in this communication. Internet communications cannot be > guaranteed to be timely, secure, error or virus-free. The sender does not > accept liability for any errors or omissions >

