Can anyone point me to some good information on how to optimize crawling speed while maintaining politeness? My current situation is that Nutch is running reliably for me on a single hadoop node. Before bringing up additional nodes, I want to make it go reasonably fast on this one node. At the moment it is only trying to fetch less than 1 url per second. It seems like it should be able to do much more than this, but it is utilizing very little internet bandwidth and CPU time.
I originally seeded it with 6 urls, each on a different domain. I generate topN 1000 in each round. I have set generate.max.count to 100 and fetcher.server.delay to 1.0. I do not explicitly set any number of threads. After 10 rounds, I get the following statistics. This took about 12 hours of elapsed time. 16/11/04 08:17:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: /orgs/data/crawldb16/11/04 08:17:47 INFO crawl.CrawlDbReader: TOTAL urls: 5697616/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 0: 5694916/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 1: 2716/11/04 08:17:47 INFO crawl.CrawlDbReader: min score: 0.016/11/04 08:17:47 INFO crawl.CrawlDbReader: avg score: 1.2285875E-416/11/04 08:17:47 INFO crawl.CrawlDbReader: max score: 1.016/11/04 08:17:47 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 4748616/11/04 08:17:47 INFO crawl.CrawlDbReader: status 2 (db_fetched): 669716/11/04 08:17:47 INFO crawl.CrawlDbReader: status 3 (db_gone): 242416/11/04 08:17:47 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 3816/11/04 08:17:47 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 20216/11/04 08:17:47 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 12916/11/04 08:17:47 INFO crawl.CrawlDbReader: CrawlDb statistics: doneFri Nov 4 08:17:48 PDT 2016 : Finished loop with 10 iterations I use the standard crawl script, with only sizeFetchlist changed. It issues the following generate command /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch generate -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true /orgs/data/crawldb /orgs/data/segments -topN 1000 -numFetchers 1 -noFilter -adddays 30 It issues the following fetch command /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch fetch -D mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 /orgs/data/segments/20161104110458 -noParsing -threads 50 Any suggestions would be greatly appreciated. By the way, thanks for all the help so far!

