crawling speed when polite

Michael Coffey Fri, 04 Nov 2016 14:32:12 -0700

Can anyone point me to some good information on how to optimize crawling speed 
while maintaining politeness?
My current situation is that Nutch is running reliably for me on a single 
hadoop node. Before bringing up additional nodes, I want to make it go 
reasonably fast on this one node. At the moment it is only trying to fetch less 
than 1 url per second. It seems like it should be able to do much more than 
this, but it is utilizing very little internet bandwidth and CPU time.


I originally seeded it with 6 urls, each on a different domain. I generate topN 
1000 in each round. I have set generate.max.count to 100 and 
fetcher.server.delay to 1.0. I do not explicitly set any number of threads.
After 10 rounds, I get the following statistics. This took about 12 hours of 
elapsed time.
16/11/04 08:17:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: 
/orgs/data/crawldb16/11/04 08:17:47 INFO crawl.CrawlDbReader: TOTAL urls: 
5697616/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 0:    5694916/11/04 
08:17:47 INFO crawl.CrawlDbReader: retry 1:    2716/11/04 08:17:47 INFO 
crawl.CrawlDbReader: min score:  0.016/11/04 08:17:47 INFO crawl.CrawlDbReader: 
avg score:  1.2285875E-416/11/04 08:17:47 INFO crawl.CrawlDbReader: max score:  
1.016/11/04 08:17:47 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
4748616/11/04 08:17:47 INFO crawl.CrawlDbReader: status 2 (db_fetched):      
669716/11/04 08:17:47 INFO crawl.CrawlDbReader: status 3 (db_gone): 
242416/11/04 08:17:47 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   
3816/11/04 08:17:47 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   
20216/11/04 08:17:47 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    
12916/11/04 08:17:47 INFO crawl.CrawlDbReader: CrawlDb statistics: doneFri Nov 
4 08:17:48 PDT 2016 : Finished loop with 10 iterations
I use the standard crawl script, with only sizeFetchlist changed. It issues the 
following generate command
/home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch generate -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true /orgs/data/crawldb /orgs/data/segments -topN 
1000 -numFetchers 1 -noFilter -adddays 30

It issues the following fetch command
/home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch fetch -D 
mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 
/orgs/data/segments/20161104110458 -noParsing -threads 50


Any suggestions would be greatly appreciated. By the way, thanks for all the 
help so far!

crawling speed when polite

Reply via email to