Hi All We have a few domains and we would like to crawl all pages (deep crawling) from those domains (excluding external links).
We started with a domain that has 400 urls and started crawling using Nutch. Here is the time taken between the two modes for the smaller domain local mode = 5 minutes distributed mode (a cluster of 3 nodes) = 2 hours We tried the same with a domain that has > 100K urls and local mode still seem to be faster. Time taken for the bigger domain local mode crawled 28K urls in 4 hours distributed mode crawled only 12k urls in 11hours When i looked into the information printed in console, I saw that it runs a mapreduce job for every step in each iteration in distributed mode. It looked to me like these map reduce jobs for not so big number of urls are slowing things down. Here is some of the configuration db.ignore.external.links=true fetcher.server.delay=0.1 fetcher.queue.mode=byHost smaller domain fetcher.threads.fetch=100 fetcher.threads.per.queue=100 bigger domain (as we wanted to see whether number of threads make a difference) fetcher.threads.fetch=400 fetcher.threads.per.queue=200 The performance looks surprisingly slow. Are we missing something ? Any suggestion would be really appreciated. Thanks Srini

