Hi everyone here, I have used nutch2.0 to crawl three big sites and each of them contains millions of pages. I set "fetcher.queue.mode"="byHost" and "partition.url.mode"="byHost". My hadoop/hbase cluster contains one master node and seven slave nodes. But only one reduce task running on a single node is trying to fetch pages from those three sites. It's too slow to fetch all three sites only through one task of one node. How can I speed up the job? What should I configure so that each site crawling task will be taken by different tasks on different nodes?
-- View this message in context: http://lucene.472066.n3.nabble.com/How-to-speed-up-crawling-of-some-certains-sites-with-millions-of-pages-tp4068519.html Sent from the Nutch - User mailing list archive at Nabble.com.

