How to speed up crawling of some certains sites with millions of pages

weishenyun Thu, 06 Jun 2013 06:03:12 -0700

Hi everyone here, I have used nutch2.0 to crawl three big sites and each of
them contains millions of pages. I set "fetcher.queue.mode"="byHost" and
"partition.url.mode"="byHost". My hadoop/hbase cluster contains one master
node and seven slave nodes. But only one reduce task running on a single
node is trying to fetch pages from those three sites. It's too slow to fetch
all three sites only through one task of one node. How can I speed up the
job? What should I configure so that each site crawling task will be taken
by different tasks on different nodes?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-speed-up-crawling-of-some-certains-sites-with-millions-of-pages-tp4068519.html
Sent from the Nutch - User mailing list archive at Nabble.com.

How to speed up crawling of some certains sites with millions of pages

Reply via email to