Hi,

I tried to crawl millions of pages from a single site by Nutch 2.0. Since
Nutch will only use one reducer task to fetch all the pages from the same
domain/host, I tried to launch multiple Nutch jobs on the same Hadoop
cluster to accelerate the crawl speed. But it seems that different jobs
generated the same fetchlist. How can I configure and set the crawl
parameter to achieve my goal? For example, there are 10 million pages from a
same site and they are already stored in the table. I want to launch two
jobs to fetch them in parallel. How can I configure so that the first job
will fetch the first 5 million pages and the second job will fetch another 5
million?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Running-multiple-nutch-jobs-to-fetch-a-same-site-with-millions-of-pages-tp4074523.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to