Hi, I tried to crawl millions of pages from a single site by Nutch 2.0. Since Nutch will only use one reducer task to fetch all the pages from the same domain/host, I tried to launch multiple Nutch jobs on the same Hadoop cluster to accelerate the crawl speed. But it seems that different jobs generated the same fetchlist. How can I configure and set the crawl parameter to achieve my goal? For example, there are 10 million pages from a same site and they are already stored in the table. I want to launch two jobs to fetch them in parallel. How can I configure so that the first job will fetch the first 5 million pages and the second job will fetch another 5 million?
-- View this message in context: http://lucene.472066.n3.nabble.com/Running-multiple-nutch-jobs-to-fetch-a-same-site-with-millions-of-pages-tp4074523.html Sent from the Nutch - User mailing list archive at Nabble.com.

