Re: Running multiple nutch jobs to fetch a same site with millions of pages

alxsss Mon, 01 Jul 2013 19:55:06 -0700

Hi,

Try to run more than one reducer by adding numtask param option to the fetch 
command.


hth,
Alex. 

 

 

 

-----Original Message-----
From: weishenyun <[email protected]>
To: user <[email protected]>
Sent: Mon, Jul 1, 2013 7:44 pm
Subject: Running multiple nutch jobs to fetch a same site with millions of pages


Hi,

I tried to crawl millions of pages from a single site by Nutch 2.0. Since
Nutch will only use one reducer task to fetch all the pages from the same
domain/host, I tried to launch multiple Nutch jobs on the same Hadoop
cluster to accelerate the crawl speed. But it seems that different jobs
generated the same fetchlist. How can I configure and set the crawl
parameter to achieve my goal? For example, there are 10 million pages from a
same site and they are already stored in the table. I want to launch two
jobs to fetch them in parallel. How can I configure so that the first job
will fetch the first 5 million pages and the second job will fetch another 5
million?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Running-multiple-nutch-jobs-to-fetch-a-same-site-with-millions-of-pages-tp4074523.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Running multiple nutch jobs to fetch a same site with millions of pages

Reply via email to