Most of the time reducing the number of URL per cycle solves the problem. You 
can also limit the fetcher's run time, check the fetcher.* settings.
 
-----Original message-----
> From:Alberto Ramos <[email protected]>
> Sent: Tuesday 18th February 2014 15:38
> To: [email protected]
> Subject: Crawling on slow and fast sites parallely
> 
> Hi,
> l use nutch 2 on Hadoop in order to crawl a few sites.
> One of them is deep and fast and others are shallow and slow.
> At the first fetches the fast site finishing after about 2 minutes and
> waits for the slow sites that finish after about 40 minutes.  After nutch
> is done crawling the slow sites,  the fast site is still being fetched
> (because it is deeper). I don't want to use fetcher.max.crawl.delay since I
> do want to crawl on both sites. My temporary solution is to run a seperated
> nutch process for each site,  which is obviously very ugly and doesn't take
> effect of the Hadoop architecture.
> Any suggestions for performance improvement?
> 

Reply via email to