In addition, you could use generate.max.count to limit the number of URLs per host and cycle to a fix maximum size. That may help to keep the balance between hosts / sites.
On 02/18/2014 04:01 PM, Markus Jelsma wrote: > Most of the time reducing the number of URL per cycle solves the problem. You > can also limit the fetcher's run time, check the fetcher.* settings. > > -----Original message----- >> From:Alberto Ramos <[email protected]> >> Sent: Tuesday 18th February 2014 15:38 >> To: [email protected] >> Subject: Crawling on slow and fast sites parallely >> >> Hi, >> l use nutch 2 on Hadoop in order to crawl a few sites. >> One of them is deep and fast and others are shallow and slow. >> At the first fetches the fast site finishing after about 2 minutes and >> waits for the slow sites that finish after about 40 minutes. After nutch >> is done crawling the slow sites, the fast site is still being fetched >> (because it is deeper). I don't want to use fetcher.max.crawl.delay since I >> do want to crawl on both sites. My temporary solution is to run a seperated >> nutch process for each site, which is obviously very ugly and doesn't take >> effect of the Hadoop architecture. >> Any suggestions for performance improvement? >>

