Crawling on slow and fast sites parallely

Alberto Ramos Tue, 18 Feb 2014 06:38:41 -0800

Hi,
l use nutch 2 on Hadoop in order to crawl a few sites.
One of them is deep and fast and others are shallow and slow.
At the first fetches the fast site finishing after about 2 minutes and
waits for the slow sites that finish after about 40 minutes.  After nutch
is done crawling the slow sites,  the fast site is still being fetched
(because it is deeper). I don't want to use fetcher.max.crawl.delay since I
do want to crawl on both sites. My temporary solution is to run a seperated
nutch process for each site,  which is obviously very ugly and doesn't take
effect of the Hadoop architecture.
Any suggestions for performance improvement?

Crawling on slow and fast sites parallely

Reply via email to