In addition, you could use generate.max.count
to limit the number of URLs per host and cycle
to a fix maximum size. That may help to keep
the balance between hosts / sites.

On 02/18/2014 04:01 PM, Markus Jelsma wrote:
> Most of the time reducing the number of URL per cycle solves the problem. You 
> can also limit the fetcher's run time, check the fetcher.* settings.
>  
> -----Original message-----
>> From:Alberto Ramos <[email protected]>
>> Sent: Tuesday 18th February 2014 15:38
>> To: [email protected]
>> Subject: Crawling on slow and fast sites parallely
>>
>> Hi,
>> l use nutch 2 on Hadoop in order to crawl a few sites.
>> One of them is deep and fast and others are shallow and slow.
>> At the first fetches the fast site finishing after about 2 minutes and
>> waits for the slow sites that finish after about 40 minutes.  After nutch
>> is done crawling the slow sites,  the fast site is still being fetched
>> (because it is deeper). I don't want to use fetcher.max.crawl.delay since I
>> do want to crawl on both sites. My temporary solution is to run a seperated
>> nutch process for each site,  which is obviously very ugly and doesn't take
>> effect of the Hadoop architecture.
>> Any suggestions for performance improvement?
>>

Reply via email to