Hello Shubham,

You can always eliminate the parse step by enabling the fetcher.parse 
parameter. It used to be bad advice but it is only very rarely a problem, 
hanging fetchers can still terminate themselves in a proper manner. I am not 
sure about 2.x but i think you can use this parameter.

Maximizing bandwidth and CPU power is a matter of finding balance between 
number of fetchers and threads, which you control. Try to tune it as you see 
fit. And remember, crawling a lot just takes a lot of time as it always will be 
:)

Markus
 
-----Original message-----
> From:shubham.gupta <[email protected]>
> Sent: Friday 29th July 2016 6:00
> To: [email protected]
> Subject: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + 
> hadoop 2.7.1 +yarn
> 
> Hi
> 
> I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop 
> 2.7.1 cluster . The seed list provided consists of around 5000 Urls . I 
> am using 60 threads and  5 numTasks for crawling these urls at distance 
> of 1, but, it is taking 1 day to complete the crawl job (Inject : 1 
> minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41 
> minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds) 
> which is very long in terms of crawling. I want to crawl these urls 
> within 2-3 hours.
> 
> The maximum memory allocated to yarn per container is 8GB and vCores 
> provided are 8.
> 
> I am unable to identify whether this is a problem of hadoop cluster 
> configuration or nutch.
> 
> Please help. Thanks in advance.
> 
> 
> -- 
> Shubham Gupta
> 
> 

Reply via email to