Hello Shubham, You can always eliminate the parse step by enabling the fetcher.parse parameter. It used to be bad advice but it is only very rarely a problem, hanging fetchers can still terminate themselves in a proper manner. I am not sure about 2.x but i think you can use this parameter.
Maximizing bandwidth and CPU power is a matter of finding balance between number of fetchers and threads, which you control. Try to tune it as you see fit. And remember, crawling a lot just takes a lot of time as it always will be :) Markus -----Original message----- > From:shubham.gupta <[email protected]> > Sent: Friday 29th July 2016 6:00 > To: [email protected] > Subject: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + > hadoop 2.7.1 +yarn > > Hi > > I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop > 2.7.1 cluster . The seed list provided consists of around 5000 Urls . I > am using 60 threads and 5 numTasks for crawling these urls at distance > of 1, but, it is taking 1 day to complete the crawl job (Inject : 1 > minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41 > minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds) > which is very long in terms of crawling. I want to crawl these urls > within 2-3 hours. > > The maximum memory allocated to yarn per container is 8GB and vCores > provided are 8. > > I am unable to identify whether this is a problem of hadoop cluster > configuration or nutch. > > Please help. Thanks in advance. > > > -- > Shubham Gupta > >

