Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 +yarn

shubham.gupta Thu, 28 Jul 2016 21:01:31 -0700

Hi

I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop2.7.1 cluster . The seed list provided consists of around 5000 Urls . Iam using 60 threads and 5 numTasks for crawling these urls at distanceof 1, but, it is taking 1 day to complete the crawl job (Inject : 1minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds)which is very long in terms of crawling. I want to crawl these urlswithin 2-3 hours.

The maximum memory allocated to yarn per container is 8GB and vCoresprovided are 8.

I am unable to identify whether this is a problem of hadoop clusterconfiguration or nutch.


Please help. Thanks in advance.


--
Shubham Gupta

Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 +yarn

Reply via email to