Hi

I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop 2.7.1 cluster . The seed list provided consists of around 5000 Urls . I am using 60 threads and 5 numTasks for crawling these urls at distance of 1, but, it is taking 1 day to complete the crawl job (Inject : 1 minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41 minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds) which is very long in terms of crawling. I want to crawl these urls within 2-3 hours.

The maximum memory allocated to yarn per container is 8GB and vCores provided are 8.

I am unable to identify whether this is a problem of hadoop cluster configuration or nutch.

Please help. Thanks in advance.


--
Shubham Gupta

Reply via email to