Hi
I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop
2.7.1 cluster . The seed list provided consists of around 5000 Urls . I
am using 60 threads and 5 numTasks for crawling these urls at distance
of 1, but, it is taking 1 day to complete the crawl job (Inject : 1
minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41
minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds)
which is very long in terms of crawling. I want to crawl these urls
within 2-3 hours.
The maximum memory allocated to yarn per container is 8GB and vCores
provided are 8.
I am unable to identify whether this is a problem of hadoop cluster
configuration or nutch.
Please help. Thanks in advance.
--
Shubham Gupta