Re: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 + Yarn

shubham.gupta Mon, 01 Aug 2016 22:01:38 -0700

Hey Markus,

What I am trying to do is perform RSS crawling using Nutch, therefore Irequire that the whole process should be completed within 1 hour.According to your suggestion, I set fetcher.parse = true which reducedthe time of fetching process to 44 minutes and fetched 9195 pages. But,the parsing phase took over 14 hours to complete and parsed only 11935documents which is a very low count considering the time taken.


These are the settings I am using:

http.timeout = 99999

fetcher.threads.per.queue = 5

fetcher.threads.fetch = 100

numTasks = 5

fetcher.queue.mode = byHost

Number of URL in seed list = 5085

3 node Hadoop cluster with 6 GB RAM on each datanode.

Also, we are using MongoDB as database.

Network Bandwidth dedicated to Nutch: 2 Mbps

Please help.

Shubham Gupta

On 07/29/2016 05:03 PM, Markus Jelsma wrote:

Hello Shubham,

You can always eliminate the parse step by enabling the fetcher.parse 
parameter. It used to be bad advice but it is only very rarely a problem, 
hanging fetchers can still terminate themselves in a proper manner. I am not 
sure about 2.x but i think you can use this parameter.

Maximizing bandwidth and CPU power is a matter of finding balance between 
number of fetchers and threads, which you control. Try to tune it as you see 
fit. And remember, crawling a lot just takes a lot of time as it always will be 
:)

Markus

-----Original message-----

From:shubham.gupta<[email protected]>
Sent: Friday 29th July 2016 6:00
To:[email protected]
Subject: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + 
hadoop 2.7.1 +yarn

Hi

I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop
2.7.1 cluster . The seed list provided consists of around 5000 Urls . I
am using 60 threads and  5 numTasks for crawling these urls at distance
of 1, but, it is taking 1 day to complete the crawl job (Inject : 1
minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41
minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds)
which is very long in terms of crawling. I want to crawl these urls
within 2-3 hours.

The maximum memory allocated to yarn per container is 8GB and vCores
provided are 8.

I am unable to identify whether this is a problem of hadoop cluster
configuration or nutch.

Please help. Thanks in advance.

--
Shubham Gupta

Re: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 + Yarn

Reply via email to