Hey Markus,
What I am trying to do is perform RSS crawling using Nutch, therefore I
require that the whole process should be completed within 1 hour.
According to your suggestion, I set fetcher.parse = true which reduced
the time of fetching process to 44 minutes and fetched 9195 pages. But,
the parsing phase took over 14 hours to complete and parsed only 11935
documents which is a very low count considering the time taken.
These are the settings I am using:
http.timeout = 99999
fetcher.threads.per.queue = 5
fetcher.threads.fetch = 100
numTasks = 5
fetcher.queue.mode = byHost
Number of URL in seed list = 5085
3 node Hadoop cluster with 6 GB RAM on each datanode.
Also, we are using MongoDB as database.
Network Bandwidth dedicated to Nutch: 2 Mbps
Please help.
Shubham Gupta
On 07/29/2016 05:03 PM, Markus Jelsma wrote:
Hello Shubham,
You can always eliminate the parse step by enabling the fetcher.parse
parameter. It used to be bad advice but it is only very rarely a problem,
hanging fetchers can still terminate themselves in a proper manner. I am not
sure about 2.x but i think you can use this parameter.
Maximizing bandwidth and CPU power is a matter of finding balance between
number of fetchers and threads, which you control. Try to tune it as you see
fit. And remember, crawling a lot just takes a lot of time as it always will be
:)
Markus
-----Original message-----
From:shubham.gupta<[email protected]>
Sent: Friday 29th July 2016 6:00
To:[email protected]
Subject: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 +
hadoop 2.7.1 +yarn
Hi
I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop
2.7.1 cluster . The seed list provided consists of around 5000 Urls . I
am using 60 threads and 5 numTasks for crawling these urls at distance
of 1, but, it is taking 1 day to complete the crawl job (Inject : 1
minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41
minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds)
which is very long in terms of crawling. I want to crawl these urls
within 2-3 hours.
The maximum memory allocated to yarn per container is 8GB and vCores
provided are 8.
I am unable to identify whether this is a problem of hadoop cluster
configuration or nutch.
Please help. Thanks in advance.
--
Shubham Gupta