RE: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 + Yarn

Markus Jelsma Fri, 05 Aug 2016 13:56:51 -0700

Hello Shubham,

If youy have set fetcher.parse to true, then do not execute the parse job, 
because it is already parsed. I don't know about 2.x, but in 1.x, you cannot 
parse a segment which was already parsed. If you use some crawl script that has 
parsing hardcoded in it, despite fetcher.parse, then don't use that script or 
modify it so it doesn't run the parse job.


Markus

 
 
-----Original message-----
> From:shubham.gupta <[email protected]>
> Sent: Tuesday 2nd August 2016 7:01
> To: [email protected]
> Subject: Re: Nutch is taking very long time to complete crawl job :Nutch 
> 2.3.1 + hadoop 2.7.1 + Yarn
> 
> Hey Markus,
> 
> What I am trying to do is perform RSS crawling using Nutch, therefore I 
> require that the whole process should be completed within 1 hour. 
> According to your suggestion, I set fetcher.parse = true which reduced 
> the time of fetching process to 44 minutes and fetched 9195 pages. But, 
> the parsing phase took over 14 hours to complete and parsed only 11935 
> documents which is a very low count considering the time taken.
> 
> These are the settings I am using:
> 
> http.timeout = 99999
> 
> fetcher.threads.per.queue = 5
> 
> fetcher.threads.fetch = 100
> 
> numTasks = 5
> 
> fetcher.queue.mode = byHost
> 
> Number of URL in seed list = 5085
> 
> 3 node Hadoop cluster with 6 GB RAM on each datanode.
> 
> Also, we are using MongoDB as database.
> 
> Network Bandwidth dedicated to Nutch: 2 Mbps
> 
> Please help.
> 
> Shubham Gupta
> 
> On 07/29/2016 05:03 PM, Markus Jelsma wrote:
> > Hello Shubham,
> >
> > You can always eliminate the parse step by enabling the fetcher.parse 
> > parameter. It used to be bad advice but it is only very rarely a problem, 
> > hanging fetchers can still terminate themselves in a proper manner. I am 
> > not sure about 2.x but i think you can use this parameter.
> >
> > Maximizing bandwidth and CPU power is a matter of finding balance between 
> > number of fetchers and threads, which you control. Try to tune it as you 
> > see fit. And remember, crawling a lot just takes a lot of time as it always 
> > will be :)
> >
> > Markus
> >   
> > -----Original message-----
> >> From:shubham.gupta<[email protected]>
> >> Sent: Friday 29th July 2016 6:00
> >> To:[email protected]
> >> Subject: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 
> >> + hadoop 2.7.1 +yarn
> >>
> >> Hi
> >>
> >> I am trying to use Nutch 2.3.1 with 3 datanode(4GB RAM each) hadoop
> >> 2.7.1 cluster . The seed list provided consists of around 5000 Urls . I
> >> am using 60 threads and  5 numTasks for crawling these urls at distance
> >> of 1, but, it is taking 1 day to complete the crawl job (Inject : 1
> >> minute 35 seconds , Generate: 1 minute 35 seconds,Fetch: 11 hours 41
> >> minutes, Parse: 13 hours 42 minutes , Update-DB: 38 minutes 43 seconds)
> >> which is very long in terms of crawling. I want to crawl these urls
> >> within 2-3 hours.
> >>
> >> The maximum memory allocated to yarn per container is 8GB and vCores
> >> provided are 8.
> >>
> >> I am unable to identify whether this is a problem of hadoop cluster
> >> configuration or nutch.
> >>
> >> Please help. Thanks in advance.
> >>
> >>
> >> -- 
> >> Shubham Gupta
> >>
> >>
> 
>

RE: Nutch is taking very long time to complete crawl job :Nutch 2.3.1 + hadoop 2.7.1 + Yarn

Reply via email to