Re: Nutch 2.x performant and hassle-free crawling

Martin Aesch Sat, 06 Jul 2013 06:23:14 -0700

Thanks, Alex, this seems to be the problem, indeed. I increased the
heap-size for hbase (non-distributed, all in the same jvm) and now it
fetches  continously at 30MBit/s, fantastic!













On Fri, 2013-07-05 at 18:17 -0400, [email protected] wrote:
> It seems to me that in this case Hbase consumes a lot of memory on a heavy 
> load. Try to decrease number of threads or even number of mappers.
> 
> hth.
> Alex.
> 
>  
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Martin Aesch <[email protected]>
> To: user <[email protected]>
> Sent: Fri, Jul 5, 2013 2:51 pm
> Subject: Nutch 2.x performant and hassle-free crawling
> 
> 
> Dear nutchers,
> 
> So, I set up my nutch-2.2.1 with hbase-backend (finally) on a core i7
> wit 32GB and 6TB RAM. Before scaling out, I want to meet the limits of a
> single machine (HBase: pseudo-distributed, MR: 8 parallel mapper, 8
> parallel reducer). I am using the full DMOZ urlset with raw ~ 4M URLs.
> Fetching with 8 mapper tasks, each with 75 threads. I am really happy
> with Hadoop/Hbase. What I want (see also below): as many fetching as
> possible - if some URLs are not fetched, I do not care. On my way to
> that goal I met some trouble:
> 
> [Beforehand: I had some Issues with GeneratorReducer "Task ... failed to
> report status for 600 seconds. Killing! 
> I extended GeneratorReducer and do a context.progress() every
> 0.5*mapred.task.timeout, now it seemingly works without issues]
> 
> Again, I have the "failed to report status for 600 seconds", this time
> in "fetch" step. It is a little different here. What
> I already noted is that FetcherReducer does at the end of reduce()
> 
>       // some requests seem to hang, despite all intentions
>       if ((System.currentTimeMillis() - lastRequestStart.get()) >
> timeout) {
>         LOG.warn("Aborting with " + activeThreads + " hung threads.");
>         return;
>       }
> 
> But it does not kill the currently running fetch-threads. What it does
> is just preventing new URLs to be put ont fetch queues.
> In the hadoop log I see exactly 600s after the message "Aborting
> with...hung threads" that Hadoop kills the task. Not nutch.
> 
> Then, I added fetcher.throughput.threshold.pages=5 in nutch-site.xml.
> Still same problem, I get the "failed to report status for 600 seconds".
> So: is there any possiblity to limit the runtime of each individual http
> GET request? Did I oversee any other possibility? 
> 
> Other idee: fetcher.timelimit.mins
> this sets in FetcherReducer "timelimit", which more or less just hinders
> new URLs to be queued if looking in the source. And additionlly, it
> seemingly prevents the failed reduce job to be reqeued again by hadoop,
> right? Is that all?
> 
> All mechanisms prevent new URLs to be queued, nothing else, right? How
> can I reliably set the task to end? If not: Can I just assume that
> ALTHOUGH some of the reduce tasks of a job fail with "failed to report
> status for 600 seconds", I can proceed with parsing? I do not care to
> miss some URLs. I need quick and efficient fetching and want to optimize
> my crawls in that direction. The thing what I fear is: the problematic
> URLs to be fetched are not necesarilly at the end of a task. They might
> be at starting or in the middle and agglomerate. What then?
> 
> Best wishes,
> Martin
> 
> 
>

Re: Nutch 2.x performant and hassle-free crawling

Reply via email to