Re: Nutch 2.x performant and hassle-free crawling

alxsss Fri, 05 Jul 2013 15:22:22 -0700

It seems to me that in this case Hbase consumes a lot of memory on a heavy 
load. Try to decrease number of threads or even number of mappers.


hth.
Alex.

 

 

 

-----Original Message-----
From: Martin Aesch <[email protected]>
To: user <[email protected]>
Sent: Fri, Jul 5, 2013 2:51 pm
Subject: Nutch 2.x performant and hassle-free crawling


Dear nutchers,

So, I set up my nutch-2.2.1 with hbase-backend (finally) on a core i7
wit 32GB and 6TB RAM. Before scaling out, I want to meet the limits of a
single machine (HBase: pseudo-distributed, MR: 8 parallel mapper, 8
parallel reducer). I am using the full DMOZ urlset with raw ~ 4M URLs.
Fetching with 8 mapper tasks, each with 75 threads. I am really happy
with Hadoop/Hbase. What I want (see also below): as many fetching as
possible - if some URLs are not fetched, I do not care. On my way to
that goal I met some trouble:

[Beforehand: I had some Issues with GeneratorReducer "Task ... failed to
report status for 600 seconds. Killing! 
I extended GeneratorReducer and do a context.progress() every
0.5*mapred.task.timeout, now it seemingly works without issues]

Again, I have the "failed to report status for 600 seconds", this time
in "fetch" step. It is a little different here. What
I already noted is that FetcherReducer does at the end of reduce()

      // some requests seem to hang, despite all intentions
      if ((System.currentTimeMillis() - lastRequestStart.get()) >
timeout) {
        LOG.warn("Aborting with " + activeThreads + " hung threads.");
        return;
      }

But it does not kill the currently running fetch-threads. What it does
is just preventing new URLs to be put ont fetch queues.
In the hadoop log I see exactly 600s after the message "Aborting
with...hung threads" that Hadoop kills the task. Not nutch.

Then, I added fetcher.throughput.threshold.pages=5 in nutch-site.xml.
Still same problem, I get the "failed to report status for 600 seconds".
So: is there any possiblity to limit the runtime of each individual http
GET request? Did I oversee any other possibility? 

Other idee: fetcher.timelimit.mins
this sets in FetcherReducer "timelimit", which more or less just hinders
new URLs to be queued if looking in the source. And additionlly, it
seemingly prevents the failed reduce job to be reqeued again by hadoop,
right? Is that all?

All mechanisms prevent new URLs to be queued, nothing else, right? How
can I reliably set the task to end? If not: Can I just assume that
ALTHOUGH some of the reduce tasks of a job fail with "failed to report
status for 600 seconds", I can proceed with parsing? I do not care to
miss some URLs. I need quick and efficient fetching and want to optimize
my crawls in that direction. The thing what I fear is: the problematic
URLs to be fetched are not necesarilly at the end of a task. They might
be at starting or in the middle and agglomerate. What then?

Best wishes,
Martin

Re: Nutch 2.x performant and hassle-free crawling

Reply via email to