It seems to me that in this case Hbase consumes a lot of memory on a heavy load. Try to decrease number of threads or even number of mappers.
hth. Alex. -----Original Message----- From: Martin Aesch <[email protected]> To: user <[email protected]> Sent: Fri, Jul 5, 2013 2:51 pm Subject: Nutch 2.x performant and hassle-free crawling Dear nutchers, So, I set up my nutch-2.2.1 with hbase-backend (finally) on a core i7 wit 32GB and 6TB RAM. Before scaling out, I want to meet the limits of a single machine (HBase: pseudo-distributed, MR: 8 parallel mapper, 8 parallel reducer). I am using the full DMOZ urlset with raw ~ 4M URLs. Fetching with 8 mapper tasks, each with 75 threads. I am really happy with Hadoop/Hbase. What I want (see also below): as many fetching as possible - if some URLs are not fetched, I do not care. On my way to that goal I met some trouble: [Beforehand: I had some Issues with GeneratorReducer "Task ... failed to report status for 600 seconds. Killing! I extended GeneratorReducer and do a context.progress() every 0.5*mapred.task.timeout, now it seemingly works without issues] Again, I have the "failed to report status for 600 seconds", this time in "fetch" step. It is a little different here. What I already noted is that FetcherReducer does at the end of reduce() // some requests seem to hang, despite all intentions if ((System.currentTimeMillis() - lastRequestStart.get()) > timeout) { LOG.warn("Aborting with " + activeThreads + " hung threads."); return; } But it does not kill the currently running fetch-threads. What it does is just preventing new URLs to be put ont fetch queues. In the hadoop log I see exactly 600s after the message "Aborting with...hung threads" that Hadoop kills the task. Not nutch. Then, I added fetcher.throughput.threshold.pages=5 in nutch-site.xml. Still same problem, I get the "failed to report status for 600 seconds". So: is there any possiblity to limit the runtime of each individual http GET request? Did I oversee any other possibility? Other idee: fetcher.timelimit.mins this sets in FetcherReducer "timelimit", which more or less just hinders new URLs to be queued if looking in the source. And additionlly, it seemingly prevents the failed reduce job to be reqeued again by hadoop, right? Is that all? All mechanisms prevent new URLs to be queued, nothing else, right? How can I reliably set the task to end? If not: Can I just assume that ALTHOUGH some of the reduce tasks of a job fail with "failed to report status for 600 seconds", I can proceed with parsing? I do not care to miss some URLs. I need quick and efficient fetching and want to optimize my crawls in that direction. The thing what I fear is: the problematic URLs to be fetched are not necesarilly at the end of a task. They might be at starting or in the middle and agglomerate. What then? Best wishes, Martin

