Thanks, Alex, this seems to be the problem, indeed. I increased the heap-size for hbase (non-distributed, all in the same jvm) and now it fetches continously at 30MBit/s, fantastic!
On Fri, 2013-07-05 at 18:17 -0400, [email protected] wrote: > It seems to me that in this case Hbase consumes a lot of memory on a heavy > load. Try to decrease number of threads or even number of mappers. > > hth. > Alex. > > > > > > > > -----Original Message----- > From: Martin Aesch <[email protected]> > To: user <[email protected]> > Sent: Fri, Jul 5, 2013 2:51 pm > Subject: Nutch 2.x performant and hassle-free crawling > > > Dear nutchers, > > So, I set up my nutch-2.2.1 with hbase-backend (finally) on a core i7 > wit 32GB and 6TB RAM. Before scaling out, I want to meet the limits of a > single machine (HBase: pseudo-distributed, MR: 8 parallel mapper, 8 > parallel reducer). I am using the full DMOZ urlset with raw ~ 4M URLs. > Fetching with 8 mapper tasks, each with 75 threads. I am really happy > with Hadoop/Hbase. What I want (see also below): as many fetching as > possible - if some URLs are not fetched, I do not care. On my way to > that goal I met some trouble: > > [Beforehand: I had some Issues with GeneratorReducer "Task ... failed to > report status for 600 seconds. Killing! > I extended GeneratorReducer and do a context.progress() every > 0.5*mapred.task.timeout, now it seemingly works without issues] > > Again, I have the "failed to report status for 600 seconds", this time > in "fetch" step. It is a little different here. What > I already noted is that FetcherReducer does at the end of reduce() > > // some requests seem to hang, despite all intentions > if ((System.currentTimeMillis() - lastRequestStart.get()) > > timeout) { > LOG.warn("Aborting with " + activeThreads + " hung threads."); > return; > } > > But it does not kill the currently running fetch-threads. What it does > is just preventing new URLs to be put ont fetch queues. > In the hadoop log I see exactly 600s after the message "Aborting > with...hung threads" that Hadoop kills the task. Not nutch. > > Then, I added fetcher.throughput.threshold.pages=5 in nutch-site.xml. > Still same problem, I get the "failed to report status for 600 seconds". > So: is there any possiblity to limit the runtime of each individual http > GET request? Did I oversee any other possibility? > > Other idee: fetcher.timelimit.mins > this sets in FetcherReducer "timelimit", which more or less just hinders > new URLs to be queued if looking in the source. And additionlly, it > seemingly prevents the failed reduce job to be reqeued again by hadoop, > right? Is that all? > > All mechanisms prevent new URLs to be queued, nothing else, right? How > can I reliably set the task to end? If not: Can I just assume that > ALTHOUGH some of the reduce tasks of a job fail with "failed to report > status for 600 seconds", I can proceed with parsing? I do not care to > miss some URLs. I need quick and efficient fetching and want to optimize > my crawls in that direction. The thing what I fear is: the problematic > URLs to be fetched are not necesarilly at the end of a task. They might > be at starting or in the middle and agglomerate. What then? > > Best wishes, > Martin > > >

