Hi Lewis, I have about 25K URLs per map task and around 8M URLs total All 6 mappers run and have continuously output. The aggregated parse rate is < 100URLs/sec.
What I did now is I replaced neko by tagsoup in nutch-site.xml and resumed the parsing. I see now as expected mostly Skipping ... already parsed. The aggrgated parse rate is the same, less than 100 URLs/sec. Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do not get enough input. Region server heap usage is "now" 4G out of 12G with about 225 regions assigned. I am monitoring my system with ganglia and did not see anything suspicious (being a hadoop/hbase noob). I am on the way to increase gora.buffer.read.limit for a new test. On the other hand, the default of 10000 seems to me very reasonable. Martin On Fri, 2013-07-19 at 21:29 -0700, Lewis John Mcgibbney wrote: > Hi Martin, > Havve you checked that all mappers are working while parsing job is running? > How many URLs are you trying to parse here? > > On Friday, July 19, 2013, Martin Aesch <[email protected]> wrote: > > Dear nutchers, > > > > Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core > > i7-3770/32GB (no swap)/2x3TB > > > > When I parse (in mapper, 6 simultaneously running map-tasks), this is > > very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only > > 30%, max CPU for hmaster is about 30%. iotop in consequence also shows > > low numbers. > > > > Since parsing is a CPU-intensive job and all IO-stuff is on very low > > level, I wonder why parsing does not work faster und with full CPU > > usage. It really takes a long time to finish. Where might be the > > bottleneck? > > > > Thanks for any advice, > > Martin > > > > > > > > > > >

