Hi Lewis,

I have about 25K URLs per map task and around 8M URLs total
All 6 mappers run and have continuously output. The aggregated parse
rate is < 100URLs/sec.

What I did now is I replaced neko by tagsoup in nutch-site.xml and
resumed the parsing. I see now as expected mostly Skipping ... already
parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
not get enough input.

Region server heap usage is "now" 4G out of 12G with about 225 regions
assigned. I am monitoring my system with ganglia and did not see
anything suspicious (being a hadoop/hbase noob). I am on the way to 
increase gora.buffer.read.limit for a new test. On the other hand, the
default of 10000 seems to me very reasonable.

Martin

On Fri, 2013-07-19 at 21:29 -0700, Lewis John Mcgibbney wrote:
> Hi Martin,
> Havve you checked that all mappers are working while parsing job is running?
> How many URLs are you trying to parse here?
> 
> On Friday, July 19, 2013, Martin Aesch <[email protected]> wrote:
> > Dear nutchers,
> >
> > Having Nutch 2.2.1/HBase 0.90.6/Hadoop 1.1.2/6Mappers/6Reducers/Core
> > i7-3770/32GB (no swap)/2x3TB
> >
> > When I parse (in mapper, 6 simultaneously running map-tasks), this is
> > very slow. Max load is ~1.5, max iowait is 5%, max CPU per task is only
> > 30%, max CPU for hmaster is about 30%. iotop in consequence also shows
> > low numbers.
> >
> > Since parsing is a CPU-intensive job and all IO-stuff is on very low
> > level, I wonder why parsing does not work faster und with full CPU
> > usage. It really takes a long time to finish. Where might be the
> > bottleneck?
> >
> > Thanks for any advice,
> > Martin
> >
> >
> >
> >
> >
> 

Reply via email to