Hi Martin,

On Saturday, July 20, 2013, Martin Aesch <[email protected]>
wrote:
> I have about 25K URLs per map task and around 8M URLs total
> All 6 mappers run and have continuously output. The aggregated parse
> rate is < 100URLs/sec.

wow this is painstakingly slow indeed. This was similar to the problem
folks were reporting prior to 2.2.1 release.

> What I did now is I replaced neko by tagsoup in nutch-site.xml and
> resumed the parsing. I see now as expected mostly Skipping ... already
> parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
> Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
> not get enough input.

wow...  this is not the same as we were seeing before. Parsing is also
heavy on cpu... something defo fishy.

> Region server heap usage is "now" 4G out of 12G with about 225 regions
> assigned. I am monitoring my system with ganglia and did not see
> anything suspicious (being a hadoop/hbase noob). I am on the way to
> increase gora.buffer.read.limit for a new test. On the other hand, the
> default of 10000 seems to me very reasonable.

Yes it is a v reasonable default. Off topic, for Injetcing and some other
tasks I actually found a lower value of 1000 for gora writes (with
Cassandra backend) provided faster overall completion time.

Is the data all local or are you having to send it over the network?
I am merely trying to see why such low levels of URLs are being processed.

-- 
*Lewis*

Reply via email to