Hi Martin, On Saturday, July 20, 2013, Martin Aesch <[email protected]> wrote: > I have about 25K URLs per map task and around 8M URLs total > All 6 mappers run and have continuously output. The aggregated parse > rate is < 100URLs/sec.
wow this is painstakingly slow indeed. This was similar to the problem folks were reporting prior to 2.2.1 release. > What I did now is I replaced neko by tagsoup in nutch-site.xml and > resumed the parsing. I see now as expected mostly Skipping ... already > parsed. The aggrgated parse rate is the same, less than 100 URLs/sec. > Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do > not get enough input. wow... this is not the same as we were seeing before. Parsing is also heavy on cpu... something defo fishy. > Region server heap usage is "now" 4G out of 12G with about 225 regions > assigned. I am monitoring my system with ganglia and did not see > anything suspicious (being a hadoop/hbase noob). I am on the way to > increase gora.buffer.read.limit for a new test. On the other hand, the > default of 10000 seems to me very reasonable. Yes it is a v reasonable default. Off topic, for Injetcing and some other tasks I actually found a lower value of 1000 for gora writes (with Cassandra backend) provided faster overall completion time. Is the data all local or are you having to send it over the network? I am merely trying to see why such low levels of URLs are being processed. -- *Lewis*

