Re: Nutch 2.2.1 parse (slow?)

Martin Aesch Sun, 21 Jul 2013 13:26:27 -0700

Hi Lewis,

I am in pseudo-distributed mode, all is local. I added some output of my
successful but slow parse job (nutch parse -resume) with altogether 82M
URLs in the webpage table, around 8M in the batch as I said, where one third was
unparsed. Took 24 hours to complete. Again, low io, low cpu usage.

Next trials were again parse-resumes, but as all stuff is parsed, absolutely no
parsing took place ("Skipping..."). I deactivated the firewall just in case 
some stuff is not bound to
localhost, did not make any difference. I tried gora.buffer.read.limit = 
100000, no difference.

I checked the preceding generate job (webtable total size was also 82M URLs). 
This was completed in 1 hour. I did a fresh generate job,
which is on its way and will be in the same order of magnitude targeting 1 hour.

Both generatormapper und parsemapper look somehow very similar in terms of 
computational effort, as far as I can judge (which is obviously not too far). 
On the other hand I see that parserjob adds more fields to read for 
ParseMapper/GoraMapper including the actual content.
(Regardless of "my" issue, this means  that ParserMapper reads in really every 
piece of content of the webpage table, right?)
But as I said, my parse jobs were neither io-bound nor cpu bound.

What else could I try?

Martin

--------------------------------------------------------------------------------
13/07/21 02:28:58 INFO mapred.JobClient:  map 99% reduce 0%
13/07/21 02:33:32 INFO mapred.JobClient:  map 100% reduce 0%
13/07/21 02:33:32 INFO mapred.JobClient: Job complete:
job_201307121441_0022
13/07/21 02:33:32 INFO mapred.JobClient: Counters: 20
13/07/21 02:33:32 INFO mapred.JobClient:   ParserStatus
13/07/21 02:33:32 INFO mapred.JobClient:     failed=21314
13/07/21 02:33:32 INFO mapred.JobClient:     success=2107033
13/07/21 02:33:32 INFO mapred.JobClient:   Job Counters 
13/07/21 02:33:32 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=521752464
13/07/21 02:33:32 INFO mapred.JobClient:     Total time spent by all
reduces waiting after reserving slots (ms)=0
13/07/21 02:33:32 INFO mapred.JobClient:     Total time spent by all
maps waiting after reserving slots (ms)=0
13/07/21 02:33:32 INFO mapred.JobClient:     Launched map tasks=223
13/07/21 02:33:32 INFO mapred.JobClient:     Data-local map tasks=223
13/07/21 02:33:32 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
13/07/21 02:33:32 INFO mapred.JobClient:   File Output Format Counters 
13/07/21 02:33:32 INFO mapred.JobClient:     Bytes Written=0
13/07/21 02:33:32 INFO mapred.JobClient:   FileSystemCounters
13/07/21 02:33:32 INFO mapred.JobClient:     HDFS_BYTES_READ=258415
13/07/21 02:33:32 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=17061620
13/07/21 02:33:32 INFO mapred.JobClient:   File Input Format Counters 
13/07/21 02:33:32 INFO mapred.JobClient:     Bytes Read=0
13/07/21 02:33:32 INFO mapred.JobClient:   Map-Reduce Framework
13/07/21 02:33:32 INFO mapred.JobClient:     Map input records=82277147
13/07/21 02:33:32 INFO mapred.JobClient:     Physical memory (bytes)
snapshot=69987168256
13/07/21 02:33:32 INFO mapred.JobClient:     Spilled Records=0
13/07/21 02:33:32 INFO mapred.JobClient:     CPU time spent
(ms)=46213980
13/07/21 02:33:32 INFO mapred.JobClient:     Total committed heap usage
(bytes)=79857254400
13/07/21 02:33:32 INFO mapred.JobClient:     Virtual memory (bytes)
snapshot=373927632896
13/07/21 02:33:32 INFO mapred.JobClient:     Map output records=2657067
13/07/21 02:33:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=258415
13/07/21 02:33:32 INFO parse.ParserJob: ParserJob: success

-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
Reply-to: [email protected]
To: [email protected] <[email protected]>
Subject: Re: Nutch 2.2.1 parse (slow?)
Date: Sat, 20 Jul 2013 21:34:10 -0700

Hi Martin,

On Saturday, July 20, 2013, Martin Aesch <[email protected]>
wrote:
> I have about 25K URLs per map task and around 8M URLs total
> All 6 mappers run and have continuously output. The aggregated parse
> rate is < 100URLs/sec.

wow this is painstakingly slow indeed. This was similar to the problem
folks were reporting prior to 2.2.1 release.

> What I did now is I replaced neko by tagsoup in nutch-site.xml and
> resumed the parsing. I see now as expected mostly Skipping ... already
> parsed. The aggrgated parse rate is the same, less than 100 URLs/sec.
> Load is now < 1, cpu is 95% idle. Looks somehow, if the mapper tasks do
> not get enough input.

wow...  this is not the same as we were seeing before. Parsing is also
heavy on cpu... something defo fishy.

> Region server heap usage is "now" 4G out of 12G with about 225 regions
> assigned. I am monitoring my system with ganglia and did not see
> anything suspicious (being a hadoop/hbase noob). I am on the way to
> increase gora.buffer.read.limit for a new test. On the other hand, the
> default of 10000 seems to me very reasonable.

Yes it is a v reasonable default. Off topic, for Injetcing and some other
tasks I actually found a lower value of 1000 for gora writes (with
Cassandra backend) provided faster overall completion time.

Is the data all local or are you having to send it over the network?
I am merely trying to see why such low levels of URLs are being processed.

Re: Nutch 2.2.1 parse (slow?)

Reply via email to