Slow parse on hadoop

Žygimantas Wed, 13 Feb 2013 06:30:37 -0800

Hi,

I have Nutch running on a Hadoop cluster. Inject, generate, fetch are working 
fine, they are executed on multiple nodes. We seam to get only one mapper for 
the parse job and the parse step only runs on one node and it takes a minute or 
so to parse one page. Please see the log below (1min 41s to parse thetimes).


2013-02-13 13:46:02,658 INFO org.apache.nutch.parse.ParserJob: Parsing 
http://www.thetimes.co.uk/tto/news/ 2013-02-13 13:47:43,415 INFO 
org.apache.nutch.parse.ParserJob: Parsing http://online.wsj.com/home-page
I am using parse-html plugin to do the job. Cassandra as the DB. When running 
locally all is fine.
Running parse with this:
hadoop jar apache-nutch-2.1-SNAPSHOT.job org.apache.nutch.parse.ParserJob $id


Also including log from jobtracker
Hadoop job_201302131311_0006 onJob Name: parse
Job-ACLs: All users are allowed
Status: Succeeded
Started at: Wed Feb 13 13:44:06 GMT 2013
Finished at: Wed Feb 13 14:06:30 GMT 2013
Finished in: 22mins, 23sec



Counter
Map
Reduce
Total
ParserStatus success 13 0 13 
notparsed 1 0 1 
Job Counters SLOTS_MILLIS_MAPS 0 0 1,335,834 
Total time spent by all reduces waiting after reserving slots (ms) 0 0 0 
Total time spent by all maps waiting after reserving slots (ms) 0 0 0 
Launched map tasks 0 0 1 
SLOTS_MILLIS_REDUCES 0 0 0 
File Output Format Counters Bytes Written 0 0 0 
File Input Format Counters Bytes Read 0 0 0 
FileSystemCounters HDFS_BYTES_READ 689 0 689 
FILE_BYTES_WRITTEN 32,142 0 32,142 
Map-Reduce Framework Map input records 138 0 138 
Physical memory (bytes) snapshot 417,538,048 0 417,538,048 
Spilled Records 0 0 0 
Total committed heap usage (bytes) 186,449,920 0 186,449,920 
CPU time spent (ms) 1,379,340 0 1,379,340 
Virtual memory (bytes) snapshot 1,163,165,696 0 1,163,165,696 
SPLIT_RAW_BYTES 689 0 689 
Map output records 14 0 14 
________________________________

Slow parse on hadoop

Reply via email to