Nutch 1.5 - "Error: Java heap space" during MAP step of CrawlDb update

sidbatra Tue, 19 Jun 2012 18:09:22 -0700

I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map
Reduce with 30 m1.small machines with the following settings:


Parameter       Value
HADOOP_JOBTRACKER_HEAPSIZE      512
HADOOP_NAMENODE_HEAPSIZE        512
HADOOP_TASKTRACKER_HEAPSIZE     256
HADOOP_DATANODE_HEAPSIZE        128
mapred.child.java.opts  -Xmx512m
mapred.tasktracker.map.tasks.maximum    2
mapred.tasktracker.reduce.tasks.maximum 1



topN is 1,000,000 and the depth is 10

At depth=3 the CrawlDB update job starts to throw out errors as follows:




12/06/20 00:31:58 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000161_0, Status : FAILED
Error: Java heap space
12/06/20 00:31:58 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000165_0, Status : FAILED
Error: Java heap space
12/06/20 00:31:58 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000168_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient:  map 42% reduce 2%
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000170_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000152_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000171_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000153_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000172_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000135_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:01 INFO mapred.JobClient:  map 43% reduce 3%
12/06/20 00:32:01 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000160_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:01 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000126_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:02 INFO mapred.JobClient:  map 45% reduce 3%
12/06/20 00:32:02 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000163_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:03 INFO mapred.JobClient:  map 46% reduce 3%
12/06/20 00:32:04 INFO mapred.JobClient:  map 49% reduce 3%
12/06/20 00:32:04 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000141_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:04 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000137_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:05 INFO mapred.JobClient:  map 50% reduce 3%
12/06/20 00:32:05 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000181_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:05 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000184_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:05 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000185_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:06 INFO mapred.JobClient:  map 52% reduce 3%
12/06/20 00:32:06 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000193_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:07 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000159_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:08 INFO mapred.JobClient:  map 54% reduce 3%
12/06/20 00:32:09 INFO mapred.JobClient:  map 55% reduce 3%
12/06/20 00:32:09 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000188_0, Status : FAILED
java.lang.Throwable: Child Error
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of
255.
        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)




Nutch 1.5 has the patch NUTCH-702 applied and I've set the
db.update.max.inlinks and db.max.inlinks both to 10. 

The CrawlDB step also has URLNormalization and URLFiltering turned off. This
is done in the parse step prior to the update step.


Can I tweak some settings to use less memory?
Do I need to use larger machines
Do I need to use more machines?
Any other insights?

I'll really appreciate help in solving this issue. 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-1-5-Error-Java-heap-space-during-MAP-step-of-CrawlDb-update-tp3990448.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Nutch 1.5 - "Error: Java heap space" during MAP step of CrawlDb update

Reply via email to