I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map
Reduce with 30 m1.small machines with the following settings:
Parameter Value
HADOOP_JOBTRACKER_HEAPSIZE 512
HADOOP_NAMENODE_HEAPSIZE 512
HADOOP_TASKTRACKER_HEAPSIZE 256
HADOOP_DATANODE_HEAPSIZE 128
mapred.child.java.opts -Xmx512m
mapred.tasktracker.map.tasks.maximum 2
mapred.tasktracker.reduce.tasks.maximum 1
topN is 1,000,000 and the depth is 10
At depth=3 the CrawlDB update job starts to throw out errors as follows:
12/06/20 00:31:58 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000161_0, Status : FAILED
Error: Java heap space
12/06/20 00:31:58 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000165_0, Status : FAILED
Error: Java heap space
12/06/20 00:31:58 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000168_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: map 42% reduce 2%
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000170_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000152_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000171_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000153_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000172_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:00 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000135_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:01 INFO mapred.JobClient: map 43% reduce 3%
12/06/20 00:32:01 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000160_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:01 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000126_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:02 INFO mapred.JobClient: map 45% reduce 3%
12/06/20 00:32:02 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000163_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:03 INFO mapred.JobClient: map 46% reduce 3%
12/06/20 00:32:04 INFO mapred.JobClient: map 49% reduce 3%
12/06/20 00:32:04 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000141_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:04 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000137_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:05 INFO mapred.JobClient: map 50% reduce 3%
12/06/20 00:32:05 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000181_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:05 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000184_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:05 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000185_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:06 INFO mapred.JobClient: map 52% reduce 3%
12/06/20 00:32:06 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000193_0, Status : FAILED
Error: Java heap space
12/06/20 00:32:07 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000159_1, Status : FAILED
Error: Java heap space
12/06/20 00:32:08 INFO mapred.JobClient: map 54% reduce 3%
12/06/20 00:32:09 INFO mapred.JobClient: map 55% reduce 3%
12/06/20 00:32:09 INFO mapred.JobClient: Task Id :
attempt_201206192134_0022_m_000188_0, Status : FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of
255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
Nutch 1.5 has the patch NUTCH-702 applied and I've set the
db.update.max.inlinks and db.max.inlinks both to 10.
The CrawlDB step also has URLNormalization and URLFiltering turned off. This
is done in the parse step prior to the update step.
Can I tweak some settings to use less memory?
Do I need to use larger machines
Do I need to use more machines?
Any other insights?
I'll really appreciate help in solving this issue.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-1-5-Error-Java-heap-space-during-MAP-step-of-CrawlDb-update-tp3990448.html
Sent from the Nutch - User mailing list archive at Nabble.com.