thanks for the reply.
The MAP tasks are the ones failing and most of them simply fail with:
attempt_201206200559_0032_m_000313_0 task_201206200559_0032_m_000313
10.76.89.196 FAILED
Error: Java heap space
Some of the MAP tasks have a trace as follows:
attempt_201206200559_0032_m_000322_1 task_201206200559_0032_m_000322
10.242.110.38 FAILED
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of
255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
and eventually after too many failures:
12/06/20 10:53:21 INFO mapred.JobClient: Job Failed: # of failed Map Tasks
exceeded allowed limit. FailedCount: 1. LastFailedTask:
task_201206200559_0032_m_000434
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1312)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:63)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:140)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Crawl settings:
seed urls - 30
topN - 1,000,000
depth - 10 (Execution crashes at depth 6)
Cluster:
Amazon Elastic Map Reduce
Machines:
type - c1.medium
number - 70
JAVA settings:
HADOOP_JOBTRACKER_HEAPSIZE 768
HADOOP_NAMENODE_HEAPSIZE 512
HADOOP_TASKTRACKER_HEAPSIZE 256
HADOOP_DATANODE_HEAPSIZE 128
mapred.child.java.opts -Xmx512m
mapred.tasktracker.map.tasks.maximum 2
mapred.tasktracker.reduce.tasks.maximum 1
CrawlDB stats after the crash:
2/06/20 09:26:08 INFO mapred.JobClient: CrawlDB status
12/06/20 09:26:08 INFO mapred.JobClient: db_redir_temp=2117
12/06/20 09:26:08 INFO mapred.JobClient: db_redir_perm=11542
12/06/20 09:26:08 INFO mapred.JobClient: db_unfetched=2616086
12/06/20 09:26:08 INFO mapred.JobClient: db_gone=2722
12/06/20 09:26:08 INFO mapred.JobClient: db_fetched=238775
--
View this message in context:
http://lucene.472066.n3.nabble.com/Nutch-1-5-Error-Java-heap-space-during-MAP-step-of-CrawlDb-update-tp3990448p3990579.html
Sent from the Nutch - User mailing list archive at Nabble.com.