thanks for the reply.

The MAP tasks are the ones failing and most of them simply fail with:

attempt_201206200559_0032_m_000313_0 task_201206200559_0032_m_000313
10.76.89.196   FAILED
Error: Java heap space


Some of the MAP tasks have a trace as follows:

attempt_201206200559_0032_m_000322_1 task_201206200559_0032_m_000322
10.242.110.38 FAILED
java.lang.Throwable: Child Error
  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of
255.
  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)


and eventually after too many failures:

12/06/20 10:53:21 INFO mapred.JobClient: Job Failed: # of failed Map Tasks
exceeded allowed limit. FailedCount: 1. LastFailedTask:
task_201206200559_0032_m_000434

Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1312)
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:105)
        at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:63)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:140)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)



Crawl settings:
seed urls - 30
topN - 1,000,000
depth - 10 (Execution crashes at depth 6)

Cluster: 
Amazon Elastic Map Reduce

Machines:
type - c1.medium 
number - 70

JAVA settings:
HADOOP_JOBTRACKER_HEAPSIZE      768
HADOOP_NAMENODE_HEAPSIZE        512
HADOOP_TASKTRACKER_HEAPSIZE     256
HADOOP_DATANODE_HEAPSIZE        128
mapred.child.java.opts  -Xmx512m
mapred.tasktracker.map.tasks.maximum    2
mapred.tasktracker.reduce.tasks.maximum 1


CrawlDB stats after the crash:
2/06/20 09:26:08 INFO mapred.JobClient:   CrawlDB status
12/06/20 09:26:08 INFO mapred.JobClient:     db_redir_temp=2117
12/06/20 09:26:08 INFO mapred.JobClient:     db_redir_perm=11542
12/06/20 09:26:08 INFO mapred.JobClient:     db_unfetched=2616086
12/06/20 09:26:08 INFO mapred.JobClient:     db_gone=2722
12/06/20 09:26:08 INFO mapred.JobClient:     db_fetched=238775

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Nutch-1-5-Error-Java-heap-space-during-MAP-step-of-CrawlDb-update-tp3990448p3990579.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to