can U tell me how to unregedit this mail? i got alot of mail like "nutch.apache.org" so boring .
在2012-06-20,"sidbatra" <[email protected]> 写道: -----原始邮件----- 发件人:"sidbatra" <[email protected]> 发送时间:2012年06月20日 星期三 收件人:"user" <[email protected]> 主题:Nutch 1.5 - "Error: Java heap space" during MAP step of CrawlDb update I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map Reduce with 30 m1.small machines with the following settings: Parameter Value HADOOP_JOBTRACKER_HEAPSIZE 512 HADOOP_NAMENODE_HEAPSIZE 512 HADOOP_TASKTRACKER_HEAPSIZE 256 HADOOP_DATANODE_HEAPSIZE 128 mapred.child.java.opts -Xmx512m mapred.tasktracker.map.tasks.maximum 2 mapred.tasktracker.reduce.tasks.maximum 1 topN is 1,000,000 and the depth is 10 At depth=3 the CrawlDB update job starts to throw out errors as follows: 12/06/20 00:31:58 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000161_0, Status : FAILED Error: Java heap space 12/06/20 00:31:58 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000165_0, Status : FAILED Error: Java heap space 12/06/20 00:31:58 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000168_0, Status : FAILED Error: Java heap space 12/06/20 00:32:00 INFO mapred.JobClient: map 42% reduce 2% 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000170_0, Status : FAILED Error: Java heap space 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000152_0, Status : FAILED Error: Java heap space 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000171_0, Status : FAILED Error: Java heap space 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000153_0, Status : FAILED Error: Java heap space 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000172_0, Status : FAILED Error: Java heap space 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000135_1, Status : FAILED Error: Java heap space 12/06/20 00:32:01 INFO mapred.JobClient: map 43% reduce 3% 12/06/20 00:32:01 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000160_0, Status : FAILED Error: Java heap space 12/06/20 00:32:01 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000126_1, Status : FAILED Error: Java heap space 12/06/20 00:32:02 INFO mapred.JobClient: map 45% reduce 3% 12/06/20 00:32:02 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000163_0, Status : FAILED Error: Java heap space 12/06/20 00:32:03 INFO mapred.JobClient: map 46% reduce 3% 12/06/20 00:32:04 INFO mapred.JobClient: map 49% reduce 3% 12/06/20 00:32:04 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000141_1, Status : FAILED Error: Java heap space 12/06/20 00:32:04 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000137_1, Status : FAILED Error: Java heap space 12/06/20 00:32:05 INFO mapred.JobClient: map 50% reduce 3% 12/06/20 00:32:05 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000181_0, Status : FAILED Error: Java heap space 12/06/20 00:32:05 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000184_0, Status : FAILED Error: Java heap space 12/06/20 00:32:05 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000185_0, Status : FAILED Error: Java heap space 12/06/20 00:32:06 INFO mapred.JobClient: map 52% reduce 3% 12/06/20 00:32:06 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000193_0, Status : FAILED Error: Java heap space 12/06/20 00:32:07 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000159_1, Status : FAILED Error: Java heap space 12/06/20 00:32:08 INFO mapred.JobClient: map 54% reduce 3% 12/06/20 00:32:09 INFO mapred.JobClient: map 55% reduce 3% 12/06/20 00:32:09 INFO mapred.JobClient: Task Id : attempt_201206192134_0022_m_000188_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) Nutch 1.5 has the patch NUTCH-702 applied and I've set the db.update.max.inlinks and db.max.inlinks both to 10. The CrawlDB step also has URLNormalization and URLFiltering turned off. This is done in the parse step prior to the update step. Can I tweak some settings to use less memory? Do I need to use larger machines Do I need to use more machines? Any other insights? I'll really appreciate help in solving this issue. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-5-Error-Java-heap-space-during-MAP-step-of-CrawlDb-update-tp3990448.html Sent from the Nutch - User mailing list archive at Nabble.com.

