Do you have a stacktrace of a failed task? On Wed, Jun 20, 2012 at 3:08 AM, sidbatra <[email protected]> wrote:
> I'm using Nutch 1.5 to crawl 30 sites in deploy mode on Amazon Elastic Map > Reduce with 30 m1.small machines with the following settings: > > Parameter Value > HADOOP_JOBTRACKER_HEAPSIZE 512 > HADOOP_NAMENODE_HEAPSIZE 512 > HADOOP_TASKTRACKER_HEAPSIZE 256 > HADOOP_DATANODE_HEAPSIZE 128 > mapred.child.java.opts -Xmx512m > mapred.tasktracker.map.tasks.maximum 2 > mapred.tasktracker.reduce.tasks.maximum 1 > > > > topN is 1,000,000 and the depth is 10 > > At depth=3 the CrawlDB update job starts to throw out errors as follows: > > > > > 12/06/20 00:31:58 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000161_0, Status : FAILED > Error: Java heap space > 12/06/20 00:31:58 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000165_0, Status : FAILED > Error: Java heap space > 12/06/20 00:31:58 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000168_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:00 INFO mapred.JobClient: map 42% reduce 2% > 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000170_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000152_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000171_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000153_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000172_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:00 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000135_1, Status : FAILED > Error: Java heap space > 12/06/20 00:32:01 INFO mapred.JobClient: map 43% reduce 3% > 12/06/20 00:32:01 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000160_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:01 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000126_1, Status : FAILED > Error: Java heap space > 12/06/20 00:32:02 INFO mapred.JobClient: map 45% reduce 3% > 12/06/20 00:32:02 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000163_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:03 INFO mapred.JobClient: map 46% reduce 3% > 12/06/20 00:32:04 INFO mapred.JobClient: map 49% reduce 3% > 12/06/20 00:32:04 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000141_1, Status : FAILED > Error: Java heap space > 12/06/20 00:32:04 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000137_1, Status : FAILED > Error: Java heap space > 12/06/20 00:32:05 INFO mapred.JobClient: map 50% reduce 3% > 12/06/20 00:32:05 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000181_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:05 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000184_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:05 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000185_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:06 INFO mapred.JobClient: map 52% reduce 3% > 12/06/20 00:32:06 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000193_0, Status : FAILED > Error: Java heap space > 12/06/20 00:32:07 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000159_1, Status : FAILED > Error: Java heap space > 12/06/20 00:32:08 INFO mapred.JobClient: map 54% reduce 3% > 12/06/20 00:32:09 INFO mapred.JobClient: map 55% reduce 3% > 12/06/20 00:32:09 INFO mapred.JobClient: Task Id : > attempt_201206192134_0022_m_000188_0, Status : FAILED > java.lang.Throwable: Child Error > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) > Caused by: java.io.IOException: Task process exit with nonzero status of > 255. > at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) > > > > > Nutch 1.5 has the patch NUTCH-702 applied and I've set the > db.update.max.inlinks and db.max.inlinks both to 10. > > The CrawlDB step also has URLNormalization and URLFiltering turned off. > This > is done in the parse step prior to the update step. > > > Can I tweak some settings to use less memory? > Do I need to use larger machines > Do I need to use more machines? > Any other insights? > > I'll really appreciate help in solving this issue. > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-1-5-Error-Java-heap-space-during-MAP-step-of-CrawlDb-update-tp3990448.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

