It's your topN.  You're pulling too many urls at one time.  At the end of
each fetch map task, Nutch pulls all of the downloaded data into memory in
order to do the merge and sort.  If that data is more than your allocated
memory, then you'll get a java heap exception.

There are only two ways to "fix" it.  Either decrease your topN to pull
less urls per segment or increase your allocated memory.  The various sort
options don't actually do anything.


On Wed, Jun 20, 2012 at 10:15 PM, sidbatra <[email protected]>wrote:

> Ok, here are the syslogs from the individual machines. They all have a
> stack
> trace similar to this
>
>
>
> 2012-06-21 00:28:40,838 WARN org.apache.hadoop.conf.Configuration (main):
> DEPRECATED: hadoop-site.xml found in the classpath. Usage of
> hadoop-site.xml
> is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml
> to override properties of core-default.xml, mapred-default.xml and
> hdfs-default.xml respectively
> 2012-06-21 00:28:41,826 INFO org.apache.hadoop.util.NativeCodeLoader
> (main):
> Loaded the native-hadoop library
> 2012-06-21 00:28:42,043 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi
> already exists!
> 2012-06-21 00:28:42,133 INFO org.apache.hadoop.mapred.MapTask (main): Host
> name: ip-10-244-113-139.ec2.internal
> 2012-06-21 00:28:42,164 INFO org.apache.hadoop.util.ProcessTree (main):
> setsid exited with exit code 0
> 2012-06-21 00:28:42,207 INFO org.apache.hadoop.mapred.Task (main):  Using
> ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@17e4dee
> 2012-06-21 00:28:42,303 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory
> (main): Successfully loaded & initialized native-zlib library
> 2012-06-21 00:28:42,303 INFO org.apache.hadoop.io.compress.CodecPool
> (main):
> Got brand-new decompressor
> 2012-06-21 00:28:42,313 INFO org.apache.hadoop.mapred.MapTask (main):
> numReduceTasks: 105
> 2012-06-21 00:28:42,319 INFO org.apache.hadoop.mapred.MapTask (main):
> io.sort.mb = 200
> 2012-06-21 00:28:43,081 INFO org.apache.hadoop.mapred.MapTask (main): data
> buffer = 159383552/199229440
> 2012-06-21 00:28:43,082 INFO org.apache.hadoop.mapred.MapTask (main):
> record
> buffer = 524288/655360
> 2012-06-21 00:28:43,102 WARN
> org.apache.hadoop.io.compress.snappy.LoadSnappy
> (main): Snappy native library is available
> 2012-06-21 00:28:43,102 INFO
> org.apache.hadoop.io.compress.snappy.LoadSnappy
> (main): Snappy native library loaded
> 2012-06-21 00:28:52,114 INFO org.apache.hadoop.mapred.MapTask (main):
> Starting flush of map output
> 2012-06-21 00:28:52,289 INFO org.apache.hadoop.io.compress.CodecPool
> (main):
> Got brand-new compressor
> 2012-06-21 00:28:53,668 INFO org.apache.hadoop.mapred.MapTask (main):
> Finished spill 0
> 2012-06-21 00:28:53,673 INFO org.apache.hadoop.mapred.Task (main):
> Task:attempt_201206202314_0017_m_000055_0 is done. And is in the process of
> commiting
> 2012-06-21 00:28:54,591 INFO org.apache.hadoop.mapred.Task (main): Task
> 'attempt_201206202314_0017_m_000055_0' done.
> 2012-06-21 00:28:54,594 INFO org.apache.hadoop.mapred.TaskLogsTruncater
> (main): Initializing logs' truncater with mapRetainSize=-1 and
> reduceRetainSize=-1
> 2012-06-21 00:28:54,707 INFO org.apache.hadoop.io.nativeio.NativeIO (main):
> Initialized cache for UID to User mapping with a cache timeout of 14400
> seconds.
> 2012-06-21 00:28:54,707 INFO org.apache.hadoop.io.nativeio.NativeIO (main):
> Got UserName hadoop for UID 106 from the native implementation
> 2012-06-21 00:28:55,760 INFO org.apache.hadoop.mapred.TaskLog (main):
> Starting logging for a new task attempt_201206202314_0017_m_000114_0 in the
> same JVM as that of the first task
>
> /mnt/var/log/hadoop/userlogs/job_201206202314_0017/attempt_201206202314_0017_m_000055_0
> 2012-06-21 00:28:55,761 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): MapTask metrics
> system already initialized!
> 2012-06-21 00:28:55,761 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name jvm
> already exists!
> 2012-06-21 00:28:55,766 INFO org.apache.hadoop.mapred.MapTask (main): Host
> name: ip-10-244-113-139.ec2.internal
> 2012-06-21 00:28:55,773 INFO org.apache.hadoop.mapred.Task (main):  Using
> ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@bfed5a
> 2012-06-21 00:28:55,862 INFO org.apache.hadoop.mapred.MapTask (main):
> numReduceTasks: 105
> 2012-06-21 00:28:55,863 INFO org.apache.hadoop.mapred.MapTask (main):
> io.sort.mb = 200
> 2012-06-21 00:28:57,804 INFO org.apache.hadoop.mapred.MapTask (main): data
> buffer = 159383552/199229440
> 2012-06-21 00:28:57,804 INFO org.apache.hadoop.mapred.MapTask (main):
> record
> buffer = 524288/655360
> 2012-06-21 00:28:59,370 INFO org.apache.hadoop.mapred.MapTask (main):
> Starting flush of map output
> 2012-06-21 00:28:59,452 INFO org.apache.hadoop.mapred.MapTask (main):
> Finished spill 0
> 2012-06-21 00:28:59,455 INFO org.apache.hadoop.mapred.Task (main):
> Task:attempt_201206202314_0017_m_000114_0 is done. And is in the process of
> commiting
> 2012-06-21 00:29:01,818 INFO org.apache.hadoop.mapred.Task (main): Task
> 'attempt_201206202314_0017_m_000114_0' done.
> 2012-06-21 00:29:01,820 INFO org.apache.hadoop.mapred.TaskLogsTruncater
> (main): Initializing logs' truncater with mapRetainSize=-1 and
> reduceRetainSize=-1
> 2012-06-21 00:29:05,408 INFO org.apache.hadoop.mapred.TaskLog (main):
> Starting logging for a new task attempt_201206202314_0017_m_000130_0 in the
> same JVM as that of the first task
>
> /mnt/var/log/hadoop/userlogs/job_201206202314_0017/attempt_201206202314_0017_m_000055_0
> 2012-06-21 00:29:05,409 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): MapTask metrics
> system already initialized!
> 2012-06-21 00:29:05,409 WARN
> org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name jvm
> already exists!
> 2012-06-21 00:29:05,412 INFO org.apache.hadoop.mapred.MapTask (main): Host
> name: ip-10-244-113-139.ec2.internal
> 2012-06-21 00:29:05,419 INFO org.apache.hadoop.mapred.Task (main):  Using
> ResourceCalculatorPlugin :
> org.apache.hadoop.util.LinuxResourceCalculatorPlugin@13fba1
> 2012-06-21 00:29:05,459 INFO org.apache.hadoop.mapred.MapTask (main):
> numReduceTasks: 105
> 2012-06-21 00:29:05,460 INFO org.apache.hadoop.mapred.MapTask (main):
> io.sort.mb = 200
> 2012-06-21 00:29:07,739 INFO org.apache.hadoop.mapred.TaskLogsTruncater
> (main): Initializing logs' truncater with mapRetainSize=-1 and
> reduceRetainSize=-1
> 2012-06-21 00:29:07,786 FATAL org.apache.hadoop.mapred.Child (main): Error
> running child : java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:965)
>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:433)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:377)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>        at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> Cleaner format on pastie:
> Stack trace 1 - http://pastie.org/pastes/4123975/text
> Stack trace 2 - http://pastie.org/pastes/4123976/text
>
>
> I'll really appreciate some help on this. Please let me know if there are
> any other logs that'll help debug this.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-5-Error-Java-heap-space-during-MAP-step-of-CrawlDb-update-tp3990448p3990627.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to