Re: Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap space

Dmitriy Lyubimov Tue, 16 Feb 2016 17:16:03 -0800

BTW, depending on the resource manager, 10G per executor may not
necessarily be a sufficient number. I never plan less than 1.5G per core
(after excluding block manager, or 3Gb per core including block manager).
That means that 10G executor memory might be barely enough for 4-core
worker nodes. So my setup is based on at least 90G per executor with 32
cores. Well, if one has 32 thread boxes, then chances are they are at least
128G, but more often more than that.


I never ran row similarity for anything substantial though, so cannot be
quite sure how it behaves under load.

On Tue, Feb 16, 2016 at 9:59 AM, Jaume Galí <jg...@konodrac.com> wrote:

> Hi,
>
> I did all you suggest but i couldn’t solve the problem yet and i don’t
> know what else to do.
>
> Now I have a machine with 64Gb of Memory Ram, so physical memory should
> not be a problem any more.
> I attach input matrix if anybody could try to execute the command it would
> be great.
>
> This is what I tried:
>
> - I used this command as Angelo suggested:
>
> /opt/mahout/bin/mahout spark-rowsimilarity -i matrix_country_115k.dat -o
> test_country_115k_output.tmp --maxObservations 500 --maxSimilaritiesPerRow
> 100 --omitStrength --master local --sparkExecutorMem 10g
> -D:spark.dynamicAllocation.enabled=true
> -D:spark.shuffle.service.enabled=true
>
> - I increased *MAHOUT_HEAPSIZE *up to 32Gb in two ways:
>
>
> + Mahout script (MAHOUT_HOME/bin/mahout):
>
> JAVA=$JAVA_HOME/bin/java
>
> JAVA_HEAP_MAX=-Xmx4g
>
> MAHOUT_HEAPSIZE=32768
>
>
> + ~/.profile setting environment variables:
>
> #Global conf JAVA
> export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
> export JAVA_OPTS=-Xmx32g
> export _JAVA_OPTIONS=-Xmx32g
> export HADOOP_PREFIX=/opt/hadoop
> export SPARK_HOME=/opt/spark
> export MAHOUT_HOME=/opt/mahout
> export MAHOUT_HEAPSIZE=32g
>
>
> I printed trace of memoery following mahout script and this is output:
>
> run with heapsize 32768
>
> -Xmx32768m
>
> So mahout is reading memory parameters fine.
>
>
> I’m glad if you people could guide me about what parameters I have to tune
> or check in order to solved this issue because I don’t know to do
>
> Thank you for advance.
> Jaume.
>
>
>
> El 13/2/2016, a las 22:56, Pat Ferrel <p...@occamsmachete.com> escribió:
>
> OK, this makes sense. When people see Out of Memory problems they
> naturally try to give more to the process throwing the exception but what
> is often happening is that you have given too much to the collection of
> other processes on the machine so there is not enough to go around and the
> allocation fails on Spark. In which case you need to allocate less to Spark
> so you can guarantee it will always be able to get that much.
>
>
> On Feb 13, 2016, at 9:30 AM, Angelo Leto <angl...@gmail.com> wrote:
>
> I was able to make it working by setting the executor memory to 10g
> and with -D:spark.dynamicAllocation.enabled=true :
>
> mahout spark-rowsimilarity --input hdfs:/indata/row-similarity.tsv
> --output rowsim-out --omitStrength --sparkExecutorMem 10g --master
> yarn-client -D:spark.dynamicAllocation.enabled=true
> -D:spark.shuffle.service.enabled=true
>
>
> On Sat, Feb 13, 2016 at 2:42 PM, Angelo Leto <angl...@gmail.com> wrote:
>
> Hello,
> I have the same problem described above using spark-rowsimilarity.
> I have a ~65k lines input file (each row with less than 300 items),
> and I run the job on a small cluster with 1 master and 2 workers, each
> machine has 15GB of RAM.
> I tried to increase executor and driver memory:
> --sparkExecutorMem 15g
> -D:spark.driver.memory=15g
>
> but I get the OutOfMemoryError exception:
>
> 16/02/13 13:00:36 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID
> 12)
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>       at
> org.apache.mahout.math.OrderedIntDoubleMapping.growTo(OrderedIntDoubleMapping.java:86)
>       at
> org.apache.mahout.math.OrderedIntDoubleMapping.set(OrderedIntDoubleMapping.java:118)
> [...]
>
> Thanks for any hint.
> Angelo
>
> On Fri, Feb 12, 2016 at 10:15 PM, Pat Ferrel <p...@occamsmachete.com>
> wrote:
>
> You have to set the executor memory. BTW you have given the driver all
> memory on the machine.
>
> On Feb 10, 2016, at 9:30 AM, Jaume Galí <jg...@konodrac.com> wrote:
>
> Hi again,
> (Sorry for my delay but we didn’t have machine to test your thoughts about
> memory issue.)
>
> The problem still happening testing with an input matrix of 100k rows by
> 300 items, I increase memory as you suggest but nothing changed. I attached
> spark_env.sh and new specs of machine
>
> Machine specs:
>
> m3.xlarge AWS (Ivy Bridge, 15Gb ram, 2x40gb HD)
>
> This is my spark-env.sh:
>
>        #!/usr/bin/env bash
> # Licensed to ...
>
> export SPARK_HOME=${SPARK_HOME:-/usr/lib/spark}
> export SPARK_LOG_DIR=${SPARK_LOG_DIR:-/var/log/spark}
> export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
> export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf}
> export HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
>
> export STANDALONE_SPARK_MASTER_HOST=ip-10-12-17-235.eu <
> http://ip-10-12-17-235.eu/>-west-1.compute.internal
> export SPARK_MASTER_PORT=7077
> export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST
> export SPARK_MASTER_WEBUI_PORT=8080
>
> export SPARK_WORKER_DIR=${SPARK_WORKER_DIR:-/var/run/spark/work}
> export SPARK_WORKER_PORT=7078
> export SPARK_WORKER_WEBUI_PORT=8081
>
> export HIVE_SERVER2_THRIFT_BIND_HOST=0.0.0.0
> export HIVE_SERVER2_THRIFT_PORT=10001
>
> export SPARK_DRIVER_MEMORY=15G
> export SPARK_DAEMON_JAVA_OPTS="$SPARK_DAEMON_JAVA_OPTS
> -XX:OnOutOfMemoryError='kill -9 %p’”
>
> Log:
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent
> failure: Lost task 0.0 in stage 12.0 (TID 24, localhost):
> java.lang.OutOfMemoryError: GC overhead limit exceeded
> …….
> …..
> ..
> .
>
> Driver stacktrace:
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> …….
> …..
> ...
> ..
> .
>
>
> Thanks for advance
>
> El 2/2/2016, a las 7:48, Pat Ferrel <p...@occamsmachete.com <
> mailto:p...@occamsmachete.com <p...@occamsmachete.com>>> escribió:
>
> You probably need to increase your driver memory and 8g will not work. 16g
> is probably the smallest stand alone machine that will work since the
> driver and executors run on it.
>
> On Feb 1, 2016, at 1:24 AM, jg...@konodrac.com <mailto:jg...@konodrac.com
> <jg...@konodrac.com>> wrote:
>
> Hello everybody,
>
> We are experimenting problems when we use "mahout spark-rowsimilarity”
> operation. We have an input matrix with 100k rows and 100 items and process
> throws an exception about “Exception in task 0.0 in stage 13.0 (TID 13)
> java.lang.OutOfMemoryError: Java heap space” and we try to increase JAVA
> HEAP MEMORY, MAHOUT HEAP MEMORY and spark.driver.memory.
>
> Environment versions:
> Mahout: 0.11.1
> Spark: 1.6.0.
>
> Mahout command line:
>   /opt/mahout/bin/mahout spark-rowsimilarity -i 50k_rows__50items.dat -o
> test_output.tmp --maxObservations 500 --maxSimilaritiesPerRow 100
> --omitStrength --master local --sparkExecutorMem 8g
>
> This process is running on a machine with following specifications:
> Mem RAM: 8gb
> CPU with 8 cores
>
> .profile file:
> export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
> export HADOOP_HOME=/opt/hadoop-2.6.0
> export SPARK_HOME=/opt/spark
> export MAHOUT_HOME=/opt/mahout
> export MAHOUT_HEAPSIZE=8192
>
> Throws exception:
>
> 16/01/22 11:45:06 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID
> 13)
> java.lang.OutOfMemoryError: Java heap space
>    at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:66)
>    at
> org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:70)
>    at
> org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:59)
>    at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>    at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>    at org.apache.spark.scheduler.Task.run(Task.scala:89)
>    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>    at java.lang.Thread.run(Thread.java:745)
> 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message [message
> = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver,
> localhost, 42107))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
>    at org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>    at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>    at org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468)
>    at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>    at java.lang.Thread.run(Thread.java:745)
> 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message [message
> = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver,
> localhost, 42107))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
> seconds]. This timeout is controlled by spark.rpc.askTimeout
>    at org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>    at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>    at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>    at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>    at org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
>    at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468)
>    at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>    at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>    at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [120 seconds]
>    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>    at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>    at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>    at scala.concurrent.Await$.result(package.scala:107)
>    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>    ...
>
> Can you please advise?
>
>
> Thanks for advance.
> Cheers.
>
>
>
>
>
>
>
>

Re: Exception in task 0.0 in stage 13.0 (TID 13) java.lang.OutOfMemoryError: Java heap space

Reply via email to