the original exception definitely happens in the task when mahout tries to build an entire matrix block out of a partition. Use more tasks, smaller in size initially. using par(min=??) will help to repartition to at least ?? tasks. off-hdfs defaults are just too big for matrix processing. Not sure how to do that with command line utility, Pat may help.
On Tue, Feb 16, 2016 at 9:59 AM, Jaume Galí <jg...@konodrac.com> wrote: > Hi, > > I did all you suggest but i couldn’t solve the problem yet and i don’t > know what else to do. > > Now I have a machine with 64Gb of Memory Ram, so physical memory should > not be a problem any more. > I attach input matrix if anybody could try to execute the command it would > be great. > > This is what I tried: > > - I used this command as Angelo suggested: > > /opt/mahout/bin/mahout spark-rowsimilarity -i matrix_country_115k.dat -o > test_country_115k_output.tmp --maxObservations 500 --maxSimilaritiesPerRow > 100 --omitStrength --master local --sparkExecutorMem 10g > -D:spark.dynamicAllocation.enabled=true > -D:spark.shuffle.service.enabled=true > > - I increased *MAHOUT_HEAPSIZE *up to 32Gb in two ways: > > > + Mahout script (MAHOUT_HOME/bin/mahout): > > JAVA=$JAVA_HOME/bin/java > > JAVA_HEAP_MAX=-Xmx4g > > MAHOUT_HEAPSIZE=32768 > > > + ~/.profile setting environment variables: > > #Global conf JAVA > export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 > export JAVA_OPTS=-Xmx32g > export _JAVA_OPTIONS=-Xmx32g > export HADOOP_PREFIX=/opt/hadoop > export SPARK_HOME=/opt/spark > export MAHOUT_HOME=/opt/mahout > export MAHOUT_HEAPSIZE=32g > > > I printed trace of memoery following mahout script and this is output: > > run with heapsize 32768 > > -Xmx32768m > > So mahout is reading memory parameters fine. > > > I’m glad if you people could guide me about what parameters I have to tune > or check in order to solved this issue because I don’t know to do > > Thank you for advance. > Jaume. > > > > El 13/2/2016, a las 22:56, Pat Ferrel <p...@occamsmachete.com> escribió: > > OK, this makes sense. When people see Out of Memory problems they > naturally try to give more to the process throwing the exception but what > is often happening is that you have given too much to the collection of > other processes on the machine so there is not enough to go around and the > allocation fails on Spark. In which case you need to allocate less to Spark > so you can guarantee it will always be able to get that much. > > > On Feb 13, 2016, at 9:30 AM, Angelo Leto <angl...@gmail.com> wrote: > > I was able to make it working by setting the executor memory to 10g > and with -D:spark.dynamicAllocation.enabled=true : > > mahout spark-rowsimilarity --input hdfs:/indata/row-similarity.tsv > --output rowsim-out --omitStrength --sparkExecutorMem 10g --master > yarn-client -D:spark.dynamicAllocation.enabled=true > -D:spark.shuffle.service.enabled=true > > > On Sat, Feb 13, 2016 at 2:42 PM, Angelo Leto <angl...@gmail.com> wrote: > > Hello, > I have the same problem described above using spark-rowsimilarity. > I have a ~65k lines input file (each row with less than 300 items), > and I run the job on a small cluster with 1 master and 2 workers, each > machine has 15GB of RAM. > I tried to increase executor and driver memory: > --sparkExecutorMem 15g > -D:spark.driver.memory=15g > > but I get the OutOfMemoryError exception: > > 16/02/13 13:00:36 ERROR Executor: Exception in task 0.0 in stage 12.0 (TID > 12) > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > org.apache.mahout.math.OrderedIntDoubleMapping.growTo(OrderedIntDoubleMapping.java:86) > at > org.apache.mahout.math.OrderedIntDoubleMapping.set(OrderedIntDoubleMapping.java:118) > [...] > > Thanks for any hint. > Angelo > > On Fri, Feb 12, 2016 at 10:15 PM, Pat Ferrel <p...@occamsmachete.com> > wrote: > > You have to set the executor memory. BTW you have given the driver all > memory on the machine. > > On Feb 10, 2016, at 9:30 AM, Jaume Galí <jg...@konodrac.com> wrote: > > Hi again, > (Sorry for my delay but we didn’t have machine to test your thoughts about > memory issue.) > > The problem still happening testing with an input matrix of 100k rows by > 300 items, I increase memory as you suggest but nothing changed. I attached > spark_env.sh and new specs of machine > > Machine specs: > > m3.xlarge AWS (Ivy Bridge, 15Gb ram, 2x40gb HD) > > This is my spark-env.sh: > > #!/usr/bin/env bash > # Licensed to ... > > export SPARK_HOME=${SPARK_HOME:-/usr/lib/spark} > export SPARK_LOG_DIR=${SPARK_LOG_DIR:-/var/log/spark} > export HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop} > export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop/conf} > export HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf} > > export STANDALONE_SPARK_MASTER_HOST=ip-10-12-17-235.eu < > http://ip-10-12-17-235.eu/>-west-1.compute.internal > export SPARK_MASTER_PORT=7077 > export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST > export SPARK_MASTER_WEBUI_PORT=8080 > > export SPARK_WORKER_DIR=${SPARK_WORKER_DIR:-/var/run/spark/work} > export SPARK_WORKER_PORT=7078 > export SPARK_WORKER_WEBUI_PORT=8081 > > export HIVE_SERVER2_THRIFT_BIND_HOST=0.0.0.0 > export HIVE_SERVER2_THRIFT_PORT=10001 > > export SPARK_DRIVER_MEMORY=15G > export SPARK_DAEMON_JAVA_OPTS="$SPARK_DAEMON_JAVA_OPTS > -XX:OnOutOfMemoryError='kill -9 %p’” > > Log: > > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent > failure: Lost task 0.0 in stage 12.0 (TID 24, localhost): > java.lang.OutOfMemoryError: GC overhead limit exceeded > ……. > ….. > .. > . > > Driver stacktrace: > Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded > ……. > ….. > ... > .. > . > > > Thanks for advance > > El 2/2/2016, a las 7:48, Pat Ferrel <p...@occamsmachete.com < > mailto:p...@occamsmachete.com <p...@occamsmachete.com>>> escribió: > > You probably need to increase your driver memory and 8g will not work. 16g > is probably the smallest stand alone machine that will work since the > driver and executors run on it. > > On Feb 1, 2016, at 1:24 AM, jg...@konodrac.com <mailto:jg...@konodrac.com > <jg...@konodrac.com>> wrote: > > Hello everybody, > > We are experimenting problems when we use "mahout spark-rowsimilarity” > operation. We have an input matrix with 100k rows and 100 items and process > throws an exception about “Exception in task 0.0 in stage 13.0 (TID 13) > java.lang.OutOfMemoryError: Java heap space” and we try to increase JAVA > HEAP MEMORY, MAHOUT HEAP MEMORY and spark.driver.memory. > > Environment versions: > Mahout: 0.11.1 > Spark: 1.6.0. > > Mahout command line: > /opt/mahout/bin/mahout spark-rowsimilarity -i 50k_rows__50items.dat -o > test_output.tmp --maxObservations 500 --maxSimilaritiesPerRow 100 > --omitStrength --master local --sparkExecutorMem 8g > > This process is running on a machine with following specifications: > Mem RAM: 8gb > CPU with 8 cores > > .profile file: > export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64 > export HADOOP_HOME=/opt/hadoop-2.6.0 > export SPARK_HOME=/opt/spark > export MAHOUT_HOME=/opt/mahout > export MAHOUT_HEAPSIZE=8192 > > Throws exception: > > 16/01/22 11:45:06 ERROR Executor: Exception in task 0.0 in stage 13.0 (TID > 13) > java.lang.OutOfMemoryError: Java heap space > at org.apache.mahout.math.DenseMatrix.<init>(DenseMatrix.java:66) > at > org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:70) > at > org.apache.mahout.sparkbindings.drm.package$$anonfun$blockify$1.apply(package.scala:59) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message [message > = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver, > localhost, 42107))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at org.apache.spark.executor.Executor.org > $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 16/01/22 11:45:06 WARN NettyRpcEndpointRef: Error sending message [message > = Heartbeat(driver,[Lscala.Tuple2;@12498227,BlockManagerId(driver, > localhost, 42107))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at org.apache.spark.executor.Executor.org > $apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:448) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:468) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after > [120 seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > ... > > Can you please advise? > > > Thanks for advance. > Cheers. > > > > > > > >