Can anyone help point me to configuration options that allow me to reduce the max buffer size when the BlockManager calls doGetRemote()?
I'm assuming that is my problem based on the below stack trace. Any help thinking this through (especially if you have dealt with large datasets (greater than RAM)) would be appreciated. 14/06/09 21:33:26 ERROR executor.ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-16,5,main] java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:329) at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) at org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) at org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) at org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) at org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:128) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:489) at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:487) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:487) at org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:473) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:513) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:39) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) On Mon, Jun 9, 2014 at 10:47 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Sorry for the stream of consciousness but after thinking about this a bit > more, I'm thinking that the FileNotFoundExceptions are due to tasks being > cancelled/restarted and the root cause is the OutOfMemoryError. > > If anyone has any insights on how to debug this more deeply or relevant > config settings, that would be much appreciated. > > Otherwise, I figure next steps would be to enable more debugging levels in > the spark code to see what much memory the code is trying to allocate. At > this point, I'm wondering if the block could be in the GB range. > > -Suren > > > > > On Mon, Jun 9, 2014 at 10:27 PM, Surendranauth Hiraman < > suren.hira...@velos.io> wrote: > >> I don't know if this is related but a little earlier in stderr, I also >> have the following stacktrace. But this stacktrace seems to be when the >> code is grabbing RDD data from a remote node, which is different from the >> above. >> >> >> 14/06/09 21:33:26 ERROR executor.ExecutorUncaughtExceptionHandler: >> Uncaught exception in thread Thread[Executor task launch worker-16,5,main] >> java.lang.OutOfMemoryError: Java heap space >> at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) >> at java.nio.ByteBuffer.allocate(ByteBuffer.java:329) >> at org.apache.spark.storage.BlockMessage.set(BlockMessage.scala:94) >> at >> org.apache.spark.storage.BlockMessage$.fromByteBuffer(BlockMessage.scala:176) >> at >> org.apache.spark.storage.BlockMessageArray.set(BlockMessageArray.scala:63) >> at >> org.apache.spark.storage.BlockMessageArray$.fromBufferMessage(BlockMessageArray.scala:109) >> at >> org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:128) >> at >> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:489) >> at >> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:487) >> at >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> at >> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:487) >> at >> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:473) >> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:513) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:39) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) >> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) >> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) >> at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) >> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) >> >> >> >> On Mon, Jun 9, 2014 at 10:05 PM, Surendranauth Hiraman < >> suren.hira...@velos.io> wrote: >> >>> I have a dataset of about 10GB. I am using persist(DISK_ONLY) to avoid >>> out of memory issues when running my job. >>> >>> When I run with a dataset of about 1 GB, the job is able to complete. >>> >>> But when I run with the larger dataset of 10 GB, I get the following >>> error/stacktrace, which seems to be happening when the RDD is writing out >>> to disk. >>> >>> Anyone have any ideas as to what is going on or if there is a setting I >>> can tune? >>> >>> >>> 14/06/09 21:33:55 ERROR executor.Executor: Exception in task ID 560 >>> java.io.FileNotFoundException: >>> /tmp/spark-local-20140609210741-0bb8/14/rdd_331_175 (No such file or >>> directory) >>> at java.io.FileOutputStream.open(Native Method) >>> at java.io.FileOutputStream.<init>(FileOutputStream.java:209) >>> at java.io.FileOutputStream.<init>(FileOutputStream.java:160) >>> at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79) >>> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:698) >>> at org.apache.spark.storage.BlockManager.put(BlockManager.scala:546) >>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:95) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) >>> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) >>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) >>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:77) >>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227) >>> at >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) >>> at >>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) >>> at org.apache.spark.scheduler.Task.run(Task.scala:51) >>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:679) >>> >>> -- >>> >>> SUREN HIRAMAN, VP TECHNOLOGY >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR >>> NEW YORK, NY 10001 >>> O: (917) 525-2466 ext. 105 >>> F: 646.349.4063 >>> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >>> W: www.velos.io >>> >>> >> >> >> -- >> >> SUREN HIRAMAN, VP TECHNOLOGY >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR >> NEW YORK, NY 10001 >> O: (917) 525-2466 ext. 105 >> F: 646.349.4063 >> E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io >> W: www.velos.io >> >> > > > -- > > SUREN HIRAMAN, VP TECHNOLOGY > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR > NEW YORK, NY 10001 > O: (917) 525-2466 ext. 105 > F: 646.349.4063 > E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io > W: www.velos.io > > -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v <suren.hira...@sociocast.com>elos.io W: www.velos.io