Looks like a standard "not enough memory" issue. I can only recommend the usual advice of increasing the number of partitions to give you a quick-win.
Also, your JVMs have an enormous amount of memory. This may cause long GC pause times. You might like to try reducing the memory to about 20gb and having 10x as many executors. Finally, you might want to monitor how many file handles your user is able to exploit. On a *Nix system use the ulimit command to see what the restriction (if any) is. Phill On Wed, Aug 10, 2016 at 8:34 PM, شجاع الرحمن بیگ <shujamug...@gmail.com> wrote: > Hi, > > I am getting following error while processing large input size. > > ... > [Stage 18:====================> (90 + 24) > / 240]16/08/10 19:39:54 WARN TaskSetManager: Lost task 86.1 in stage 18.0 > (TID 2517, bscpower8n2-data): FetchFailed(null, shuffleId=0, mapId=-1, > reduceId=86, message= > org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output > location for shuffle 0 > at org.apache.spark.MapOutputTracker$$anonfun$org$ > apache$spark$MapOutputTracker$$convertMapStatuses$2.apply( > MapOutputTracker.scala:542) > at org.apache.spark.MapOutputTracker$$anonfun$org$ > apache$spark$MapOutputTracker$$convertMapStatuses$2.apply( > MapOutputTracker.scala:538) > at scala.collection.TraversableLike$WithFilter$$ > anonfun$foreach$1.apply(TraversableLike.scala:772) > at scala.collection.IndexedSeqOptimized$class. > foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$WithFilter. > foreach(TraversableLike.scala:771) > at org.apache.spark.MapOutputTracker$.org$apache$ > spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:538) > at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId( > MapOutputTracker.scala:155) > at org.apache.spark.shuffle.BlockStoreShuffleReader.read( > BlockStoreShuffleReader.scala:47) > at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:51) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at org.apache.spark.rdd.ZippedPartitionsRDD2.compute( > ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.rdd.ZippedPartitionsRDD2.compute( > ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run( > Executor.scala:214) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1145) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > ... > > The specification are as follow. > > Spark version. 1.6.1 > > Cluster Mode= Standalone > Storage level: Memory and Disk > > Spark Worker cores=6 > > spark worker memory=200gb > > spark executor memory = 199gb > spark driver memory = 5gb > > Number of input partitions=240 > > input data set =34 GB > > > > > > > > I investigated the issue further and monitor the free ram using vmstat > during the execution of workload and it reveals that the job keep running > successfully until free memory available but start throwing exception on > ending up the free memory. > > > > Is anyone face the similar problem and if yes then please share the > solution. > > Thanks > > Shuja > > > > -- > Regards > Shuja-ur-Rehman Baig > <http://pk.linkedin.com/in/shujamughal> >