Re: Running out of space (when there's no shortage)
Hi Joe, you might increase spark.yarn.executor.memoryOverhead to see if it fixes the problem. Please take a look of this report: https://issues.apache.org/jira/browse/SPARK-4996 Hope this helps. On Tue, Feb 24, 2015 at 2:05 PM, Yiannis Gkoufas johngou...@gmail.com wrote: No problem, Joe. There you go https://issues.apache.org/jira/browse/SPARK-5081 And also there is this one https://issues.apache.org/jira/browse/SPARK-5715 which is marked as resolved On 24 February 2015 at 21:51, Joe Wass jw...@crossref.org wrote: Thanks everyone. Yiannis, do you know if there's a bug report for this regression? For some other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but I can't remember what the bug was. Joe On 24 February 2015 at 19:26, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I assume you are using spark 1.2.1 right? I faced the exact same issue and switched to 1.1.1 with the same configuration and it was solved. On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote: Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/ Cheers On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov vrodio...@splicemachine.com wrote: Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have enough space and you have no idea why does application report no space left on device. Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote: I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my three nodes, I have plenty of space and inodes: A $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97937 426351 19% / tmpfs1909200 1 19091991% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds831869296 23844 8318454521% /vol0 A $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 802G 199G 81% /vol0 B $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97947 426341 19% / tmpfs1906639 1 19066381% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds816200704 24223 8161764811% /vol0 B $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.6G 4.3G 46% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 805G 195G 81% /vol0 C $df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97938 426350 19% / tmpfs1906897 1 19068961% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds755218352 24024 7551943281% /vol0 root@ip-10-204-136-223 ~]$ C $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 820G 181G 82% /vol0 The devices may be ~80% full but that still leaves ~200G free on each. My spark-env.sh has export SPARK_LOCAL_DIRS=/vol0/spark I have manually verified that on each slave the only temporary files are stored on /vol0, all looking something like this /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 So it looks like all the files are being stored on the large drives (incidentally they're AWS EBS volumes, but that's the only way to get enough storage). My process crashed before with a slightly different exception under the same circumstances: kryo.KryoException: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1:
Running out of space (when there's no shortage)
I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my three nodes, I have plenty of space and inodes: A $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97937 426351 19% / tmpfs1909200 1 19091991% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds831869296 23844 8318454521% /vol0 A $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 802G 199G 81% /vol0 B $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97947 426341 19% / tmpfs1906639 1 19066381% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds816200704 24223 8161764811% /vol0 B $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.6G 4.3G 46% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 805G 195G 81% /vol0 C $df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97938 426350 19% / tmpfs1906897 1 19068961% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds755218352 24024 7551943281% /vol0 root@ip-10-204-136-223 ~]$ C $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 820G 181G 82% /vol0 The devices may be ~80% full but that still leaves ~200G free on each. My spark-env.sh has export SPARK_LOCAL_DIRS=/vol0/spark I have manually verified that on each slave the only temporary files are stored on /vol0, all looking something like this /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 So it looks like all the files are being stored on the large drives (incidentally they're AWS EBS volumes, but that's the only way to get enough storage). My process crashed before with a slightly different exception under the same circumstances: kryo.KryoException: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93) at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:109) at
Re: Running out of space (when there's no shortage)
Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have enough space and you have no idea why does application report no space left on device. Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote: I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my three nodes, I have plenty of space and inodes: A $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97937 426351 19% / tmpfs1909200 1 19091991% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds831869296 23844 8318454521% /vol0 A $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 802G 199G 81% /vol0 B $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97947 426341 19% / tmpfs1906639 1 19066381% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds816200704 24223 8161764811% /vol0 B $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.6G 4.3G 46% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 805G 195G 81% /vol0 C $df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97938 426350 19% / tmpfs1906897 1 19068961% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds755218352 24024 7551943281% /vol0 root@ip-10-204-136-223 ~]$ C $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 820G 181G 82% /vol0 The devices may be ~80% full but that still leaves ~200G free on each. My spark-env.sh has export SPARK_LOCAL_DIRS=/vol0/spark I have manually verified that on each slave the only temporary files are stored on /vol0, all looking something like this /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 So it looks like all the files are being stored on the large drives (incidentally they're AWS EBS volumes, but that's the only way to get enough storage). My process crashed before with a slightly different exception under the same circumstances: kryo.KryoException: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at
Re: Running out of space (when there's no shortage)
Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/ Cheers On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov vrodio...@splicemachine.com wrote: Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have enough space and you have no idea why does application report no space left on device. Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote: I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my three nodes, I have plenty of space and inodes: A $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97937 426351 19% / tmpfs1909200 1 19091991% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds831869296 23844 8318454521% /vol0 A $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 802G 199G 81% /vol0 B $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97947 426341 19% / tmpfs1906639 1 19066381% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds816200704 24223 8161764811% /vol0 B $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.6G 4.3G 46% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 805G 195G 81% /vol0 C $df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97938 426350 19% / tmpfs1906897 1 19068961% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds755218352 24024 7551943281% /vol0 root@ip-10-204-136-223 ~]$ C $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 820G 181G 82% /vol0 The devices may be ~80% full but that still leaves ~200G free on each. My spark-env.sh has export SPARK_LOCAL_DIRS=/vol0/spark I have manually verified that on each slave the only temporary files are stored on /vol0, all looking something like this /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 So it looks like all the files are being stored on the large drives (incidentally they're AWS EBS volumes, but that's the only way to get enough storage). My process crashed before with a slightly different exception under the same circumstances: kryo.KryoException: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) at
Re: Running out of space (when there's no shortage)
Hi there, I assume you are using spark 1.2.1 right? I faced the exact same issue and switched to 1.1.1 with the same configuration and it was solved. On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote: Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/ Cheers On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov vrodio...@splicemachine.com wrote: Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have enough space and you have no idea why does application report no space left on device. Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote: I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my three nodes, I have plenty of space and inodes: A $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97937 426351 19% / tmpfs1909200 1 19091991% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds831869296 23844 8318454521% /vol0 A $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 802G 199G 81% /vol0 B $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97947 426341 19% / tmpfs1906639 1 19066381% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds816200704 24223 8161764811% /vol0 B $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.6G 4.3G 46% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 805G 195G 81% /vol0 C $df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97938 426350 19% / tmpfs1906897 1 19068961% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds755218352 24024 7551943281% /vol0 root@ip-10-204-136-223 ~]$ C $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 820G 181G 82% /vol0 The devices may be ~80% full but that still leaves ~200G free on each. My spark-env.sh has export SPARK_LOCAL_DIRS=/vol0/spark I have manually verified that on each slave the only temporary files are stored on /vol0, all looking something like this /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 So it looks like all the files are being stored on the large drives (incidentally they're AWS EBS volumes, but that's the only way to get enough storage). My process crashed before with a slightly different exception under the same circumstances: kryo.KryoException: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at
Re: Running out of space (when there's no shortage)
No problem, Joe. There you go https://issues.apache.org/jira/browse/SPARK-5081 And also there is this one https://issues.apache.org/jira/browse/SPARK-5715 which is marked as resolved On 24 February 2015 at 21:51, Joe Wass jw...@crossref.org wrote: Thanks everyone. Yiannis, do you know if there's a bug report for this regression? For some other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but I can't remember what the bug was. Joe On 24 February 2015 at 19:26, Yiannis Gkoufas johngou...@gmail.com wrote: Hi there, I assume you are using spark 1.2.1 right? I faced the exact same issue and switched to 1.1.1 with the same configuration and it was solved. On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote: Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/ Cheers On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov vrodio...@splicemachine.com wrote: Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have enough space and you have no idea why does application report no space left on device. Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote: I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my three nodes, I have plenty of space and inodes: A $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97937 426351 19% / tmpfs1909200 1 19091991% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds831869296 23844 8318454521% /vol0 A $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 802G 199G 81% /vol0 B $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97947 426341 19% / tmpfs1906639 1 19066381% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds816200704 24223 8161764811% /vol0 B $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.6G 4.3G 46% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 805G 195G 81% /vol0 C $df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97938 426350 19% / tmpfs1906897 1 19068961% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds755218352 24024 7551943281% /vol0 root@ip-10-204-136-223 ~]$ C $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 820G 181G 82% /vol0 The devices may be ~80% full but that still leaves ~200G free on each. My spark-env.sh has export SPARK_LOCAL_DIRS=/vol0/spark I have manually verified that on each slave the only temporary files are stored on /vol0, all looking something like this /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 So it looks like all the files are being stored on the large drives (incidentally they're AWS EBS volumes, but that's the only way to get enough storage). My process crashed before with a slightly different exception under the same circumstances: kryo.KryoException: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at