Re: Running out of space (when there's no shortage)

2015-02-27 Thread Kelvin Chu
Hi Joe, you might increase spark.yarn.executor.memoryOverhead to see if it
fixes the problem. Please take a look of this report:
https://issues.apache.org/jira/browse/SPARK-4996

Hope this helps.

On Tue, Feb 24, 2015 at 2:05 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:

 No problem, Joe. There you go
 https://issues.apache.org/jira/browse/SPARK-5081
 And also there is this one
 https://issues.apache.org/jira/browse/SPARK-5715 which is marked as
 resolved

 On 24 February 2015 at 21:51, Joe Wass jw...@crossref.org wrote:

 Thanks everyone.

 Yiannis, do you know if there's a bug report for this regression? For
 some other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but
 I can't remember what the bug was.

 Joe




 On 24 February 2015 at 19:26, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi there,

 I assume you are using spark 1.2.1 right?
 I faced the exact same issue and switched to 1.1.1 with the same
 configuration and it was solved.
 On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote:

 Here is a tool which may give you some clue:
 http://file-leak-detector.kohsuke.org/

 Cheers

 On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov 
 vrodio...@splicemachine.com wrote:

 Usually it happens in Linux when application deletes file w/o double
 checking that there are no open FDs (resource leak). In this case, Linux
 holds all space allocated and does not release it until application
 exits (crashes in your case). You check file system and everything is
 normal, you have enough space and you have no idea why does application
 report no space left on device.

 Just a guess.

 -Vladimir Rodionov

 On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote:

 I'm running a cluster of 3 Amazon EC2 machines (small number because
 it's expensive when experiments keep crashing after a day!).

 Today's crash looks like this (stacktrace at end of message).
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
 output location for shuffle 0

 On my three nodes, I have plenty of space and inodes:

 A $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97937  426351   19% /
 tmpfs1909200   1 19091991% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds831869296   23844 8318454521% /vol0

 A $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  802G  199G  81% /vol0

 B $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97947  426341   19% /
 tmpfs1906639   1 19066381% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds816200704   24223 8161764811% /vol0

 B $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.6G  4.3G  46% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  805G  195G  81% /vol0

 C $df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97938  426350   19% /
 tmpfs1906897   1 19068961% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds755218352   24024 7551943281% /vol0
 root@ip-10-204-136-223 ~]$

 C $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  820G  181G  82% /vol0

 The devices may be ~80% full but that still leaves ~200G free on
 each. My spark-env.sh has

 export SPARK_LOCAL_DIRS=/vol0/spark

 I have manually verified that on each slave the only temporary files
 are stored on /vol0, all looking something like this


 /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

 So it looks like all the files are being stored on the large drives
 (incidentally they're AWS EBS volumes, but that's the only way to get
 enough storage). My process crashed before with a slightly different
 exception under the same circumstances: kryo.KryoException:
 java.io.IOException: No space left on device

 These both happen after several hours and several GB of temporary
 files.

 Why does Spark think it's run out of space?

 TIA

 Joe

 Stack trace 1:

 

Running out of space (when there's no shortage)

2015-02-24 Thread Joe Wass
I'm running a cluster of 3 Amazon EC2 machines (small number because it's
expensive when experiments keep crashing after a day!).

Today's crash looks like this (stacktrace at end of message).
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0

On my three nodes, I have plenty of space and inodes:

A $ df -i
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/xvda1524288   97937  426351   19% /
tmpfs1909200   1 19091991% /dev/shm
/dev/xvdb2457600  54 24575461% /mnt
/dev/xvdc2457600  24 24575761% /mnt2
/dev/xvds831869296   23844 8318454521% /vol0

A $ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  3.4G  4.5G  44% /
tmpfs 7.3G 0  7.3G   0% /dev/shm
/dev/xvdb  37G  1.2G   34G   4% /mnt
/dev/xvdc  37G  177M   35G   1% /mnt2
/dev/xvds1000G  802G  199G  81% /vol0

B $ df -i
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/xvda1524288   97947  426341   19% /
tmpfs1906639   1 19066381% /dev/shm
/dev/xvdb2457600  54 24575461% /mnt
/dev/xvdc2457600  24 24575761% /mnt2
/dev/xvds816200704   24223 8161764811% /vol0

B $ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  3.6G  4.3G  46% /
tmpfs 7.3G 0  7.3G   0% /dev/shm
/dev/xvdb  37G  1.2G   34G   4% /mnt
/dev/xvdc  37G  177M   35G   1% /mnt2
/dev/xvds1000G  805G  195G  81% /vol0

C $df -i
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/xvda1524288   97938  426350   19% /
tmpfs1906897   1 19068961% /dev/shm
/dev/xvdb2457600  54 24575461% /mnt
/dev/xvdc2457600  24 24575761% /mnt2
/dev/xvds755218352   24024 7551943281% /vol0
root@ip-10-204-136-223 ~]$

C $ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/xvda17.9G  3.4G  4.5G  44% /
tmpfs 7.3G 0  7.3G   0% /dev/shm
/dev/xvdb  37G  1.2G   34G   4% /mnt
/dev/xvdc  37G  177M   35G   1% /mnt2
/dev/xvds1000G  820G  181G  82% /vol0

The devices may be ~80% full but that still leaves ~200G free on each. My
spark-env.sh has

export SPARK_LOCAL_DIRS=/vol0/spark

I have manually verified that on each slave the only temporary files are
stored on /vol0, all looking something like this

/vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

So it looks like all the files are being stored on the large drives
(incidentally they're AWS EBS volumes, but that's the only way to get
enough storage). My process crashed before with a slightly different
exception under the same circumstances: kryo.KryoException:
java.io.IOException: No space left on device

These both happen after several hours and several GB of temporary files.

Why does Spark think it's run out of space?

TIA

Joe

Stack trace 1:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 0
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
at
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
at
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
at
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at
org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:109)
at

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Vladimir Rodionov
Usually it happens in Linux when application deletes file w/o double
checking that there are no open FDs (resource leak). In this case, Linux
holds all space allocated and does not release it until application exits
(crashes in your case). You check file system and everything is normal, you
have enough space and you have no idea why does application report no
space left on device.

Just a guess.

-Vladimir Rodionov

On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote:

 I'm running a cluster of 3 Amazon EC2 machines (small number because it's
 expensive when experiments keep crashing after a day!).

 Today's crash looks like this (stacktrace at end of message).
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0

 On my three nodes, I have plenty of space and inodes:

 A $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97937  426351   19% /
 tmpfs1909200   1 19091991% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds831869296   23844 8318454521% /vol0

 A $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  802G  199G  81% /vol0

 B $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97947  426341   19% /
 tmpfs1906639   1 19066381% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds816200704   24223 8161764811% /vol0

 B $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.6G  4.3G  46% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  805G  195G  81% /vol0

 C $df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97938  426350   19% /
 tmpfs1906897   1 19068961% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds755218352   24024 7551943281% /vol0
 root@ip-10-204-136-223 ~]$

 C $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  820G  181G  82% /vol0

 The devices may be ~80% full but that still leaves ~200G free on each. My
 spark-env.sh has

 export SPARK_LOCAL_DIRS=/vol0/spark

 I have manually verified that on each slave the only temporary files are
 stored on /vol0, all looking something like this


 /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

 So it looks like all the files are being stored on the large drives
 (incidentally they're AWS EBS volumes, but that's the only way to get
 enough storage). My process crashed before with a slightly different
 exception under the same circumstances: kryo.KryoException:
 java.io.IOException: No space left on device

 These both happen after several hours and several GB of temporary files.

 Why does Spark think it's run out of space?

 TIA

 Joe

 Stack trace 1:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at
 org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
 at
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
 at
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
 at
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
 at 

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Ted Yu
Here is a tool which may give you some clue:
http://file-leak-detector.kohsuke.org/

Cheers

On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov 
vrodio...@splicemachine.com wrote:

 Usually it happens in Linux when application deletes file w/o double
 checking that there are no open FDs (resource leak). In this case, Linux
 holds all space allocated and does not release it until application exits
 (crashes in your case). You check file system and everything is normal, you
 have enough space and you have no idea why does application report no
 space left on device.

 Just a guess.

 -Vladimir Rodionov

 On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote:

 I'm running a cluster of 3 Amazon EC2 machines (small number because it's
 expensive when experiments keep crashing after a day!).

 Today's crash looks like this (stacktrace at end of message).
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0

 On my three nodes, I have plenty of space and inodes:

 A $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97937  426351   19% /
 tmpfs1909200   1 19091991% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds831869296   23844 8318454521% /vol0

 A $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  802G  199G  81% /vol0

 B $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97947  426341   19% /
 tmpfs1906639   1 19066381% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds816200704   24223 8161764811% /vol0

 B $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.6G  4.3G  46% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  805G  195G  81% /vol0

 C $df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97938  426350   19% /
 tmpfs1906897   1 19068961% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds755218352   24024 7551943281% /vol0
 root@ip-10-204-136-223 ~]$

 C $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  820G  181G  82% /vol0

 The devices may be ~80% full but that still leaves ~200G free on each. My
 spark-env.sh has

 export SPARK_LOCAL_DIRS=/vol0/spark

 I have manually verified that on each slave the only temporary files are
 stored on /vol0, all looking something like this


 /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

 So it looks like all the files are being stored on the large drives
 (incidentally they're AWS EBS volumes, but that's the only way to get
 enough storage). My process crashed before with a slightly different
 exception under the same circumstances: kryo.KryoException:
 java.io.IOException: No space left on device

 These both happen after several hours and several GB of temporary files.

 Why does Spark think it's run out of space?

 TIA

 Joe

 Stack trace 1:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at
 org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
 at
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
 at
 

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
Hi there,

I assume you are using spark 1.2.1 right?
I faced the exact same issue and switched to 1.1.1 with the same
configuration and it was solved.
On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote:

 Here is a tool which may give you some clue:
 http://file-leak-detector.kohsuke.org/

 Cheers

 On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov 
 vrodio...@splicemachine.com wrote:

 Usually it happens in Linux when application deletes file w/o double
 checking that there are no open FDs (resource leak). In this case, Linux
 holds all space allocated and does not release it until application exits
 (crashes in your case). You check file system and everything is normal, you
 have enough space and you have no idea why does application report no
 space left on device.

 Just a guess.

 -Vladimir Rodionov

 On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote:

 I'm running a cluster of 3 Amazon EC2 machines (small number because
 it's expensive when experiments keep crashing after a day!).

 Today's crash looks like this (stacktrace at end of message).
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0

 On my three nodes, I have plenty of space and inodes:

 A $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97937  426351   19% /
 tmpfs1909200   1 19091991% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds831869296   23844 8318454521% /vol0

 A $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  802G  199G  81% /vol0

 B $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97947  426341   19% /
 tmpfs1906639   1 19066381% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds816200704   24223 8161764811% /vol0

 B $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.6G  4.3G  46% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  805G  195G  81% /vol0

 C $df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97938  426350   19% /
 tmpfs1906897   1 19068961% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds755218352   24024 7551943281% /vol0
 root@ip-10-204-136-223 ~]$

 C $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  820G  181G  82% /vol0

 The devices may be ~80% full but that still leaves ~200G free on each.
 My spark-env.sh has

 export SPARK_LOCAL_DIRS=/vol0/spark

 I have manually verified that on each slave the only temporary files are
 stored on /vol0, all looking something like this


 /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

 So it looks like all the files are being stored on the large drives
 (incidentally they're AWS EBS volumes, but that's the only way to get
 enough storage). My process crashed before with a slightly different
 exception under the same circumstances: kryo.KryoException:
 java.io.IOException: No space left on device

 These both happen after several hours and several GB of temporary files.

 Why does Spark think it's run out of space?

 TIA

 Joe

 Stack trace 1:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at
 

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
No problem, Joe. There you go
https://issues.apache.org/jira/browse/SPARK-5081
And also there is this one https://issues.apache.org/jira/browse/SPARK-5715
which is marked as resolved

On 24 February 2015 at 21:51, Joe Wass jw...@crossref.org wrote:

 Thanks everyone.

 Yiannis, do you know if there's a bug report for this regression? For some
 other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but I
 can't remember what the bug was.

 Joe




 On 24 February 2015 at 19:26, Yiannis Gkoufas johngou...@gmail.com
 wrote:

 Hi there,

 I assume you are using spark 1.2.1 right?
 I faced the exact same issue and switched to 1.1.1 with the same
 configuration and it was solved.
 On 24 Feb 2015 19:22, Ted Yu yuzhih...@gmail.com wrote:

 Here is a tool which may give you some clue:
 http://file-leak-detector.kohsuke.org/

 Cheers

 On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov 
 vrodio...@splicemachine.com wrote:

 Usually it happens in Linux when application deletes file w/o double
 checking that there are no open FDs (resource leak). In this case, Linux
 holds all space allocated and does not release it until application
 exits (crashes in your case). You check file system and everything is
 normal, you have enough space and you have no idea why does application
 report no space left on device.

 Just a guess.

 -Vladimir Rodionov

 On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote:

 I'm running a cluster of 3 Amazon EC2 machines (small number because
 it's expensive when experiments keep crashing after a day!).

 Today's crash looks like this (stacktrace at end of message).
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
 output location for shuffle 0

 On my three nodes, I have plenty of space and inodes:

 A $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97937  426351   19% /
 tmpfs1909200   1 19091991% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds831869296   23844 8318454521% /vol0

 A $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  802G  199G  81% /vol0

 B $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97947  426341   19% /
 tmpfs1906639   1 19066381% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds816200704   24223 8161764811% /vol0

 B $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.6G  4.3G  46% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  805G  195G  81% /vol0

 C $df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97938  426350   19% /
 tmpfs1906897   1 19068961% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds755218352   24024 7551943281% /vol0
 root@ip-10-204-136-223 ~]$

 C $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  820G  181G  82% /vol0

 The devices may be ~80% full but that still leaves ~200G free on each.
 My spark-env.sh has

 export SPARK_LOCAL_DIRS=/vol0/spark

 I have manually verified that on each slave the only temporary files
 are stored on /vol0, all looking something like this


 /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

 So it looks like all the files are being stored on the large drives
 (incidentally they're AWS EBS volumes, but that's the only way to get
 enough storage). My process crashed before with a slightly different
 exception under the same circumstances: kryo.KryoException:
 java.io.IOException: No space left on device

 These both happen after several hours and several GB of temporary
 files.

 Why does Spark think it's run out of space?

 TIA

 Joe

 Stack trace 1:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
 output location for shuffle 0
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
 at