>> This sounds like either a bug or somehow the S3 library requiring lots of >> memory to read a block. There isn’t a separate way to run HDFS over S3. >> Hadoop just has different implementations of “file systems”, one of which is >> S3. There’s a pointer to these versions at the bottom of >> http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3 >> but it is indeed pretty hidden in the docs. > > > Hmmm. Maybe a bug then. If I read a small 600 byte file via the s3n:// uri - > it works on a spark cluster. If I try a 20GB file it just sits and sits and > sits frozen. Is there anything I can do to instrument this and figure out > what is going on? >
Try taking a look at the stderr log of the executor that failed. You should hopefully see a more detailed error message there. The stderr logs can be found by browsing to http://mymaster:8080, where `mymaster` is the hostname of your Spark master. Hope that helps, -Jey
