Alright.. I found the issue. I wasn't setting "fs.s3.buffer.dir" configuration. Here is the final spark conf snippet that works:
"spark.hadoop.fs.s3n.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem", "spark.hadoop.fs.s3.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem", "spark.hadoop.fs.s3bfs.impl": "org.apache.hadoop.fs.s3.S3FileSystem", "spark.hadoop.fs.s3.buffer.dir": "/mnt/var/lib/hadoop/s3,/mnt1/var/lib/hadoop/s3", "spark.hadoop.fs.s3n.endpoint": "s3.amazonaws.com", "spark.hadoop.fs.emr.configuration.version": "1.0", "spark.hadoop.fs.s3n.multipart.uploads.enabled": "true", "spark.hadoop.fs.s3.enableServerSideEncryption": "false", "spark.hadoop.fs.s3.serverSideEncryptionAlgorithm": "AES256", "spark.hadoop.fs.s3.consistent": "true", "spark.hadoop.fs.s3.consistent.retryPolicyType": "exponential", "spark.hadoop.fs.s3.consistent.retryPeriodSeconds": "10", "spark.hadoop.fs.s3.consistent.retryCount": "5", "spark.hadoop.fs.s3.maxRetries": "4", "spark.hadoop.fs.s3.sleepTimeSeconds": "10", "spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency": "true", "spark.hadoop.fs.s3.consistent.metadata.autoCreate": "true", "spark.hadoop.fs.s3.consistent.metadata.tableName": "EmrFSMetadata", "spark.hadoop.fs.s3.consistent.metadata.read.capacity": "500", "spark.hadoop.fs.s3.consistent.metadata.write.capacity": "100", "spark.hadoop.fs.s3.consistent.fastList": "true", "spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata": "false", "spark.hadoop.fs.s3.consistent.notification.CloudWatch": "false", "spark.hadoop.fs.s3.consistent.notification.SQS": "false" Thanks, Aniket On Fri Jan 30 2015 at 23:29:25 Aniket Bhatnagar <[email protected]> wrote: > Right. Which makes me to believe that the directory is perhaps configured > somewhere and i have missed configuring the same. The process that is > submitting jobs (basically becomes driver) is running in sudo mode and the > executors are executed by YARN. The hadoop username is configured as > 'hadoop' (default user in EMR). > > On Fri, Jan 30, 2015, 11:25 PM Sven Krasser <[email protected]> wrote: > >> From your stacktrace it appears that the S3 writer tries to write the >> data to a temp file on the local file system first. Taking a guess, that >> local directory doesn't exist or you don't have permissions for it. >> -Sven >> >> On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar < >> [email protected]> wrote: >> >>> I am programmatically submit spark jobs in yarn-client mode on EMR. >>> Whenever a job tries to save file to s3, it gives the below mentioned >>> exception. I think the issue might be what EMR is not setup properly as I >>> have to set all hadoop configurations manually in SparkContext. However, I >>> am not sure which configuration am I missing (if any). >>> >>> Configurations that I am using in SparkContext to setup EMRFS: >>> "spark.hadoop.fs.s3n.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem", >>> "spark.hadoop.fs.s3.impl": "com.amazon.ws.emr.hadoop.fs.EmrFileSystem", >>> "spark.hadoop.fs.emr.configuration.version": "1.0", >>> "spark.hadoop.fs.s3n.multipart.uploads.enabled": "true", >>> "spark.hadoop.fs.s3.enableServerSideEncryption": "false", >>> "spark.hadoop.fs.s3.serverSideEncryptionAlgorithm": "AES256", >>> "spark.hadoop.fs.s3.consistent": "true", >>> "spark.hadoop.fs.s3.consistent.retryPolicyType": "exponential", >>> "spark.hadoop.fs.s3.consistent.retryPeriodSeconds": "10", >>> "spark.hadoop.fs.s3.consistent.retryCount": "5", >>> "spark.hadoop.fs.s3.maxRetries": "4", >>> "spark.hadoop.fs.s3.sleepTimeSeconds": "10", >>> "spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency": "true", >>> "spark.hadoop.fs.s3.consistent.metadata.autoCreate": "true", >>> "spark.hadoop.fs.s3.consistent.metadata.tableName": "EmrFSMetadata", >>> "spark.hadoop.fs.s3.consistent.metadata.read.capacity": "500", >>> "spark.hadoop.fs.s3.consistent.metadata.write.capacity": "100", >>> "spark.hadoop.fs.s3.consistent.fastList": "true", >>> "spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata": "false", >>> "spark.hadoop.fs.s3.consistent.notification.CloudWatch": "false", >>> "spark.hadoop.fs.s3.consistent.notification.SQS": "false", >>> >>> Exception: >>> java.io.IOException: No such file or directory >>> at java.io.UnixFileSystem.createFileExclusively(Native Method) >>> at java.io.File.createNewFile(File.java:1006) >>> at java.io.File.createTempFile(File.java:1989) >>> at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.startNewTempFile( >>> S3FSOutputStream.java:269) >>> at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.writeInternal( >>> S3FSOutputStream.java:205) >>> at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.flush( >>> S3FSOutputStream.java:136) >>> at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.close( >>> S3FSOutputStream.java:156) >>> at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close( >>> FSDataOutputStream.java:72) >>> at org.apache.hadoop.fs.FSDataOutputStream.close( >>> FSDataOutputStream.java:105) >>> at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close( >>> TextOutputFormat.java:109) >>> at org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.close( >>> MultipleOutputFormat.java:116) >>> at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102) >>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13. >>> apply(PairRDDFunctions.scala:1068) >>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13. >>> apply(PairRDDFunctions.scala:1047) >>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) >>> at org.apache.spark.scheduler.Task.run(Task.scala:56) >>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) >>> at java.util.concurrent.ThreadPoolExecutor.runWorker( >>> ThreadPoolExecutor.java:1145) >>> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >>> ThreadPoolExecutor.java:615) >>> at java.lang.Thread.run(Thread.java:745) >>> >>> Hints? Suggestions? >>> >> >> >> >> -- >> http://sites.google.com/site/krasser/?utm_source=sig >> >
