Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-02-02 Thread Aniket Bhatnagar
Alright.. I found the issue. I wasn't setting fs.s3.buffer.dir
configuration. Here is the final spark conf snippet that works:


spark.hadoop.fs.s3n.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
spark.hadoop.fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
spark.hadoop.fs.s3bfs.impl: org.apache.hadoop.fs.s3.S3FileSystem,
spark.hadoop.fs.s3.buffer.dir:
/mnt/var/lib/hadoop/s3,/mnt1/var/lib/hadoop/s3,
spark.hadoop.fs.s3n.endpoint: s3.amazonaws.com,
spark.hadoop.fs.emr.configuration.version: 1.0,
spark.hadoop.fs.s3n.multipart.uploads.enabled: true,
spark.hadoop.fs.s3.enableServerSideEncryption: false,
spark.hadoop.fs.s3.serverSideEncryptionAlgorithm: AES256,
spark.hadoop.fs.s3.consistent: true,
spark.hadoop.fs.s3.consistent.retryPolicyType: exponential,
spark.hadoop.fs.s3.consistent.retryPeriodSeconds: 10,
spark.hadoop.fs.s3.consistent.retryCount: 5,
spark.hadoop.fs.s3.maxRetries: 4,
spark.hadoop.fs.s3.sleepTimeSeconds: 10,
spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency: true,
spark.hadoop.fs.s3.consistent.metadata.autoCreate: true,
spark.hadoop.fs.s3.consistent.metadata.tableName: EmrFSMetadata,
spark.hadoop.fs.s3.consistent.metadata.read.capacity: 500,
spark.hadoop.fs.s3.consistent.metadata.write.capacity: 100,
spark.hadoop.fs.s3.consistent.fastList: true,
spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata: false,
spark.hadoop.fs.s3.consistent.notification.CloudWatch: false,
spark.hadoop.fs.s3.consistent.notification.SQS: false

Thanks,
Aniket


On Fri Jan 30 2015 at 23:29:25 Aniket Bhatnagar aniket.bhatna...@gmail.com
wrote:

 Right. Which makes me to believe that the directory is perhaps configured
 somewhere and i have missed configuring the same. The process that is
 submitting jobs (basically becomes driver) is running in sudo mode and the
 executors are executed by YARN. The hadoop username is configured as
 'hadoop' (default user in EMR).

 On Fri, Jan 30, 2015, 11:25 PM Sven Krasser kras...@gmail.com wrote:

 From your stacktrace it appears that the S3 writer tries to write the
 data to a temp file on the local file system first. Taking a guess, that
 local directory doesn't exist or you don't have permissions for it.
 -Sven

 On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am programmatically submit spark jobs in yarn-client mode on EMR.
 Whenever a job tries to save file to s3, it gives the below mentioned
 exception. I think the issue might be what EMR is not setup properly as I
 have to set all hadoop configurations manually in SparkContext. However, I
 am not sure which configuration am I missing (if any).

 Configurations that I am using in SparkContext to setup EMRFS:
 spark.hadoop.fs.s3n.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
 spark.hadoop.fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
 spark.hadoop.fs.emr.configuration.version: 1.0,
 spark.hadoop.fs.s3n.multipart.uploads.enabled: true,
 spark.hadoop.fs.s3.enableServerSideEncryption: false,
 spark.hadoop.fs.s3.serverSideEncryptionAlgorithm: AES256,
 spark.hadoop.fs.s3.consistent: true,
 spark.hadoop.fs.s3.consistent.retryPolicyType: exponential,
 spark.hadoop.fs.s3.consistent.retryPeriodSeconds: 10,
 spark.hadoop.fs.s3.consistent.retryCount: 5,
 spark.hadoop.fs.s3.maxRetries: 4,
 spark.hadoop.fs.s3.sleepTimeSeconds: 10,
 spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency: true,
 spark.hadoop.fs.s3.consistent.metadata.autoCreate: true,
 spark.hadoop.fs.s3.consistent.metadata.tableName: EmrFSMetadata,
 spark.hadoop.fs.s3.consistent.metadata.read.capacity: 500,
 spark.hadoop.fs.s3.consistent.metadata.write.capacity: 100,
 spark.hadoop.fs.s3.consistent.fastList: true,
 spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata: false,
 spark.hadoop.fs.s3.consistent.notification.CloudWatch: false,
 spark.hadoop.fs.s3.consistent.notification.SQS: false,

 Exception:
 java.io.IOException: No such file or directory
 at java.io.UnixFileSystem.createFileExclusively(Native Method)
 at java.io.File.createNewFile(File.java:1006)
 at java.io.File.createTempFile(File.java:1989)
 at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.startNewTempFile(
 S3FSOutputStream.java:269)
 at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.writeInternal(
 S3FSOutputStream.java:205)
 at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.flush(
 S3FSOutputStream.java:136)
 at com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.close(
 S3FSOutputStream.java:156)
 at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(
 FSDataOutputStream.java:72)
 at org.apache.hadoop.fs.FSDataOutputStream.close(
 FSDataOutputStream.java:105)
 at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(
 TextOutputFormat.java:109)
 at org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.close(
 MultipleOutputFormat.java:116)
 at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
 at org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.
 apply(PairRDDFunctions.scala:1068)
 at 

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Sven Krasser
From your stacktrace it appears that the S3 writer tries to write the data
to a temp file on the local file system first. Taking a guess, that local
directory doesn't exist or you don't have permissions for it.
-Sven

On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 I am programmatically submit spark jobs in yarn-client mode on EMR.
 Whenever a job tries to save file to s3, it gives the below mentioned
 exception. I think the issue might be what EMR is not setup properly as I
 have to set all hadoop configurations manually in SparkContext. However, I
 am not sure which configuration am I missing (if any).

 Configurations that I am using in SparkContext to setup EMRFS:
 spark.hadoop.fs.s3n.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
 spark.hadoop.fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
 spark.hadoop.fs.emr.configuration.version: 1.0,
 spark.hadoop.fs.s3n.multipart.uploads.enabled: true,
 spark.hadoop.fs.s3.enableServerSideEncryption: false,
 spark.hadoop.fs.s3.serverSideEncryptionAlgorithm: AES256,
 spark.hadoop.fs.s3.consistent: true,
 spark.hadoop.fs.s3.consistent.retryPolicyType: exponential,
 spark.hadoop.fs.s3.consistent.retryPeriodSeconds: 10,
 spark.hadoop.fs.s3.consistent.retryCount: 5,
 spark.hadoop.fs.s3.maxRetries: 4,
 spark.hadoop.fs.s3.sleepTimeSeconds: 10,
 spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency: true,
 spark.hadoop.fs.s3.consistent.metadata.autoCreate: true,
 spark.hadoop.fs.s3.consistent.metadata.tableName: EmrFSMetadata,
 spark.hadoop.fs.s3.consistent.metadata.read.capacity: 500,
 spark.hadoop.fs.s3.consistent.metadata.write.capacity: 100,
 spark.hadoop.fs.s3.consistent.fastList: true,
 spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata: false,
 spark.hadoop.fs.s3.consistent.notification.CloudWatch: false,
 spark.hadoop.fs.s3.consistent.notification.SQS: false,

 Exception:
 java.io.IOException: No such file or directory
 at java.io.UnixFileSystem.createFileExclusively(Native Method)
 at java.io.File.createNewFile(File.java:1006)
 at java.io.File.createTempFile(File.java:1989)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.startNewTempFile(S3FSOutputStream.java:269)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.writeInternal(S3FSOutputStream.java:205)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.flush(S3FSOutputStream.java:136)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.close(S3FSOutputStream.java:156)
 at
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
 at
 org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
 at
 org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109)
 at
 org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.close(MultipleOutputFormat.java:116)
 at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Hints? Suggestions?




-- 
http://sites.google.com/site/krasser/?utm_source=sig


Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Aniket Bhatnagar
Right. Which makes me to believe that the directory is perhaps configured
somewhere and i have missed configuring the same. The process that is
submitting jobs (basically becomes driver) is running in sudo mode and the
executors are executed by YARN. The hadoop username is configured as
'hadoop' (default user in EMR).

On Fri, Jan 30, 2015, 11:25 PM Sven Krasser kras...@gmail.com wrote:

 From your stacktrace it appears that the S3 writer tries to write the data
 to a temp file on the local file system first. Taking a guess, that local
 directory doesn't exist or you don't have permissions for it.
 -Sven

 On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am programmatically submit spark jobs in yarn-client mode on EMR.
 Whenever a job tries to save file to s3, it gives the below mentioned
 exception. I think the issue might be what EMR is not setup properly as I
 have to set all hadoop configurations manually in SparkContext. However, I
 am not sure which configuration am I missing (if any).

 Configurations that I am using in SparkContext to setup EMRFS:
 spark.hadoop.fs.s3n.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
 spark.hadoop.fs.s3.impl: com.amazon.ws.emr.hadoop.fs.EmrFileSystem,
 spark.hadoop.fs.emr.configuration.version: 1.0,
 spark.hadoop.fs.s3n.multipart.uploads.enabled: true,
 spark.hadoop.fs.s3.enableServerSideEncryption: false,
 spark.hadoop.fs.s3.serverSideEncryptionAlgorithm: AES256,
 spark.hadoop.fs.s3.consistent: true,
 spark.hadoop.fs.s3.consistent.retryPolicyType: exponential,
 spark.hadoop.fs.s3.consistent.retryPeriodSeconds: 10,
 spark.hadoop.fs.s3.consistent.retryCount: 5,
 spark.hadoop.fs.s3.maxRetries: 4,
 spark.hadoop.fs.s3.sleepTimeSeconds: 10,
 spark.hadoop.fs.s3.consistent.throwExceptionOnInconsistency: true,
 spark.hadoop.fs.s3.consistent.metadata.autoCreate: true,
 spark.hadoop.fs.s3.consistent.metadata.tableName: EmrFSMetadata,
 spark.hadoop.fs.s3.consistent.metadata.read.capacity: 500,
 spark.hadoop.fs.s3.consistent.metadata.write.capacity: 100,
 spark.hadoop.fs.s3.consistent.fastList: true,
 spark.hadoop.fs.s3.consistent.fastList.prefetchMetadata: false,
 spark.hadoop.fs.s3.consistent.notification.CloudWatch: false,
 spark.hadoop.fs.s3.consistent.notification.SQS: false,

 Exception:
 java.io.IOException: No such file or directory
 at java.io.UnixFileSystem.createFileExclusively(Native Method)
 at java.io.File.createNewFile(File.java:1006)
 at java.io.File.createTempFile(File.java:1989)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.startNewTempFile(S3FSOutputStream.java:269)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.writeInternal(S3FSOutputStream.java:205)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.flush(S3FSOutputStream.java:136)
 at
 com.amazon.ws.emr.hadoop.fs.s3.S3FSOutputStream.close(S3FSOutputStream.java:156)
 at
 org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
 at
 org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)
 at
 org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109)
 at
 org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.close(MultipleOutputFormat.java:116)
 at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1068)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1047)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Hints? Suggestions?




 --
 http://sites.google.com/site/krasser/?utm_source=sig