Thanks, Andy. I am indeed often doing something similar, now -- copying data locally rather than dealing with the S3 impl selection and AWS credentials issues. It'd be nice if it worked a little easier out of the box, though!
On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > Hi Everett > > I always do my initial data exploration and all our product development in > my local dev env. I typically select a small data set and copy it to my > local machine > > My main() has an optional command line argument ‘- - runLocal’ Normally I > load data from either hdfs:/// or S3n:// . If the arg is set I read from > file:/// > > Sometime I use a CLI arg ‘- -dataFileURL’ > > So in your case I would log into my data cluster and use “AWS s3 cp" to > copy the data into my cluster and then use “SCP” to copy the data from the > data center back to my local env. > > Andy > > From: Everett Anderson <ever...@nuna.com.INVALID> > Date: Tuesday, July 19, 2016 at 2:30 PM > To: "user @spark" <user@spark.apache.org> > Subject: Role-based S3 access outside of EMR > > Hi, > > When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop > FileSystem implementation for s3:// URLs and seems to install the > necessary S3 credentials properties, as well. > > Often, it's nice during development to run outside of a cluster even with > the "local" Spark master, though, which I've found to be more troublesome. > I'm curious if I'm doing this the right way. > > There are two issues -- AWS credentials and finding the right combination > of compatible AWS SDK and Hadoop S3 FileSystem dependencies. > > *Credentials and Hadoop Configuration* > > For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and > AWS_ACCESS_KEY_ID environment variables or putting the corresponding > properties in Hadoop XML config files, but it seems better practice to rely > on machine roles and not expose these. > > What I end up doing is, in code, when not running on EMR, creating a > DefaultAWSCredentialsProviderChain > <https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html> > and then installing the following properties in the Hadoop Configuration > using it: > > fs.s3.awsAccessKeyId > fs.s3n.awsAccessKeyId > fs.s3a.awsAccessKeyId > fs.s3.awsSecretAccessKey > fs.s3n.awsSecretAccessKey > fs.s3a.awsSecretAccessKey > > I also set the fs.s3.impl and fs.s3n.impl properties to > org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A > implementation since people usually use "s3://" URIs. > > *SDK and File System Dependencies* > > Some special combination > <https://issues.apache.org/jira/browse/HADOOP-12420> of the Hadoop > version, AWS SDK version, and hadoop-aws is necessary. > > One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems > to be with > > --packages > com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2 > > Is this generally what people do? Is there a better way? > > I realize this isn't entirely a Spark-specific problem, but as so many > people seem to be using S3 with Spark, I imagine this community's faced the > problem a lot. > > Thanks! > > - Everett > >