Thanks, Andy.

I am indeed often doing something similar, now -- copying data locally
rather than dealing with the S3 impl selection and AWS credentials issues.
It'd be nice if it worked a little easier out of the box, though!


On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Everett
>
> I always do my initial data exploration and all our product development in
> my local dev env. I typically select a small data set and copy it to my
> local machine
>
> My main() has an optional command line argument ‘- - runLocal’ Normally I
> load data from either hdfs:/// or S3n:// . If the arg is set I read from
> file:///
>
> Sometime I use a CLI arg ‘- -dataFileURL’
>
> So in your case I would log into my data cluster and use “AWS s3 cp" to
> copy the data into my cluster and then use “SCP” to copy the data from the
> data center back to my local env.
>
> Andy
>
> From: Everett Anderson <ever...@nuna.com.INVALID>
> Date: Tuesday, July 19, 2016 at 2:30 PM
> To: "user @spark" <user@spark.apache.org>
> Subject: Role-based S3 access outside of EMR
>
> Hi,
>
> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
> FileSystem implementation for s3:// URLs and seems to install the
> necessary S3 credentials properties, as well.
>
> Often, it's nice during development to run outside of a cluster even with
> the "local" Spark master, though, which I've found to be more troublesome.
> I'm curious if I'm doing this the right way.
>
> There are two issues -- AWS credentials and finding the right combination
> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>
> *Credentials and Hadoop Configuration*
>
> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
> properties in Hadoop XML config files, but it seems better practice to rely
> on machine roles and not expose these.
>
> What I end up doing is, in code, when not running on EMR, creating a
> DefaultAWSCredentialsProviderChain
> <https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html>
> and then installing the following properties in the Hadoop Configuration
> using it:
>
> fs.s3.awsAccessKeyId
> fs.s3n.awsAccessKeyId
> fs.s3a.awsAccessKeyId
> fs.s3.awsSecretAccessKey
> fs.s3n.awsSecretAccessKey
> fs.s3a.awsSecretAccessKey
>
> I also set the fs.s3.impl and fs.s3n.impl properties to
> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
> implementation since people usually use "s3://" URIs.
>
> *SDK and File System Dependencies*
>
> Some special combination
> <https://issues.apache.org/jira/browse/HADOOP-12420> of the Hadoop
> version, AWS SDK version, and hadoop-aws is necessary.
>
> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems
> to be with
>
> --packages
> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
>
> Is this generally what people do? Is there a better way?
>
> I realize this isn't entirely a Spark-specific problem, but as so many
> people seem to be using S3 with Spark, I imagine this community's faced the
> problem a lot.
>
> Thanks!
>
> - Everett
>
>

Reply via email to