But that would mean you would be accessing data over internet increasing
data read latency, data transmission failures. Why are you not using EMR?


On Thu, Jul 21, 2016 at 1:06 AM, Everett Anderson <ever...@nuna.com.invalid>

> Thanks, Andy.
> I am indeed often doing something similar, now -- copying data locally
> rather than dealing with the S3 impl selection and AWS credentials issues.
> It'd be nice if it worked a little easier out of the box, though!
> On Tue, Jul 19, 2016 at 2:47 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>> Hi Everett
>> I always do my initial data exploration and all our product development
>> in my local dev env. I typically select a small data set and copy it to my
>> local machine
>> My main() has an optional command line argument ‘- - runLocal’ Normally I
>> load data from either hdfs:/// or S3n:// . If the arg is set I read from
>> file:///
>> Sometime I use a CLI arg ‘- -dataFileURL’
>> So in your case I would log into my data cluster and use “AWS s3 cp" to
>> copy the data into my cluster and then use “SCP” to copy the data from the
>> data center back to my local env.
>> Andy
>> From: Everett Anderson <ever...@nuna.com.INVALID>
>> Date: Tuesday, July 19, 2016 at 2:30 PM
>> To: "user @spark" <user@spark.apache.org>
>> Subject: Role-based S3 access outside of EMR
>> Hi,
>> When running on EMR, AWS configures Hadoop to use their EMRFS Hadoop
>> FileSystem implementation for s3:// URLs and seems to install the
>> necessary S3 credentials properties, as well.
>> Often, it's nice during development to run outside of a cluster even with
>> the "local" Spark master, though, which I've found to be more troublesome.
>> I'm curious if I'm doing this the right way.
>> There are two issues -- AWS credentials and finding the right combination
>> of compatible AWS SDK and Hadoop S3 FileSystem dependencies.
>> *Credentials and Hadoop Configuration*
>> For credentials, some guides recommend setting AWS_SECRET_ACCESS_KEY and
>> AWS_ACCESS_KEY_ID environment variables or putting the corresponding
>> properties in Hadoop XML config files, but it seems better practice to rely
>> on machine roles and not expose these.
>> What I end up doing is, in code, when not running on EMR, creating a
>> DefaultAWSCredentialsProviderChain
>> <https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html>
>> and then installing the following properties in the Hadoop Configuration
>> using it:
>> fs.s3.awsAccessKeyId
>> fs.s3n.awsAccessKeyId
>> fs.s3a.awsAccessKeyId
>> fs.s3.awsSecretAccessKey
>> fs.s3n.awsSecretAccessKey
>> fs.s3a.awsSecretAccessKey
>> I also set the fs.s3.impl and fs.s3n.impl properties to
>> org.apache.hadoop.fs.s3a.S3AFileSystem to force them to use the S3A
>> implementation since people usually use "s3://" URIs.
>> *SDK and File System Dependencies*
>> Some special combination
>> <https://issues.apache.org/jira/browse/HADOOP-12420> of the Hadoop
>> version, AWS SDK version, and hadoop-aws is necessary.
>> One working S3A combination with Spark 1.6.1 + Hadoop 2.7.x for me seems
>> to be with
>> --packages
>> com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.2
>> Is this generally what people do? Is there a better way?
>> I realize this isn't entirely a Spark-specific problem, but as so many
>> people seem to be using S3 with Spark, I imagine this community's faced the
>> problem a lot.
>> Thanks!
>> - Everett

Reply via email to