Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Kostiantyn Kudriavtsev
Hi guys,

the only one big issue with this approach:
> spark.hadoop.s3a.access.key  is now visible everywhere, in logs, in spark 
> webui and is not secured at all...

On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev 
 wrote:

> thanks Jerry, it works!
> really appreciate your help 
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam  wrote:
> Hi Kostiantyn,
> 
> You should be able to use spark.conf to specify s3a keys.
> 
> I don't remember exactly but you can add hadoop properties by prefixing 
> spark.hadoop.*
> * is the s3a properties. For instance,
> 
> spark.hadoop.s3a.access.key wudjgdueyhsj
> 
> Of course, you need to make sure the property key is right. I'm using my 
> phone so I cannot easily verifying.
> 
> Then you can specify different user using different spark.conf via 
> --properties-file when spark-submit
> 
> HTH,
> 
> Jerry
> 
> Sent from my iPhone
> 
> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
>> Hi Jerry,
>> 
>> what you suggested looks to be working (I put hdfs-site.xml into 
>> $SPARK_HOME/conf folder), but could you shed some light on how it can be 
>> federated per user?
>> Thanks in advance!
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:
>> Hi Kostiantyn,
>> 
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you 
>> could define different spark-{user-x}.conf and source them during 
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> Sent from my iPhone
>> 
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> 
>>> Hi Jerry,
>>> 
>>> I want to run different jobs on different S3 buckets - different AWS creds 
>>> - on the same instances. Could you shed some light if it's possible to 
>>> achieve with hdfs-site?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>>> Hi Kostiantyn,
>>> 
>>> Can you define those properties in hdfs-site.xml and make sure it is 
>>> visible in the class path when you spark-submit? It looks like a conf 
>>> sourcing issue to me. 
>>> 
>>> Cheers,
>>> 
>>> Sent from my iPhone
>>> 
>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> 
 Chris,
 
 thanks for the hist with AIM roles, but in my case  I need to run 
 different jobs with different S3 permissions on the same cluster, so this 
 approach doesn't work for me as far as I understood it
 
 Thank you,
 Konstantin Kudryavtsev
 
 On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
 couple things:
 
 1) switch to IAM roles if at all possible - explicitly passing AWS 
 credentials is a long and lonely road in the end
 
 2) one really bad workaround/hack is to run a job that hits every worker 
 and writes the credentials to the proper location (~/.awscredentials or 
 whatever)
 
 ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
 autoscaling, but i'm mentioning it anyway as it is a temporary fix.
 
 if you switch to IAM roles, things become a lot easier as you can 
 authorize all of the EC2 instances in the cluster - and handles 
 autoscaling very well - and at some point, you will want to autoscale.
 
 On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
  wrote:
 Chris,
 
  good question, as you can see from the code I set up them on driver, so I 
 expect they will be propagated to all nodes, won't them?
 
 Thank you,
 Konstantin Kudryavtsev
 
 On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
 are the credentials visible from each Worker node to all the Executor JVMs 
 on each Worker?
 
 On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
  wrote:
 
> Dear Spark community,
> 
> I faced the following issue with trying accessing data on S3a, my code is 
> the following:
> 
> val sparkConf = new SparkConf()
> 
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
> val sqlContext = SQLContext.getOrCreate(sc)
> val df = sqlContext.read.parquet(...)
> df.count
> 
> It results in the following exception and log messages:
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: Access key or secret key is 

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Jerry Lam
Hi Kostiantyn,

Yes. If security is a concern then this approach cannot satisfy it. The keys 
are visible in the properties files. If the goal is to hide them, you might be 
able go a bit further with this approach. Have you look at spark security page?

Best Regards,

Jerry 

Sent from my iPhone

> On 6 Jan, 2016, at 8:49 am, Kostiantyn Kudriavtsev 
>  wrote:
> 
> Hi guys,
> 
> the only one big issue with this approach:
>>> spark.hadoop.s3a.access.key  is now visible everywhere, in logs, in spark 
>>> webui and is not secured at all...
> 
>> On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> 
>> thanks Jerry, it works!
>> really appreciate your help 
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>>> On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam  wrote:
>>> Hi Kostiantyn,
>>> 
>>> You should be able to use spark.conf to specify s3a keys.
>>> 
>>> I don't remember exactly but you can add hadoop properties by prefixing 
>>> spark.hadoop.*
>>> * is the s3a properties. For instance,
>>> 
>>> spark.hadoop.s3a.access.key wudjgdueyhsj
>>> 
>>> Of course, you need to make sure the property key is right. I'm using my 
>>> phone so I cannot easily verifying.
>>> 
>>> Then you can specify different user using different spark.conf via 
>>> --properties-file when spark-submit
>>> 
>>> HTH,
>>> 
>>> Jerry
>>> 
>>> Sent from my iPhone
>>> 
 On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev 
  wrote:
 
 Hi Jerry,
 
 what you suggested looks to be working (I put hdfs-site.xml into 
 $SPARK_HOME/conf folder), but could you shed some light on how it can be 
 federated per user?
 Thanks in advance!
 
 Thank you,
 Konstantin Kudryavtsev
 
> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:
> Hi Kostiantyn,
> 
> I want to confirm that it works first by using hdfs-site.xml. If yes, you 
> could define different spark-{user-x}.conf and source them during 
> spark-submit. let us know if hdfs-site.xml works first. It should.
> 
> Best Regards,
> 
> Jerry
> 
> Sent from my iPhone
> 
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> 
>> Hi Jerry,
>> 
>> I want to run different jobs on different S3 buckets - different AWS 
>> creds - on the same instances. Could you shed some light if it's 
>> possible to achieve with hdfs-site?
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>>> Hi Kostiantyn,
>>> 
>>> Can you define those properties in hdfs-site.xml and make sure it is 
>>> visible in the class path when you spark-submit? It looks like a conf 
>>> sourcing issue to me. 
>>> 
>>> Cheers,
>>> 
>>> Sent from my iPhone
>>> 
 On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
  wrote:
 
 Chris,
 
 thanks for the hist with AIM roles, but in my case  I need to run 
 different jobs with different S3 permissions on the same cluster, so 
 this approach doesn't work for me as far as I understood it
 
 Thank you,
 Konstantin Kudryavtsev
 
> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  
> wrote:
> couple things:
> 
> 1) switch to IAM roles if at all possible - explicitly passing AWS 
> credentials is a long and lonely road in the end
> 
> 2) one really bad workaround/hack is to run a job that hits every 
> worker and writes the credentials to the proper location 
> (~/.awscredentials or whatever)
> 
> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
> 
> if you switch to IAM roles, things become a lot easier as you can 
> authorize all of the EC2 instances in the cluster - and handles 
> autoscaling very well - and at some point, you will want to autoscale.
> 
>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> Chris,
>> 
>>  good question, as you can see from the code I set up them on 
>> driver, so I expect they will be propagated to all nodes, won't them?
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  
>>> wrote:
>>> are the credentials visible from each Worker node to all the 
>>> Executor JVMs on each Worker?
>>> 
 On Dec 30, 

Re: SparkSQL integration issue with AWS S3a

2016-01-02 Thread KOSTIANTYN Kudriavtsev
thanks Jerry, it works!
really appreciate your help

Thank you,
Konstantin Kudryavtsev

On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam  wrote:

> Hi Kostiantyn,
>
> You should be able to use spark.conf to specify s3a keys.
>
> I don't remember exactly but you can add hadoop properties by prefixing
> spark.hadoop.*
> * is the s3a properties. For instance,
>
> spark.hadoop.s3a.access.key wudjgdueyhsj
>
> Of course, you need to make sure the property key is right. I'm using my
> phone so I cannot easily verifying.
>
> Then you can specify different user using different spark.conf via
> --properties-file when spark-submit
>
> HTH,
>
> Jerry
>
> Sent from my iPhone
>
> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Jerry,
>
> what you suggested looks to be working (I put hdfs-site.xml into
> $SPARK_HOME/conf folder), but could you shed some light on how it can be
> federated per user?
> Thanks in advance!
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:
>
>> Hi Kostiantyn,
>>
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you
>> could define different spark-{user-x}.conf and source them during
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>> Hi Jerry,
>>
>> I want to run different jobs on different S3 buckets - different AWS
>> creds - on the same instances. Could you shed some light if it's possible
>> to achieve with hdfs-site?
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>>
>>> Hi Kostiantyn,
>>>
>>> Can you define those properties in hdfs-site.xml and make sure it is
>>> visible in the class path when you spark-submit? It looks like a conf
>>> sourcing issue to me.
>>>
>>> Cheers,
>>>
>>> Sent from my iPhone
>>>
>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <
>>> kudryavtsev.konstan...@gmail.com> wrote:
>>>
>>> Chris,
>>>
>>> thanks for the hist with AIM roles, but in my case  I need to run
>>> different jobs with different S3 permissions on the same cluster, so this
>>> approach doesn't work for me as far as I understood it
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
>>>
 couple things:

 1) switch to IAM roles if at all possible - explicitly passing AWS
 credentials is a long and lonely road in the end

 2) one really bad workaround/hack is to run a job that hits every
 worker and writes the credentials to the proper location (~/.awscredentials
 or whatever)

 ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle
 autoscaling, but i'm mentioning it anyway as it is a temporary fix.

 if you switch to IAM roles, things become a lot easier as you can
 authorize all of the EC2 instances in the cluster - and handles autoscaling
 very well - and at some point, you will want to autoscale.

 On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
 kudryavtsev.konstan...@gmail.com> wrote:

> Chris,
>
>  good question, as you can see from the code I set up them on driver,
> so I expect they will be propagated to all nodes, won't them?
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly 
> wrote:
>
>> are the credentials visible from each Worker node to all the Executor
>> JVMs on each Worker?
>>
>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>> Dear Spark community,
>>
>> I faced the following issue with trying accessing data on S3a, my
>> code is the following:
>>
>> val sparkConf = new SparkConf()
>>
>> val sc = new SparkContext(sparkConf)
>> sc.hadoopConfiguration.set("fs.s3a.impl", 
>> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>
>> val sqlContext = SQLContext.getOrCreate(sc)
>>
>> val df = sqlContext.read.parquet(...)
>>
>> df.count
>>
>>
>> It results in the following exception and log messages:
>>
>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>> credentials from BasicAWSCredentialsProvider: *Access key or secret key 
>> is null*
>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
>> metadata service at URL: 
>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>> 15/12/30 
>> 

Re: SparkSQL integration issue with AWS S3a

2016-01-01 Thread Jerry Lam
Hi Kostiantyn,

You should be able to use spark.conf to specify s3a keys.

I don't remember exactly but you can add hadoop properties by prefixing 
spark.hadoop.*
* is the s3a properties. For instance,

spark.hadoop.s3a.access.key wudjgdueyhsj

Of course, you need to make sure the property key is right. I'm using my phone 
so I cannot easily verifying.

Then you can specify different user using different spark.conf via 
--properties-file when spark-submit

HTH,

Jerry

Sent from my iPhone

> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Hi Jerry,
> 
> what you suggested looks to be working (I put hdfs-site.xml into 
> $SPARK_HOME/conf folder), but could you shed some light on how it can be 
> federated per user?
> Thanks in advance!
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:
>> Hi Kostiantyn,
>> 
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you 
>> could define different spark-{user-x}.conf and source them during 
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> Sent from my iPhone
>> 
>>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> 
>>> Hi Jerry,
>>> 
>>> I want to run different jobs on different S3 buckets - different AWS creds 
>>> - on the same instances. Could you shed some light if it's possible to 
>>> achieve with hdfs-site?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
 On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
 Hi Kostiantyn,
 
 Can you define those properties in hdfs-site.xml and make sure it is 
 visible in the class path when you spark-submit? It looks like a conf 
 sourcing issue to me. 
 
 Cheers,
 
 Sent from my iPhone
 
> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Chris,
> 
> thanks for the hist with AIM roles, but in my case  I need to run 
> different jobs with different S3 permissions on the same cluster, so this 
> approach doesn't work for me as far as I understood it
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
>> couple things:
>> 
>> 1) switch to IAM roles if at all possible - explicitly passing AWS 
>> credentials is a long and lonely road in the end
>> 
>> 2) one really bad workaround/hack is to run a job that hits every worker 
>> and writes the credentials to the proper location (~/.awscredentials or 
>> whatever)
>> 
>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>> 
>> if you switch to IAM roles, things become a lot easier as you can 
>> authorize all of the EC2 instances in the cluster - and handles 
>> autoscaling very well - and at some point, you will want to autoscale.
>> 
>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> Chris,
>>> 
>>>  good question, as you can see from the code I set up them on driver, 
>>> so I expect they will be propagated to all nodes, won't them?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
 On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
 are the credentials visible from each Worker node to all the Executor 
 JVMs on each Worker?
 
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Dear Spark community,
> 
> I faced the following issue with trying accessing data on S3a, my 
> code is the following:
> 
> val sparkConf = new SparkConf()
> 
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
> val sqlContext = SQLContext.getOrCreate(sc)
> val df = sqlContext.read.parquet(...)
> df.count
> 
> It results in the following exception and log messages:
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: Access key or secret 
> key is null
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> 

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Steve Loughran

> On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Hi Jerry,
> 
> I want to run different jobs on different S3 buckets - different AWS creds - 
> on the same instances. Could you shed some light if it's possible to achieve 
> with hdfs-site?
> 
> Thank you,
> Konstantin Kudryavtsev
> 


The Hadoop s3a client doesn't have much (anything?) in the way for multiple 
logins. 

It'd be possible to do it by hand (create a Hadoop Configuration object, fill 
with the credential, and set "fs.s3a.impl.disable.cache"= true to make sure you 
weren't getting an existing version. 

I don't know how you'd hook that up to spark jobs. maybe try setting the 
credentials and that fs.s3a.impl.disable.cache flag in your spark context to 
see if together they get picked up

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Brian London
Since you're running in standalone mode, can you try it using Spark 1.5.1
please?
On Thu, Dec 31, 2015 at 9:09 AM Steve Loughran 
wrote:

>
> > On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
> >
> > Hi Jerry,
> >
> > I want to run different jobs on different S3 buckets - different AWS
> creds - on the same instances. Could you shed some light if it's possible
> to achieve with hdfs-site?
> >
> > Thank you,
> > Konstantin Kudryavtsev
> >
>
>
> The Hadoop s3a client doesn't have much (anything?) in the way for
> multiple logins.
>
> It'd be possible to do it by hand (create a Hadoop Configuration object,
> fill with the credential, and set "fs.s3a.impl.disable.cache"= true to make
> sure you weren't getting an existing version.
>
> I don't know how you'd hook that up to spark jobs. maybe try setting the
> credentials and that fs.s3a.impl.disable.cache flag in your spark context
> to see if together they get picked up
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev
Hi Jerry,

thanks for the hint, could you please more specific how can I pass
different spark-{usr}.conf per user during job submit and which propery I
can use to specify custom hdfs-site.xml? I tried to google, but didn't find
nothing

Thank you,
Konstantin Kudryavtsev

On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:

> Hi Kostiantyn,
>
> I want to confirm that it works first by using hdfs-site.xml. If yes, you
> could define different spark-{user-x}.conf and source them during
> spark-submit. let us know if hdfs-site.xml works first. It should.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Jerry,
>
> I want to run different jobs on different S3 buckets - different AWS creds
> - on the same instances. Could you shed some light if it's possible to
> achieve with hdfs-site?
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>
>> Hi Kostiantyn,
>>
>> Can you define those properties in hdfs-site.xml and make sure it is
>> visible in the class path when you spark-submit? It looks like a conf
>> sourcing issue to me.
>>
>> Cheers,
>>
>> Sent from my iPhone
>>
>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>> Chris,
>>
>> thanks for the hist with AIM roles, but in my case  I need to run
>> different jobs with different S3 permissions on the same cluster, so this
>> approach doesn't work for me as far as I understood it
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
>>
>>> couple things:
>>>
>>> 1) switch to IAM roles if at all possible - explicitly passing AWS
>>> credentials is a long and lonely road in the end
>>>
>>> 2) one really bad workaround/hack is to run a job that hits every worker
>>> and writes the credentials to the proper location (~/.awscredentials or
>>> whatever)
>>>
>>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle
>>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>
>>> if you switch to IAM roles, things become a lot easier as you can
>>> authorize all of the EC2 instances in the cluster - and handles autoscaling
>>> very well - and at some point, you will want to autoscale.
>>>
>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
>>> kudryavtsev.konstan...@gmail.com> wrote:
>>>
 Chris,

  good question, as you can see from the code I set up them on driver,
 so I expect they will be propagated to all nodes, won't them?

 Thank you,
 Konstantin Kudryavtsev

 On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:

> are the credentials visible from each Worker node to all the Executor
> JVMs on each Worker?
>
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Dear Spark community,
>
> I faced the following issue with trying accessing data on S3a, my code
> is the following:
>
> val sparkConf = new SparkConf()
>
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>
> val sqlContext = SQLContext.getOrCreate(sc)
>
> val df = sqlContext.read.parquet(...)
>
> df.count
>
>
> It results in the following exception and log messages:
>
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: *Access key or secret key 
> is null*
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 
>  
> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials 
> from InstanceProfileCredentialsProvider: The requested metadata is not 
> found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 
>  
> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from 
> any provider in the chain
>   at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>   at 
> 

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev
Hi Jerry,

what you suggested looks to be working (I put hdfs-site.xml into
$SPARK_HOME/conf folder), but could you shed some light on how it can be
federated per user?
Thanks in advance!

Thank you,
Konstantin Kudryavtsev

On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:

> Hi Kostiantyn,
>
> I want to confirm that it works first by using hdfs-site.xml. If yes, you
> could define different spark-{user-x}.conf and source them during
> spark-submit. let us know if hdfs-site.xml works first. It should.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Hi Jerry,
>
> I want to run different jobs on different S3 buckets - different AWS creds
> - on the same instances. Could you shed some light if it's possible to
> achieve with hdfs-site?
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>
>> Hi Kostiantyn,
>>
>> Can you define those properties in hdfs-site.xml and make sure it is
>> visible in the class path when you spark-submit? It looks like a conf
>> sourcing issue to me.
>>
>> Cheers,
>>
>> Sent from my iPhone
>>
>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>> Chris,
>>
>> thanks for the hist with AIM roles, but in my case  I need to run
>> different jobs with different S3 permissions on the same cluster, so this
>> approach doesn't work for me as far as I understood it
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
>>
>>> couple things:
>>>
>>> 1) switch to IAM roles if at all possible - explicitly passing AWS
>>> credentials is a long and lonely road in the end
>>>
>>> 2) one really bad workaround/hack is to run a job that hits every worker
>>> and writes the credentials to the proper location (~/.awscredentials or
>>> whatever)
>>>
>>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle
>>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>
>>> if you switch to IAM roles, things become a lot easier as you can
>>> authorize all of the EC2 instances in the cluster - and handles autoscaling
>>> very well - and at some point, you will want to autoscale.
>>>
>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
>>> kudryavtsev.konstan...@gmail.com> wrote:
>>>
 Chris,

  good question, as you can see from the code I set up them on driver,
 so I expect they will be propagated to all nodes, won't them?

 Thank you,
 Konstantin Kudryavtsev

 On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:

> are the credentials visible from each Worker node to all the Executor
> JVMs on each Worker?
>
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Dear Spark community,
>
> I faced the following issue with trying accessing data on S3a, my code
> is the following:
>
> val sparkConf = new SparkConf()
>
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>
> val sqlContext = SQLContext.getOrCreate(sc)
>
> val df = sqlContext.read.parquet(...)
>
> df.count
>
>
> It results in the following exception and log messages:
>
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: *Access key or secret key 
> is null*
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 
>  
> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials 
> from InstanceProfileCredentialsProvider: The requested metadata is not 
> found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 
>  
> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from 
> any provider in the chain
>   at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>   at 

SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Dear Spark community,

I faced the following issue with trying accessing data on S3a, my code is
the following:

val sparkConf = new SparkConf()

val sc = new SparkContext(sparkConf)
sc.hadoopConfiguration.set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")

val sqlContext = SQLContext.getOrCreate(sc)

val df = sqlContext.read.parquet(...)

df.count


It results in the following exception and log messages:

15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load
credentials from BasicAWSCredentialsProvider: *Access key or secret
key is null*
15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance
metadata service at URL:
http://x.x.x.x/latest/meta-data/iam/security-credentials/
15/12/30 
17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials
from InstanceProfileCredentialsProvider: The requested metadata is not
found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
15/12/30 
17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
com.amazonaws.AmazonClientException: Unable to load AWS credentials
from any provider in the chain
at 
com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)


I run standalone spark 1.5.2 and using hadoop 2.7.1

any ideas/workarounds?

AWS credentials are correct for this bucket

Thank you,
Konstantin Kudryavtsev


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
are the credentials visible from each Worker node to all the Executor JVMs on 
each Worker?

> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Dear Spark community,
> 
> I faced the following issue with trying accessing data on S3a, my code is the 
> following:
> 
> val sparkConf = new SparkConf()
> 
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
> val sqlContext = SQLContext.getOrCreate(sc)
> val df = sqlContext.read.parquet(...)
> df.count
> 
> It results in the following exception and log messages:
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: Access key or secret key is null
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from InstanceProfileCredentialsProvider: The requested metadata 
> is not found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
> provider in the chain
>   at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
> 
> I run standalone spark 1.5.2 and using hadoop 2.7.1
> 
> any ideas/workarounds?
> 
> AWS credentials are correct for this bucket
> 
> Thank you,
> Konstantin Kudryavtsev


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Chris,

 good question, as you can see from the code I set up them on driver, so I
expect they will be propagated to all nodes, won't them?

Thank you,
Konstantin Kudryavtsev

On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:

> are the credentials visible from each Worker node to all the Executor JVMs
> on each Worker?
>
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Dear Spark community,
>
> I faced the following issue with trying accessing data on S3a, my code is
> the following:
>
> val sparkConf = new SparkConf()
>
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>
> val sqlContext = SQLContext.getOrCreate(sc)
>
> val df = sqlContext.read.parquet(...)
>
> df.count
>
>
> It results in the following exception and log messages:
>
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: *Access key or secret key is 
> null*
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30  
> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from 
> InstanceProfileCredentialsProvider: The requested metadata is not found at 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30  
> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
> provider in the chain
>   at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>
>
> I run standalone spark 1.5.2 and using hadoop 2.7.1
>
> any ideas/workarounds?
>
> AWS credentials are correct for this bucket
>
> Thank you,
> Konstantin Kudryavtsev
>
>


SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Dear Spark community,

I faced the following issue with trying accessing data on S3a, my code is
the following:

val sparkConf = new SparkConf()

val sc = new SparkContext(sparkConf)
sc.hadoopConfiguration.set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")

val sqlContext = SQLContext.getOrCreate(sc)

val df = sqlContext.read.parquet(...)

df.count


It results in the following exception and log messages:

15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load
credentials from BasicAWSCredentialsProvider: *Access key or secret
key is null*
15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance
metadata service at URL:
http://x.x.x.x/latest/meta-data/iam/security-credentials/
15/12/30 
17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials
from InstanceProfileCredentialsProvider: The requested metadata is not
found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
15/12/30 
17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
com.amazonaws.AmazonClientException: Unable to load AWS credentials
from any provider in the chain
at 
com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at 
com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at 
com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)


I run standalone spark 1.5.2 and using hadoop 2.7.1

any ideas/workarounds?

AWS credentials are correct for this bucket

Thank you,
Konstantin Kudryavtsev


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Blaž Šnuderl
Try setting s3 credentials using keys specified here
https://github.com/Aloisius/hadoop-s3a/blob/master/README.md

Blaz
On Dec 30, 2015 6:48 PM, "KOSTIANTYN Kudriavtsev" <
kudryavtsev.konstan...@gmail.com> wrote:

> Dear Spark community,
>
> I faced the following issue with trying accessing data on S3a, my code is
> the following:
>
> val sparkConf = new SparkConf()
>
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>
> val sqlContext = SQLContext.getOrCreate(sc)
>
> val df = sqlContext.read.parquet(...)
>
> df.count
>
>
> It results in the following exception and log messages:
>
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: *Access key or secret key is 
> null*
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30  
> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from 
> InstanceProfileCredentialsProvider: The requested metadata is not found at 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30  
> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
> provider in the chain
>   at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>
>
> I run standalone spark 1.5.2 and using hadoop 2.7.1
>
> any ideas/workarounds?
>
> AWS credentials are correct for this bucket
>
> Thank you,
> Konstantin Kudryavtsev
>


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Hi Blaz,

I did, the same result

Thank you,
Konstantin Kudryavtsev

On Wed, Dec 30, 2015 at 12:54 PM, Blaž Šnuderl  wrote:

> Try setting s3 credentials using keys specified here
> https://github.com/Aloisius/hadoop-s3a/blob/master/README.md
>
> Blaz
> On Dec 30, 2015 6:48 PM, "KOSTIANTYN Kudriavtsev" <
> kudryavtsev.konstan...@gmail.com> wrote:
>
>> Dear Spark community,
>>
>> I faced the following issue with trying accessing data on S3a, my code is
>> the following:
>>
>> val sparkConf = new SparkConf()
>>
>> val sc = new SparkContext(sparkConf)
>> sc.hadoopConfiguration.set("fs.s3a.impl", 
>> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>
>> val sqlContext = SQLContext.getOrCreate(sc)
>>
>> val df = sqlContext.read.parquet(...)
>>
>> df.count
>>
>>
>> It results in the following exception and log messages:
>>
>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>> credentials from BasicAWSCredentialsProvider: *Access key or secret key is 
>> null*
>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
>> metadata service at URL: 
>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>> 15/12/30  
>> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from 
>> InstanceProfileCredentialsProvider: The requested metadata is not found at 
>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>> 15/12/30  
>> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
>> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
>> provider in the chain
>>  at 
>> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>  at 
>> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>  at 
>> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>  at 
>> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>  at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>
>>
>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>
>> any ideas/workarounds?
>>
>> AWS credentials are correct for this bucket
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Chris Fregly
couple things:

1) switch to IAM roles if at all possible - explicitly passing AWS
credentials is a long and lonely road in the end

2) one really bad workaround/hack is to run a job that hits every worker
and writes the credentials to the proper location (~/.awscredentials or
whatever)

^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle
autoscaling, but i'm mentioning it anyway as it is a temporary fix.

if you switch to IAM roles, things become a lot easier as you can authorize
all of the EC2 instances in the cluster - and handles autoscaling very well
- and at some point, you will want to autoscale.

On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
kudryavtsev.konstan...@gmail.com> wrote:

> Chris,
>
>  good question, as you can see from the code I set up them on driver, so I
> expect they will be propagated to all nodes, won't them?
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
>
>> are the credentials visible from each Worker node to all the Executor
>> JVMs on each Worker?
>>
>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>> Dear Spark community,
>>
>> I faced the following issue with trying accessing data on S3a, my code is
>> the following:
>>
>> val sparkConf = new SparkConf()
>>
>> val sc = new SparkContext(sparkConf)
>> sc.hadoopConfiguration.set("fs.s3a.impl", 
>> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>
>> val sqlContext = SQLContext.getOrCreate(sc)
>>
>> val df = sqlContext.read.parquet(...)
>>
>> df.count
>>
>>
>> It results in the following exception and log messages:
>>
>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>> credentials from BasicAWSCredentialsProvider: *Access key or secret key is 
>> null*
>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
>> metadata service at URL: 
>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>> 15/12/30  
>> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from 
>> InstanceProfileCredentialsProvider: The requested metadata is not found at 
>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>> 15/12/30  
>> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
>> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any 
>> provider in the chain
>>  at 
>> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>  at 
>> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>  at 
>> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>  at 
>> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>  at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>
>>
>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>
>> any ideas/workarounds?
>>
>> AWS credentials are correct for this bucket
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>>
>


-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Chris,

thanks for the hist with AIM roles, but in my case  I need to run different
jobs with different S3 permissions on the same cluster, so this approach
doesn't work for me as far as I understood it

Thank you,
Konstantin Kudryavtsev

On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:

> couple things:
>
> 1) switch to IAM roles if at all possible - explicitly passing AWS
> credentials is a long and lonely road in the end
>
> 2) one really bad workaround/hack is to run a job that hits every worker
> and writes the credentials to the proper location (~/.awscredentials or
> whatever)
>
> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle
> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>
> if you switch to IAM roles, things become a lot easier as you can
> authorize all of the EC2 instances in the cluster - and handles autoscaling
> very well - and at some point, you will want to autoscale.
>
> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
>> Chris,
>>
>>  good question, as you can see from the code I set up them on driver, so
>> I expect they will be propagated to all nodes, won't them?
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
>>
>>> are the credentials visible from each Worker node to all the Executor
>>> JVMs on each Worker?
>>>
>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
>>> kudryavtsev.konstan...@gmail.com> wrote:
>>>
>>> Dear Spark community,
>>>
>>> I faced the following issue with trying accessing data on S3a, my code
>>> is the following:
>>>
>>> val sparkConf = new SparkConf()
>>>
>>> val sc = new SparkContext(sparkConf)
>>> sc.hadoopConfiguration.set("fs.s3a.impl", 
>>> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>>
>>> val sqlContext = SQLContext.getOrCreate(sc)
>>>
>>> val df = sqlContext.read.parquet(...)
>>>
>>> df.count
>>>
>>>
>>> It results in the following exception and log messages:
>>>
>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>>> credentials from BasicAWSCredentialsProvider: *Access key or secret key is 
>>> null*
>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
>>> metadata service at URL: 
>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>> 15/12/30 
>>>  
>>> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from 
>>> InstanceProfileCredentialsProvider: The requested metadata is not found at 
>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>> 15/12/30 
>>>  
>>> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials from 
>>> any provider in the chain
>>> at 
>>> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>> at 
>>> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>> at 
>>> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>> at 
>>> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>> at 
>>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>>
>>>
>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>>
>>> any ideas/workarounds?
>>>
>>> AWS credentials are correct for this bucket
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>>
>>
>
>
> --
>
> *Chris Fregly*
> Principal Data Solutions Engineer
> IBM Spark Technology Center, San Francisco, CA
> http://spark.tc | http://advancedspark.com
>


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn,

Can you define those properties in hdfs-site.xml and make sure it is visible in 
the class path when you spark-submit? It looks like a conf sourcing issue to 
me. 

Cheers,

Sent from my iPhone

> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Chris,
> 
> thanks for the hist with AIM roles, but in my case  I need to run different 
> jobs with different S3 permissions on the same cluster, so this approach 
> doesn't work for me as far as I understood it
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
>> couple things:
>> 
>> 1) switch to IAM roles if at all possible - explicitly passing AWS 
>> credentials is a long and lonely road in the end
>> 
>> 2) one really bad workaround/hack is to run a job that hits every worker and 
>> writes the credentials to the proper location (~/.awscredentials or whatever)
>> 
>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>> 
>> if you switch to IAM roles, things become a lot easier as you can authorize 
>> all of the EC2 instances in the cluster - and handles autoscaling very well 
>> - and at some point, you will want to autoscale.
>> 
>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> Chris,
>>> 
>>>  good question, as you can see from the code I set up them on driver, so I 
>>> expect they will be propagated to all nodes, won't them?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
 On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
 are the credentials visible from each Worker node to all the Executor JVMs 
 on each Worker?
 
> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Dear Spark community,
> 
> I faced the following issue with trying accessing data on S3a, my code is 
> the following:
> 
> val sparkConf = new SparkConf()
> 
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
> val sqlContext = SQLContext.getOrCreate(sc)
> val df = sqlContext.read.parquet(...)
> df.count
> 
> It results in the following exception and log messages:
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: Access key or secret key is 
> null
> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
> metadata service at URL: 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from InstanceProfileCredentialsProvider: The requested 
> metadata is not found at 
> http://x.x.x.x/latest/meta-data/iam/security-credentials/
> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 3)
> com.amazonaws.AmazonClientException: Unable to load AWS credentials from 
> any provider in the chain
>   at 
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>   at 
> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>   at 
> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
> 
> I run standalone spark 1.5.2 and using hadoop 2.7.1
> 
> any ideas/workarounds?
> 
> AWS credentials are correct for this bucket
> 
> Thank you,
> Konstantin Kudryavtsev
>> 
>> 
>> 
>> -- 
>> 
>> Chris Fregly
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
> 


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread KOSTIANTYN Kudriavtsev
Hi Jerry,

I want to run different jobs on different S3 buckets - different AWS creds
- on the same instances. Could you shed some light if it's possible to
achieve with hdfs-site?

Thank you,
Konstantin Kudryavtsev

On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:

> Hi Kostiantyn,
>
> Can you define those properties in hdfs-site.xml and make sure it is
> visible in the class path when you spark-submit? It looks like a conf
> sourcing issue to me.
>
> Cheers,
>
> Sent from my iPhone
>
> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstan...@gmail.com> wrote:
>
> Chris,
>
> thanks for the hist with AIM roles, but in my case  I need to run
> different jobs with different S3 permissions on the same cluster, so this
> approach doesn't work for me as far as I understood it
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
>
>> couple things:
>>
>> 1) switch to IAM roles if at all possible - explicitly passing AWS
>> credentials is a long and lonely road in the end
>>
>> 2) one really bad workaround/hack is to run a job that hits every worker
>> and writes the credentials to the proper location (~/.awscredentials or
>> whatever)
>>
>> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle
>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>
>> if you switch to IAM roles, things become a lot easier as you can
>> authorize all of the EC2 instances in the cluster - and handles autoscaling
>> very well - and at some point, you will want to autoscale.
>>
>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstan...@gmail.com> wrote:
>>
>>> Chris,
>>>
>>>  good question, as you can see from the code I set up them on driver, so
>>> I expect they will be propagated to all nodes, won't them?
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
>>>
 are the credentials visible from each Worker node to all the Executor
 JVMs on each Worker?

 On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
 kudryavtsev.konstan...@gmail.com> wrote:

 Dear Spark community,

 I faced the following issue with trying accessing data on S3a, my code
 is the following:

 val sparkConf = new SparkConf()

 val sc = new SparkContext(sparkConf)
 sc.hadoopConfiguration.set("fs.s3a.impl", 
 "org.apache.hadoop.fs.s3a.S3AFileSystem")
 sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
 sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")

 val sqlContext = SQLContext.getOrCreate(sc)

 val df = sqlContext.read.parquet(...)

 df.count


 It results in the following exception and log messages:

 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
 credentials from BasicAWSCredentialsProvider: *Access key or secret key is 
 null*
 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
 metadata service at URL: 
 http://x.x.x.x/latest/meta-data/iam/security-credentials/
 15/12/30 
  
 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials 
 from InstanceProfileCredentialsProvider: The requested metadata is not 
 found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
 15/12/30 
  
 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
 com.amazonaws.AmazonClientException: Unable to load AWS credentials from 
 any provider in the chain
at 
 com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at 
 com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at 
 com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at 
 com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at 
 org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)


 I run standalone spark 1.5.2 and using hadoop 2.7.1

 any ideas/workarounds?

 AWS credentials are correct for this bucket

 Thank you,
 Konstantin Kudryavtsev


>>>
>>
>>
>> --
>>
>> *Chris Fregly*
>> Principal Data Solutions Engineer
>> IBM Spark Technology Center, San Francisco, CA
>> http://spark.tc | http://advancedspark.com
>>
>
>


Re: SparkSQL integration issue with AWS S3a

2015-12-30 Thread Jerry Lam
Hi Kostiantyn,

I want to confirm that it works first by using hdfs-site.xml. If yes, you could 
define different spark-{user-x}.conf and source them during spark-submit. let 
us know if hdfs-site.xml works first. It should.

Best Regards,

Jerry

Sent from my iPhone

> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
> Hi Jerry,
> 
> I want to run different jobs on different S3 buckets - different AWS creds - 
> on the same instances. Could you shed some light if it's possible to achieve 
> with hdfs-site?
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>> Hi Kostiantyn,
>> 
>> Can you define those properties in hdfs-site.xml and make sure it is visible 
>> in the class path when you spark-submit? It looks like a conf sourcing issue 
>> to me. 
>> 
>> Cheers,
>> 
>> Sent from my iPhone
>> 
>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> 
>>> Chris,
>>> 
>>> thanks for the hist with AIM roles, but in my case  I need to run different 
>>> jobs with different S3 permissions on the same cluster, so this approach 
>>> doesn't work for me as far as I understood it
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
 On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
 couple things:
 
 1) switch to IAM roles if at all possible - explicitly passing AWS 
 credentials is a long and lonely road in the end
 
 2) one really bad workaround/hack is to run a job that hits every worker 
 and writes the credentials to the proper location (~/.awscredentials or 
 whatever)
 
 ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
 autoscaling, but i'm mentioning it anyway as it is a temporary fix.
 
 if you switch to IAM roles, things become a lot easier as you can 
 authorize all of the EC2 instances in the cluster - and handles 
 autoscaling very well - and at some point, you will want to autoscale.
 
> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>  wrote:
> Chris,
> 
>  good question, as you can see from the code I set up them on driver, so 
> I expect they will be propagated to all nodes, won't them?
> 
> Thank you,
> Konstantin Kudryavtsev
> 
>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
>> are the credentials visible from each Worker node to all the Executor 
>> JVMs on each Worker?
>> 
>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> 
>>> Dear Spark community,
>>> 
>>> I faced the following issue with trying accessing data on S3a, my code 
>>> is the following:
>>> 
>>> val sparkConf = new SparkConf()
>>> 
>>> val sc = new SparkContext(sparkConf)
>>> sc.hadoopConfiguration.set("fs.s3a.impl", 
>>> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>> val sqlContext = SQLContext.getOrCreate(sc)
>>> val df = sqlContext.read.parquet(...)
>>> df.count
>>> 
>>> It results in the following exception and log messages:
>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>>> credentials from BasicAWSCredentialsProvider: Access key or secret key 
>>> is null
>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance 
>>> metadata service at URL: 
>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
>>> credentials from InstanceProfileCredentialsProvider: The requested 
>>> metadata is not found at 
>>> http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 
>>> (TID 3)
>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials 
>>> from any provider in the chain
>>> at 
>>> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>> at 
>>> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>> at 
>>> com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>> at 
>>> com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>> at 
>>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>> 
>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>> 
>>> any ideas/workarounds?
>>> 
>>> AWS credentials are correct for this bucket
>>> 
>>> Thank you,
>>> Konstantin