On 2 Aug 2017, at 10:34, Riccardo Ferrari
<[email protected]<mailto:[email protected]>> wrote:
Hi list!
I am working on a pyspark streaming job (ver 2.2.0) and I need to enable
checkpointing. At high level my python script goes like this:
class StreamingJob():
def __init__(..):
...
sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key',....)
sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key',....)
def doJob(self):
ssc = StreamingContext.getOrCreate('<S3-location>', <function to create ssc>)
and I run it:
myJob = StreamingJob(...)
myJob.doJob()
The problem is that StreamingContext.getOrCreate is not able to have access to
hadoop configuration configured in the constructor and fails to load from
checkpoint with
"com.amazonaws.AmazonClientException: Unable to load AWS credentials from any
provider in the chain"
If I export AWS credentials to the system ENV before starting the script it
works!
Spark magically copies the env vars over for you when you launch a job
I see the Scala version has an option to provide the hadoop configuration that
is not available in python
I don't have the whole Hadoop, just Spark, so I don't really want to configure
hadoop's xmls and such
when you set up the context, as in spark-defaults.conf
spark.hadoop.fs.s3a.access.key=access key
spark.hadoop.fs.s3a.secret.key=secret key
Reminder: Do keep your secret key a secret, avoid checking it in to any form of
revision control.