Hello,

I'm attempting to run Spark within a Docker container with the hope of
eventually running Spark on Kubernetes. Nearly all the data we currently
process with Spark is stored in S3, so I need to be able to interface with
it using the S3A filesystem.

I feel like I've gotten close to getting this working but for some reason
cannot get my local Spark installations to correctly interface with S3 yet.

A basic example of what I've tried:

   - Build Kubernetes docker images by downloading the
   spark-2.4.5-bin-hadoop2.7.tgz archive and building the
   kubernetes/dockerfiles/spark/Dockerfile image.
   - Run an interactive docker container using the above built image.
   - Within that container, run spark-shell. This command passes valid AWS
   credentials by setting spark.hadoop.fs.s3a.access.key and
   spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the
   hadoop-aws package by specifying the --packages
   org.apache.hadoop:hadoop-aws:2.7.3 flag.
   - Try to access the simple public file as outlined in the "Integration
   with Cloud Infrastructures
   <https://spark.apache.org/docs/latest/cloud-integration.html#installation>"
   documentation by running:
   sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
   - Observe this to fail with a 403 Forbidden exception thrown by S3


I've tried a variety of other means of setting credentials (like exporting
the standard AWS_ACCESS_KEY_ID environment variable before launching
spark-shell), and other means of building a Spark image and including the
appropriate libraries (see this Github repo:
https://github.com/drboyer/spark-s3a-demo), all with the same results. I've
tried also accessing objects within our AWS account, rather than the object
from the public landsat-pds bucket, with the same 403 error being thrown.

Can anyone help explain why I can't seem to connect to S3 successfully
using Spark, or even explain where I could look for additional clues as to
what's misconfigured? I've tried turning up the logging verbosity and
didn't see much that was particularly useful, but happy to share additional
log output too.

Thanks for any help you can provide!

Best,
Devin Boyer

Reply via email to