Hello, I'm attempting to run Spark within a Docker container with the hope of eventually running Spark on Kubernetes. Nearly all the data we currently process with Spark is stored in S3, so I need to be able to interface with it using the S3A filesystem.
I feel like I've gotten close to getting this working but for some reason cannot get my local Spark installations to correctly interface with S3 yet. A basic example of what I've tried: - Build Kubernetes docker images by downloading the spark-2.4.5-bin-hadoop2.7.tgz archive and building the kubernetes/dockerfiles/spark/Dockerfile image. - Run an interactive docker container using the above built image. - Within that container, run spark-shell. This command passes valid AWS credentials by setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the hadoop-aws package by specifying the --packages org.apache.hadoop:hadoop-aws:2.7.3 flag. - Try to access the simple public file as outlined in the "Integration with Cloud Infrastructures <https://spark.apache.org/docs/latest/cloud-integration.html#installation>" documentation by running: sc.textFile("s3a://landsat-pds/scene_list.gz").take(5) - Observe this to fail with a 403 Forbidden exception thrown by S3 I've tried a variety of other means of setting credentials (like exporting the standard AWS_ACCESS_KEY_ID environment variable before launching spark-shell), and other means of building a Spark image and including the appropriate libraries (see this Github repo: https://github.com/drboyer/spark-s3a-demo), all with the same results. I've tried also accessing objects within our AWS account, rather than the object from the public landsat-pds bucket, with the same 403 error being thrown. Can anyone help explain why I can't seem to connect to S3 successfully using Spark, or even explain where I could look for additional clues as to what's misconfigured? I've tried turning up the logging verbosity and didn't see much that was particularly useful, but happy to share additional log output too. Thanks for any help you can provide! Best, Devin Boyer