Hi,

I'm struggling to figure out the best way to make Python Beam jobs execute on a Spark cluster running on Kubernetes. Unfortunately, the available documentation is incomplete and confusing at best.

The most flexible way I found was to compile a JAR from my Python job and submit that via spark-submit. Unfortunately, this seems to be extremely buggy and I cannot get it to feed logs from the SDK containers back to the Spark executors back to the driver. See: https://github.com/apache/beam/issues/29683

The other way would be to use a Beam job server, but here I cannot find a sensible way to set any Spark config options besides the master URL. I have a spark-defaults.conf with vital configuration, which needs to be passed to the job. I see two ways forward here:

1) I could let users run the job server locally in a Docker container. This way they could potentially mount their spark-defaults.conf somewhere, but I don't really see where (pointers here?). They would also need to mount their Kubernetes access credentials somehow, otherwise the job server cannot access the cluster.

2) I could run the Job server in the Kubernetes cluster, which would resolve the Kubernetes credential issue but not the Spark config issue. Though, even if that were solved, I would now force all users to use the same Spark config (not ideal).

Is there a better way? From what I can see, the compiled JAR is the only viable option, but the log issue is a deal breaker.

Thanks
Janek

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to