Hi, Apologies if this has been answered before - I did some searching but wasn't able to come up with anything helpful.
I'm attempting to run a pyspark job in yarn-cluster mode using Oozie on a medium sized cluster, where Spark is only installed on the master node. Normally the SparkSubmit class, when used in this mode, handles distributing work to the slave nodes via YARN. My issue stems from the fact that when running such a spark job via Oozie, the container that is launched for SparkSubmit is ran on a random slave node, which assumes has Spark (and therefore, necessary configuration such as spark-defaults.conf, and spark-env.sh) is installed locally, leading to issues like the one seen in https://issues.apache.org/jira/browse/OOZIE-2482 . Is this an issue anyone else has ran into before? Is there work being done in Oozie being done to address this assumption which places unnecessary constraints on the user? I have yet to come across a method of ensuring a specific job gets launched on a particular node, or seen any documentation referring to this problem. If I can provide more information, please don't hesitate to ask. Best,
