Hi,

Apologies if this has been answered before - I did some searching but
wasn't able to come up with anything helpful.

I'm attempting to run a pyspark job in yarn-cluster mode using Oozie on a
medium sized cluster, where Spark is only installed on the master node.
Normally the SparkSubmit class, when used in this mode, handles
distributing work to the slave nodes via YARN.

My issue stems from the fact that when running such a spark job via Oozie,
the container that is launched for SparkSubmit is ran on a random slave
node, which assumes has Spark (and therefore, necessary configuration such
as spark-defaults.conf, and spark-env.sh) is installed locally, leading to
issues like the one seen in https://issues.apache.org/jira/browse/OOZIE-2482
.

Is this an issue anyone else has ran into before? Is there work being done
in Oozie being done to address this assumption which places unnecessary
constraints on the user? I have yet to come across a method of ensuring a
specific job gets launched on a particular node, or seen any documentation
referring to this problem.

If I can provide more information, please don't hesitate to ask.

Best,

Reply via email to