Greetings,
Thanks for the suggestion. I tried this and noted two things. The first
is that one has to prepend `oozie.launcher` to the parameter in order
for it to have an effect over the actual environment of the script. The
second, is that when I did this the python script exited claiming it
couldn't find the module pyspark.sql.types, which leads me to believe
that `mapred.map.child.env` is being used underneath in order to pass
some other environment variables, and being overwritten by me when I
manually set it to a particular set of k=v pairs. I don't know if this
is the case though, just concluding it off the observed behavior.
In the end I managed to get the ${wf:id()} result by appealing to the
SparkConf object inside the SparkContext provided by Oozie for the spark
action. I noticed in the stdout log that when the script is run, one of
the command line parameters given to spark-submit is actually `--conf
spark.oozie.job.id=${wf:id()}. So in the end, I was lucky and wf_id =
sc._conf.get("spark.oozie.job.id") did the trick for me from within the
script. However, I'd still like to find a way of doing it as I
originally intended, which is by being able to access some environment
variable set from the XML definition.
On 05/14/2018 05:58 AM, Peter Cseh wrote:
Hi!
There is no easy and straightforward way of doing this for the Spark
action, but you can take advantage of the fact that Oozie 4.1.0 uses
MapReduce to launch Spark.
Just put "mapred.map.child.env" in the action configuration using the
format k1=v1,k2=v2. EL functions should also work here.
Gp
On Thu, May 10, 2018 at 6:39 PM, Richard Primera <
richard.prim...@woombatcg.com> wrote:
Greetings,
How can I set an environment variable to be accessible from either a .jar
or .py script launched via a spark action?
The idea is to set the environment variable with the output of the EL
function ${wf:id()} from within the XML workflow definition, something
along these lines:
<jar>script.py</jar>
<env>OOZIE_WORKFLOW_ID=${wf:id()}</env>
And then have the ability to do wf_id = os.getenv("OOZIE_WORKFLOW_ID")
from the script without having to pass them as command line arguments. The
thing about command line arguments is that they don't scale as well because
they rely on a specific ordering or some custom parsing implementation.
This can be done easily it seems with a shell action, but I've been unable
to find a similar straightforward way of doing it for a spark action.
Oozie Version: 4.1.0-cdh5.12.1