Hi Richard!

I'm happy you've found a workaround for your issue.

Yes, "SPARK_HOME" is set to the current working directory in
SparkActionExecutor
<https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/SparkActionExecutor.java#L111>

Based on the surrounding code, the It should be merged with user-defined
environment properties though, so the behavior you're experiencing sounds
like a bug to me.
The issue might be that SparkActionExecutor uses "mapred.child.env"
<https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/SparkActionExecutor.java#L47>
and
not mapred.map.child.env and MapReduce overwrites one with the other
instead of merging them.
gp


On Wed, May 16, 2018 at 11:11 PM, Richard Primera <
richard.prim...@woombatcg.com> wrote:

> Greetings,
>
> Thanks for the suggestion. I tried this and noted two things. The first is
> that one has to prepend `oozie.launcher` to the parameter in order for it
> to have an effect over the actual environment of the script. The second, is
> that when I did this the python script exited claiming it couldn't find the
> module pyspark.sql.types, which leads me to believe that
> `mapred.map.child.env` is being used underneath in order to pass some other
> environment variables, and being overwritten by me when I manually set it
> to a particular set of k=v pairs. I don't know if this is the case though,
> just concluding it off the observed behavior.
>
> In the end I managed to get the ${wf:id()} result by appealing to the
> SparkConf object inside the SparkContext provided by Oozie for the spark
> action. I noticed in the stdout log that when the script is run, one of the
> command line parameters given to spark-submit is actually `--conf
> spark.oozie.job.id=${wf:id()}. So in the end, I was lucky and wf_id =
> sc._conf.get("spark.oozie.job.id") did the trick for me from within the
> script. However, I'd still like to find a way of doing it as I originally
> intended, which is by being able to access some environment variable set
> from the XML definition.
>
>
>
> On 05/14/2018 05:58 AM, Peter Cseh wrote:
>
>> Hi!
>>
>> There is no easy and straightforward way of doing this for the Spark
>> action, but you can take advantage of the fact that Oozie 4.1.0 uses
>> MapReduce to launch Spark.
>> Just put "mapred.map.child.env" in the action configuration using the
>> format k1=v1,k2=v2. EL functions should also work here.
>>
>> Gp
>>
>>
>> On Thu, May 10, 2018 at 6:39 PM, Richard Primera <
>> richard.prim...@woombatcg.com> wrote:
>>
>> Greetings,
>>>
>>> How can I set an environment variable to be accessible from either a .jar
>>> or .py script launched via a spark action?
>>>
>>> The idea is to set the environment variable with the output of the EL
>>> function ${wf:id()} from within the XML workflow definition, something
>>> along these lines:
>>>
>>> <jar>script.py</jar>
>>>
>>>      <env>OOZIE_WORKFLOW_ID=${wf:id()}</env>
>>>
>>> And then have the ability to do wf_id = os.getenv("OOZIE_WORKFLOW_ID")
>>> from the script without having to pass them as command line arguments.
>>> The
>>> thing about command line arguments is that they don't scale as well
>>> because
>>> they rely on a specific ordering or some custom parsing implementation.
>>> This can be done easily it seems with a shell action, but I've been
>>> unable
>>> to find a similar straightforward way of doing it for a spark action.
>>>
>>> Oozie Version: 4.1.0-cdh5.12.1
>>>
>>>
>>>
>>
>


-- 
*Peter Cseh *| Software Engineer
cloudera.com <https://www.cloudera.com>

[image: Cloudera] <https://www.cloudera.com/>

[image: Cloudera on Twitter] <https://twitter.com/cloudera> [image:
Cloudera on Facebook] <https://www.facebook.com/cloudera> [image: Cloudera
on LinkedIn] <https://www.linkedin.com/company/cloudera>
------------------------------

Reply via email to