Hi, We're running into an issue where workflows that fail and have to be re-run (with oozie.wf.rerun.failnodes=true ) immediately fail again with a message in the Oozie log "invalid execution path."
The consistent pattern that we observe is that in a workflow with a fork (fork_2) leading to a join (join_2) which leads to a fork (fork_3), if a failure occurs in the jobs that fork_3 leads to, then on retry, the failure will immediately occur. If there is no failure, the workflow executes to completion normally. What we've also observed is that if fork_2 leads to a number of jobs (bash_cp_3, bash_cp_4, bash_cp6, bash_cp_8, bash_cp_10, bash_cp_12), then the apparently invalid execution paths are any of the first five. In other words, if any of the first five are seemingly randomly set by Oozie in Oozie's wf_actions table for the execution_path for join_2, the re-run will fail. Only if "bash_cp_12" is set then the workflow will successfully re-run. Another thing that might be relevant is that we are using a custom action executor that submits to SGE (for legacy reasons). The code is available at https://github.com/SeqWare/oozie-sge/tree/1.0.2 This is with Oozie version 3.3.2-cdh4.5.0 Are there any thoughts on whether there is some API call that we're failing to make in our custom action executor that affects execution path? Are we structuring our workflows in some unexpected manner? What is the meaning of an execution path for a control node such as join anyways? Thanks for any insight! Large amounts of text follow .... Relevant error in log: 2014-06-05 14:06:01,599 DEBUG SignalXCommand:545 - USER[dyuen] GROUP[-] TOKEN[] APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] ACTION[0000000-140605140030484-oozie-oozi-W@join_2] STARTED SignalCommand for jobid=0000000-140605140030484-oozie-oozi-W, actionId=0000000-140605140030484-oozie-oozi-W@join_2 2014-06-05 14:06:01,600 DEBUG LiteWorkflowInstance:545 - USER[dyuen] GROUP[-] TOKEN[] APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] ACTION[0000000-140605140030484-oozie-oozi-W@join_2] Signaling job execution path [/bash_cp_3/] signal value [OK] 2014-06-05 14:06:01,600 ERROR LiteWorkflowInstance:536 - USER[dyuen] GROUP[-] TOKEN[] APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] ACTION[0000000-140605140030484-oozie-oozi-W@join_2] invalid execution path [/bash_cp_3/] 2014-06-05 14:06:01,601 WARN LiteWorkflowInstance:542 - USER[dyuen] GROUP[-] TOKEN[] APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] ACTION[0000000-140605140030484-oozie-oozi-W@join_2] Workflow completed [FAILED], failing [0] running nodes Oozie wf_actions table for the relevant workflow: id | name | signal_value | status | transition | execution_path ----------------------------------------------------------------+---------------------------+--------------+--------+---------------------------+------------------------ 0000000-140605140030484-oozie-oozi-W@:start: | :start: | OK | OK | start_0 | / 0000000-140605140030484-oozie-oozi-W@start_0 | start_0 | OK | OK | provisionFile_file_in_0_1 | / 0000000-140605140030484-oozie-oozi-W@provisionFile_file_in_0_1 | provisionFile_file_in_0_1 | OK | OK | bash_mkdir_2 | / 0000000-140605140030484-oozie-oozi-W@bash_mkdir_2 | bash_mkdir_2 | OK | OK | fork_2 | / 0000000-140605140030484-oozie-oozi-W@fork_2 | fork_2 | OK | OK | * | / 0000000-140605140030484-oozie-oozi-W@bash_cp_3 | bash_cp_3 | OK | OK | join_2 | /bash_cp_3/ 0000000-140605140030484-oozie-oozi-W@bash_cp_4 | bash_cp_4 | OK | OK | join_2 | /bash_cp_4/ 0000000-140605140030484-oozie-oozi-W@bash_cp_6 | bash_cp_6 | OK | OK | join_2 | /bash_cp_6/ 0000000-140605140030484-oozie-oozi-W@bash_cp_8 | bash_cp_8 | OK | OK | join_2 | /bash_cp_8/ 0000000-140605140030484-oozie-oozi-W@bash_cp_10 | bash_cp_10 | OK | OK | join_2 | /bash_cp_10/ 0000000-140605140030484-oozie-oozi-W@bash_cp_12 | bash_cp_12 | OK | OK | join_2 | /bash_cp_12/ 0000000-140605140030484-oozie-oozi-W@join_2 | join_2 | OK | OK | fork_3 | /bash_cp_3/ 0000000-140605140030484-oozie-oozi-W@fork_3 | fork_3 | OK | OK | * | / 0000000-140605140030484-oozie-oozi-W@provisionFile_out_5 | provisionFile_out_5 | OK | OK | join_3 | /provisionFile_out_5/ 0000000-140605140030484-oozie-oozi-W@provisionFile_out_7 | provisionFile_out_7 | OK | OK | join_3 | /provisionFile_out_7/ 0000000-140605140030484-oozie-oozi-W@provisionFile_out_9 | provisionFile_out_9 | OK | OK | join_3 | /provisionFile_out_9/ 0000000-140605140030484-oozie-oozi-W@provisionFile_out_11 | provisionFile_out_11 | OK | OK | join_3 | /provisionFile_out_11/ 0000000-140605140030484-oozie-oozi-W@provisionFile_out_13 | provisionFile_out_13 | OK | OK | join_3 | /provisionFile_out_13/ 0000000-140605140030484-oozie-oozi-W@fail | fail | OK | OK | | /bash_cp_14/ (19 rows) The workflow: <?xml version="1.0" encoding="UTF-8"?> <workflow-app xmlns="uri:oozie:workflow:0.4" name="HelloWorld"> <start to="start_0" /> <action name="start_0" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/start_0-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/start_0-qsub.opts</options-file> </sge> <ok to="provisionFile_file_in_0_1" /> <error to="fail" /> </action> <action name="provisionFile_file_in_0_1" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_file_in_0_1-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_file_in_0_1-qsub.opts</options-file> </sge> <ok to="bash_mkdir_2" /> <error to="fail" /> </action> <action name="bash_mkdir_2" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_mkdir_2-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_mkdir_2-qsub.opts</options-file> </sge> <ok to="fork_2" /> <error to="fail" /> </action> <fork name="fork_2"> <path start="bash_cp_3" /> <path start="bash_cp_4" /> <path start="bash_cp_6" /> <path start="bash_cp_8" /> <path start="bash_cp_10" /> <path start="bash_cp_12" /> </fork> <action name="bash_cp_3" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_3-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_3-qsub.opts</options-file> </sge> <ok to="join_2" /> <error to="fail" /> </action> <action name="bash_cp_4" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_4-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_4-qsub.opts</options-file> </sge> <ok to="join_2" /> <error to="fail" /> </action> <action name="bash_cp_6" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_6-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_6-qsub.opts</options-file> </sge> <ok to="join_2" /> <error to="fail" /> </action> <action name="bash_cp_8" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_8-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_8-qsub.opts</options-file> </sge> <ok to="join_2" /> <error to="fail" /> </action> <action name="bash_cp_10" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_10-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_10-qsub.opts</options-file> </sge> <ok to="join_2" /> <error to="fail" /> </action> <action name="bash_cp_12" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_12-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_12-qsub.opts</options-file> </sge> <ok to="join_2" /> <error to="fail" /> </action> <join name="join_2" to="fork_3" /> <fork name="fork_3"> <path start="bash_cp_14" /> <path start="provisionFile_out_5" /> <path start="provisionFile_out_7" /> <path start="provisionFile_out_9" /> <path start="provisionFile_out_11" /> <path start="provisionFile_out_13" /> </fork> <action name="bash_cp_14" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_14-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_14-qsub.opts</options-file> </sge> <ok to="join_3" /> <error to="fail" /> </action> <action name="provisionFile_out_5" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_5-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_5-qsub.opts</options-file> </sge> <ok to="join_3" /> <error to="fail" /> </action> <action name="provisionFile_out_7" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_7-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_7-qsub.opts</options-file> </sge> <ok to="join_3" /> <error to="fail" /> </action> <action name="provisionFile_out_9" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_9-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_9-qsub.opts</options-file> </sge> <ok to="join_3" /> <error to="fail" /> </action> <action name="provisionFile_out_11" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_11-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_11-qsub.opts</options-file> </sge> <ok to="join_3" /> <error to="fail" /> </action> <action name="provisionFile_out_13" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_13-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_13-qsub.opts</options-file> </sge> <ok to="join_3" /> <error to="fail" /> </action> <join name="join_3" to="fork_4" /> <fork name="fork_4"> <path start="bash_cp_15" /> <path start="bash_cp_17" /> <path start="bash_cp_19" /> <path start="bash_cp_21" /> <path start="bash_cp_23" /> </fork> <action name="bash_cp_15" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_15-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_15-qsub.opts</options-file> </sge> <ok to="join_4" /> <error to="fail" /> </action> <action name="bash_cp_17" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_17-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_17-qsub.opts</options-file> </sge> <ok to="join_4" /> <error to="fail" /> </action> <action name="bash_cp_19" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_19-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_19-qsub.opts</options-file> </sge> <ok to="join_4" /> <error to="fail" /> </action> <action name="bash_cp_21" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_21-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_21-qsub.opts</options-file> </sge> <ok to="join_4" /> <error to="fail" /> </action> <action name="bash_cp_23" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_23-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_23-qsub.opts</options-file> </sge> <ok to="join_4" /> <error to="fail" /> </action> <join name="join_4" to="fork_5" /> <fork name="fork_5"> <path start="provisionFile_out_16" /> <path start="provisionFile_out_18" /> <path start="provisionFile_out_20" /> <path start="provisionFile_out_22" /> <path start="provisionFile_out_24" /> </fork> <action name="provisionFile_out_16" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_16-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_16-qsub.opts</options-file> </sge> <ok to="join_5" /> <error to="fail" /> </action> <action name="provisionFile_out_18" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_18-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_18-qsub.opts</options-file> </sge> <ok to="join_5" /> <error to="fail" /> </action> <action name="provisionFile_out_20" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_20-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_20-qsub.opts</options-file> </sge> <ok to="join_5" /> <error to="fail" /> </action> <action name="provisionFile_out_22" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_22-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_22-qsub.opts</options-file> </sge> <ok to="join_5" /> <error to="fail" /> </action> <action name="provisionFile_out_24" retry-max="5" retry-interval="5"> <sge xmlns="uri:oozie:sge-action:1.0"> <script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_24-runner.sh</script> <options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_24-qsub.opts</options-file> </sge> <ok to="join_5" /> <error to="fail" /> </action> <join name="join_5" to="done" /> <join name="join_274314800376896" to="done" /> <action name="done"> <fs> <delete path="hdfs://localhost:8020/user/dyuen/seqware_workflow/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b" /> </fs> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>Java failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end" /> </workflow-app>
