Hi,

We're running into an issue where workflows that fail and have to be re-run 
(with oozie.wf.rerun.failnodes=true ) immediately fail again with a message in 
the Oozie log "invalid execution path."

The consistent pattern that we observe is that in a workflow with a fork 
(fork_2) leading to a join (join_2) which leads to a fork (fork_3), if a 
failure occurs in the jobs that fork_3 leads to, then on retry, the failure 
will immediately occur. If there is no failure, the workflow executes to 
completion normally.  What we've also observed is that if fork_2 leads to a 
number of jobs (bash_cp_3, bash_cp_4, bash_cp6, bash_cp_8, bash_cp_10, 
bash_cp_12), then the apparently invalid execution paths are any of the first 
five. In other words, if any of the first five are seemingly randomly set by 
Oozie in Oozie's wf_actions table for the execution_path for join_2, the re-run 
will fail. Only if "bash_cp_12" is set then the workflow will successfully 
re-run.

Another thing that might be relevant is that we are using a custom action 
executor that submits to SGE (for legacy reasons). The code is available at 
https://github.com/SeqWare/oozie-sge/tree/1.0.2 This is with Oozie version 
3.3.2-cdh4.5.0

Are there any thoughts on whether there is some API call that we're failing to 
make in our custom action executor that affects execution path?
Are we structuring our workflows in some unexpected manner?
What is the meaning of an execution path for a control node such as join 
anyways?

Thanks for any insight!

Large amounts of text follow ....

Relevant error in log:
2014-06-05 14:06:01,599 DEBUG SignalXCommand:545 - USER[dyuen] GROUP[-] TOKEN[] 
APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] 
ACTION[0000000-140605140030484-oozie-oozi-W@join_2] STARTED SignalCommand for 
jobid=0000000-140605140030484-oozie-oozi-W, 
actionId=0000000-140605140030484-oozie-oozi-W@join_2
2014-06-05 14:06:01,600 DEBUG LiteWorkflowInstance:545 - USER[dyuen] GROUP[-] 
TOKEN[] APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] 
ACTION[0000000-140605140030484-oozie-oozi-W@join_2] Signaling job execution 
path [/bash_cp_3/] signal value [OK]
2014-06-05 14:06:01,600 ERROR LiteWorkflowInstance:536 - USER[dyuen] GROUP[-] 
TOKEN[] APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] 
ACTION[0000000-140605140030484-oozie-oozi-W@join_2] invalid execution path 
[/bash_cp_3/]
2014-06-05 14:06:01,601  WARN LiteWorkflowInstance:542 - USER[dyuen] GROUP[-] 
TOKEN[] APP[HelloWorld] JOB[0000000-140605140030484-oozie-oozi-W] 
ACTION[0000000-140605140030484-oozie-oozi-W@join_2] Workflow completed 
[FAILED], failing [0] running nodes
Oozie wf_actions table for the relevant workflow:

                               id                               |           
name            | signal_value | status |        transition         |     
execution_path
----------------------------------------------------------------+---------------------------+--------------+--------+---------------------------+------------------------
 0000000-140605140030484-oozie-oozi-W@:start:                   | :start:       
            | OK           | OK     | start_0                   | /
 0000000-140605140030484-oozie-oozi-W@start_0                   | start_0       
            | OK           | OK     | provisionFile_file_in_0_1 | /
 0000000-140605140030484-oozie-oozi-W@provisionFile_file_in_0_1 | 
provisionFile_file_in_0_1 | OK           | OK     | bash_mkdir_2              | 
/
 0000000-140605140030484-oozie-oozi-W@bash_mkdir_2              | bash_mkdir_2  
            | OK           | OK     | fork_2                    | /
 0000000-140605140030484-oozie-oozi-W@fork_2                    | fork_2        
            | OK           | OK     | *                         | /
 0000000-140605140030484-oozie-oozi-W@bash_cp_3                 | bash_cp_3     
            | OK           | OK     | join_2                    | /bash_cp_3/
 0000000-140605140030484-oozie-oozi-W@bash_cp_4                 | bash_cp_4     
            | OK           | OK     | join_2                    | /bash_cp_4/
 0000000-140605140030484-oozie-oozi-W@bash_cp_6                 | bash_cp_6     
            | OK           | OK     | join_2                    | /bash_cp_6/
 0000000-140605140030484-oozie-oozi-W@bash_cp_8                 | bash_cp_8     
            | OK           | OK     | join_2                    | /bash_cp_8/
 0000000-140605140030484-oozie-oozi-W@bash_cp_10                | bash_cp_10    
            | OK           | OK     | join_2                    | /bash_cp_10/
 0000000-140605140030484-oozie-oozi-W@bash_cp_12                | bash_cp_12    
            | OK           | OK     | join_2                    | /bash_cp_12/
 0000000-140605140030484-oozie-oozi-W@join_2                    | join_2        
            | OK           | OK     | fork_3                    | /bash_cp_3/
 0000000-140605140030484-oozie-oozi-W@fork_3                    | fork_3        
            | OK           | OK     | *                         | /
 0000000-140605140030484-oozie-oozi-W@provisionFile_out_5       | 
provisionFile_out_5       | OK           | OK     | join_3                    | 
/provisionFile_out_5/
 0000000-140605140030484-oozie-oozi-W@provisionFile_out_7       | 
provisionFile_out_7       | OK           | OK     | join_3                    | 
/provisionFile_out_7/
 0000000-140605140030484-oozie-oozi-W@provisionFile_out_9       | 
provisionFile_out_9       | OK           | OK     | join_3                    | 
/provisionFile_out_9/
 0000000-140605140030484-oozie-oozi-W@provisionFile_out_11      | 
provisionFile_out_11      | OK           | OK     | join_3                    | 
/provisionFile_out_11/
 0000000-140605140030484-oozie-oozi-W@provisionFile_out_13      | 
provisionFile_out_13      | OK           | OK     | join_3                    | 
/provisionFile_out_13/
 0000000-140605140030484-oozie-oozi-W@fail                      | fail          
            | OK           | OK     |                           | /bash_cp_14/
(19 rows)


The workflow:

<?xml version="1.0" encoding="UTF-8"?>
<workflow-app xmlns="uri:oozie:workflow:0.4" name="HelloWorld">
  <start to="start_0" />
  <action name="start_0" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/start_0-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/start_0-qsub.opts</options-file>
    </sge>
    <ok to="provisionFile_file_in_0_1" />
    <error to="fail" />
  </action>
  <action name="provisionFile_file_in_0_1" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_file_in_0_1-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_file_in_0_1-qsub.opts</options-file>
    </sge>
    <ok to="bash_mkdir_2" />
    <error to="fail" />
  </action>
  <action name="bash_mkdir_2" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_mkdir_2-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_mkdir_2-qsub.opts</options-file>
    </sge>
    <ok to="fork_2" />
    <error to="fail" />
  </action>
  <fork name="fork_2">
    <path start="bash_cp_3" />
    <path start="bash_cp_4" />
    <path start="bash_cp_6" />
    <path start="bash_cp_8" />
    <path start="bash_cp_10" />
    <path start="bash_cp_12" />
  </fork>
  <action name="bash_cp_3" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_3-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_3-qsub.opts</options-file>
    </sge>
    <ok to="join_2" />
    <error to="fail" />
  </action>
  <action name="bash_cp_4" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_4-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_4-qsub.opts</options-file>
    </sge>
    <ok to="join_2" />
    <error to="fail" />
  </action>
  <action name="bash_cp_6" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_6-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_6-qsub.opts</options-file>
    </sge>
    <ok to="join_2" />
    <error to="fail" />
  </action>
  <action name="bash_cp_8" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_8-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_8-qsub.opts</options-file>
    </sge>
    <ok to="join_2" />
    <error to="fail" />
  </action>
  <action name="bash_cp_10" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_10-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_10-qsub.opts</options-file>
    </sge>
    <ok to="join_2" />
    <error to="fail" />
  </action>
  <action name="bash_cp_12" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_12-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_12-qsub.opts</options-file>
    </sge>
    <ok to="join_2" />
    <error to="fail" />
  </action>
  <join name="join_2" to="fork_3" />
  <fork name="fork_3">
    <path start="bash_cp_14" />
    <path start="provisionFile_out_5" />
    <path start="provisionFile_out_7" />
    <path start="provisionFile_out_9" />
    <path start="provisionFile_out_11" />
    <path start="provisionFile_out_13" />
  </fork>
  <action name="bash_cp_14" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_14-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_14-qsub.opts</options-file>
    </sge>
    <ok to="join_3" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_5" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_5-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_5-qsub.opts</options-file>
    </sge>
    <ok to="join_3" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_7" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_7-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_7-qsub.opts</options-file>
    </sge>
    <ok to="join_3" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_9" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_9-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_9-qsub.opts</options-file>
    </sge>
    <ok to="join_3" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_11" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_11-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_11-qsub.opts</options-file>
    </sge>
    <ok to="join_3" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_13" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_13-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_13-qsub.opts</options-file>
    </sge>
    <ok to="join_3" />
    <error to="fail" />
  </action>
  <join name="join_3" to="fork_4" />
  <fork name="fork_4">
    <path start="bash_cp_15" />
    <path start="bash_cp_17" />
    <path start="bash_cp_19" />
    <path start="bash_cp_21" />
    <path start="bash_cp_23" />
  </fork>
  <action name="bash_cp_15" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_15-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_15-qsub.opts</options-file>
    </sge>
    <ok to="join_4" />
    <error to="fail" />
  </action>
  <action name="bash_cp_17" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_17-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_17-qsub.opts</options-file>
    </sge>
    <ok to="join_4" />
    <error to="fail" />
  </action>
  <action name="bash_cp_19" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_19-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_19-qsub.opts</options-file>
    </sge>
    <ok to="join_4" />
    <error to="fail" />
  </action>
  <action name="bash_cp_21" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_21-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_21-qsub.opts</options-file>
    </sge>
    <ok to="join_4" />
    <error to="fail" />
  </action>
  <action name="bash_cp_23" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_23-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/bash_cp_23-qsub.opts</options-file>
    </sge>
    <ok to="join_4" />
    <error to="fail" />
  </action>
  <join name="join_4" to="fork_5" />
  <fork name="fork_5">
    <path start="provisionFile_out_16" />
    <path start="provisionFile_out_18" />
    <path start="provisionFile_out_20" />
    <path start="provisionFile_out_22" />
    <path start="provisionFile_out_24" />
  </fork>
  <action name="provisionFile_out_16" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_16-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_16-qsub.opts</options-file>
    </sge>
    <ok to="join_5" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_18" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_18-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_18-qsub.opts</options-file>
    </sge>
    <ok to="join_5" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_20" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_20-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_20-qsub.opts</options-file>
    </sge>
    <ok to="join_5" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_22" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_22-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_22-qsub.opts</options-file>
    </sge>
    <ok to="join_5" />
    <error to="fail" />
  </action>
  <action name="provisionFile_out_24" retry-max="5" retry-interval="5">
    <sge xmlns="uri:oozie:sge-action:1.0">
      
<script>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_24-runner.sh</script>
      
<options-file>/usr/tmp/oozie/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b/generated-scripts/provisionFile_out_24-qsub.opts</options-file>
    </sge>
    <ok to="join_5" />
    <error to="fail" />
  </action>
  <join name="join_5" to="done" />
  <join name="join_274314800376896" to="done" />
  <action name="done">
    <fs>
      <delete 
path="hdfs://localhost:8020/user/dyuen/seqware_workflow/oozie-8d157b87-5f1a-496f-b66c-8374cd05233b"
 />
    </fs>
    <ok to="end" />
    <error to="fail" />
  </action>
  <kill name="fail">
    <message>Java failed, error 
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
  </kill>
  <end name="end" />
</workflow-app>

Reply via email to