[jira] [Created] (FLINK-30444) State recovery error not handled correctly and always causes JM failure

Gyula Fora (Jira) Fri, 16 Dec 2022 09:37:08 -0800

Gyula Fora created FLINK-30444:
----------------------------------

             Summary: State recovery error not handled correctly and always 
causes JM failure
                 Key: FLINK-30444
                 URL: https://issues.apache.org/jira/browse/FLINK-30444
             Project: Flink
          Issue Type: Bug
          Components: Client / Job Submission
    Affects Versions: 1.15.3, 1.14.6, 1.16.0
            Reporter: Gyula Fora



When you submit a job in Application mode and you try to restore from an 
incompatible savepoint, there is a very unexpected behaviour.

Even with the following config:
{noformat}
execution.shutdown-on-application-finish: false
execution.submit-failed-job-on-application-error: true{noformat}

The job goes into a FAILED state, and the jobmanager fails. In a kubernetes 
environment (when using the native kubernetes integration) this means that the 
JobManager is restarted automatically.

This will mean that if you have jobresult store enabled, after the JM comes 
back you will end up with an empty application cluster.

I think the correct behaviour would be, depending on the above mention config:

1. If there is a job recovery error and you have 
(execution.submit-failed-job-on-application-error) configured, then the job 
should show up as failed, and the JM should not exit (if 
execution.shutdown-on-application-finish is false)
2. If (execution.shutdown-on-application-finish is true) then the jobmanager 
should exit cleanly like on normal job terminal state and thus stop the 
deployment in Kubernetes, preventing a JM restart cycle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-30444) State recovery error not handled correctly and always causes JM failure

Reply via email to