[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186961#comment-15186961 ] ASF GitHub Bot commented on FLINK-3396: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/1657 > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179633#comment-15179633 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1657#issuecomment-192209616 If there are no objections, I am going to merge this to master later today. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150499#comment-15150499 ] ASF GitHub Bot commented on FLINK-3396: --- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/1657 [FLINK-3396] [runtime] Suppress job restart if adding to job graph store fails @tillrohrmann, this leaves everything as is, but suppresses the restart in case of a failure to add the job graph to ZooKeeper (the standalone case should never fail). This applies only to the initial job submission. The behaviour before was possibly leading to restart and the job being submitted w/o being added to ZooKeeper. Furthermore, the submission was not ACK'd in such a case. Now it will simply fail and the user can/has to react. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 3396-submit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1657.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1657 commit 3847e304c93a30c18560719d8f169ca424d734e6 Author: Ufuk CelebiDate: 2016-02-12T21:09:29Z [hotfix] Rename UnrecoverableException to SuppressRestartsException commit 07753af54db020ddef5540ab0ae3dcbb3af69c54 Author: Ufuk Celebi Date: 2016-02-16T16:49:13Z [FLINK-3396] [runtime] Suppress job restart if adding to job graph store fails A failure to add the job graph to the submitted job graphs in ZooKeeper could lead to a job restart w/o the job graph ever being added to the submitted graphs store. Although the job submission was not ACK'd in this case before, it received job status messages. Now the job will not be ACK'd as before, but the job will be failed. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150374#comment-15150374 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce closed the pull request at: https://github.com/apache/flink/pull/1656 > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15150365#comment-15150365 ] ASF GitHub Bot commented on FLINK-3396: --- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/1656 [FLINK-3396] [runtime] Fix JobGraph submission and client ACK logic A failure when recovering savepoint state could lead to not ACKing the job submission. For detached submissions, this could have lead to a submission timeout although the job eventually starts to run. Moreover, a failure to restore savepoint state, could lead to a job graph skipping the submitted graph store for HA. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 3396-submit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1656.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1656 commit 3847e304c93a30c18560719d8f169ca424d734e6 Author: Ufuk CelebiDate: 2016-02-12T21:09:29Z [hotfix] Rename UnrecoverableException to SuppressRestartsException commit 92e900b3266b93204ea84050d189f00bbbca6678 Author: Ufuk Celebi Date: 2016-02-16T16:49:13Z [FLINK-3396] [runtime] Fix JobGraph submission and client ACK logic A failure when recovering savepoint state could lead to not ACKing the job submission. For detached submissions, this could have lead to a submission timeout although the job eventually starts to run. Moreover, a failure to restore savepoint state, could lead to a job graph skipping the submitted graph store for HA. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148774#comment-15148774 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1633#issuecomment-184719993 Closing this... > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148775#comment-15148775 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce closed the pull request at: https://github.com/apache/flink/pull/1633 > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148700#comment-15148700 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r53019879 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1079,6 +1079,9 @@ class JobManager( executionGraph.registerExecutionListener(gateway) executionGraph.registerJobStatusListener(gateway) } + +// All good. Submission succeeded! +jobInfo.client ! decorateMessage(JobSubmitSuccess(jobGraph.getJobID)) --- End diff -- Oh boy... not my day today. Thanks for catching that. This was not expected. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148695#comment-15148695 ] ASF GitHub Bot commented on FLINK-3396: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r53019468 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1079,6 +1079,9 @@ class JobManager( executionGraph.registerExecutionListener(gateway) executionGraph.registerJobStatusListener(gateway) } + +// All good. Submission succeeded! +jobInfo.client ! decorateMessage(JobSubmitSuccess(jobGraph.getJobID)) --- End diff -- Hmm but now we have the problem that the user might see a `JobSubmitSuccess` without the job being stored in the `SubmittedJobGraphStore`, right? This means that if the `JobManager` dies before the job is persisted, it will never be recovered. I think this violates the `JobSubmitSuccess` contract. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148628#comment-15148628 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1633#issuecomment-184694713 Travis has passed. Did you have another look at this @tillrohrmann? > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148495#comment-15148495 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r53000913 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1073,57 +1073,73 @@ class JobManager( // execute the recovery/writing the jobGraph into the SubmittedJobGraphStore asynchronously // because it is a blocking operation future { -try { - if (isRecovery) { -executionGraph.restoreLatestCheckpointedState() - } - else { -val snapshotSettings = jobGraph.getSnapshotSettings -if (snapshotSettings != null) { - val savepointPath = snapshotSettings.getSavepointPath() +val restoreStateSuccess = + try { +if (isRecovery) { + executionGraph.restoreLatestCheckpointedState() --- End diff -- Regarding the `JobSubmitSuccess`: we had it as a follow up to have more fine-grained integration with the the client and left it as a duplicate submit message for the time being (instead of something like `JobRecovered`). The other behaviour is back to the previous state now. I hear you that it makes sense to integrate the state restore behaviour with the execution graph restart. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148485#comment-15148485 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r52999704 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1079,6 +1079,9 @@ class JobManager( executionGraph.registerExecutionListener(gateway) executionGraph.registerJobStatusListener(gateway) } + +// All good. Submission succeeded! +jobInfo.client ! decorateMessage(JobSubmitSuccess(jobGraph.getJobID)) --- End diff -- Moved this one up to have correct ACKing behaviour. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148480#comment-15148480 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r52998947 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1073,57 +1073,73 @@ class JobManager( // execute the recovery/writing the jobGraph into the SubmittedJobGraphStore asynchronously // because it is a blocking operation future { -try { - if (isRecovery) { -executionGraph.restoreLatestCheckpointedState() - } - else { -val snapshotSettings = jobGraph.getSnapshotSettings -if (snapshotSettings != null) { - val savepointPath = snapshotSettings.getSavepointPath() +val restoreStateSuccess = + try { +if (isRecovery) { + executionGraph.restoreLatestCheckpointedState() --- End diff -- Had an offline discussion with Stephan. He agrees with you that the failure in this case is too hard. I'll undo that change by ACK'ing the submission earlier. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148478#comment-15148478 ] ASF GitHub Bot commented on FLINK-3396: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r52998567 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1073,57 +1073,73 @@ class JobManager( // execute the recovery/writing the jobGraph into the SubmittedJobGraphStore asynchronously // because it is a blocking operation future { -try { - if (isRecovery) { -executionGraph.restoreLatestCheckpointedState() - } - else { -val snapshotSettings = jobGraph.getSnapshotSettings -if (snapshotSettings != null) { - val savepointPath = snapshotSettings.getSavepointPath() +val restoreStateSuccess = + try { +if (isRecovery) { + executionGraph.restoreLatestCheckpointedState() --- End diff -- The behaviour right now for a failure while doing a job recovery would simply fail the `ExecutionGraph` triggering a restart. A successful job recovery would send a `JobSubmitSuccess` to the client. I'm not sure whether this is actually correct, since the client already received a `JobSubmitMessage` from the `JobManager` while initially submitting the job. But I think this will simply be ignored. Thus, suppressing the restart behaviour in case of a job recovery would actually change the behaviour. If it makes sense and if it is possible to recover from failures while recovering a job or restoring a savepoint, it would make sense to not directly fail the job without restarting. Maybe one should distinguish that based on the actually occurring exception. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148466#comment-15148466 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r52997375 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1073,57 +1073,73 @@ class JobManager( // execute the recovery/writing the jobGraph into the SubmittedJobGraphStore asynchronously // because it is a blocking operation future { -try { - if (isRecovery) { -executionGraph.restoreLatestCheckpointedState() - } - else { -val snapshotSettings = jobGraph.getSnapshotSettings -if (snapshotSettings != null) { - val savepointPath = snapshotSettings.getSavepointPath() +val restoreStateSuccess = + try { +if (isRecovery) { + executionGraph.restoreLatestCheckpointedState() --- End diff -- But then I would not keep the behaviour as it is right now. Instead, we should then consider the job submitted before trying to recover any checkpoint state and keep the restart behaviour. What do you think? > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148413#comment-15148413 ] ASF GitHub Bot commented on FLINK-3396: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r52992451 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1073,57 +1073,73 @@ class JobManager( // execute the recovery/writing the jobGraph into the SubmittedJobGraphStore asynchronously // because it is a blocking operation future { -try { - if (isRecovery) { -executionGraph.restoreLatestCheckpointedState() - } - else { -val snapshotSettings = jobGraph.getSnapshotSettings -if (snapshotSettings != null) { - val savepointPath = snapshotSettings.getSavepointPath() +val restoreStateSuccess = + try { +if (isRecovery) { + executionGraph.restoreLatestCheckpointedState() --- End diff -- I'm not so sure about that, to be honest. What if the `restoreLatestCheckpointedState` fails because of some HDFS/ZooKeeper problems. Then you would like to try restarting the job, wouldn't you? The client should then be notified once all restarting attempts have been exhausted. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148354#comment-15148354 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1633#issuecomment-184599724 Thanks for review Till! I've [opened an issue](https://issues.apache.org/jira/browse/FLINK-3411) for the failure behaviour in case of checkpoint state recovery. I'll merge this today. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147615#comment-15147615 ] ASF GitHub Bot commented on FLINK-3396: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/1633#issuecomment-184313039 Changes look good to me @uce. I had only one inline question concerning a semantic change. Apart from that +1 for merging. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147613#comment-15147613 ] ASF GitHub Bot commented on FLINK-3396: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1633#discussion_r52924511 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1073,57 +1073,73 @@ class JobManager( // execute the recovery/writing the jobGraph into the SubmittedJobGraphStore asynchronously // because it is a blocking operation future { -try { - if (isRecovery) { -executionGraph.restoreLatestCheckpointedState() - } - else { -val snapshotSettings = jobGraph.getSnapshotSettings -if (snapshotSettings != null) { - val savepointPath = snapshotSettings.getSavepointPath() +val restoreStateSuccess = + try { +if (isRecovery) { + executionGraph.restoreLatestCheckpointedState() --- End diff -- Is it intended that now failures in the `restoreLatestCheckpointedState` are non recoverable as well? This seems to be different from the former implementation. > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147465#comment-15147465 ] ASF GitHub Bot commented on FLINK-3396: --- Github user uce commented on the pull request: https://github.com/apache/flink/pull/1633#issuecomment-184250973 @tillrohrmann, could you have a look at this change? > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3396) Job submission Savepoint restore logic flawed
[ https://issues.apache.org/jira/browse/FLINK-3396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15145312#comment-15145312 ] ASF GitHub Bot commented on FLINK-3396: --- GitHub user uce opened a pull request: https://github.com/apache/flink/pull/1633 [FLINK-3396] [runtime] Fail job submission after state restore failure If state restore fails during job graph submission, the submitting client never gets notified about the submission failure. For detached submissions, this results in a time out rather than the actual failure cause. The actual failure can only be seen in the logs. This PR changes the behaviour to fail the submission. The failure cause is wrapped in a `SuppressRestartsException` in order to prevent the restart strategy from kicking in. We call `ExecutionGraph#fail(Throwable)` in the first place in order to have proper clean up. You can merge this pull request into a Git repository by running: $ git pull https://github.com/uce/flink 3396-state_restore Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1633.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1633 commit 982857bc0f5248a694d908141df0330de2befa3c Author: Ufuk CelebiDate: 2016-02-12T20:59:58Z [FLINK-3396] [runtime] Fail job submission after state restore failure Problem: If state restore fails during job graph submission, the submitting client never gets notified about the submission failure. For detached submissions, this results in a time out rather than the actual failure cause. The actual failure can only be seen in the logs. Solution: Fail the submission if state restore fails. commit 9d2f4d88e99c3e50d8c001c8aa2cdb2848c80478 Author: Ufuk Celebi Date: 2016-02-12T21:09:29Z [hotfix] Rename UnrecoverableException to SuppressRestartsException > Job submission Savepoint restore logic flawed > - > > Key: FLINK-3396 > URL: https://issues.apache.org/jira/browse/FLINK-3396 > Project: Flink > Issue Type: Bug >Reporter: Ufuk Celebi >Assignee: Ufuk Celebi > Fix For: 1.0.0 > > > When savepoint restoring fails, the thrown Exception fails the execution > graph, but the client is not informed about the failure. > The expected behaviour is that the submission should be acked with success or > failure in any case. With savepoint restore failures, the ack message will be > skipped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)