[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16419265#comment-16419265 ] ASF GitHub Bot commented on FLINK-9097: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/5774 > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418988#comment-16418988 ] ASF GitHub Bot commented on FLINK-9097: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/5774 Thanks for the review @GJL. Merging this PR and #5784. > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418711#comment-16418711 ] ASF GitHub Bot commented on FLINK-9097: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/5774 I had to put another commit on top of it to fix a problem with the failing `DispatcherTest#testWaitingForJobMasterLeadership` @GJL. The new commit makes sure that we first recover all jobs before we set the fencing token of the `Dispatcher`. That way, no other action can interfere with the job recover, e.g. another job submission. > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417298#comment-16417298 ] ASF GitHub Bot commented on FLINK-9097: --- Github user tillrohrmann commented on the issue: https://github.com/apache/flink/pull/5774 Rebased on the latest master. Sorry for the inconveniences @GJL. I've addressed all your comments so far. > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417297#comment-16417297 ] ASF GitHub Bot commented on FLINK-9097: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/5774#discussion_r177739972 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherTest.java --- @@ -507,19 +539,21 @@ private TestingDispatcher( jobManagerMetricGroup, metricQueryServicePath, archivedExecutionGraphStore, - new ExpectedJobIdJobManagerRunnerFactory(expectedJobId), + jobManagerRunnerFactory, fatalErrorHandler, null); } - @Override - public CompletableFuture submitJob(final JobGraph jobGraph, final Time timeout) { - final CompletableFuture submitJobFuture = super.submitJob(jobGraph, timeout); - - submitJobFuture.thenAccept(ignored -> submitJobLatch.countDown()); - return submitJobFuture; - } +// +// @Override --- End diff -- Will remove it > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417287#comment-16417287 ] ASF GitHub Bot commented on FLINK-9097: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/5774#discussion_r177726769 --- Diff: flink-runtime/src/test/java/org/apache/flink/runtime/dispatcher/DispatcherTest.java --- @@ -507,19 +539,21 @@ private TestingDispatcher( jobManagerMetricGroup, metricQueryServicePath, archivedExecutionGraphStore, - new ExpectedJobIdJobManagerRunnerFactory(expectedJobId), + jobManagerRunnerFactory, fatalErrorHandler, null); } - @Override - public CompletableFuture submitJob(final JobGraph jobGraph, final Time timeout) { - final CompletableFuture submitJobFuture = super.submitJob(jobGraph, timeout); - - submitJobFuture.thenAccept(ignored -> submitJobLatch.countDown()); - return submitJobFuture; - } +// +// @Override --- End diff -- delete commented code? > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417281#comment-16417281 ] ASF GitHub Bot commented on FLINK-9097: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/5774#discussion_r177734109 --- Diff: flink-runtime/src/test/resources/log4j-test.properties --- @@ -16,7 +16,7 @@ # limitations under the License. -log4j.rootLogger=OFF, console +log4j.rootLogger=INFO, console --- End diff -- Good catch. Will revert it. > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16417244#comment-16417244 ] ASF GitHub Bot commented on FLINK-9097: --- Github user GJL commented on a diff in the pull request: https://github.com/apache/flink/pull/5774#discussion_r177725828 --- Diff: flink-runtime/src/test/resources/log4j-test.properties --- @@ -16,7 +16,7 @@ # limitations under the License. -log4j.rootLogger=OFF, console +log4j.rootLogger=INFO, console --- End diff -- No > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails
[ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16415378#comment-16415378 ] ASF GitHub Bot commented on FLINK-9097: --- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/5774 [FLINK-9097] Fail fatally if job submission fails when recovering jobs ## What is the purpose of the change In order to not drop jobs, we have to fail fatally if a job submission fails when recovering jobs. In HA mode, this will restart the Dispatcher and let it retry to recover all jobs. This PR is based on #5746. cc @GJL ## Brief change log - Restructured `Dispatcher#submitJob` method - Registered callback to listen to job submission result - Fail `Dispatcher` if job submission result is a failure if recovering a job ## Verifying this change - Added `DispatcherTest#testJobSubmissionErrorAfterJobRecovery` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink reintroduceFatalErrorHandler Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5774.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5774 commit 6c69755c077e78c46a40a0d9f35e435d7ef1618b Author: Till RohrmannDate: 2018-03-22T09:46:04Z [hotfix] Extend TestingFatalErrorHandler to return an error future commit 080132d8ec938eddb545ff3a80d0039402c48e94 Author: Till Rohrmann Date: 2018-03-22T09:46:28Z [hotfix] Add BiFunctionWithException commit ff19155c1bcca8610ca78ae41fa607ede94ddffc Author: Till Rohrmann Date: 2018-03-21T21:36:33Z [FLINK-8943] [ha] Fail Dispatcher if jobs cannot be recovered from HA store In HA mode, the Dispatcher should fail if it cannot recover the persisted jobs. The idea is that another Dispatcher will be brought up and tries it again. This is better than simply dropping the not recovered jobs. commit 4656c2adeb93500c02d63adbfb90b8eecabb474b Author: Till Rohrmann Date: 2018-03-27T07:45:13Z [hotfix] Re-introduce FatalErrorHandler to JobManagerRunner commit 93bb2799c08b23398b6927fb599770260fad2c8f Author: Till Rohrmann Date: 2018-03-27T08:00:56Z [hotfix] Correct JavaDocs in SubmittedJobGraphStore and add Nullable annotation commit 2ba75d09e38d23c242e147adf613af91328b219a Author: Till Rohrmann Date: 2018-03-27T08:59:54Z [FLINK-9097] Fail fatally if job submission fails when recovering jobs In order to not drop jobs, we have to fail fatally if a job submission fails when recovering jobs. In HA mode, this will restart the Dispatcher and let it retry to recover all jobs. > Jobs can be dropped in HA when job submission fails > --- > > Key: FLINK-9097 > URL: https://issues.apache.org/jira/browse/FLINK-9097 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination >Affects Versions: 1.5.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > Jobs can be dropped in HA mode if the job submission step fails. In such a > case, we should fail fatally to let the {{Dispatcher}} restart and retry to > recover all jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)