[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886694#comment-16886694 ] Aljoscha Krettek commented on FLINK-4810: - This feature has been implemented in FLINK-12364. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing >Reporter: Stephan Ewen >Assignee: vinoyang >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. > The design document is here : > https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718490#comment-16718490 ] ASF GitHub Bot commented on FLINK-4810: --- ramkrish86 commented on issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334#issuecomment-446471220 Closing the PR as per request. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718491#comment-16718491 ] ASF GitHub Bot commented on FLINK-4810: --- ramkrish86 closed pull request #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java b/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java index 0592e3d9aea..9f453d0f2c8 100644 --- a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java +++ b/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java @@ -132,6 +132,8 @@ /** The maximum number of checkpoints that may be in progress at the same time */ private final int maxConcurrentCheckpointAttempts; + /** The maximum number of unsuccessful checkpoints */ + private final int maxFailedCheckpoints; /** The timer that handles the checkpoint timeouts and triggers periodic checkpoints */ private final Timer timer; @@ -142,6 +144,9 @@ /** The number of consecutive failed trigger attempts */ private final AtomicInteger numUnsuccessfulCheckpointsTriggers = new AtomicInteger(0); + /** The number of consecutive failed checkpoints */ + private final AtomicInteger numFailedCheckpoints = new AtomicInteger(0); + private ScheduledTrigger currentPeriodicTrigger; /** The timestamp (via {@link System#nanoTime()}) when the last checkpoint completed */ @@ -163,6 +168,23 @@ private CheckpointStatsTracker statsTracker; // + public CheckpointCoordinator( + JobID job, + long baseInterval, + long checkpointTimeout, + long minPauseBetweenCheckpoints, + int maxConcurrentCheckpointAttempts, + ExternalizedCheckpointSettings externalizeSettings, + ExecutionVertex[] tasksToTrigger, + ExecutionVertex[] tasksToWaitFor, + ExecutionVertex[] tasksToCommitTo, + CheckpointIDCounter checkpointIDCounter, + CompletedCheckpointStore completedCheckpointStore, + String checkpointDirectory, + Executor executor) { + this(job, baseInterval, checkpointTimeout, minPauseBetweenCheckpoints, maxConcurrentCheckpointAttempts, 0, externalizeSettings, tasksToTrigger, tasksToWaitFor, tasksToCommitTo, + checkpointIDCounter, completedCheckpointStore, checkpointDirectory, executor); + } public CheckpointCoordinator( JobID job, @@ -170,6 +192,7 @@ public CheckpointCoordinator( long checkpointTimeout, long minPauseBetweenCheckpoints, int maxConcurrentCheckpointAttempts, + int maxFailedCheckpoints, ExternalizedCheckpointSettings externalizeSettings, ExecutionVertex[] tasksToTrigger, ExecutionVertex[] tasksToWaitFor, @@ -184,6 +207,7 @@ public CheckpointCoordinator( checkArgument(checkpointTimeout >= 1, "Checkpoint timeout must be larger than zero"); checkArgument(minPauseBetweenCheckpoints >= 0, "minPauseBetweenCheckpoints must be >= 0"); checkArgument(maxConcurrentCheckpointAttempts >= 1, "maxConcurrentCheckpointAttempts must be >= 1"); + checkArgument(maxFailedCheckpoints >= 0, "maxFailedCheckpoints must be >= 0"); if (externalizeSettings.externalizeCheckpoints() && checkpointDirectory == null) { throw new IllegalStateException("CheckpointConfig says to persist periodic " + @@ -207,6 +231,7 @@ public CheckpointCoordinator( this.checkpointTimeout = checkpointTimeout; this.minPauseBetweenCheckpointsNanos = minPauseBetweenCheckpoints * 1_000_000; this.maxConcurrentCheckpointAttempts = maxConcurrentCheckpointAttempts; + this.maxFailedCheckpoints = maxFailedCheckpoints; this.tasksToTrigger = checkNotNull(tasksToTrigger); this.tasksToWaitFor = checkNotNull(tasksToWaitFor); this.tasksToCommitTo = checkNotNull(tasksToCommitTo); @@ -461,6 +486
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716617#comment-16716617 ] ASF GitHub Bot commented on FLINK-4810: --- azagrebin edited a comment on issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334#issuecomment-446127077 @ramkrish86 thanks for the information, could you close then this PR for now? cc @yanghua This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716614#comment-16716614 ] ASF GitHub Bot commented on FLINK-4810: --- azagrebin edited a comment on issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334#issuecomment-446127077 @ramkrish86 could you close then this PR for now? cc @yanghua This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716608#comment-16716608 ] ASF GitHub Bot commented on FLINK-4810: --- azagrebin commented on issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334#issuecomment-446127077 @ramkrish86 could you close then this PR for now? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716171#comment-16716171 ] ASF GitHub Bot commented on FLINK-4810: --- ramkrish86 commented on issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334#issuecomment-446070175 @azagrebin - Thanks for the ping. Currently am not working on this. Pls feel free to work on this or the related JIRA FLINK-10074. I would add myself as a watcher to understand more about it. Thanks once again. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714968#comment-16714968 ] vinoyang commented on FLINK-4810: - [~azagrebin] OK, I'd like write a design document about refactoring checkpoint failure process. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714929#comment-16714929 ] Andrey Zagrebin commented on FLINK-4810: [~ram_krish], [~yanghua] I think we need a design document to proceed with PRs related to this issue. It could also reflect results of the existed PR discussions: [https://github.com/apache/flink/pull/3334] https://github.com/apache/flink/pull/6567 > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714887#comment-16714887 ] ASF GitHub Bot commented on FLINK-4810: --- azagrebin edited a comment on issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334#issuecomment-445847190 @ramkrish86 do you plan to continue working on this PR? There is also another on-going effort addressing this issue, turned out to be a duplicate of this. https://issues.apache.org/jira/browse/FLINK-10074 Do you want to join discussions? cc @tillrohrmann This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16714863#comment-16714863 ] ASF GitHub Bot commented on FLINK-4810: --- azagrebin commented on issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints URL: https://github.com/apache/flink/pull/3334#issuecomment-445847190 @ramkrish86 do you plan to continue working on this PR? There is also another on-going effort addressing this issue. Do you want to join discussions? https://issues.apache.org/jira/browse/FLINK-10074 cc @tillrohrmann This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > Labels: pull-request-available > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492091#comment-16492091 ] ASF GitHub Bot commented on FLINK-4810: --- Github user eliaslevy commented on the issue: https://github.com/apache/flink/pull/3334 Any chance this will be merged now that 1.5 is out? > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen >Priority: Major > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004086#comment-16004086 ] ramkrishna.s.vasudevan commented on FLINK-4810: --- [~StephanEwen] Can I rebase this PR with the current code? Am not sure on the current status of CheckPointcoordinator. Has this already been taken care of? > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902723#comment-15902723 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen No problem. I appreciate your time and efforts. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902712#comment-15902712 ] ASF GitHub Bot commented on FLINK-4810: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3334 @ramkrish86 I would like to get to this one here after the additions to the checkpoint coordinator I am currently working on are done. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15902515#comment-15902515 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen I saw in another JIRA one of your comment where you talked about refactoring CheckPointcoordinator and Pendingcheckpoint. So you woud this PR to wait till then? > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15895513#comment-15895513 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 Ping for reviews here!!! > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894047#comment-15894047 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen , @wenlong88 , @shixiaogang Pls have a look at the latest push. Now I am tracking the failures in the checkpointing and incrementing a new counter based on it. Added test cases also. I have not changed the constructors of the affected class because it touches many files. I can update it based on the feedback of the latest PR. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892091#comment-15892091 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 I thinkI got a better way to trck this. Will update the PR sooner. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889998#comment-15889998 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 Thanks for the input. I read the code. There are two ways a checkpoint fails (as per my code understanding). If for some reason checkpointing cannot be performed we send DeclineCheckpoint message. That is handled by the Checkpointcoordinator. Another is if there is an external error in checkpointing and in that case we call failExternally. Which transitions the state to FAILED and closes all the watchdog, and cancels the invokable also. Now is the intent to track how many times this happens and if so track such occurences of failure and then fail the execution graph? > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889872#comment-15889872 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 I think I got what you are saying here. Since Execution#triggerCheckpoint is the actual checkpoint call and currently we don't track it if there is a failure. So your point is it is better know if there was a failure in actual checkpoint triggering at the Task level and then count that as a failure. Am I right @wenlong88 ? > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889830#comment-15889830 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @wenlong88 Can you tell more when you say checkpointing failure and trigger failure? I think if you are saying about tracking the number of times the execution fails after restoring from a checkpoint I think FLINK-4815 is trying to focus that. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889822#comment-15889822 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103638771 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) { if (!checkpoint.isDiscarded()) { checkpoint.abortError(new Exception("Failed to trigger checkpoint")); } + if(numUnsuccessful > maxUnsuccessfulCheckpoints) { + return failExecution(executions); + } return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION); } } // end trigger lock } + private CheckpointTriggerResult failExecution(Execution[] executions) { + if (currentPeriodicTrigger != null) { + currentPeriodicTrigger.cancel(); + currentPeriodicTrigger = null; + } + for (Execution execution : executions) { + // fail the graph + execution.fail(new Throwable("The number of max unsuccessful checkpoints attempts exhausted")); --- End diff -- I verified the code once again. There is no reference to ExecutionGraph in Checkpointcoordinator and also calling fail on the current Execution actually triggers the restart flow to happen. Execution#fail()->Marks state to FAILED->vertex#executionFailed()->graph#jobVertexInFinalState(). So you think this way of failing won't work? > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889681#comment-15889681 ] ASF GitHub Bot commented on FLINK-4810: --- Github user wenlong88 commented on the issue: https://github.com/apache/flink/pull/3334 Currently the `numUnsuccessfulCheckpointsTriggers` will be reset after a successful trigger instead of a successful checkpoint. But I think it is rare actually for triggering failure and monitoring checkpoint failure is more valuable. What do you guys think. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889524#comment-15889524 ] ASF GitHub Bot commented on FLINK-4810: --- Github user shixiaogang commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103612613 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -428,6 +450,9 @@ CheckpointTriggerResult triggerCheckpoint( catch (Throwable t) { int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet(); LOG.warn("Failed to trigger checkpoint (" + numUnsuccessful + " consecutive failed attempts so far)", t); + if(numUnsuccessful > maxUnsuccessfulCheckpoints) { --- End diff -- You are right. I missed it. Sorry for that. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889518#comment-15889518 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103612421 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) { if (!checkpoint.isDiscarded()) { checkpoint.abortError(new Exception("Failed to trigger checkpoint")); } + if(numUnsuccessful > maxUnsuccessfulCheckpoints) { + return failExecution(executions); + } return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION); } } // end trigger lock } + private CheckpointTriggerResult failExecution(Execution[] executions) { + if (currentPeriodicTrigger != null) { + currentPeriodicTrigger.cancel(); + currentPeriodicTrigger = null; + } + for (Execution execution : executions) { + // fail the graph + execution.fail(new Throwable("The number of max unsuccessful checkpoints attempts exhausted")); --- End diff -- Ok sure. I will add tests for this. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889516#comment-15889516 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103612320 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -121,6 +121,8 @@ /** The maximum number of checkpoints that may be in progress at the same time */ private final int maxConcurrentCheckpointAttempts; + /** The maximum number of unsuccessful checkpoints */ + private final int maxUnsuccessfulCheckpoints; --- End diff -- ok. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889447#comment-15889447 ] ASF GitHub Bot commented on FLINK-4810: --- Github user shixiaogang commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103605788 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -428,6 +450,9 @@ CheckpointTriggerResult triggerCheckpoint( catch (Throwable t) { int numUnsuccessful = numUnsuccessfulCheckpointsTriggers.incrementAndGet(); LOG.warn("Failed to trigger checkpoint (" + numUnsuccessful + " consecutive failed attempts so far)", t); + if(numUnsuccessful > maxUnsuccessfulCheckpoints) { --- End diff -- Here the counter records the total number of failed attempts. Since a streaming job is intended to run a quite long time, the number of failed attempts will eventually exceed the limit. We should use a different counter here which is reset once a pending checkpoint successfully completes. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889445#comment-15889445 ] ASF GitHub Bot commented on FLINK-4810: --- Github user shixiaogang commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103605271 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) { if (!checkpoint.isDiscarded()) { checkpoint.abortError(new Exception("Failed to trigger checkpoint")); } + if(numUnsuccessful > maxUnsuccessfulCheckpoints) { + return failExecution(executions); + } return new CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION); } } // end trigger lock } + private CheckpointTriggerResult failExecution(Execution[] executions) { + if (currentPeriodicTrigger != null) { + currentPeriodicTrigger.cancel(); + currentPeriodicTrigger = null; + } + for (Execution execution : executions) { + // fail the graph + execution.fail(new Throwable("The number of max unsuccessful checkpoints attempts exhausted")); --- End diff -- I think it's not good here to fail the executions one by one. We should call `ExecutionGraph#fail` to fail the execution graph. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889446#comment-15889446 ] ASF GitHub Bot commented on FLINK-4810: --- Github user shixiaogang commented on a diff in the pull request: https://github.com/apache/flink/pull/3334#discussion_r103604470 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java --- @@ -121,6 +121,8 @@ /** The maximum number of checkpoints that may be in progress at the same time */ private final int maxConcurrentCheckpointAttempts; + /** The maximum number of unsuccessful checkpoints */ + private final int maxUnsuccessfulCheckpoints; --- End diff -- I think `failed` is a better word than `unsuccessful`. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885697#comment-15885697 ] ASF GitHub Bot commented on FLINK-4810: --- Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen - Ping for initial reviews. Will work on it based on the feedback. > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15872260#comment-15872260 ] ASF GitHub Bot commented on FLINK-4810: --- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3334 Thank you for opening this pull request. I'll try to review it in the coming days... > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints
[ https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15869773#comment-15869773 ] ASF GitHub Bot commented on FLINK-4810: --- GitHub user ramkrish86 opened a pull request: https://github.com/apache/flink/pull/3334 FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints unsuccessful checkpoints Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration. If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the [How To Contribute guide](http://flink.apache.org/how-to-contribute.html). In addition to going through the list, please provide a meaningful description of your changes. - [ ] General - The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text") - The pull request addresses only one issue - Each commit in the PR has a meaningful commit message (including the JIRA id) - [ ] Documentation - Documentation has been added for new functionality - Old documentation affected by the pull request has been updated - JavaDoc for public methods has been added - [ ] Tests & Build - Functionality added by the pull request is covered by tests - `mvn clean verify` has been executed successfully locally or a Travis build has passed Ran mvn clean verify. Did not add test cases to know the first level feedback. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ramkrish86/flink FLINK-4810 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3334.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3334 commit 6e0fb38272e6bb59528065461c6ec6fdd43689ad Author: Ramkrishna Date: 2017-02-16T11:29:37Z FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints > Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful > checkpoints > > > Key: FLINK-4810 > URL: https://issues.apache.org/jira/browse/FLINK-4810 > Project: Flink > Issue Type: Sub-task > Components: State Backends, Checkpointing >Reporter: Stephan Ewen > > The Checkpoint coordinator should track the number of consecutive > unsuccessful checkpoints. > If more than {{n}} (configured value) checkpoints fail in a row, it should > call {{fail()}} on the execution graph to trigger a recovery. -- This message was sent by Atlassian JIRA (v6.3.15#6346)