[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2019-07-16 Thread Aljoscha Krettek (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886694#comment-16886694
 ] 

Aljoscha Krettek commented on FLINK-4810:
-

This feature has been implemented in FLINK-12364.

> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Checkpointing
>Reporter: Stephan Ewen
>Assignee: vinoyang
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.
> The design document is here : 
> https://docs.google.com/document/d/1ce7RtecuTxcVUJlnU44hzcO2Dwq9g4Oyd8_biy94hJc/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718490#comment-16718490
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

ramkrish86 commented on issue #3334: FLINK-4810 Checkpoint Coordinator should 
fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334#issuecomment-446471220
 
 
   Closing the PR as per request.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718491#comment-16718491
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

ramkrish86 closed pull request #3334: FLINK-4810 Checkpoint Coordinator should 
fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 
b/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
index 0592e3d9aea..9f453d0f2c8 100644
--- 
a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
+++ 
b/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
@@ -132,6 +132,8 @@
 
/** The maximum number of checkpoints that may be in progress at the 
same time */
private final int maxConcurrentCheckpointAttempts;
+   /** The maximum number of unsuccessful checkpoints */
+   private final int maxFailedCheckpoints;
 
/** The timer that handles the checkpoint timeouts and triggers 
periodic checkpoints */
private final Timer timer;
@@ -142,6 +144,9 @@
/** The number of consecutive failed trigger attempts */
private final AtomicInteger numUnsuccessfulCheckpointsTriggers = new 
AtomicInteger(0);
 
+   /** The number of consecutive failed checkpoints */
+   private final AtomicInteger numFailedCheckpoints = new AtomicInteger(0);
+
private ScheduledTrigger currentPeriodicTrigger;
 
/** The timestamp (via {@link System#nanoTime()}) when the last 
checkpoint completed */
@@ -163,6 +168,23 @@
private CheckpointStatsTracker statsTracker;
 
// 

+   public CheckpointCoordinator(
+   JobID job,
+   long baseInterval,
+   long checkpointTimeout,
+   long minPauseBetweenCheckpoints,
+   int maxConcurrentCheckpointAttempts,
+   ExternalizedCheckpointSettings externalizeSettings,
+   ExecutionVertex[] tasksToTrigger,
+   ExecutionVertex[] tasksToWaitFor,
+   ExecutionVertex[] tasksToCommitTo,
+   CheckpointIDCounter checkpointIDCounter,
+   CompletedCheckpointStore completedCheckpointStore,
+   String checkpointDirectory,
+   Executor executor) {
+   this(job, baseInterval, checkpointTimeout, 
minPauseBetweenCheckpoints, maxConcurrentCheckpointAttempts, 0, 
externalizeSettings, tasksToTrigger, tasksToWaitFor, tasksToCommitTo,
+   checkpointIDCounter, completedCheckpointStore, 
checkpointDirectory, executor);
+   }
 
public CheckpointCoordinator(
JobID job,
@@ -170,6 +192,7 @@ public CheckpointCoordinator(
long checkpointTimeout,
long minPauseBetweenCheckpoints,
int maxConcurrentCheckpointAttempts,
+   int maxFailedCheckpoints,
ExternalizedCheckpointSettings externalizeSettings,
ExecutionVertex[] tasksToTrigger,
ExecutionVertex[] tasksToWaitFor,
@@ -184,6 +207,7 @@ public CheckpointCoordinator(
checkArgument(checkpointTimeout >= 1, "Checkpoint timeout must 
be larger than zero");
checkArgument(minPauseBetweenCheckpoints >= 0, 
"minPauseBetweenCheckpoints must be >= 0");
checkArgument(maxConcurrentCheckpointAttempts >= 1, 
"maxConcurrentCheckpointAttempts must be >= 1");
+   checkArgument(maxFailedCheckpoints >= 0, "maxFailedCheckpoints 
must be >= 0");
 
if (externalizeSettings.externalizeCheckpoints() && 
checkpointDirectory == null) {
throw new IllegalStateException("CheckpointConfig says 
to persist periodic " +
@@ -207,6 +231,7 @@ public CheckpointCoordinator(
this.checkpointTimeout = checkpointTimeout;
this.minPauseBetweenCheckpointsNanos = 
minPauseBetweenCheckpoints * 1_000_000;
this.maxConcurrentCheckpointAttempts = 
maxConcurrentCheckpointAttempts;
+   this.maxFailedCheckpoints = maxFailedCheckpoints;
this.tasksToTrigger = checkNotNull(tasksToTrigger);
this.tasksToWaitFor = checkNotNull(tasksToWaitFor);
this.tasksToCommitTo = checkNotNull(tasksToCommitTo);
@@ -461,6 +486,9 @@ 

[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716617#comment-16716617
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

azagrebin edited a comment on issue #3334: FLINK-4810 Checkpoint Coordinator 
should fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334#issuecomment-446127077
 
 
   @ramkrish86 thanks for the information, could you close then this PR for now?
   cc @yanghua


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716614#comment-16716614
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

azagrebin edited a comment on issue #3334: FLINK-4810 Checkpoint Coordinator 
should fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334#issuecomment-446127077
 
 
   @ramkrish86 could you close then this PR for now?
   cc @yanghua


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-11 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716608#comment-16716608
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

azagrebin commented on issue #3334: FLINK-4810 Checkpoint Coordinator should 
fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334#issuecomment-446127077
 
 
   @ramkrish86 could you close then this PR for now?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716171#comment-16716171
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

ramkrish86 commented on issue #3334: FLINK-4810 Checkpoint Coordinator should 
fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334#issuecomment-446070175
 
 
   @azagrebin - Thanks for the ping. Currently am not working on this. Pls feel 
free to work on this or the related JIRA FLINK-10074. I would add myself as a 
watcher to understand more about it. Thanks once again.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-10 Thread vinoyang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714968#comment-16714968
 ] 

vinoyang commented on FLINK-4810:
-

[~azagrebin] OK, I'd like write a design document about refactoring checkpoint 
failure process.

> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-10 Thread Andrey Zagrebin (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714929#comment-16714929
 ] 

Andrey Zagrebin commented on FLINK-4810:


[~ram_krish], [~yanghua]

I think we need a design document to proceed with PRs related to this issue.

It could also reflect results of the existed PR discussions:

[https://github.com/apache/flink/pull/3334]

https://github.com/apache/flink/pull/6567

 

> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714887#comment-16714887
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

azagrebin edited a comment on issue #3334: FLINK-4810 Checkpoint Coordinator 
should fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334#issuecomment-445847190
 
 
   @ramkrish86 do you plan to continue working on this PR?
   There is also another on-going effort addressing this issue, turned out to 
be a duplicate of this.
   https://issues.apache.org/jira/browse/FLINK-10074 
   Do you want to join discussions?
   cc @tillrohrmann 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-12-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714863#comment-16714863
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

azagrebin commented on issue #3334: FLINK-4810 Checkpoint Coordinator should 
fail ExecutionGraph after "n" unsuccessful checkpoints
URL: https://github.com/apache/flink/pull/3334#issuecomment-445847190
 
 
   @ramkrish86 do you plan to continue working on this PR?
   There is also another on-going effort addressing this issue. Do you want to 
join discussions?
   https://issues.apache.org/jira/browse/FLINK-10074
   cc @tillrohrmann 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>  Labels: pull-request-available
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2018-05-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16492091#comment-16492091
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user eliaslevy commented on the issue:

https://github.com/apache/flink/pull/3334
  
Any chance this will be merged now that 1.5 is out?


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>Priority: Major
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-05-09 Thread ramkrishna.s.vasudevan (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16004086#comment-16004086
 ] 

ramkrishna.s.vasudevan commented on FLINK-4810:
---

[~StephanEwen]
Can I rebase this PR with the current code? Am not sure on the current status 
of CheckPointcoordinator. Has this already been taken care of?

> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902723#comment-15902723
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
@StephanEwen 
No problem. I appreciate your time and efforts. 


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-09 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902712#comment-15902712
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3334
  
@ramkrish86 I would like to get to this one here after the additions to the 
checkpoint coordinator I am currently working on are done.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902515#comment-15902515
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
@StephanEwen 
I saw in another JIRA one of your comment where you talked about 
refactoring CheckPointcoordinator and Pendingcheckpoint. So you woud this PR to 
wait till then?


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895513#comment-15895513
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
Ping for reviews here!!!


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894047#comment-15894047
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
@StephanEwen , @wenlong88 , @shixiaogang 
Pls have a look at the latest push. Now I am tracking the failures in the 
checkpointing and incrementing  a new counter based on it. Added test cases 
also. 
I have not changed the constructors of the affected class because it 
touches many files. I can update it based on the feedback of the latest PR.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892091#comment-15892091
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
I thinkI got a better way to trck this. Will update the PR sooner.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889998#comment-15889998
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
Thanks for the input. I read the code. There are two ways a checkpoint 
fails (as per my code understanding). If for some reason checkpointing cannot 
be performed we send DeclineCheckpoint message. That is handled by the 
Checkpointcoordinator.
Another is if there is an external error in checkpointing and in that case 
we call failExternally. Which transitions the state to FAILED and closes all 
the watchdog, and cancels the invokable also. Now is the intent to track how 
many times this happens and if so track such occurences of failure and then 
fail the execution graph?


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889872#comment-15889872
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
I think I got what you are saying here. Since Execution#triggerCheckpoint 
is the actual checkpoint call and currently we don't track it if there is a 
failure. So your point is it is better know if there was a failure in actual 
checkpoint triggering at the Task level and then count that as a failure. Am I 
right @wenlong88 ?


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889830#comment-15889830
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
@wenlong88 
Can you tell more when you say checkpointing failure and trigger failure? I 
think if you are saying about tracking the number of times the execution fails 
after restoring from a checkpoint I think FLINK-4815 is trying to focus that.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889822#comment-15889822
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on a diff in the pull request:

https://github.com/apache/flink/pull/3334#discussion_r103638771
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ---
@@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) {
if (!checkpoint.isDiscarded()) {
checkpoint.abortError(new 
Exception("Failed to trigger checkpoint"));
}
+   if(numUnsuccessful > 
maxUnsuccessfulCheckpoints) {
+   return failExecution(executions);
+   }
return new 
CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
}
 
} // end trigger lock
}
 
+   private CheckpointTriggerResult failExecution(Execution[] executions) {
+   if (currentPeriodicTrigger != null) {
+   currentPeriodicTrigger.cancel();
+   currentPeriodicTrigger = null;
+   }
+   for (Execution execution : executions) {
+   // fail the graph
+   execution.fail(new Throwable("The number of max 
unsuccessful checkpoints attempts exhausted"));
--- End diff --

I verified the code once again. There is no reference to ExecutionGraph in 
Checkpointcoordinator and also calling fail on the current Execution actually 
triggers the restart flow to happen.
Execution#fail()->Marks state to 
FAILED->vertex#executionFailed()->graph#jobVertexInFinalState(). So you think 
this way of failing won't work?


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889681#comment-15889681
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user wenlong88 commented on the issue:

https://github.com/apache/flink/pull/3334
  
Currently the `numUnsuccessfulCheckpointsTriggers` will be reset after a 
successful trigger instead of a successful checkpoint. But I think it is rare 
actually for triggering failure and monitoring checkpoint failure is more 
valuable. What do you guys think.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889524#comment-15889524
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user shixiaogang commented on a diff in the pull request:

https://github.com/apache/flink/pull/3334#discussion_r103612613
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ---
@@ -428,6 +450,9 @@ CheckpointTriggerResult triggerCheckpoint(
catch (Throwable t) {
int numUnsuccessful = 
numUnsuccessfulCheckpointsTriggers.incrementAndGet();
LOG.warn("Failed to trigger checkpoint (" + 
numUnsuccessful + " consecutive failed attempts so far)", t);
+   if(numUnsuccessful > 
maxUnsuccessfulCheckpoints) {
--- End diff --

You are right. I missed it. Sorry for that.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889518#comment-15889518
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on a diff in the pull request:

https://github.com/apache/flink/pull/3334#discussion_r103612421
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ---
@@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) {
if (!checkpoint.isDiscarded()) {
checkpoint.abortError(new 
Exception("Failed to trigger checkpoint"));
}
+   if(numUnsuccessful > 
maxUnsuccessfulCheckpoints) {
+   return failExecution(executions);
+   }
return new 
CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
}
 
} // end trigger lock
}
 
+   private CheckpointTriggerResult failExecution(Execution[] executions) {
+   if (currentPeriodicTrigger != null) {
+   currentPeriodicTrigger.cancel();
+   currentPeriodicTrigger = null;
+   }
+   for (Execution execution : executions) {
+   // fail the graph
+   execution.fail(new Throwable("The number of max 
unsuccessful checkpoints attempts exhausted"));
--- End diff --

Ok sure. I will add tests for this.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889516#comment-15889516
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on a diff in the pull request:

https://github.com/apache/flink/pull/3334#discussion_r103612320
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ---
@@ -121,6 +121,8 @@
 
/** The maximum number of checkpoints that may be in progress at the 
same time */
private final int maxConcurrentCheckpointAttempts;
+   /** The maximum number of unsuccessful checkpoints */
+   private final int maxUnsuccessfulCheckpoints;
--- End diff --

ok.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889447#comment-15889447
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user shixiaogang commented on a diff in the pull request:

https://github.com/apache/flink/pull/3334#discussion_r103605788
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ---
@@ -428,6 +450,9 @@ CheckpointTriggerResult triggerCheckpoint(
catch (Throwable t) {
int numUnsuccessful = 
numUnsuccessfulCheckpointsTriggers.incrementAndGet();
LOG.warn("Failed to trigger checkpoint (" + 
numUnsuccessful + " consecutive failed attempts so far)", t);
+   if(numUnsuccessful > 
maxUnsuccessfulCheckpoints) {
--- End diff --

Here the counter records the total number of failed attempts. Since a 
streaming job is intended to run a quite long time, the number of failed 
attempts will eventually exceed the limit. We should use a different counter 
here which is reset once a pending checkpoint successfully completes.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889445#comment-15889445
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user shixiaogang commented on a diff in the pull request:

https://github.com/apache/flink/pull/3334#discussion_r103605271
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ---
@@ -537,12 +562,27 @@ else if (!props.forceCheckpoint()) {
if (!checkpoint.isDiscarded()) {
checkpoint.abortError(new 
Exception("Failed to trigger checkpoint"));
}
+   if(numUnsuccessful > 
maxUnsuccessfulCheckpoints) {
+   return failExecution(executions);
+   }
return new 
CheckpointTriggerResult(CheckpointDeclineReason.EXCEPTION);
}
 
} // end trigger lock
}
 
+   private CheckpointTriggerResult failExecution(Execution[] executions) {
+   if (currentPeriodicTrigger != null) {
+   currentPeriodicTrigger.cancel();
+   currentPeriodicTrigger = null;
+   }
+   for (Execution execution : executions) {
+   // fail the graph
+   execution.fail(new Throwable("The number of max 
unsuccessful checkpoints attempts exhausted"));
--- End diff --

I think it's not good here to fail the executions one by one. We should 
call `ExecutionGraph#fail` to fail the execution graph.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15889446#comment-15889446
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user shixiaogang commented on a diff in the pull request:

https://github.com/apache/flink/pull/3334#discussion_r103604470
  
--- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java
 ---
@@ -121,6 +121,8 @@
 
/** The maximum number of checkpoints that may be in progress at the 
same time */
private final int maxConcurrentCheckpointAttempts;
+   /** The maximum number of unsuccessful checkpoints */
+   private final int maxUnsuccessfulCheckpoints;
--- End diff --

I think `failed` is a better word than `unsuccessful`.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-27 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885697#comment-15885697
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user ramkrish86 commented on the issue:

https://github.com/apache/flink/pull/3334
  
@StephanEwen - Ping for initial reviews. Will work on it based on the 
feedback.


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872260#comment-15872260
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

Github user StephanEwen commented on the issue:

https://github.com/apache/flink/pull/3334
  
Thank you for opening this pull request.
I'll try to review it in the coming days...


> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-4810) Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful checkpoints

2017-02-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869773#comment-15869773
 ] 

ASF GitHub Bot commented on FLINK-4810:
---

GitHub user ramkrish86 opened a pull request:

https://github.com/apache/flink/pull/3334

FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n" 
unsuccessful checkpoints

unsuccessful checkpoints

Thanks for contributing to Apache Flink. Before you open your pull request, 
please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your 
pull request. For more information and/or questions please refer to the [How To 
Contribute guide](http://flink.apache.org/how-to-contribute.html).
In addition to going through the list, please provide a meaningful 
description of your changes.

- [ ] General
  - The pull request references the related JIRA issue ("[FLINK-XXX] Jira 
title text")
  - The pull request addresses only one issue
  - Each commit in the PR has a meaningful commit message (including the 
JIRA id)

- [ ] Documentation
  - Documentation has been added for new functionality
  - Old documentation affected by the pull request has been updated
  - JavaDoc for public methods has been added

- [ ] Tests & Build
  - Functionality added by the pull request is covered by tests
  - `mvn clean verify` has been executed successfully locally or a Travis 
build has passed


Ran mvn clean verify. Did not add test cases to know the first level 
feedback. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ramkrish86/flink FLINK-4810

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/3334.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3334


commit 6e0fb38272e6bb59528065461c6ec6fdd43689ad
Author: Ramkrishna 
Date:   2017-02-16T11:29:37Z

FLINK-4810 Checkpoint Coordinator should fail ExecutionGraph after "n"
unsuccessful checkpoints




> Checkpoint Coordinator should fail ExecutionGraph after "n" unsuccessful 
> checkpoints
> 
>
> Key: FLINK-4810
> URL: https://issues.apache.org/jira/browse/FLINK-4810
> Project: Flink
>  Issue Type: Sub-task
>  Components: State Backends, Checkpointing
>Reporter: Stephan Ewen
>
> The Checkpoint coordinator should track the number of consecutive 
> unsuccessful checkpoints.
> If more than {{n}} (configured value) checkpoints fail in a row, it should 
> call {{fail()}} on the execution graph to trigger a recovery.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)