[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user eliaslevy commented on the issue: https://github.com/apache/flink/pull/3334 Any chance this will be merged now that 1.5 is out? ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen No problem. I appreciate your time and efforts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3334 @ramkrish86 I would like to get to this one here after the additions to the checkpoint coordinator I am currently working on are done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen I saw in another JIRA one of your comment where you talked about refactoring CheckPointcoordinator and Pendingcheckpoint. So you woud this PR to wait till then? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 Just updated and did a force push to avoid the merge commit. Now things are fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 Ping for reviews here!!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen , @wenlong88 , @shixiaogang Pls have a look at the latest push. Now I am tracking the failures in the checkpointing and incrementing a new counter based on it. Added test cases also. I have not changed the constructors of the affected class because it touches many files. I can update it based on the feedback of the latest PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 I thinkI got a better way to trck this. Will update the PR sooner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 Thanks for the input. I read the code. There are two ways a checkpoint fails (as per my code understanding). If for some reason checkpointing cannot be performed we send DeclineCheckpoint message. That is handled by the Checkpointcoordinator. Another is if there is an external error in checkpointing and in that case we call failExternally. Which transitions the state to FAILED and closes all the watchdog, and cancels the invokable also. Now is the intent to track how many times this happens and if so track such occurences of failure and then fail the execution graph? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 I think I got what you are saying here. Since Execution#triggerCheckpoint is the actual checkpoint call and currently we don't track it if there is a failure. So your point is it is better know if there was a failure in actual checkpoint triggering at the Task level and then count that as a failure. Am I right @wenlong88 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @wenlong88 Can you tell more when you say checkpointing failure and trigger failure? I think if you are saying about tracking the number of times the execution fails after restoring from a checkpoint I think FLINK-4815 is trying to focus that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user wenlong88 commented on the issue: https://github.com/apache/flink/pull/3334 Currently the `numUnsuccessfulCheckpointsTriggers` will be reset after a successful trigger instead of a successful checkpoint. But I think it is rare actually for triggering failure and monitoring checkpoint failure is more valuable. What do you guys think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user ramkrish86 commented on the issue: https://github.com/apache/flink/pull/3334 @StephanEwen - Ping for initial reviews. Will work on it based on the feedback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink issue #3334: FLINK-4810 Checkpoint Coordinator should fail ExecutionGr...
Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3334 Thank you for opening this pull request. I'll try to review it in the coming days... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---