[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-501997307 @StefanRRichter I have fixed most of the issues. Please review again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-501568073 Hi @StefanRRichter thanks for your suggestion. I have fixed some issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-501212069 Hi @StefanRRichter I have fixed the conflicts and rebased the code. When you have time, please review it again. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-496903602 Hi @StefanRRichter I think I have fixed the existed issues. I have triggered the Travis twice, both of them are green. So I'd invite you to review it again. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-496101546 Hi @StefanRRichter In the last few days, I did further refactor, include: * remove `org.apache.flink.runtime.checkpoint.decline.xxxException` and enumerate more reason in `CheckpointFailureReason` * remove `org.apache.flink.runtime.taskexecutor.exceptions.CheckpointException` and reused unified `org.apache.flink.runtime.checkpoint.CheckpointException` * temporarily ignore serval failure reason to keep compatibility with `failOnCheckpointingErrors ` There is just one issue left now. So would you mind to give a full review again? IMO, these refactor about checkpoint exception and the unified invocation for `PendingCheckpoint#abort` will be good for other issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-495915181 Hi @StefanRRichter I understand what you mean and have given a reply. I think this PR has a relationship with FLINK-11662, but there is no dependency. If FLINK-11662 can be completed before this PR, it will make the CheckpointFailureManager looks better. But if it is resolved after this PR, it will not affect the semantics of this PR. I think the semantics of this PR is: tolerate failure in the situation we have controlled, and the current unhandled exception is not tolerated. In addition, the failure of the job caused by FLINK-11662 will cause the entire job to start again. Isn't the introduction of `CheckpointFailureManager` so that it can handle `CheckpointFailureReason` better? In fact, `CheckpointFailureReason` is a change point (it may increase because maybe there are other issues like FLINK-11662 we do not find or track them, or it may decrease in the future), we encapsulate its counting logic in the `handleCheckpointException` to change their counting logic in the future. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-495483854 > @StefanRRichter After thinking seriously, my view of point has changed. > > There are two options: > > * to keep all test cases and user jobs' default behavior, the `tolerableCheckpointFailureNumber`'s default value seems should be `Integer.MAX_VALUE`; > * to keep the compatibility, the `tolerableCheckpointFailureNumber`'s default value seems should be 0; > > The key problem is the two options ' **range of action** is different. The `failOnCheckpointingErrors` option just cover the task's checkpointing error in the TaskManager end (even not the whole execution phase). While the `tolerableCheckpointFailureNumber` option needs to cover the whole trigger and execution phases. > > The thought to support option1: > > In many scenes, if we set the `tolerableCheckpointFailureNumber`'s default value to 0. The behavior of the users' job would be changed. It would cause more frequency to fail and restart. For example, the task is not ready to do checkpoint so it sends a decline message to trigger failure manager to fail and restart the job. So it changed the test cases and user jobs' default behavior. This is the reason why I change the default value to `Integer.MAX_VALUE`, although they are sporadic. > > The thought to support option2: > > If we set the `tolerableCheckpointFailureNumber`'s default value to `Integer.MAX_VALUE`, it may introduce a **compatibility issue** how to handle the `failOnCheckpointingErrors` config option in the future? The original thought is that the failOnCheckpointingErrors(two values : 0 and Integer.MAX_VALUE) is a subset of the tolerableCheckpointFailureNumber(0 to Integer.MAX_VALUE). We had wanted to deprecate the `failOnCheckpointingErrors` option in the third step. > > What do you think? I tried to find a way to fix the compatibility issue (user job behavior and `failOnCheckpointingErrors`) by refining the decline exception. Currently, I extracted a new failure reason `CHECKPOINT_DECLINED_TASK_NOT_READY`(CheckpointDeclineTaskNotReadyException) from coarse-grained `CHECKPOINT_DECLINED`. I think in this way, we can keep `tolerableCheckpointFailureNumber`'s default value as 0 and do not change the user jobs' behavior. In addition, we should also continue to reduce the exception classes in package `org.apache.flink.runtime.checkpoint.decline`. It seems it's the unfinished work left over from the first step. WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-495481836 > About the problem with the SQL test, having the detailed logs from JM/TMs would be helpful. However, if I see correctly, those are batch tests and should not care about changes to checkpointing - so the error might very well be unrelated. @StefanRRichter I have debugged in my local. It seems the problem comes from the mechanism of failing job. The `DeduplicateITCase` also triggered job fail because of `CheckpointDeclineTaskNotReadyException`. But in `ExecutionGraph#failGlobal` method, it should check the main thread by calling `assertRunningInJobMasterMainThread` method. I found it can not jump out from this method. My guess is the trigger thread is the Timer in `CheckpointCoordinator`, not the main thread. So we may figure out a new way to fail the job. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-495081839 @StefanRRichter I happened to have an exception about Kafka's end-to-end testing: https://api.travis-ci.org/v3/job/536105563/log.txt This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-495066930 @klion26 Thanks for your review suggestion. What do you think about the new change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-494781182 The essential difference between `failOnCheckpointingErrors` and `tolerableCheckpointFailureNumber` is that the way the two cause the job to fail is different. The former is triggered on the TaskManager side and the latter is triggered on the JobManager side. It is difficult to maintain compatibility at the API level. But I personally prefer to set the default value of `tolerableCheckpointFailureNumber` to Integer.MAX_VALUE. WDYT? @StefanRRichter This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-494394030 @StefanRRichter OK, if it is unrelated, we can ignore it. In addition, what do you think about my prior opinion? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-493837160 Triggered again, still has SQL deadlock : https://travis-ci.org/apache/flink/jobs/534620910 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-493826613 @StefanRRichter After thinking seriously, my view of point has changed. There are two options: * to keep all test cases and user jobs' default behavior, the `tolerableCheckpointFailureNumber`'s default value seems should be `Integer.MAX_VALUE`; * to keep the compatibility, the `tolerableCheckpointFailureNumber`'s default value seems should be 0; The key problem is the two options ' **range of action** is different. The `failOnCheckpointingErrors` option just cover the task's checkpointing error in the TaskManager end (even not the whole execution phase). While the `tolerableCheckpointFailureNumber` option needs to cover the whole trigger and execution phases. The thought to support option1: In many scenes, if we set the `tolerableCheckpointFailureNumber`'s default value to 0. The behavior of the users' job would be changed. It would cause more frequency to fail and restart. For example, the task is not ready to do checkpoint so it sends a decline message to trigger failure manager to fail and restart the job. So it changed the test cases and user jobs' default behavior. This is the reason why I change the default value to `Integer.MAX_VALUE`, although they are sporadic. The thought to support option2: If we set the `tolerableCheckpointFailureNumber`'s default value to `Integer.MAX_VALUE`, it may introduce a **compatibility issue** how to handle the `failOnCheckpointingErrors` config option in the future? The original thought is that the failOnCheckpointingErrors(two values : 0 and Integer.MAX_VALUE) is a subset of the tolerableCheckpointFailureNumber(0 to Integer.MAX_VALUE). We had wanted to deprecate the `failOnCheckpointingErrors` option in the third step. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-493817814 retriggered once, result failure, reason: dead lock (sql) detail : https://travis-ci.org/apache/flink/jobs/534067083 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-493402537 For a clearer analysis, I plan to do three steps: * Rebase master branch * Remove all four settings about tolerance number * Squash unrelated commits We can analyze the log again. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-493345060 @StefanRRichter So far, the test problems caused by this PR have been fixed, but Travis has been not very stable recently. During the multiple builds, other unrelated failures have been created and tracked as Jira issues. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-493080226 @StefanRRichter Have updated. What do you think about the new commit and the whole PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-492885118 Now, it seems Travis's failure is not related to this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-492636770 Have fixed conflicts. In addition, I have refactored `CheckpointCoordinator`'s constructor and `ExecutionGraph#enableCheckpointing` about the new introduced parameter `isPreferCheckpointForRecovery` because we have converged multiple parameters to the instance of the `CheckpointCoordinatorConfiguration`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-492620693 @StefanRRichter My another PR #8410 which just merged caused many conflicts. The reason is that they both changed checkpoint specific configuration. If you do not mind, I'd like to squash this PR to rebase it and fix the conflicts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-492471326 Irrelevant test error, Have created an issue to report it, will trigger rebuild~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-491497620 @StefanRRichter Sorry, I still do not provide a final implementation of this issue. It's a temporary version just for discussion. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-490365680 @StefanRRichter Really sorry for the late reply, I just took a holiday and attended QCon conference Beijing. I have given a more detailed implementation so that we could discuss more easily. The new commit cannot be passed by Travis, it is for discussion reason. When we agree about the result, I will complete this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [flink] yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure
yanghua commented on issue #8322: [FLINK-12364] Introduce a CheckpointFailureManager to centralized manage checkpoint failure URL: https://github.com/apache/flink/pull/8322#issuecomment-487947096 @StefanRRichter This is the second step's PR of the checkpoint failure process improvement. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services