[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.
zhuzhurk commented on code in PR #21970: URL: https://github.com/apache/flink/pull/21970#discussion_r1113903871 ## flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java: ## @@ -377,6 +377,10 @@ private void restartTasks( final Set verticesToRestart = executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions); +if (verticesToRestart.isEmpty()) { +return; Review Comment: Yes you are right. If a global failover is in progress, no regional failover will be triggered. Thanks for the explanation! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.
zhuzhurk commented on code in PR #21970: URL: https://github.com/apache/flink/pull/21970#discussion_r1112513443 ## flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java: ## @@ -377,6 +377,10 @@ private void restartTasks( final Set verticesToRestart = executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions); +if (verticesToRestart.isEmpty()) { +return; Review Comment: I'm not entirely sure but a bit concerned is that Flink may take some important actions in `CheckpointCoordinator#restoreLatestCheckpointedStateInternal()` even if `verticesToRestart` if empty, e.g. `invoking OperatorCoordinator#resetToCheckpoint(...)`. These actions were always taken previously, while are possible to be skipped after this change(when a global failover and regional failover happen concurrently). I haven't had the chance to examine it all over yet. It's appreciated if you can also help to examine this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.
zhuzhurk commented on code in PR #21970: URL: https://github.com/apache/flink/pull/21970#discussion_r1113780430 ## flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java: ## @@ -377,6 +377,10 @@ private void restartTasks( final Set verticesToRestart = executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions); +if (verticesToRestart.isEmpty()) { +return; Review Comment: A global failover can be superseded by a regional failover, regarding the tasks to restart. Here's an example: Here's a job consists of one only pipelined region. A global failure happens first(caused by the OperatorCoordinator) and need to restart all the tasks. It also needs `OperatorCoordinatorHolder#resetToCheckpoint()` to be invoked to recover from an inconsistent status. However, a task happens later but almost at the same time, which needs to restart all the tasks. Therefore, the `verticesToRestart` would be empty when `restartTasks(...)` is invoked for the global failure. And `OperatorCoordinatorHolder#resetToCheckpoint()` will not be invoked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.
zhuzhurk commented on code in PR #21970: URL: https://github.com/apache/flink/pull/21970#discussion_r1112513443 ## flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java: ## @@ -377,6 +377,10 @@ private void restartTasks( final Set verticesToRestart = executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions); +if (verticesToRestart.isEmpty()) { +return; Review Comment: I'm entirely sure but a bit concerned is that Flink may take some important actions in `CheckpointCoordinator#restoreLatestCheckpointedStateInternal()` even if `verticesToRestart` if empty, e.g. `invoking OperatorCoordinator#resetToCheckpoint(...)`. These actions were always taken previously, while are possible to be skipped after this change(when a global failover and regional failover happen concurrently). I haven't had the chance to examine it all over yet. It's appreciated if you can also help to examine this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org