[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.

2023-02-21 Thread via GitHub


zhuzhurk commented on code in PR #21970:
URL: https://github.com/apache/flink/pull/21970#discussion_r1113903871


##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java:
##
@@ -377,6 +377,10 @@ private void restartTasks(
 final Set verticesToRestart =
 
executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions);
 
+if (verticesToRestart.isEmpty()) {
+return;

Review Comment:
   Yes you are right. If a global failover is in progress, no regional failover 
will be triggered.
   Thanks for the explanation!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.

2023-02-21 Thread via GitHub


zhuzhurk commented on code in PR #21970:
URL: https://github.com/apache/flink/pull/21970#discussion_r1112513443


##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java:
##
@@ -377,6 +377,10 @@ private void restartTasks(
 final Set verticesToRestart =
 
executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions);
 
+if (verticesToRestart.isEmpty()) {
+return;

Review Comment:
   I'm not entirely sure but a bit concerned is that Flink may take some 
important actions in 
`CheckpointCoordinator#restoreLatestCheckpointedStateInternal()` even if 
`verticesToRestart` if empty, e.g. `invoking 
OperatorCoordinator#resetToCheckpoint(...)`. These actions were always taken 
previously, while are possible to be skipped after this change(when a global 
failover and regional failover happen concurrently).
   
   I haven't had the chance to examine it all over yet. It's appreciated if you 
can also help to examine this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.

2023-02-21 Thread via GitHub


zhuzhurk commented on code in PR #21970:
URL: https://github.com/apache/flink/pull/21970#discussion_r1113780430


##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java:
##
@@ -377,6 +377,10 @@ private void restartTasks(
 final Set verticesToRestart =
 
executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions);
 
+if (verticesToRestart.isEmpty()) {
+return;

Review Comment:
   A global failover can be superseded by a regional failover, regarding the 
tasks to restart.
   Here's an example: Here's a job consists of one only pipelined region. A 
global failure happens first(caused by the OperatorCoordinator) and need to 
restart all the tasks. It also needs 
`OperatorCoordinatorHolder#resetToCheckpoint()` to be invoked to recover from 
an inconsistent status. However, a task happens later but almost at the same 
time, which needs to restart all the tasks. Therefore, the `verticesToRestart` 
would be empty when `restartTasks(...)` is invoked for the global failure. And 
`OperatorCoordinatorHolder#resetToCheckpoint()` will not be invoked.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [flink] zhuzhurk commented on a diff in pull request #21970: [FLINK-31041][runtime] Fix multiple restoreState when GlobalFailure occurs in a short period.

2023-02-20 Thread via GitHub


zhuzhurk commented on code in PR #21970:
URL: https://github.com/apache/flink/pull/21970#discussion_r1112513443


##
flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/DefaultScheduler.java:
##
@@ -377,6 +377,10 @@ private void restartTasks(
 final Set verticesToRestart =
 
executionVertexVersioner.getUnmodifiedExecutionVertices(executionVertexVersions);
 
+if (verticesToRestart.isEmpty()) {
+return;

Review Comment:
   I'm entirely sure but a bit concerned is that Flink may take some important 
actions in `CheckpointCoordinator#restoreLatestCheckpointedStateInternal()` 
even if `verticesToRestart` if empty, e.g. `invoking 
OperatorCoordinator#resetToCheckpoint(...)`. These actions were always taken 
previously, while are possible to be skipped after this change(when a global 
failover and regional failover happen concurrently).
   
   I haven't had the chance to examine it all over yet. It's appreciated if you 
can also help to examine this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org