[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814524#comment-17814524 ] Matthias Pohl edited comment on FLINK-34007 at 2/5/24 10:10 PM: * master ** [79cccd7103a304bfa07104dcafd1f65a032c88ce|https://github.com/apache/flink/commit/79cccd7103a304bfa07104dcafd1f65a032c88ce] ** [95417a4857ec87a349c0fa9f4d3951f7d3807844|https://github.com/apache/flink/commit/95417a4857ec87a349c0fa9f4d3951f7d3807844] ** [927972ff4ad6252fd933fcc627c7d95dbbdae431|https://github.com/apache/flink/commit/927972ff4ad6252fd933fcc627c7d95dbbdae431] * 1.18: will be handled in FLINK-34333 was (Author: mapohl): * master ** [79cccd7103a304bfa07104dcafd1f65a032c88ce|https://github.com/apache/flink/commit/79cccd7103a304bfa07104dcafd1f65a032c88ce] ** [95417a4857ec87a349c0fa9f4d3951f7d3807844|https://github.com/apache/flink/commit/95417a4857ec87a349c0fa9f4d3951f7d3807844] ** [927972ff4ad6252fd933fcc627c7d95dbbdae431|https://github.com/apache/flink/commit/927972ff4ad6252fd933fcc627c7d95dbbdae431] > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813150#comment-17813150 ] Matthias Pohl edited comment on FLINK-34007 at 2/1/24 10:18 AM: Hi [~yunta] we're not aware of any issues related to FLINK-34007 in Flink 1.17. The issue started to appear with the upgrade of the k8s client dependency to v6.6.2 in FLINK-31997 (which ended up in Flink 1.18). That being said, [~ZhenqiuHuang] reported similar errors in Flink 1.17 and 1.16 deployments as well which we cannot explain. We were not able to investigate the cause due to missing logs. We agreed to cover any other problems in a separate Jira issue if [~ZhenqiuHuang] comes up with new information (see his comment above). was (Author: mapohl): Hi [~yunta] we're not aware of any issues related to FLINK-34007 in Flink 1.17. The issue started to appear with the upgrade of the k8s client dependency to v6.6.2 in FLINK-31997 (which ended up in Flink 1.18). That said, [~ZhenqiuHuang] reported similar errors in Flink 1.17 and 1.16 deployments as well which we cannot explain. We were not able to investigate the cause due to missing logs. We agreed to cover any other problems in a separate Jira issue if [~ZhenqiuHuang] comes up with new information (see his comment above). > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > Fix For: 1.19.0 > > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810038#comment-17810038 ] Matthias Pohl edited comment on FLINK-34007 at 1/24/24 8:37 AM: Just checking whether a new instance of {{LeaderElector}} was created is tricky because {{LeaderElector}} is coupled with the k8s backend. We would either have to mock {{LeaderElector}} or have to work with some backend to work properly within Flink's {{{}KubernetesLeaderElector{}}}. Anyway, I came up with a quite straight-forward ITCase. I added it to the PR which is ready to be reviewed now. was (Author: mapohl): Just checking whether a new instance of \{{LeaderElector}} was created is tricky because \{{LeaderElector}} is coupled with the k8s backend. We would either have to mock \{{LeaderElector}} or have to work with some backend. Anyway, I came up with a quite straight-forward ITCase. I added it to the PR which is ready to be reviewed now. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.19.0, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Assignee: Matthias Pohl >Priority: Blocker > Labels: pull-request-available > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808872#comment-17808872 ] Matthias Pohl edited comment on FLINK-34007 at 1/20/24 6:11 AM: Increasing the thread count was necessary because we executed the run command in [KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103] in the same thread pool that is used in the fabric8io LeaderElector for running the CompetableFutures in the loop: The {{LeaderElector#run}} call waits for the CompletableFuture returned by {{LeaderElector#acquire}} to complete (see [LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]). {{LeaderElector#acquire}} will trigger an asynchronous call on the {{executorService}} which wouldn't pick up the task because the single thread is waiting for the acquire call to complete. This deadlock situation is reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2. The 3 threads that were introduced by FLINK-31997 shouldn't have caused any issues because the LeaderElector only writes to the ConfigMap in [LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210] and [LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147]. Both methods are synchronized. Hence, they shouldn't cause a race condition in any way as far as I can see. I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] that recreates the fabric8io {{LeaderElector}} instead of reusing it. The fix also covers reverts the thread pool size from 3 to 1 with {{KubernetesLeaderElectionITCase}} passing again. I still have to think of a way to test Flink's KubernetesLeaderElector on the issue that caused the failure of this Jira. The only way I could think of is doing something similar to what's done in the [fabric8io codebase|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/test/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElectorTest.java#L63] with Mockito. Any other ideas are appreciated. I would like to avoid Mockito. was (Author: mapohl): Increasing the thread size was necessary because we executed the run command in [KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103] in the same thread pool that is used in the fabric8io LeaderElector for running the CompetableFutures in the loop: The {{LeaderElector#run}} call waits for the CompletableFuture returned by {{LeaderElector#acquire}} to complete (see [LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]). {{LeaderElector#acquire}} will trigger an asynchronous call on the {{executorService}} which wouldn't pick up the task because the single thread is waiting for the acquire call to complete. This deadlock situation is reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2. The 3 threads that were introduced by FLINK-31997 shouldn't have caused any issues because the LeaderElector only writes to the ConfigMap in [LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210] and [LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147]. Both methods are synchronized. Hence, they shouldn't cause a race condition in any way as far as I can see. I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] that recreates the fabric8io {{LeaderElector}} instead of reusing
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808872#comment-17808872 ] Matthias Pohl edited comment on FLINK-34007 at 1/20/24 6:12 AM: Increasing the thread count appeared to be necessary because the old Flink code executed the fabric8io {{LeaderElector#run}} command in [KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103] in the same thread pool that is used in the fabric8io LeaderElector for running the CompetableFutures in the loop: The {{LeaderElector#run}} call waits for the CompletableFuture returned by {{LeaderElector#acquire}} to complete (see [LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]). {{LeaderElector#acquire}} will trigger an asynchronous call on the {{executorService}} which wouldn't pick up the task because the single thread is waiting for the acquire call to complete. This deadlock situation is reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2. The 3 threads that were introduced by FLINK-31997 shouldn't have caused any issues because the LeaderElector only writes to the ConfigMap in [LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210] and [LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147]. Both methods are synchronized. Hence, they shouldn't cause a race condition in any way as far as I can see. I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] that recreates the fabric8io {{LeaderElector}} instead of reusing it. The fix also covers reverts the thread pool size from 3 to 1 with {{KubernetesLeaderElectionITCase}} passing again. I still have to think of a way to test Flink's KubernetesLeaderElector on the issue that caused the failure of this Jira. The only way I could think of is doing something similar to what's done in the [fabric8io codebase|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/test/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElectorTest.java#L63] with Mockito. Any other ideas are appreciated. I would like to avoid Mockito. was (Author: mapohl): Increasing the thread count was necessary because we executed the run command in [KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103] in the same thread pool that is used in the fabric8io LeaderElector for running the CompetableFutures in the loop: The {{LeaderElector#run}} call waits for the CompletableFuture returned by {{LeaderElector#acquire}} to complete (see [LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]). {{LeaderElector#acquire}} will trigger an asynchronous call on the {{executorService}} which wouldn't pick up the task because the single thread is waiting for the acquire call to complete. This deadlock situation is reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2. The 3 threads that were introduced by FLINK-31997 shouldn't have caused any issues because the LeaderElector only writes to the ConfigMap in [LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210] and [LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147]. Both methods are synchronized. Hence, they shouldn't cause a race condition in any way as far as I can see. I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] that
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808237#comment-17808237 ] Matthias Pohl edited comment on FLINK-34007 at 1/18/24 2:30 PM: I checked the implementation (since we're getting close to the 1.19 feature freeze). We have the following options: # We could downgrade the fabric8io kubernetes client dependency back to v5.12.4 (essentially reverting FLINK-31997). # We could fix the issue in the fabric8io kubernetes client and update the dependency as soon as the fix is released. Here, I'm also not that confident that we would be able to bring the fix into Flink before 1.19 should be released. ...because we're relying on the release of another project. # Refactor the k8s implementation to allow the restart of the KubernetesLeaderElector within the KubernetesLeaderElectionDriver. That would require updating the lockIdentify as well. The problem is that the lockIdentity is actually not owned by the KubernetesLeaderElector but by the k8s HighAvailabilityServices (because it's also used by the KubernetesStateHandleStore when checking the leadership). Even though moving the lockIdentity into the KubernetesLeaderElector makes sense (the KubernetesStateHandleStore should rely on the LeaderElectionService to detect whether leadership is acquired, instead), it is a larger effort and I am hesitant to work on that one that closely to the 1.19 feature freeze. I feel like we should apply all the three options in the above order. Option #1 would end up in 1.19.0 and 1.18.3 with option #2 being the follow-up. Option #3 could be considered as a dedicated refactoring effort in 1.20 or later. What's your view on that? My proposal only covers the issue we identified in k8s client v6.6.2. I ignored the fact that there are issues observed in deployments with Flink 1.17-. Any investigation around Flink 1.17- leadership issues should be moved into a separate Jira issue. was (Author: mapohl): I checked the implementation (since we're getting close to the 1.19 feature freeze). We have the following options: # We could downgrade the fabric8io kubernetes client dependency back to v5.12.4 (essentially reverting FLINK-31997). # We could fix the issue in the fabric8io kubernetes client and update the dependency as soon as the fix is released. Here, I'm also not that confident that we would be able to bring the fix into Flink before 1.19 should be released. ...because we're relying on the release of another project. # Refactor the k8s implementation to allow the restart of the KubernetesLeaderElector within the KubernetesLeaderElectionDriver. That would require updating the lockIdentify as well. The problem is that the lockIdentity is actually not owned by the KubernetesLeaderElector but by the k8s HighAvailabilityServices (because it's also used by the KubernetesStateHandleStore when checking the leadership). Even though moving the lockIdentity into the KubernetesLeaderElector makes sense (the KubernetesStateHandleStore should rely on the LeaderElectionService to detect whether leadership is acquired, instead), it is a larger effort and I am hesitant to work on that one that closely to the 1.19 feature freeze. I feel like we should apply all the three options in the above order. Option #1 would end up in 1.19.0 and 1.18.3 with option #2 being the follow-up. Option #3 could be considered as a dedicated refactoring effort in 1.20 or later. What's your view on that? > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Labels: pull-request-available > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229 ] Matthias Pohl edited comment on FLINK-34007 at 1/16/24 2:04 PM: The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. I replied to [fabric8io:kubernetes-client#5635|https://github.com/fabric8io/kubernetes-client/issues/5635] because the bug report provides the same stacktrace. That also explains why restarting the JobManager fixes the issue. Because that results in a new (not-stopped) LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. was (Author: mapohl): The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. I replied to [fabric8io:kubernetes-client#5635|https://github.com/fabric8io/kubernetes-client/issues/5635] because the bug report provides the same stacktrace. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229 ] Matthias Pohl edited comment on FLINK-34007 at 1/16/24 2:02 PM: The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. I replied to [fabric8io:kubernetes-client#5635|https://github.com/fabric8io/kubernetes-client/issues/5635] because the bug report provides the same stacktrace. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. was (Author: mapohl): The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229 ] Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:52 PM: The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. was (Author: mapohl): The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229 ] Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:50 PM: The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. was (Author: mapohl): The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229 ] Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:01 PM: The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now) stopped instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. was (Author: mapohl): The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). But Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now) stopped instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229 ] Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:01 PM: The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now stopped) instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. was (Author: mapohl): The timeout of the renew operation leads to a {{stopLeading}} call in [LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] which results in LeaderElector being stopped (the {{stopped}} flag is set to true in [LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119] and never disabled again). Any later call of {{tryAcquireOrRenew}} will not perform any action because of the {{stopped}} flag being true (see [LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]). Flink doesn't re-instantiate the {{LeaderElector}} but calls {{LeaderElector#run}} on the same (now) stopped instance in the [KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214] callback. That explains the behavior in the Flink 1.18 deployments, if I didn't miss anything. It sounds like a bug in the fabric8io k8s client implementation which we should be able to workaround by recreating a new LeaderElector on leadership loss. That also explains why restarting the JobManager fixes the issue. Because that results in a new LeaderElector being instantiated. I'm still puzzled about the issues in other Flink versions, though. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807017#comment-17807017 ] Zhenqiu Huang edited comment on FLINK-34007 at 1/16/24 3:40 AM: [~mapohl] [~wangyang0918] I am intensively testing flink 1.18. Within two days, there are users reported the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are running in the same cluster. 1.16 is in different cluster. I attached another LeaderElector-Debug.json file that contains debug log of a flink 1.18 app. The issue happened several times: 1. due to the configmap not accessible from api sever then renew timeout exceeded. 2. a failure on patch on a updated configmap The interesting part of the behavior of last several days is that job manager was not stuck but exit directly. Then, new job manager pod started correctly that is why new leader is selected in the log above. Hopefully, it is useful for your diagnosis. [~wangyang0918] >From my initial observation (before creating the jira), the leader annotation >update stopped when job manager was stuck. was (Author: zhenqiuhuang): [~mapohl] [~wangyang0918] I am intensively testing flink 1.18. Within two days, there are users reported the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are running in the same cluster. 1.16 is in different cluster. I attached another LeaderElector-Debug.json file that contains debug log of a flink 1.18 app. The issue happened several times: 1. due to the configmap not accessible from api sever then renew timeout exceeded 2. a failure on patch on a updated configmap The interesting part of the behavior of last several days is that job manager was not stuck but exit directly. Then, new job manager pod started correctly that is why new leader is selected in the log above. Hopefully, it is useful for your diagnosis. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807017#comment-17807017 ] Zhenqiu Huang edited comment on FLINK-34007 at 1/16/24 3:34 AM: [~mapohl] [~wangyang0918] I am intensively testing flink 1.18. Within two days, there are users reported the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are running in the same cluster. 1.16 is in different cluster. I attached another LeaderElector-Debug.json file that contains debug log of a flink 1.18 app. The issue happened several times: 1. due to the configmap not accessible from api sever then renew timeout exceeded 2. a failure on patch on a updated configmap The interesting part of the behavior of last several days is that job manager was not stuck but exit directly. Then, new job manager pod started correctly that is why new leader is selected in the log above. Hopefully, it is useful for your diagnosis. was (Author: zhenqiuhuang): [~mapohl] I am intensively testing flink 1.18. Within two days, there are users reported the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are running in the same cluster. 1.16 is in different cluster. I attached another LeaderElector-Debug.json file that contains debug log of a flink 1.18 app. The issue happened several times: 1. due to the configmap not accessible from api sever then renew timeout exceeded 2. a failure on patch on a updated configmap The interesting part of the behavior of last several days is that job manager was not stuck but exit directly. Then, new job manager pod started correctly that is why new leader is selected in the log above. Hopefully, it is useful for your diagnosis. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759 ] Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:57 AM: - But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident (which caused the lease to not be renewed) in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2]. But the change you're referring to ([PR #4152|https://github.com/fabric8io/kubernetes-client/pull/4125]) seems to be the only bigger change in {{{}LeaderElector{}}}, indeed. — On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase, {{stopLeading}} is either called after noticing the change in the lock identity ([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238]; the lock identity change would prevent the clearing of the data) or when the lease wasn't renewed ([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95]) where we would have to assume that other leader information is already written. And this change shouldn't be related to the issues with the lock lifecycle in general because it only affects metadata and not the lock annotation itself, should it? WDYT? was (Author: mapohl): But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident (which caused the lease to not be renewed) in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2]. — On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase,
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759 ] Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:41 AM: - But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident (which caused the lease to not be renewed) in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2]. — On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase, {{stopLeading}} is either called after noticing the change in the lock identity ([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238]; the lock identity change would prevent the clearing of the data) or when the lease wasn't renewed ([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95]) where we would have to assume that other leader information is already written. And this change shouldn't be related to the issues with the lock lifecycle in general because it only affects metadata and not the lock annotation itself, should it? WDYT? was (Author: mapohl): But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2]. — On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase, {{stopLeading}} is either called after noticing the change in the lock identity
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759 ] Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:38 AM: - But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2]. — On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase, {{stopLeading}} is either called after noticing the change in the lock identity ([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238]; the lock identity change would prevent the clearing of the data) or when the lease wasn't renewed ([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95]) where we would have to assume that other leader information is already written. And this change shouldn't be related to the issues with the lock lifecycle in general because it only affects metadata and not the lock annotation itself, should it? WDYT? was (Author: mapohl): But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in [5.12.3, 6.6.2]. — On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase, {{stopLeading}} is either called after noticing the change in the lock identity
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759 ] Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:24 AM: - But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in [5.12.3, 6.6.2]. — On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase, {{stopLeading}} is either called after noticing the change in the lock identity ([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238]; the lock identity change would prevent the clearing of the data) or when the lease wasn't renewed ([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95]) where we would have to assume that other leader information is already written. And this change shouldn't be related to the issues with the lock lifecycle in general because it only affects metadata and not the lock annotation itself, should it? WDYT? was (Author: mapohl): But it should be an issue that is available in different k8s client version (not only 6.6.2): ||Flink||k8s client||Jira issue|| |1.18|6.6.2|FLINK-31997| |1.17|5.12.4|FLINK-30231| |1.16|5.12.3|FLINK-28481| |1.14-1.15|5.5.0|FLINK-22802| At least based on the reports of this Jira issue, there must have been an incident in a k8s cluster that triggered the same failure in multiple Flink clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the same issue in all of those deployments. ...if I understand it correctly. Therefore, the issue should exist in [5.12.3, 6.6.2]. --- On another note: I remembered that there is a slight difference in the revocation protocol in the [FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box] changes: * The old implementation (see [1.15 DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238]) did try to clear the leader information from the ConfigMap. * The new implementation (see [1.18+ DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484]) doesn't clear the component leader information, anymore. Here, the reasoning was that the data wouldn't be able to be updated, anymore, because the leadership is already lost. But that change still seems to be reasonable based on my findings: In the k8s client 6.6.2 codebase, {{stopLeading}} is either called after noticing the change in the lock identity ([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806699#comment-17806699 ] Matthias Pohl edited comment on FLINK-34007 at 1/15/24 9:17 AM: Sorry, I misunderstood you initially. If the annotation is the problem, I would assume the issue being somewhere within the k8s client library. The annotation was never touched by the Flink code (there's only one read access in [KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115] to verify the leadership). ...even before the changes of FLINK-24038. The lifecycle of the annotation is handled within the fabric8.io client's {{{}LeaderElector{}}}. I'm gonna do some more investigation of that code segment. But having more debug k8s/kubernetes-client logs would help, I guess. was (Author: mapohl): If the annotation is the problem, I would assume the issue being somewhere within the k8s client library. The annotation was never touched by the Flink code (there's only one read access in [KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115] to verify the leadership). ...even before the changes of FLINK-24038. The lifecycle of the annotation is handled within the fabric8.io client's {{{}LeaderElector{}}}. I'm gonna do some more investigation of that code segment. But having more debug k8s/kubernetes-client logs would help, I guess. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806699#comment-17806699 ] Matthias Pohl edited comment on FLINK-34007 at 1/15/24 8:56 AM: If the annotation is the problem, I would assume the issue being somewhere within the k8s client library. The annotation was never touched by the Flink code (there's only one read access in [KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115] to verify the leadership). ...even before the changes of FLINK-24038. The lifecycle of the annotation is handled within the fabric8.io client's {{{}LeaderElector{}}}. I'm gonna do some more investigation of that code segment. But having more debug k8s/kubernetes-client logs would help, I guess. was (Author: mapohl): If the annotation is the problem, I would assume the issue being somewhere within the k8s client library. The annotation was never touched by the Flink code (there's only one read access in [KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115] to verify the leadership). The lifecycle of the annotation is handled within the fabric8.io client's \{{LeaderElector}}. I'm gonna do some more investigation of that code segment. But having more debug k8s/kubernetes-client logs would help, I guess. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805960#comment-17805960 ] Matthias Pohl edited comment on FLINK-34007 at 1/12/24 10:19 AM: - I couldn't find anything related to ConfigMap cleanup being triggered during leadership loss in the Flink code (Flink's [KubernetesLeaderElector:82|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L82] sets up the [callback for the leadership loss|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L124] which is implemented by [KubernetesLeaderElectionDriver#LeaderCallbackHandlerImpl|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L212]). This behavior is on-par with the old (i.e. pre-FLINK-24038) version of the Flink 1.15 codebase (see 1.15 class [KubernetesLeaderElectionDriver:202|https://github.com/apache/flink/blob/bba7c417217be878fffb12efedeac50dec2a7459/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L202]). That's also not something I would expect. It should be handled by the LeaderElector, instead, because the LeaderElector knows the state of the leader election and can trigger a clean up before the new leader information is written to the ConfigMap entry. Flink shouldn't trigger a clean up because it doesn't know whether a new leader was already elected (in which case cleaning up the ConfigMap entry would result in losing the leadership information of the new leader). And the Flink process wouldn't be able to clean it up, anyway, because the process isn't the leader anymore. Or am I missing something here? On another note: I came across [this change|https://github.com/apache/flink/pull/22540/files#diff-0e859df42954459619211d2ec60957742b24c9fc6ce55526616fddc540f0f8ffL59-R60] in the FLINK-31997 PR (k8s client update to 6.6.2): We're changing the thread pool size from 1 to 3 essentially allowing the same internal LeaderElector being executed multiple times (because we trigger another {{KubernetesLeaderElector#run}} call when the leadership is revoked). The old version of the code uses a single thread which would mean that the run call would get queued until the previous {{LeaderElector#run}} failed for whatever reason. That change sounds strange but shouldn't be the cause of this Jira issue because the change only went into 1.18 and we're experiencing this also in older versions of Flink. was (Author: mapohl): I couldn't find anything related to ConfigMap cleanup being triggered during leadership loss in the Flink code (Flink's [KubernetesLeaderElector:82|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L82] sets up the [callback for the leadership loss|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L124] which is implemented by [KubernetesLeaderElectionDriver#LeaderCallbackHandlerImpl|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L212]). This behavior is on-par with the old (i.e. pre-FLINK-24038) version of the Flink 1.15 codebase (see 1.15 class [KubernetesLeaderElectionDriver:202|https://github.com/apache/flink/blob/bba7c417217be878fffb12efedeac50dec2a7459/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L202]). That's also not something I would expect. It should be handled by the LeaderElector, instead, because the LeaderElector knows the state of the leader election and can trigger a clean up before the new leader information is written to the ConfigMap entry. Flink shouldn't trigger a clean up because it doesn't know whether a new leader was already elected (in which case cleaning up the ConfigMap entry would result in losing the leadership information of the new leader). And the Flink process wouldn't be able to clean it up, anyway, because the process isn't the leader anymore. Or am I missing something here? On another note: I came across [this change|https://github.com/apache/flink/pull/22540/files#diff-0e859df42954459619211d2ec60957742b24c9fc6ce55526616fddc540f0f8ffL59-R60] in the FLINK-31997 PR (k8s client
[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode
[ https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805411#comment-17805411 ] Matthias Pohl edited comment on FLINK-34007 at 1/11/24 7:36 AM: I still don't fully understand the error you shared: Shouldn't the KubernetesClientException resolve itself because the logic runs in a loop? Is this stacktrace you shared only a one-time thing or does it reoccur (which would confirm the execution in the loop and indicate that the ConfigMap is in some odd state)? Another thing I'm wondering is why the ConfigMap was concurrently updated (which caused the KubernetesClientException as far as I understand) when there's only one JM running. Are there other processes accessing the ConfigMap? {quote} [...] flink will not able to restart services (such RM and dispatcher) as DefaultLeaderRetrievalService is stopped also [...] {quote} The DefaultLeaderRetrievalService is not in charge of restarting any services. The LeaderElectionService will trigger the restart of any shut down services (in our case the SessionDispatcherLeaderProcess which would be started by the DefaultDispatcherRunner; the latter one maintains the Dispatcher's leader election) as soon as the JobManager gets the leadership again. was (Author: mapohl): I still don't fully understand the error you shared: Shouldn't the KubernetesClientException resolve itself because the logic runs in a loop? Is this stacktrace you shared only a one-time thing or does it reoccur (which would confirm the execution in the loop and indicate that the ConfigMap is in some odd state)? Another thing I'm wondering is why the ConfigMap was concurrently updated (which caused the KubernetesClientException as far as I understand) when there's only one JM running. Are there other processes accessing the ConfigMap? {quote} [...] flink will not able to restart services (such RM and dispatcher) as DefaultLeaderRetrievalService is stopped also [...] {quote} The DefaultLeaderRetrievalService is not in charge of restarting any services. The LeaderElectionService will trigger the restart of any shut down services (in that case the SessionDispatcherLeaderProcess which would be started by the DefaultDispatcherRunner; the latter one maintains the Dispatcher's leader election) as soon as the JobManager gets the leadership again. > Flink Job stuck in suspend state after losing leadership in HA Mode > --- > > Key: FLINK-34007 > URL: https://issues.apache.org/jira/browse/FLINK-34007 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2 >Reporter: Zhenqiu Huang >Priority: Major > Attachments: Debug.log, job-manager.log > > > The observation is that Job manager goes to suspend state with a failed > container not able to register itself to resource manager after timeout. > JM Log, see attached > -- This message was sent by Atlassian Jira (v8.20.10#820010)