[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-02-05 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17814524#comment-17814524
 ] 

Matthias Pohl edited comment on FLINK-34007 at 2/5/24 10:10 PM:


* master
** 
[79cccd7103a304bfa07104dcafd1f65a032c88ce|https://github.com/apache/flink/commit/79cccd7103a304bfa07104dcafd1f65a032c88ce]
** 
[95417a4857ec87a349c0fa9f4d3951f7d3807844|https://github.com/apache/flink/commit/95417a4857ec87a349c0fa9f4d3951f7d3807844]
** 
[927972ff4ad6252fd933fcc627c7d95dbbdae431|https://github.com/apache/flink/commit/927972ff4ad6252fd933fcc627c7d95dbbdae431]
* 1.18: will be handled in FLINK-34333


was (Author: mapohl):
* master
** 
[79cccd7103a304bfa07104dcafd1f65a032c88ce|https://github.com/apache/flink/commit/79cccd7103a304bfa07104dcafd1f65a032c88ce]
** 
[95417a4857ec87a349c0fa9f4d3951f7d3807844|https://github.com/apache/flink/commit/95417a4857ec87a349c0fa9f4d3951f7d3807844]
** 
[927972ff4ad6252fd933fcc627c7d95dbbdae431|https://github.com/apache/flink/commit/927972ff4ad6252fd933fcc627c7d95dbbdae431]

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-02-01 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813150#comment-17813150
 ] 

Matthias Pohl edited comment on FLINK-34007 at 2/1/24 10:18 AM:


Hi [~yunta] we're not aware of any issues related to FLINK-34007 in Flink 1.17. 
The issue started to appear with the upgrade of the k8s client dependency to 
v6.6.2 in FLINK-31997 (which ended up in Flink 1.18).

That being said, [~ZhenqiuHuang] reported similar errors in Flink 1.17 and 1.16 
deployments as well which we cannot explain. We were not able to investigate 
the cause due to missing logs. We agreed to cover any other problems in a 
separate Jira issue if [~ZhenqiuHuang] comes up with new information (see his 
comment above).


was (Author: mapohl):
Hi [~yunta] we're not aware of any issues related to FLINK-34007 in Flink 1.17. 
The issue started to appear with the upgrade of the k8s client dependency to 
v6.6.2 in FLINK-31997 (which ended up in Flink 1.18).

That said, [~ZhenqiuHuang] reported similar errors in Flink 1.17 and 1.16 
deployments as well which we cannot explain. We were not able to investigate 
the cause due to missing logs. We agreed to cover any other problems in a 
separate Jira issue if [~ZhenqiuHuang] comes up with new information (see his 
comment above).

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.19.0
>
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-24 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810038#comment-17810038
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/24/24 8:37 AM:


Just checking whether a new instance of {{LeaderElector}} was created is tricky 
because {{LeaderElector}} is coupled with the k8s backend. We would either have 
to mock {{LeaderElector}} or have to work with some backend to work properly 
within Flink's {{{}KubernetesLeaderElector{}}}. Anyway, I came up with a quite 
straight-forward ITCase. I added it to the PR which is ready to be reviewed now.


was (Author: mapohl):
Just checking whether a new instance of \{{LeaderElector}} was created is 
tricky because \{{LeaderElector}} is coupled with the k8s backend. We would 
either have to mock \{{LeaderElector}} or have to work with some backend. 
Anyway, I came up with a quite straight-forward ITCase. I added it to the PR 
which is ready to be reviewed now.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.19.0, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Assignee: Matthias Pohl
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-19 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808872#comment-17808872
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/20/24 6:11 AM:


Increasing the thread count was necessary because we executed the run command 
in 
[KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103]
 in the same thread pool that is used in the fabric8io LeaderElector for 
running the CompetableFutures in the loop:

The {{LeaderElector#run}} call waits for the CompletableFuture returned by 
{{LeaderElector#acquire}} to complete (see 
[LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]).
 {{LeaderElector#acquire}} will trigger an asynchronous call on the 
{{executorService}} which wouldn't pick up the task because the single thread 
is waiting for the acquire call to complete. This deadlock situation is 
reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine 
that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2.

The 3 threads that were introduced by FLINK-31997 shouldn't have caused any 
issues because the LeaderElector only writes to the ConfigMap in 
[LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210]
 and 
[LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147].
 Both methods are synchronized. Hence, they shouldn't cause a race condition in 
any way as far as I can see.

I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] 
that recreates the fabric8io {{LeaderElector}} instead of reusing it. The fix 
also covers reverts the thread pool size from 3 to 1 with 
{{KubernetesLeaderElectionITCase}} passing again. I still have to think of a 
way to test Flink's KubernetesLeaderElector on the issue that caused the 
failure of this Jira. The only way I could think of is doing something similar 
to what's done in the [fabric8io 
codebase|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/test/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElectorTest.java#L63]
 with Mockito. Any other ideas are appreciated. I would like to avoid Mockito.


was (Author: mapohl):
Increasing the thread size was necessary because we executed the run command in 
[KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103]
 in the same thread pool that is used in the fabric8io LeaderElector for 
running the CompetableFutures in the loop:

The {{LeaderElector#run}} call waits for the CompletableFuture returned by 
{{LeaderElector#acquire}} to complete (see 
[LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]).
 {{LeaderElector#acquire}} will trigger an asynchronous call on the 
{{executorService}} which wouldn't pick up the task because the single thread 
is waiting for the acquire call to complete. This deadlock situation is 
reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine 
that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2.

The 3 threads that were introduced by FLINK-31997 shouldn't have caused any 
issues because the LeaderElector only writes to the ConfigMap in 
[LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210]
 and 
[LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147].
 Both methods are synchronized. Hence, they shouldn't cause a race condition in 
any way as far as I can see.

I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] 
that recreates the fabric8io {{LeaderElector}} instead of reusing 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-19 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808872#comment-17808872
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/20/24 6:12 AM:


Increasing the thread count appeared to be necessary because the old Flink code 
executed the fabric8io {{LeaderElector#run}} command in 
[KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103]
 in the same thread pool that is used in the fabric8io LeaderElector for 
running the CompetableFutures in the loop:

The {{LeaderElector#run}} call waits for the CompletableFuture returned by 
{{LeaderElector#acquire}} to complete (see 
[LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]).
 {{LeaderElector#acquire}} will trigger an asynchronous call on the 
{{executorService}} which wouldn't pick up the task because the single thread 
is waiting for the acquire call to complete. This deadlock situation is 
reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine 
that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2.

The 3 threads that were introduced by FLINK-31997 shouldn't have caused any 
issues because the LeaderElector only writes to the ConfigMap in 
[LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210]
 and 
[LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147].
 Both methods are synchronized. Hence, they shouldn't cause a race condition in 
any way as far as I can see.

I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] 
that recreates the fabric8io {{LeaderElector}} instead of reusing it. The fix 
also covers reverts the thread pool size from 3 to 1 with 
{{KubernetesLeaderElectionITCase}} passing again. I still have to think of a 
way to test Flink's KubernetesLeaderElector on the issue that caused the 
failure of this Jira. The only way I could think of is doing something similar 
to what's done in the [fabric8io 
codebase|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/test/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElectorTest.java#L63]
 with Mockito. Any other ideas are appreciated. I would like to avoid Mockito.


was (Author: mapohl):
Increasing the thread count was necessary because we executed the run command 
in 
[KubernetesLeaderElector#run|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L103]
 in the same thread pool that is used in the fabric8io LeaderElector for 
running the CompetableFutures in the loop:

The {{LeaderElector#run}} call waits for the CompletableFuture returned by 
{{LeaderElector#acquire}} to complete (see 
[LeaderElector:70|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L70]).
 {{LeaderElector#acquire}} will trigger an asynchronous call on the 
{{executorService}} which wouldn't pick up the task because the single thread 
is waiting for the acquire call to complete. This deadlock situation is 
reproduced by Flink's {{{}KubernetesLeaderElectionITCase{}}}. I would imagine 
that this is the timeout, [~gyfora] was experiencing when upgrading to v6.6.2.

The 3 threads that were introduced by FLINK-31997 shouldn't have caused any 
issues because the LeaderElector only writes to the ConfigMap in 
[LeaderElector#tryAcquireOrRenew|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L210]
 and 
[LeaderElector#release|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L147].
 Both methods are synchronized. Hence, they shouldn't cause a race condition in 
any way as far as I can see.

I created a [draft PR with a fix|https://github.com/apache/flink/pull/24132] 
that 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-18 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17808237#comment-17808237
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/18/24 2:30 PM:


I checked the implementation (since we're getting close to the 1.19 feature 
freeze). We have the following options:
 # We could downgrade the fabric8io kubernetes client dependency back to 
v5.12.4 (essentially reverting FLINK-31997).
 # We could fix the issue in the fabric8io kubernetes client and update the 
dependency as soon as the fix is released. Here, I'm also not that confident 
that we would be able to bring the fix into Flink before 1.19 should be 
released. ...because we're relying on the release of another project.
 # Refactor the k8s implementation to allow the restart of the 
KubernetesLeaderElector within the KubernetesLeaderElectionDriver. That would 
require updating the lockIdentify as well. The problem is that the lockIdentity 
is actually not owned by the KubernetesLeaderElector but by the k8s 
HighAvailabilityServices (because it's also used by the 
KubernetesStateHandleStore when checking the leadership). Even though moving 
the lockIdentity into the KubernetesLeaderElector makes sense (the 
KubernetesStateHandleStore should rely on the LeaderElectionService to detect 
whether leadership is acquired, instead), it is a larger effort and I am 
hesitant to work on that one that closely to the 1.19 feature freeze.

I feel like we should apply all the three options in the above order. Option #1 
would end up in 1.19.0 and 1.18.3 with option #2 being the follow-up. Option #3 
could be considered as a dedicated refactoring effort in 1.20 or later. What's 
your view on that?

My proposal only covers the issue we identified in k8s client v6.6.2. I ignored 
the fact that there are issues observed in deployments with Flink 1.17-. Any 
investigation around Flink 1.17- leadership issues should be moved into a 
separate Jira issue.


was (Author: mapohl):
I checked the implementation (since we're getting close to the 1.19 feature 
freeze). We have the following options:
 # We could downgrade the fabric8io kubernetes client dependency back to 
v5.12.4 (essentially reverting FLINK-31997).
 # We could fix the issue in the fabric8io kubernetes client and update the 
dependency as soon as the fix is released. Here, I'm also not that confident 
that we would be able to bring the fix into Flink before 1.19 should be 
released. ...because we're relying on the release of another project.
 # Refactor the k8s implementation to allow the restart of the 
KubernetesLeaderElector within the KubernetesLeaderElectionDriver. That would 
require updating the lockIdentify as well. The problem is that the lockIdentity 
is actually not owned by the KubernetesLeaderElector but by the k8s 
HighAvailabilityServices (because it's also used by the 
KubernetesStateHandleStore when checking the leadership). Even though moving 
the lockIdentity into the KubernetesLeaderElector makes sense (the 
KubernetesStateHandleStore should rely on the LeaderElectionService to detect 
whether leadership is acquired, instead), it is a larger effort and I am 
hesitant to work on that one that closely to the 1.19 feature freeze.

I feel like we should apply all the three options in the above order. Option #1 
would end up in 1.19.0 and 1.18.3 with option #2 being the follow-up. Option #3 
could be considered as a dedicated refactoring effort in 1.20 or later. What's 
your view on that?

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/16/24 2:04 PM:


The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. I replied to 
[fabric8io:kubernetes-client#5635|https://github.com/fabric8io/kubernetes-client/issues/5635]
 because the bug report provides the same stacktrace. That also explains why 
restarting the JobManager fixes the issue. Because that results in a new 
(not-stopped) LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.


was (Author: mapohl):
The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. I replied to 
[fabric8io:kubernetes-client#5635|https://github.com/fabric8io/kubernetes-client/issues/5635]
 because the bug report provides the same stacktrace. That also explains why 
restarting the JobManager fixes the issue. Because that results in a new 
LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/16/24 2:02 PM:


The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. I replied to 
[fabric8io:kubernetes-client#5635|https://github.com/fabric8io/kubernetes-client/issues/5635]
 because the bug report provides the same stacktrace. That also explains why 
restarting the JobManager fixes the issue. Because that results in a new 
LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.


was (Author: mapohl):
The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:52 PM:


The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.


was (Author: mapohl):
The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:50 PM:


The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and is never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.


was (Author: mapohl):
The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:01 PM:


The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now) stopped instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.


was (Author: mapohl):
The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

But Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now) stopped instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-16 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807229#comment-17807229
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/16/24 1:01 PM:


The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now stopped) instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.


was (Author: mapohl):
The timeout of the renew operation leads to a {{stopLeading}} call in 
[LeaderElector:96|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 which results in LeaderElector being stopped (the {{stopped}} flag is set to 
true in 
[LeaderElector:119|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L119]
 and never disabled again). Any later call of {{tryAcquireOrRenew}} will not 
perform any action because of the {{stopped}} flag being true (see 
[LeaderElector#tryAcquireOrRenew:211|https://github.com/fabric8io/kubernetes-client/blob/0f6c696509935a6a86fdb4620caea023d8e680f1/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L211]).

Flink doesn't re-instantiate the {{LeaderElector}} but calls 
{{LeaderElector#run}} on the same (now) stopped instance in the 
[KubernetesLeaderElectionDriver#notLeader|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L214]
 callback.

That explains the behavior in the Flink 1.18 deployments, if I didn't miss 
anything. It sounds like a bug in the fabric8io k8s client implementation which 
we should be able to workaround by recreating a new LeaderElector on leadership 
loss. That also explains why restarting the JobManager fixes the issue. Because 
that results in a new LeaderElector being instantiated.

I'm still puzzled about the issues in other Flink versions, though.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807017#comment-17807017
 ] 

Zhenqiu Huang edited comment on FLINK-34007 at 1/16/24 3:40 AM:


[~mapohl] [~wangyang0918]
I am intensively testing flink 1.18. Within two days, there are users reported 
the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are 
running in the same cluster. 1.16 is in different cluster.

I attached another LeaderElector-Debug.json file that contains debug log of a 
flink 1.18 app. The issue happened several times:
1. due to the configmap not accessible from api sever then renew timeout 
exceeded. 
2. a failure on patch on a updated configmap


The interesting part of the behavior of last several days is that job manager 
was not stuck but exit directly. Then, new job manager pod started correctly 
that is why new leader is selected in the log above. Hopefully, it is useful 
for your diagnosis.


[~wangyang0918]
>From my initial observation (before creating the jira), the leader annotation 
>update stopped when job manager was stuck. 







was (Author: zhenqiuhuang):
[~mapohl] [~wangyang0918]
I am intensively testing flink 1.18. Within two days, there are users reported 
the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are 
running in the same cluster. 1.16 is in different cluster.

I attached another LeaderElector-Debug.json file that contains debug log of a 
flink 1.18 app. The issue happened several times:
1. due to the configmap not accessible from api sever then renew timeout 
exceeded 
2. a failure on patch on a updated configmap

The interesting part of the behavior of last several days is that job manager 
was not stuck but exit directly. Then, new job manager pod started correctly 
that is why new leader is selected in the log above. Hopefully, it is useful 
for your diagnosis.






> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Zhenqiu Huang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807017#comment-17807017
 ] 

Zhenqiu Huang edited comment on FLINK-34007 at 1/16/24 3:34 AM:


[~mapohl] [~wangyang0918]
I am intensively testing flink 1.18. Within two days, there are users reported 
the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are 
running in the same cluster. 1.16 is in different cluster.

I attached another LeaderElector-Debug.json file that contains debug log of a 
flink 1.18 app. The issue happened several times:
1. due to the configmap not accessible from api sever then renew timeout 
exceeded 
2. a failure on patch on a updated configmap

The interesting part of the behavior of last several days is that job manager 
was not stuck but exit directly. Then, new job manager pod started correctly 
that is why new leader is selected in the log above. Hopefully, it is useful 
for your diagnosis.







was (Author: zhenqiuhuang):
[~mapohl]
I am intensively testing flink 1.18. Within two days, there are users reported 
the job manager stuck issue in 1.17 and 1.16. 1.18 and 1.17 job instances are 
running in the same cluster. 1.16 is in different cluster.

I attached another LeaderElector-Debug.json file that contains debug log of a 
flink 1.18 app. The issue happened several times:
1. due to the configmap not accessible from api sever then renew timeout 
exceeded 
2. a failure on patch on a updated configmap

The interesting part of the behavior of last several days is that job manager 
was not stuck but exit directly. Then, new job manager pod started correctly 
that is why new leader is selected in the log above. Hopefully, it is useful 
for your diagnosis.






> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, LeaderElector-Debug.json, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:57 AM:
-

But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident (which caused the lease to not be renewed) in a k8s cluster that 
triggered the same failure in multiple Flink clusters (with versions of 1.18, 
1.17 and 1.16 at least) that triggered the same issue in all of those 
deployments. ...if I understand it correctly.

Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2]. 
But the change you're referring to ([PR 
#4152|https://github.com/fabric8io/kubernetes-client/pull/4125]) seems to be 
the only bigger change in {{{}LeaderElector{}}}, indeed.

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
 the lock identity change would prevent the clearing of the data) or when the 
lease wasn't renewed 
([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95])
 where we would have to assume that other leader information is already 
written. And this change shouldn't be related to the issues with the lock 
lifecycle in general because it only affects metadata and not the lock 
annotation itself, should it? WDYT?


was (Author: mapohl):
But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident (which caused the lease to not be renewed) in a k8s cluster that 
triggered the same failure in multiple Flink clusters (with versions of 1.18, 
1.17 and 1.16 at least) that triggered the same issue in all of those 
deployments. ...if I understand it correctly.

Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:41 AM:
-

But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident (which caused the lease to not be renewed) in a k8s cluster that 
triggered the same failure in multiple Flink clusters (with versions of 1.18, 
1.17 and 1.16 at least) that triggered the same issue in all of those 
deployments. ...if I understand it correctly.

Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
 the lock identity change would prevent the clearing of the data) or when the 
lease wasn't renewed 
([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95])
 where we would have to assume that other leader information is already 
written. And this change shouldn't be related to the issues with the lock 
lifecycle in general because it only affects metadata and not the lock 
annotation itself, should it? WDYT?


was (Author: mapohl):
But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident in a k8s cluster that triggered the same failure in multiple Flink 
clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the 
same issue in all of those deployments. ...if I understand it correctly.

Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:38 AM:
-

But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident in a k8s cluster that triggered the same failure in multiple Flink 
clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the 
same issue in all of those deployments. ...if I understand it correctly.

Therefore, the issue should exist in the entire version range [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
 the lock identity change would prevent the clearing of the data) or when the 
lease wasn't renewed 
([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95])
 where we would have to assume that other leader information is already 
written. And this change shouldn't be related to the issues with the lock 
lifecycle in general because it only affects metadata and not the lock 
annotation itself, should it? WDYT?


was (Author: mapohl):
But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident in a k8s cluster that triggered the same failure in multiple Flink 
clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the 
same issue in all of those deployments. ...if I understand it correctly.

Therefore, the issue should exist in [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806759#comment-17806759
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/15/24 11:24 AM:
-

But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident in a k8s cluster that triggered the same failure in multiple Flink 
clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the 
same issue in all of those deployments. ...if I understand it correctly.

Therefore, the issue should exist in [5.12.3, 6.6.2].

—

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
 the lock identity change would prevent the clearing of the data) or when the 
lease wasn't renewed 
([LeaderElector:95|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L95])
 where we would have to assume that other leader information is already 
written. And this change shouldn't be related to the issues with the lock 
lifecycle in general because it only affects metadata and not the lock 
annotation itself, should it? WDYT?


was (Author: mapohl):
But it should be an issue that is available in different k8s client version 
(not only 6.6.2):
||Flink||k8s client||Jira issue||
|1.18|6.6.2|FLINK-31997|
|1.17|5.12.4|FLINK-30231|
|1.16|5.12.3|FLINK-28481|
|1.14-1.15|5.5.0|FLINK-22802|

At least based on the reports of this Jira issue, there must have been an 
incident in a k8s cluster that triggered the same failure in multiple Flink 
clusters (with versions of 1.18, 1.17 and 1.16 at least) that triggered the 
same issue in all of those deployments. ...if I understand it correctly.

Therefore, the issue should exist in [5.12.3, 6.6.2].

---

On another note: I remembered that there is a slight difference in the 
revocation protocol in the 
[FLIP-285|https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box]
 changes:
 * The old implementation (see [1.15 
DefaultLeaderElectionService:238|https://github.com/apache/flink/blob/6e1caa390882996bf2d602951b54e4bb2d9c90dc/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L238])
 did try to clear the leader information from the ConfigMap.
 * The new implementation (see [1.18+ 
DefaultLeaderElectionService:484|https://github.com/apache/flink/blob/773feebbb2426ab1a8f7684f59b9a73db8f6a613/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/DefaultLeaderElectionService.java#L484])
 doesn't clear the component leader information, anymore. Here, the reasoning 
was that the data wouldn't be able to be updated, anymore, because the 
leadership is already lost.

But that change still seems to be reasonable based on my findings: In the k8s 
client 6.6.2 codebase, {{stopLeading}} is either called after noticing the 
change in the lock identity 
([LeaderElector:L238|https://github.com/fabric8io/kubernetes-client/blob/f91e0bd8e364f9a3758af0b90b9c661d0fc0a9eb/kubernetes-client-api/src/main/java/io/fabric8/kubernetes/client/extended/leaderelection/LeaderElector.java#L238];
 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806699#comment-17806699
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/15/24 9:17 AM:


Sorry, I misunderstood you initially. If the annotation is the problem, I would 
assume the issue being somewhere within the k8s client library. The annotation 
was never touched by the Flink code (there's only one read access in 
[KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115]
 to verify the leadership). ...even before the changes of FLINK-24038. The 
lifecycle of the annotation is handled within the fabric8.io client's 
{{{}LeaderElector{}}}. I'm gonna do some more investigation of that code 
segment. But having more debug k8s/kubernetes-client logs would help, I guess.


was (Author: mapohl):
If the annotation is the problem, I would assume the issue being somewhere 
within the k8s client library. The annotation was never touched by the Flink 
code (there's only one read access in 
[KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115]
 to verify the leadership). ...even before the changes of FLINK-24038. The 
lifecycle of the annotation is handled within the fabric8.io client's 
{{{}LeaderElector{}}}. I'm gonna do some more investigation of that code 
segment. But having more debug k8s/kubernetes-client logs would help, I guess.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-15 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17806699#comment-17806699
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/15/24 8:56 AM:


If the annotation is the problem, I would assume the issue being somewhere 
within the k8s client library. The annotation was never touched by the Flink 
code (there's only one read access in 
[KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115]
 to verify the leadership). ...even before the changes of FLINK-24038. The 
lifecycle of the annotation is handled within the fabric8.io client's 
{{{}LeaderElector{}}}. I'm gonna do some more investigation of that code 
segment. But having more debug k8s/kubernetes-client logs would help, I guess.


was (Author: mapohl):
If the annotation is the problem, I would assume the issue being somewhere 
within the k8s client library. The annotation was never touched by the Flink 
code (there's only one read access in 
[KubernetesLeaderElector#hasLeadership|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L115]
 to verify the leadership). The lifecycle of the annotation is handled within 
the fabric8.io client's \{{LeaderElector}}. I'm gonna do some more 
investigation of that code segment. But having more debug k8s/kubernetes-client 
logs would help, I guess.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-12 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805960#comment-17805960
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/12/24 10:19 AM:
-

I couldn't find anything related to ConfigMap cleanup being triggered during 
leadership loss in the Flink code (Flink's 
[KubernetesLeaderElector:82|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L82]
 sets up the [callback for the leadership 
loss|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L124]
 which is implemented by 
[KubernetesLeaderElectionDriver#LeaderCallbackHandlerImpl|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L212]).
 This behavior is on-par with the old (i.e. pre-FLINK-24038) version of the 
Flink 1.15 codebase (see 1.15 class 
[KubernetesLeaderElectionDriver:202|https://github.com/apache/flink/blob/bba7c417217be878fffb12efedeac50dec2a7459/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L202]).

That's also not something I would expect. It should be handled by the 
LeaderElector, instead, because the LeaderElector knows the state of the leader 
election and can trigger a clean up before the new leader information is 
written to the ConfigMap entry. Flink shouldn't trigger a clean up because it 
doesn't know whether a new leader was already elected (in which case cleaning 
up the ConfigMap entry would result in losing the leadership information of the 
new leader). And the Flink process wouldn't be able to clean it up, anyway, 
because the process isn't the leader anymore. Or am I missing something here?

On another note: I came across [this 
change|https://github.com/apache/flink/pull/22540/files#diff-0e859df42954459619211d2ec60957742b24c9fc6ce55526616fddc540f0f8ffL59-R60]
 in the FLINK-31997 PR (k8s client update to 6.6.2): We're changing the thread 
pool size from 1 to 3 essentially allowing the same internal LeaderElector 
being executed multiple times (because we trigger another 
{{KubernetesLeaderElector#run}} call when the leadership is revoked). The old 
version of the code uses a single thread which would mean that the run call 
would get queued until the previous {{LeaderElector#run}} failed for whatever 
reason. That change sounds strange but shouldn't be the cause of this Jira 
issue because the change only went into 1.18 and we're experiencing this also 
in older versions of Flink.


was (Author: mapohl):
I couldn't find anything related to ConfigMap cleanup being triggered during 
leadership loss in the Flink code (Flink's 
[KubernetesLeaderElector:82|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L82]
 sets up the [callback for the leadership 
loss|https://github.com/apache/flink/blob/c5808b04fdce9ca0b705b6cc7a64666ab6426875/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L124]
 which is implemented by 
[KubernetesLeaderElectionDriver#LeaderCallbackHandlerImpl|https://github.com/apache/flink/blob/11259ef52466889157ef473f422ecced72bab169/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L212]).
 This behavior is on-par with the old (i.e. pre-FLINK-24038) version of the 
Flink 1.15 codebase (see 1.15 class 
[KubernetesLeaderElectionDriver:202|https://github.com/apache/flink/blob/bba7c417217be878fffb12efedeac50dec2a7459/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesLeaderElectionDriver.java#L202]).

That's also not something I would expect. It should be handled by the 
LeaderElector, instead, because the LeaderElector knows the state of the leader 
election and can trigger a clean up before the new leader information is 
written to the ConfigMap entry. Flink shouldn't trigger a clean up because it 
doesn't know whether a new leader was already elected (in which case cleaning 
up the ConfigMap entry would result in losing the leadership information of the 
new leader). And the Flink process wouldn't be able to clean it up, anyway, 
because the process isn't the leader anymore. Or am I missing something here?

On another note: I came across [this 
change|https://github.com/apache/flink/pull/22540/files#diff-0e859df42954459619211d2ec60957742b24c9fc6ce55526616fddc540f0f8ffL59-R60]
 in the FLINK-31997 PR (k8s client 

[jira] [Comment Edited] (FLINK-34007) Flink Job stuck in suspend state after losing leadership in HA Mode

2024-01-10 Thread Matthias Pohl (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-34007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805411#comment-17805411
 ] 

Matthias Pohl edited comment on FLINK-34007 at 1/11/24 7:36 AM:


I still don't fully understand the error you shared: Shouldn't the 
KubernetesClientException resolve itself because the logic runs in a loop? Is 
this stacktrace you shared only a one-time thing or does it reoccur (which 
would confirm the execution in the loop and indicate that the ConfigMap is in 
some odd state)? Another thing I'm wondering is why the ConfigMap was 
concurrently updated (which caused the KubernetesClientException as far as I 
understand) when there's only one JM running. Are there other processes 
accessing the ConfigMap?

{quote}
[...] flink will not able to restart services (such RM and dispatcher) as 
DefaultLeaderRetrievalService is stopped also [...]
{quote}
The DefaultLeaderRetrievalService is not in charge of restarting any services. 
The LeaderElectionService will trigger the restart of any shut down services 
(in our case the SessionDispatcherLeaderProcess which would be started by the 
DefaultDispatcherRunner; the latter one maintains the Dispatcher's leader 
election) as soon as the JobManager gets the leadership again.


was (Author: mapohl):
I still don't fully understand the error you shared: Shouldn't the 
KubernetesClientException resolve itself because the logic runs in a loop? Is 
this stacktrace you shared only a one-time thing or does it reoccur (which 
would confirm the execution in the loop and indicate that the ConfigMap is in 
some odd state)? Another thing I'm wondering is why the ConfigMap was 
concurrently updated (which caused the KubernetesClientException as far as I 
understand) when there's only one JM running. Are there other processes 
accessing the ConfigMap?

{quote}
[...] flink will not able to restart services (such RM and dispatcher) as 
DefaultLeaderRetrievalService is stopped also [...]
{quote}
The DefaultLeaderRetrievalService is not in charge of restarting any services. 
The LeaderElectionService will trigger the restart of any shut down services 
(in that case the SessionDispatcherLeaderProcess which would be started by the 
DefaultDispatcherRunner; the latter one maintains the Dispatcher's leader 
election) as soon as the JobManager gets the leadership again.

> Flink Job stuck in suspend state after losing leadership in HA Mode
> ---
>
> Key: FLINK-34007
> URL: https://issues.apache.org/jira/browse/FLINK-34007
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.16.3, 1.17.2, 1.18.1, 1.18.2
>Reporter: Zhenqiu Huang
>Priority: Major
> Attachments: Debug.log, job-manager.log
>
>
> The observation is that Job manager goes to suspend state with a failed 
> container not able to register itself to resource manager after timeout.
> JM Log, see attached
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)