[jira] [Updated] (FLINK-33998) Flink Job Manager restarted after kube-apiserver connection intermittent

2024-01-04 Thread Xiangyan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangyan updated FLINK-33998:
-
Description: 
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 
05:{color:#ff}{{*53*}}{color}:08, both JM lost connection to 
kube-apiserver. But there is no more connection error within a few seconds. I 
guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"DefaultDispatcherRunner was revoked the leadership" error after the leader 
election timeout (at 05:{color:#ff}{{*54*}}{color}:08) and then restarted 
itself. While the other JM was still running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 

  was:
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 
05:{color:#ff}{{*53*}}{color}:08, both JM lost connection to 
kube-apiserver. But there is no more connection error within a few seconds. I 
guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"{{{}DefaultDispatcherRunner was revoked the leadership{}}}" error after the 
leader election timeout (at 05:{color:#ff}{{*54*}}{color}:08) and then 
restarted itself. While the other JM was still running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 


> Flink Job Manager restarted after kube-apiserver connection intermittent
> 
>
> Key: FLINK-33998
> URL: https://issues.apache.org/jira/browse/FLINK-33998
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.6
> Environment: Kubernetes 1.24
> Flink Operator 1.4
> Flink 1.13.6
>Reporter: Xiangyan
>Priority: Major
> Attachments: audit-log-no-restart.txt, audit-log-restart.txt, 
> connection timeout.png, jm-no-restart4.log, jm-restart4.log
>
>
> We are running Flink on AWS EKS and experienced Job Manager restart issue 
> when EKS control plane scaled up/in.
> I can reproduce this issue in my local environment too.
> Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster 
> by my own with below setup:
>  * Two kube-apiserver, only one is running at a time;
>  * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
>  * Enable Flink Job Manager HA;
>  * Configure Job Manager leader election timeout;
> {code:java}
> high-availability.kubernetes.leader-election.lease-duration: "60s"
> high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
> For testing, I switch the running kube-apiserver from one instance to another 
> each time. When the kube-apiserver is switching, I can see that some Job 
> Managers restart, but some are still running normally.
> Here is an example. When kube-apiserver swatched over at 
> 

[jira] [Updated] (FLINK-33998) Flink Job Manager restarted after kube-apiserver connection intermittent

2024-01-04 Thread Xiangyan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangyan updated FLINK-33998:
-
Description: 
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 
05:{color:#ff}{{*53*}}{color}:08, both JM lost connection to 
kube-apiserver. But there is no more connection error within a few seconds. I 
guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"{{{}DefaultDispatcherRunner was revoked the leadership{}}}" error after the 
leader election timeout (at 05:{color:#ff}{{*54*}}{color}:08) and then 
restarted itself. While the other JM was still running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 

  was:
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 
05:{color:#FF}{{*53*}}{color}:08, both JM lost connection to 
kube-apiserver. But there is no more connection error within a few seconds. I 
guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"leadership revoked" error after the leader election timeout (at 
05:{color:#FF}{{*54*}}{color}:08) and then restarted itself. While the 
other JM was still running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 


> Flink Job Manager restarted after kube-apiserver connection intermittent
> 
>
> Key: FLINK-33998
> URL: https://issues.apache.org/jira/browse/FLINK-33998
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.6
> Environment: Kubernetes 1.24
> Flink Operator 1.4
> Flink 1.13.6
>Reporter: Xiangyan
>Priority: Major
> Attachments: audit-log-no-restart.txt, audit-log-restart.txt, 
> connection timeout.png, jm-no-restart4.log, jm-restart4.log
>
>
> We are running Flink on AWS EKS and experienced Job Manager restart issue 
> when EKS control plane scaled up/in.
> I can reproduce this issue in my local environment too.
> Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster 
> by my own with below setup:
>  * Two kube-apiserver, only one is running at a time;
>  * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
>  * Enable Flink Job Manager HA;
>  * Configure Job Manager leader election timeout;
> {code:java}
> high-availability.kubernetes.leader-election.lease-duration: "60s"
> high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
> For testing, I switch the running kube-apiserver from one instance to another 
> each time. When the kube-apiserver is switching, I can see that some Job 
> Managers restart, but some are still running normally.
> Here is an example. When kube-apiserver swatched over at 
> 05:{color:#ff}{{*53*}}{color}:08, both JM lost connection 

[jira] [Updated] (FLINK-33998) Flink Job Manager restarted after kube-apiserver connection intermittent

2024-01-04 Thread Xiangyan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangyan updated FLINK-33998:
-
Description: 
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 
05:{color:#FF}{{*53*}}{color}:08, both JM lost connection to 
kube-apiserver. But there is no more connection error within a few seconds. I 
guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"leadership revoked" error after the leader election timeout (at 
05:{color:#FF}{{*54*}}{color}:08) and then restarted itself. While the 
other JM was still running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 

  was:
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}

For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 05:{{{}*53*{}}}:08, 
both JM lost connection to kube-apiserver. But there is no more connection 
error within a few seconds. I guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"leadership revoked" error after the leader election timeout (at 
05:{{{}*54*{}}}:08) and then restarted itself. While the other JM was still 
running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 


> Flink Job Manager restarted after kube-apiserver connection intermittent
> 
>
> Key: FLINK-33998
> URL: https://issues.apache.org/jira/browse/FLINK-33998
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.6
> Environment: Kubernetes 1.24
> Flink Operator 1.4
> Flink 1.13.6
>Reporter: Xiangyan
>Priority: Major
> Attachments: audit-log-no-restart.txt, audit-log-restart.txt, 
> connection timeout.png, jm-no-restart4.log, jm-restart4.log
>
>
> We are running Flink on AWS EKS and experienced Job Manager restart issue 
> when EKS control plane scaled up/in.
> I can reproduce this issue in my local environment too.
> Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster 
> by my own with below setup:
>  * Two kube-apiserver, only one is running at a time;
>  * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
>  * Enable Flink Job Manager HA;
>  * Configure Job Manager leader election timeout;
> {code:java}
> high-availability.kubernetes.leader-election.lease-duration: "60s"
> high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
> For testing, I switch the running kube-apiserver from one instance to another 
> each time. When the kube-apiserver is switching, I can see that some Job 
> Managers restart, but some are still running normally.
> Here is an example. When kube-apiserver swatched over at 
> 05:{color:#FF}{{*53*}}{color}:08, both JM lost connection to 
> kube-apiserver. But there is no more connection error within a few 

[jira] [Updated] (FLINK-33998) Flink Job Manager restarted after kube-apiserver connection intermittent

2024-01-04 Thread Xiangyan (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-33998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangyan updated FLINK-33998:
-
Description: 
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

{code:java}
high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}

For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 05:{{{}*53*{}}}:08, 
both JM lost connection to kube-apiserver. But there is no more connection 
error within a few seconds. I guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"leadership revoked" error after the leader election timeout (at 
05:{{{}*54*{}}}:08) and then restarted itself. While the other JM was still 
running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 

  was:
We are running Flink on AWS EKS and experienced Job Manager restart issue when 
EKS control plane scaled up/in.

I can reproduce this issue in my local environment too.

Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by 
my own with below setup:
 * Two kube-apiserver, only one is running at a time;
 * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
 * Enable Flink Job Manager HA;
 * Configure Job Manager leader election timeout;

high-availability.kubernetes.leader-election.lease-duration: "60s"
high-availability.kubernetes.leader-election.renew-deadline: "60s"
 
For testing, I switch the running kube-apiserver from one instance to another 
each time. When the kube-apiserver is switching, I can see that some Job 
Managers restart, but some are still running normally.

Here is an example. When kube-apiserver swatched over at 05:{{{}*53*{}}}:08, 
both JM lost connection to kube-apiserver. But there is no more connection 
error within a few seconds. I guess the connection recovered by retry.

However, one of the JM (the 2nd one in the attached screen shot) reported 
"leadership revoked" error after the leader election timeout (at 
05:{{{}*54*{}}}:08) and then restarted itself. While the other JM was still 
running normally.

>From kube-apiserver audit logs, the normal JM was able to renew leader lease 
>after the interruption. But there is no any lease renew request from the 
>failed JM until it restarted.

 


> Flink Job Manager restarted after kube-apiserver connection intermittent
> 
>
> Key: FLINK-33998
> URL: https://issues.apache.org/jira/browse/FLINK-33998
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.13.6
> Environment: Kubernetes 1.24
> Flink Operator 1.4
> Flink 1.13.6
>Reporter: Xiangyan
>Priority: Major
> Attachments: audit-log-no-restart.txt, audit-log-restart.txt, 
> connection timeout.png, jm-no-restart4.log, jm-restart4.log
>
>
> We are running Flink on AWS EKS and experienced Job Manager restart issue 
> when EKS control plane scaled up/in.
> I can reproduce this issue in my local environment too.
> Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster 
> by my own with below setup:
>  * Two kube-apiserver, only one is running at a time;
>  * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
>  * Enable Flink Job Manager HA;
>  * Configure Job Manager leader election timeout;
> {code:java}
> high-availability.kubernetes.leader-election.lease-duration: "60s"
> high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
> For testing, I switch the running kube-apiserver from one instance to another 
> each time. When the kube-apiserver is switching, I can see that some Job 
> Managers restart, but some are still running normally.
> Here is an example. When kube-apiserver swatched over at 05:{{{}*53*{}}}:08, 
> both JM lost connection to kube-apiserver. But there is no more connection 
> error within a few seconds. I guess the connection recovered by retry.
> However, one of the