Xiangyan created FLINK-33998: -------------------------------- Summary: Flink Job Manager restarted after kube-apiserver connection intermittent Key: FLINK-33998 URL: https://issues.apache.org/jira/browse/FLINK-33998 Project: Flink Issue Type: Bug Components: Deployment / Kubernetes Affects Versions: 1.13.6 Environment: Kubernetes 1.24
Flink Operator 1.4 Flink 1.13.6 Reporter: Xiangyan Attachments: audit-log-no-restart.txt, audit-log-restart.txt, connection timeout.png, jm-no-restart4.log, jm-restart4.log We are running Flink on AWS EKS and experienced Job Manager restart issue when EKS control plane scaled up/in. I can reproduce this issue in my local environment too. Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by my own with below setup: * Two kube-apiserver, only one is running at a time; * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13); * Enable Flink Job Manager HA; * Configure Job Manager leader election timeout; high-availability.kubernetes.leader-election.lease-duration: "60s" high-availability.kubernetes.leader-election.renew-deadline: "60s" For testing, I switch the running kube-apiserver from one instance to another each time. When the kube-apiserver is switching, I can see that some Job Managers restart, but some are still running normally. Here is an example. When kube-apiserver swatched over at 05:{{{}*53*{}}}:08, both JM lost connection to kube-apiserver. But there is no more connection error within a few seconds. I guess the connection recovered by retry. However, one of the JM (the 2nd one in the attached screen shot) reported "leadership revoked" error after the leader election timeout (at 05:{{{}*54*{}}}:08) and then restarted itself. While the other JM was still running normally. >From kube-apiserver audit logs, the normal JM was able to renew leader lease >after the interruption. But there is no any lease renew request from the >failed JM until it restarted. -- This message was sent by Atlassian Jira (v8.20.10#820010)