Hi All, Flink 1.13.0
I have a Session cluster deployed with StatefulSet + PVs with HA configured within a Kubernetes cluster. I have submitted jobs to it, and it all works fine. Most of my jobs are long-running, typically consuming data from Kafka. I have noticed that after some time all my JobManagers have restarted multiple times and can no longer recover. These are some of the logs I have seen in the multiple JobManager instances: This doesn't seem harmful right? It just means multiple JMs are trying to edit the ConfigMap at the same time to become the leader and it's locked? This is the only one marked as an ERROR ``` 2021-05-24 23:07:30,962 ERROR io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector - Exception occurred while acquiring lock 'ConfigMapLock: ns - foo-restserver-leader (4a786be1-80e0-4fae-bf75-2dafc5f7526b)' ``` I have seen this one multiple times and seems to be an issue with the Java version and OkHttp version https://github.com/fabric8io/kubernetes-client/pull/2176. We are using JDK11: ``` io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] for kind: [ConfigMap] with name: [foo-restserver-leader] in namespace: [foo] failed. ``` I have created Roles for RBAC in the cluster so there shouldn't be an issue with that: ``` rules: - verbs: - get - watch - list - delete - create - update apiGroups: - '' resources: - configmaps ``` Seems to be a timeout watching ConfigMaps? ``` 2021-05-25 01:00:21,470 WARN io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Exec Failure java.net.SocketTimeoutException: timeout ``` Endless loop with the following message. I have found the line https://github.com/apache/flink/blob/80ad5b3b511a68cce19a53291000c9936e10db17/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L395 but what does this message really mean? ``` org.apache.flink.util.SerializedThrowable: The leading JobMaster id 94293ee005832f68401020e856274c84 did not match the received JobMaster id 8287ad63d2d10239c0839abe06dd4344. This indicates that a JobMaster leader change has happened. ``` Final log before the pod fails: ``` WARN org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - The connection was unexpectedly closed by the client. ``` Could this be due to having old config maps when redeploying the Flink Session cluster and trying to recover those jobs? I have also seen that in some cases the leader address in the 3 ConfigMaps (Dispatcher, Restserver, and Resource Manage) can differ - is this correct? Would really appreciate any feedback! Many thanks Enrique -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/