JM cannot recover with Kubernetes HA

Enrique Thu, 27 May 2021 01:41:43 -0700

Hi All,

Flink 1.13.0


I have a Session cluster deployed with StatefulSet + PVs with HA configured
within a Kubernetes cluster. I have submitted jobs to it, and it all works
fine. Most of my jobs are long-running, typically consuming data from Kafka.

I have noticed that after some time all my JobManagers have restarted
multiple times and can no longer recover.

These are some of the logs I have seen in the multiple JobManager instances:

This doesn't seem harmful right? It just means multiple JMs are trying to
edit the ConfigMap at the same time to become the leader and it's locked?
This is the only one marked as an ERROR
```
2021-05-24 23:07:30,962 ERROR
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector -
Exception occurred while acquiring lock 'ConfigMapLock: ns -
foo-restserver-leader (4a786be1-80e0-4fae-bf75-2dafc5f7526b)'
```



I have seen this one multiple times and seems to be an issue with the Java
version and OkHttp version
https://github.com/fabric8io/kubernetes-client/pull/2176. We are using
JDK11:
```
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get] 
for kind: [ConfigMap]  with name: [foo-restserver-leader]  in namespace:
[foo]  failed.
```



I have created Roles for RBAC in the cluster so there shouldn't be an issue
with that:
```
rules:
  - verbs:
      - get
      - watch
      - list
      - delete
      - create
      - update
    apiGroups:
      - ''
    resources:
      - configmaps
```

Seems to be a timeout watching ConfigMaps?
```
2021-05-25 01:00:21,470 WARN 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager - Exec
Failure
java.net.SocketTimeoutException: timeout
```

Endless loop with the following message. I have found the line
https://github.com/apache/flink/blob/80ad5b3b511a68cce19a53291000c9936e10db17/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L395
but what does this message really mean?
```
org.apache.flink.util.SerializedThrowable: The leading JobMaster id
94293ee005832f68401020e856274c84 did not match the received JobMaster id
8287ad63d2d10239c0839abe06dd4344. This indicates that a JobMaster leader
change has happened.
```

Final log before the pod fails:
```
WARN  org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - The
connection was unexpectedly closed by the client.
```

Could this be due to having old config maps when redeploying the Flink
Session cluster and trying to recover those jobs?

I have also seen that in some cases the leader address in the 3 ConfigMaps
(Dispatcher, Restserver, and Resource Manage) can differ - is this correct?

Would really appreciate any feedback!

Many thanks

Enrique 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

JM cannot recover with Kubernetes HA

Reply via email to