When your APIServer or ETCD of your K8s cluster is working in heavy load, then the fabric8 kubernetes client might get a timeout when watching/renewing/getting the ConfigMap.
I think you could increase the read/connect timeout(default is 10s) of http client and have a try. env.java.opts: "-Dkubernetes.connection.timeout=30000 -Dkubernetes.request.timeout=30000" As well as the leader election timeout. high-availability.kubernetes.leader-election.lease-duration: 30s high-availability.kubernetes.leader-election.renew-deadline: 30s After you apply these configurations, I think the Flink cluster could tolerate the "not-very-good" network environment. Moreover, if you could share the failed JobManager logs, it will be easier for the community to debug the issues. Best, Yang Matthias Pohl <matth...@ververica.com> 于2021年5月28日周五 下午11:37写道: > Hi Enrique, > thanks for reaching out to the community. I'm not 100% sure what problem > you're facing. The log messages you're sharing could mean that the Flink > cluster still behaves as normal having some outages and the HA > functionality kicking in. > > The behavior you're seeing with leaders for the different actors (i.e. > RestServer, Dispatcher, ResourceManager) being located on different hosts > is fine and no indication for something going wrong as well. > > It might help to share the entire logs with us if you need assistance in > investigating your issue. > > Best, > Matthias > > On Thu, May 27, 2021 at 12:42 PM Enrique <enriquela...@gmail.com> wrote: > >> To add to my post, instead of using POD IP for the >> `jobmanager.rpc.address` >> configuration we start each JM pod with the Fully Qualified Name `--host >> <pod-name>.<stateful-set-name>.ns.svc:8081` and this address gets >> persisted >> to the ConfigMaps. In some scenarios, the leader address in the ConfigMaps >> might differ. >> >> For example, let's assume I have 3 JMs: >> >> jm-0.jm-statefulset.ns.svc:8081 <-- Leader >> jm-1.jm-statefulset.ns.svc:8081 >> jm-2.jm-statefulset..ns.svc:8081 >> >> I have seen the ConfigMaps in the following state: >> >> RestServer Configmap Address: jm-0.jm-statefulset.ns.svc:8081 >> DispatchServer Configmap Address: jm-1.jm-statefulset.ns.svc:8081 >> ResourceManager ConfigMap Address: jm-0.jm-statefulset.ns.svc:8081 >> >> Is this the correct behaviour? >> >> I then have seen that the TM pods fail to connect due to >> >> ``` >> java.util.concurrent.CompletionException: >> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing >> token >> not set: Ignoring message >> RemoteFencedMessage(b870874c1c590d593178811f052a42c9, >> RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) >> sent to >> akka.tcp://fl...@jm-1.jm-statefulset.ns.svc >> :6123/user/rpc/resourcemanager_0 >> because the fencing token is null. >> ``` >> >> This is explained by Till >> >> https://issues.apache.org/jira/browse/FLINK-18367?focusedCommentId=17141070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17141070 >> >> Has anyone else seen this? >> >> Thanks! >> >> Enrique >> >> >> >> -- >> Sent from: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ > >