Re: New JM pod tries to connect to failed JM pod

huweihua Tue, 19 Apr 2022 04:45:09 -0700

Hi,
After the previous JobManager fails, K8S start the new JobManager, but the 
Leader saved in HA is still the old JobManager address. After the Dispatcher 
gets the old JobManager leader, it will try to connect to it.


This error can be ignored, and it will return to normal after waiting for a 
period of time for the new JobManager to become the leader.

> 2022年4月19日 上午9:25，Alexey Trenikhun <yen...@msn.com> 写道：
> 
> Hello,
> We are running Flink 1.13.6 in Kubernetes with k8s HA, the setup includes 1 
> JM and TM.  Recently In jobmanager log I started to see:
> 
> 2022-04-19T00:11:33.102Z Association with remote system 
> [akka.tcp://flink@10.204.0.126:6123 <akka.tcp://flink@10.204.0.126:6123>] has 
> failed, address is now gated for [50] ms. Reason: [Association failed with 
> [akka.tcp://flink@10.204.0.126:6123 <akka.tcp://flink@10.204.0.126:6123>]] 
> Caused by: [No response from remote for outbound association. Associate timed 
> out after [20000 ms].]
> 
> I suspect that root cause are some network issues. But what is very strange 
> that this log from pod gsp-jm-424--1-8v5qj (10.204.2.138) and 10.204.0.126 is 
> IP address of failed JM pod - gsp-jm-424--1-kdhqp, so looks like newer 
> instance of JM (10.204.2.138) is trying to connect to older failed instance 
> of JM (10.204.0.126). 
> 
> Thanks,
> Alexey

Re: New JM pod tries to connect to failed JM pod

Reply via email to