Hi,
We recently came across an issue where JobMaster does not register with
ResourceManager in Fink high availability setup.
Let me share the details below.
Setup
* Flink 1.7.1
* K8s
* High availability mode with a single Jobmanager and 3 zookeeper nodes in
quorum.
Scenario
* Zookeeper pods are disrupted by K8s that leads to resetting of leadership
of JobMaster & ResourceManager and restart of the Flink job.
Observations
* After the first disruption of Zookeeper, JobMaster and ResourceManager
were reset & were able to register with each other. Sharing few logs that
confirm that. Flink job restarted successfully.
org.apache.flink.runtime.jobmaster.JobMaster - Connecting to
ResourceManager....
o.a.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job
manager....
o.a.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job
manager....
org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully
registered at ResourceManager...
* After another disruption later on the same Flink cluster, JobMaster &
ResourceManager were not connected and below logs can be noticed and eventually
scheduler times out.
org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot
request, no ResourceManager connected.
………
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms……
* I can confirm from the logs that both JobMaster & ResourceManager were
running. JobMaster was trying to recover the job and ResourceManager registered
the taskmanagers.
* The odd thing is that the log for JobMaster trying to connect to
ResourceManager is missing. So I assume JobMaster didn’t try to connect to
ResourceManager.
I can share more logs if required.
Has anyone noticed similar behavior or is this a known issue with Flink 1.7.1?
Any recommendations or suggestions on fix or workaround?
Appreciate your time and help here.
~ Abhinav Bajaj