Thanks Xintong for pointing that out.

I will dig deeper and get back with my findings.

~ Abhinav Bajaj

From: Xintong Song <[email protected]>
Date: Tuesday, March 3, 2020 at 7:36 PM
To: "Bajaj, Abhinav" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: JobMaster does not register with ResourceManager in high 
availability setup

Hi Abhinav,

The JobMaster log "Connecting to ResourceManager ..." is printed after 
JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I 
assume there's some ZK problem that JM cannot resolve RM address.



Have you confirmed whether the ZK pods are recovered after the second 
disruption? And does the address changed?



You can also try to enable debug logs for the following components, to see if 
there's any useful information.

org.apache.flink.runtime.jobmaster

org.apache.flink.runtime.resourcemanager

org.apache.flink.runtime.highavailability

org.apache.flink.runtime.leaderretrieval

org.apache.zookeeper



Thank you~

Xintong Song


On Wed, Mar 4, 2020 at 5:42 AM Bajaj, Abhinav 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

We recently came across an issue where JobMaster does not register with 
ResourceManager in Fink high availability setup.
Let me share the details below.

Setup

  *   Flink 1.7.1
  *   K8s
  *   High availability mode with a single Jobmanager and 3 zookeeper nodes in 
quorum.

Scenario

  *   Zookeeper pods are disrupted by K8s that leads to resetting of leadership 
of JobMaster & ResourceManager and restart of the Flink job.

Observations

  *   After the first disruption of Zookeeper, JobMaster and ResourceManager 
were reset & were able to register with each other. Sharing few logs that 
confirm that. Flink job restarted successfully.

org.apache.flink.runtime.jobmaster.JobMaster  - Connecting to 
ResourceManager....

o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering job 
manager....

o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered job 
manager....

org.apache.flink.runtime.jobmaster.JobMaster  - JobManager successfully 
registered at ResourceManager...

  *    After another disruption later on the same Flink cluster, JobMaster & 
ResourceManager were not connected and below logs can be noticed and eventually 
scheduler times out.
org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Cannot serve slot 
request, no ResourceManager connected.

       ………

        
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms……

  *   I can confirm from the logs that both JobMaster & ResourceManager were 
running. JobMaster was trying to recover the job and ResourceManager registered 
the taskmanagers.
  *   The odd thing is that the log for JobMaster trying to connect to 
ResourceManager is missing. So I assume JobMaster didn’t try to connect to 
ResourceManager.

I can share more logs if required.

Has anyone noticed similar behavior or is this a known issue with Flink 1.7.1?
Any recommendations or suggestions on fix or workaround?

Appreciate your time and help here.

~ Abhinav Bajaj


Reply via email to