Re: JobMaster does not register with ResourceManager in high availability setup

Bajaj, Abhinav Wed, 04 Mar 2020 15:30:52 -0800

While I setup to reproduce the issue with debug logs, I would like to share 
more information I noticed in INFO logs.


Below is the sequence of events/exceptions I notice during the time zookeeper 
was disrupted.
I apologize in advance as they are a bit verbose.


  *   Zookeeper seems to be down and leader election is disrupted –



·         2020-02-27 06:28:23.572 [Curator-ConnectionStateManager-0] level=WARN 
 o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Connection to 
ZooKeeper suspended. The contender 
akka.tcp://flink@FOO_BAR:6126/user/resourcemanager no longer participates in 
the leader election.

·         2020-02-27 06:28:23.573 [Curator-ConnectionStateManager-0] level=INFO 
 org.apache.flink.runtime.jobmaster.JobManagerRunner  - JobManager for job 
FOO_BAR (5a910928a71b469a091be168b0e74722) was revoked leadership at 
akka.tcp://flink@ FOO_BAR:6126/user/jobmanager_1.

·         2020-02-27 06:28:23.573 [flink-akka.actor.default-dispatcher-9897] 
level=INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - 
ResourceManager akka.tcp://flink@ FOO_BAR:6126/user/resourcemanager was revoked 
leadership. Clearing fencing token.

·         2020-02-27 06:28:23.574 [flink-akka.actor.default-dispatcher-9897] 
level=INFO  o.a.flink.runtime.resourcemanager.slotmanager.SlotManager  - 
Suspending the SlotManager.

·         2020-02-27 06:28:53.577 [Curator-Framework-0] level=ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Fatal error occurred 
in the cluster entrypoint.

org.apache.flink.runtime.dispatcher.DispatcherException: Received an error from 
the LeaderElectionService.

        at 
org.apache.flink.runtime.dispatcher.Dispatcher.handleError(Dispatcher.java:941)

        at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:416)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:576)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:572)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)

        at 
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:571)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:740)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)

        at java.util.concurrent.FutureTask.run(FutureTask.java:266)

        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)

        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.flink.util.FlinkException: Unhandled error in 
ZooKeeperLeaderElectionService: Background operation retry gave up

        ... 18 common frames omitted

Caused by: 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss

        at 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)

        ... 10 common frames omitted



  *   ClusterEntrypoint restarts and tries to connect to Zookeeper. It seems 
its fails for some time but able to connect later -



·         2020-02-27 06:28:56.467 [main] level=INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -  Starting 
StandaloneSessionClusterEntrypoint (Version: 1.7.1, Rev:89eafb4, 
Date:14.12.2018 @ 15:48:34 GMT)

·         2020-02-27 06:29:16.477 [main] level=ERROR 
o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Connection timed 
out for connection string (ZOO_BAR_0:2181, ZOO_BAR_1:2181, ZOO_BAR_2:2181) and 
timeout (15000) / elapsed (15969)

org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException:
 KeeperErrorCode = ConnectionLoss

        at 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)

        at 
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)

        at 
org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)

        at 
org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)

        at 
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)

        at 
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.start(ZooKeeperLeaderElectionService.java:134)

        at 
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.startInternal(WebMonitorEndpoint.java:712)

        at 
org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:218)

        at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:145)

        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215)

        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:163)

        at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)

        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:162)

        at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:517)

        at 
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)

·         2020-02-27 06:30:01.643 [main] level=INFO  
o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting 
ZooKeeperLeaderElectionService 
ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.

·         2020-02-27 06:30:01.655 [main] level=INFO  
o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting 
ZooKeeperLeaderElectionService 
ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.

·         2020-02-27 06:30:01.677 [flink-akka.actor.default-dispatcher-16] 
level=INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - 
Dispatcher akka.tcp://flink@FOO_BAR:6126/user/dispatcher was granted leadership 
with fencing token 113d78b5-6c33-401b-9f47-2f7a1d6dfefc

·         2020-02-27 06:30:01.677 [flink-akka.actor.default-dispatcher-5] 
level=INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - 
ResourceManager akka.tcp://flink@FOO_BAR:6126/user/resourcemanager was granted 
leadership with fencing token a2c453481ea4e0c7722cab1e4dd741db

·         2020-02-27 06:30:06.251 [main-EventThread] level=INFO  
org.apache.flink.runtime.jobmaster.JobManagerRunner  - JobManager runner for 
job HHW_WEATHER_PROCESSOR (5a910928a71b469a091be168b0e74722) was granted 
leadership with session id 32637c78-cfd4-4e44-8c10-c551bac40742 at 
akka.tcp://flink@FOO_BAR:6126/user/jobmanager_0.


My zookeeper knowledge is a bit limited but I do notice that the below log from 
zookeeper instance came back up and joined the quorum as follower before the 
above highlighted logs on jobmanager side.

  *   2020-02-27 06:29:53,732 [myid:1] - INFO 
[QuorumPeer[myid=1]/0.0.0.0:2181:Follower@64] - FOLLOWING - LEADER ELECTION 
TOOK - 25069
  *   2020-02-27 06:29:53,766 [myid:1] - INFO 
[QuorumPeer[myid=1]/0.0.0.0:2181:Learner@332] - Getting a diff from the leader 
0x200002bf6



I will setup to reproduce this issue and get debug logs as well.


But in meantime, does the above hightlighted logs confirm that zookeeper become 
available around that time?
I don’t see any logs from JobMaster complaining for not being able to connect 
to zookeeper after that.

~ Abhinav Bajaj

From: "Bajaj, Abhinav" <abhinav.ba...@here.com>
Date: Wednesday, March 4, 2020 at 12:01 PM
To: Xintong Song <tonysong...@gmail.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: JobMaster does not register with ResourceManager in high 
availability setup

Thanks Xintong for pointing that out.

I will dig deeper and get back with my findings.

~ Abhinav Bajaj

From: Xintong Song <tonysong...@gmail.com>
Date: Tuesday, March 3, 2020 at 7:36 PM
To: "Bajaj, Abhinav" <abhinav.ba...@here.com>
Cc: "user@flink.apache.org" <user@flink.apache.org>
Subject: Re: JobMaster does not register with ResourceManager in high 
availability setup

Hi Abhinav,

The JobMaster log "Connecting to ResourceManager ..." is printed after 
JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I 
assume there's some ZK problem that JM cannot resolve RM address.



Have you confirmed whether the ZK pods are recovered after the second 
disruption? And does the address changed?



You can also try to enable debug logs for the following components, to see if 
there's any useful information.

org.apache.flink.runtime.jobmaster

org.apache.flink.runtime.resourcemanager

org.apache.flink.runtime.highavailability

org.apache.flink.runtime.leaderretrieval

org.apache.zookeeper



Thank you~

Xintong Song


On Wed, Mar 4, 2020 at 5:42 AM Bajaj, Abhinav 
<abhinav.ba...@here.com<mailto:abhinav.ba...@here.com>> wrote:
Hi,

We recently came across an issue where JobMaster does not register with 
ResourceManager in Fink high availability setup.
Let me share the details below.

Setup

  *   Flink 1.7.1
  *   K8s
  *   High availability mode with a single Jobmanager and 3 zookeeper nodes in 
quorum.

Scenario

  *   Zookeeper pods are disrupted by K8s that leads to resetting of leadership 
of JobMaster & ResourceManager and restart of the Flink job.

Observations

  *   After the first disruption of Zookeeper, JobMaster and ResourceManager 
were reset & were able to register with each other. Sharing few logs that 
confirm that. Flink job restarted successfully.

org.apache.flink.runtime.jobmaster.JobMaster  - Connecting to 
ResourceManager....

o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering job 
manager....

o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered job 
manager....

org.apache.flink.runtime.jobmaster.JobMaster  - JobManager successfully 
registered at ResourceManager...

  *    After another disruption later on the same Flink cluster, JobMaster & 
ResourceManager were not connected and below logs can be noticed and eventually 
scheduler times out.
org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Cannot serve slot 
request, no ResourceManager connected.

       ………

        
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms……

  *   I can confirm from the logs that both JobMaster & ResourceManager were 
running. JobMaster was trying to recover the job and ResourceManager registered 
the taskmanagers.
  *   The odd thing is that the log for JobMaster trying to connect to 
ResourceManager is missing. So I assume JobMaster didn’t try to connect to 
ResourceManager.

I can share more logs if required.

Has anyone noticed similar behavior or is this a known issue with Flink 1.7.1?
Any recommendations or suggestions on fix or workaround?

Appreciate your time and help here.

~ Abhinav Bajaj

Re: JobMaster does not register with ResourceManager in high availability setup

Reply via email to