While I setup to reproduce the issue with debug logs, I would like to share
more information I noticed in INFO logs.
Below is the sequence of events/exceptions I notice during the time zookeeper
was disrupted.
I apologize in advance as they are a bit verbose.
* Zookeeper seems to be down and leader election is disrupted –
· 2020-02-27 06:28:23.572 [Curator-ConnectionStateManager-0] level=WARN
o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to
ZooKeeper suspended. The contender
akka.tcp://flink@FOO_BAR:6126/user/resourcemanager no longer participates in
the leader election.
· 2020-02-27 06:28:23.573 [Curator-ConnectionStateManager-0] level=INFO
org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager for job
FOO_BAR (5a910928a71b469a091be168b0e74722) was revoked leadership at
akka.tcp://flink@ FOO_BAR:6126/user/jobmanager_1.
· 2020-02-27 06:28:23.573 [flink-akka.actor.default-dispatcher-9897]
level=INFO o.a.flink.runtime.resourcemanager.StandaloneResourceManager -
ResourceManager akka.tcp://flink@ FOO_BAR:6126/user/resourcemanager was revoked
leadership. Clearing fencing token.
· 2020-02-27 06:28:23.574 [flink-akka.actor.default-dispatcher-9897]
level=INFO o.a.flink.runtime.resourcemanager.slotmanager.SlotManager -
Suspending the SlotManager.
· 2020-02-27 06:28:53.577 [Curator-Framework-0] level=ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred
in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Received an error from
the LeaderElectionService.
at
org.apache.flink.runtime.dispatcher.Dispatcher.handleError(Dispatcher.java:941)
at
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:416)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:576)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:572)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
at
org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:571)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:740)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Unhandled error in
ZooKeeperLeaderElectionService: Background operation retry gave up
... 18 common frames omitted
Caused by:
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
... 10 common frames omitted
* ClusterEntrypoint restarts and tries to connect to Zookeeper. It seems
its fails for some time but able to connect later -
· 2020-02-27 06:28:56.467 [main] level=INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting
StandaloneSessionClusterEntrypoint (Version: 1.7.1, Rev:89eafb4,
Date:14.12.2018 @ 15:48:34 GMT)
· 2020-02-27 06:29:16.477 [main] level=ERROR
o.a.flink.shaded.curator.org.apache.curator.ConnectionState - Connection timed
out for connection string (ZOO_BAR_0:2181, ZOO_BAR_1:2181, ZOO_BAR_2:2181) and
timeout (15000) / elapsed (15969)
org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException:
KeeperErrorCode = ConnectionLoss
at
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
at
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
at
org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)
at
org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
at
org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
at
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.start(ZooKeeperLeaderElectionService.java:134)
at
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.startInternal(WebMonitorEndpoint.java:712)
at
org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:218)
at
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:145)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:163)
at
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:162)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:517)
at
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
· 2020-02-27 06:30:01.643 [main] level=INFO
o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting
ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
· 2020-02-27 06:30:01.655 [main] level=INFO
o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting
ZooKeeperLeaderElectionService
ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
· 2020-02-27 06:30:01.677 [flink-akka.actor.default-dispatcher-16]
level=INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher -
Dispatcher akka.tcp://flink@FOO_BAR:6126/user/dispatcher was granted leadership
with fencing token 113d78b5-6c33-401b-9f47-2f7a1d6dfefc
· 2020-02-27 06:30:01.677 [flink-akka.actor.default-dispatcher-5]
level=INFO o.a.flink.runtime.resourcemanager.StandaloneResourceManager -
ResourceManager akka.tcp://flink@FOO_BAR:6126/user/resourcemanager was granted
leadership with fencing token a2c453481ea4e0c7722cab1e4dd741db
· 2020-02-27 06:30:06.251 [main-EventThread] level=INFO
org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager runner for
job HHW_WEATHER_PROCESSOR (5a910928a71b469a091be168b0e74722) was granted
leadership with session id 32637c78-cfd4-4e44-8c10-c551bac40742 at
akka.tcp://flink@FOO_BAR:6126/user/jobmanager_0.
My zookeeper knowledge is a bit limited but I do notice that the below log from
zookeeper instance came back up and joined the quorum as follower before the
above highlighted logs on jobmanager side.
* 2020-02-27 06:29:53,732 [myid:1] - INFO
[QuorumPeer[myid=1]/0.0.0.0:2181:Follower@64] - FOLLOWING - LEADER ELECTION
TOOK - 25069
* 2020-02-27 06:29:53,766 [myid:1] - INFO
[QuorumPeer[myid=1]/0.0.0.0:2181:Learner@332] - Getting a diff from the leader
0x200002bf6
I will setup to reproduce this issue and get debug logs as well.
But in meantime, does the above hightlighted logs confirm that zookeeper become
available around that time?
I don’t see any logs from JobMaster complaining for not being able to connect
to zookeeper after that.
~ Abhinav Bajaj
From: "Bajaj, Abhinav" <[email protected]>
Date: Wednesday, March 4, 2020 at 12:01 PM
To: Xintong Song <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: JobMaster does not register with ResourceManager in high
availability setup
Thanks Xintong for pointing that out.
I will dig deeper and get back with my findings.
~ Abhinav Bajaj
From: Xintong Song <[email protected]>
Date: Tuesday, March 3, 2020 at 7:36 PM
To: "Bajaj, Abhinav" <[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: JobMaster does not register with ResourceManager in high
availability setup
Hi Abhinav,
The JobMaster log "Connecting to ResourceManager ..." is printed after
JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I
assume there's some ZK problem that JM cannot resolve RM address.
Have you confirmed whether the ZK pods are recovered after the second
disruption? And does the address changed?
You can also try to enable debug logs for the following components, to see if
there's any useful information.
org.apache.flink.runtime.jobmaster
org.apache.flink.runtime.resourcemanager
org.apache.flink.runtime.highavailability
org.apache.flink.runtime.leaderretrieval
org.apache.zookeeper
Thank you~
Xintong Song
On Wed, Mar 4, 2020 at 5:42 AM Bajaj, Abhinav
<[email protected]<mailto:[email protected]>> wrote:
Hi,
We recently came across an issue where JobMaster does not register with
ResourceManager in Fink high availability setup.
Let me share the details below.
Setup
* Flink 1.7.1
* K8s
* High availability mode with a single Jobmanager and 3 zookeeper nodes in
quorum.
Scenario
* Zookeeper pods are disrupted by K8s that leads to resetting of leadership
of JobMaster & ResourceManager and restart of the Flink job.
Observations
* After the first disruption of Zookeeper, JobMaster and ResourceManager
were reset & were able to register with each other. Sharing few logs that
confirm that. Flink job restarted successfully.
org.apache.flink.runtime.jobmaster.JobMaster - Connecting to
ResourceManager....
o.a.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job
manager....
o.a.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job
manager....
org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully
registered at ResourceManager...
* After another disruption later on the same Flink cluster, JobMaster &
ResourceManager were not connected and below logs can be noticed and eventually
scheduler times out.
org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot
request, no ResourceManager connected.
………
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms……
* I can confirm from the logs that both JobMaster & ResourceManager were
running. JobMaster was trying to recover the job and ResourceManager registered
the taskmanagers.
* The odd thing is that the log for JobMaster trying to connect to
ResourceManager is missing. So I assume JobMaster didn’t try to connect to
ResourceManager.
I can share more logs if required.
Has anyone noticed similar behavior or is this a known issue with Flink 1.7.1?
Any recommendations or suggestions on fix or workaround?
Appreciate your time and help here.
~ Abhinav Bajaj