Using flink 11.2 on java 11, session cluster with 16 jobs running on aws ecs 
instances.  Cluster has 3 JMs and 3 TMs, separate zookeeper cluster has 3 nodes.

One of our taskmanagers crashed today with what seems to be rooted in a 
zookeeper timeout.   We are wondering if there is any tuning that might cause 
this timeout.  Any help will be greatly appreciated.

The first sign of trouble in the log is the following:

2021-01-27 11:16:39,795 WARN  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client 
session timed out, have not heard from server in 34951ms for sessionid 
0x1400000c01570036
2021-01-27 11:16:39,795 INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Client 
session timed out, have not heard from server in 34951ms for sessionid 
0x1400000c01570036, closing socket connection and attempting reconnect
2021-01-27 11:16:39,897 INFO  
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
 [] - State change: SUSPENDED
2021-01-27 11:16:39,969 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,969 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job 7613291aea3f4892a0deed0e7036e229 with leader id 
8959b1fb00fdd4e3d28daade48204e1f lost leadership.
2021-01-27 11:16:39,969 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,969 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job 3230dacf7fa0b8b8f9fe1c77ebdde2bb with leader id 
bccda87aa8ab14f23e98a4b6d2bf4081 lost leadership.
2021-01-27 11:16:39,969 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,969 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job 8f2ee940006ebb6d8f6d12e3db917da3 with leader id 
b72d64c2ec112d96cc3b93697d85478d lost leadership.
2021-01-27 11:16:39,969 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,969 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job aaec26e3924e81c12bd5a6d71f6c0d77 with leader id 
8d91fefd14539d11d60a16e0e5cd45b1 lost leadership.
2021-01-27 11:16:39,969 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,969 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job 2d5f912867ff70a58638aff51c7f6f33 with leader id 
b24724d3e03bee3486fdc5dc616b4a9c lost leadership.
2021-01-27 11:16:39,969 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,969 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job 29eb631a7a07aa6b2c0224972b9937bb with leader id 
8479de79b7eda73fca6593da93c04027 lost leadership.
2021-01-27 11:16:39,970 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,970 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job bc7688332e73f330f08c95428630b99e with leader id 
a541d5eb3b60d29afc3a16cab2f742e7 lost leadership.
2021-01-27 11:16:39,970 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,970 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,970 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job a70b0023b705c39fa66f47f1a666b65d with leader id 
a0bfc94c9ff40689a7143396cafe4ac7 lost leadership.
2021-01-27 11:16:39,970 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,970 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job 4c929f573971b8520a76ee1dfe5c3e35 with leader id 
922675f382f87225300696bae21841cc lost leadership.
2021-01-27 11:16:39,970 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,970 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job a6eb4833baac19216d7ffd189ec7be4d with leader id 
920ff4d6f778fcc5c0ad41e352914f46 lost leadership.
2021-01-27 11:16:39,970 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService [] - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2021-01-27 11:16:39,970 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - JobManager 
for job fcb8e204e9efb85c5af46cfdeb29c743 with leader id 
826bb52be9c8e80059eaf5f78c614252 lost leadership.
2021-01-27 11:16:40,723 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Close 
JobManager connection for job 7613291aea3f4892a0deed0e7036e229.
2021-01-27 11:16:40,724 INFO  org.apache.flink.runtime.taskmanager.Task         
           [] - Attempting to fail task externally EnrichTradeWithBlockSize -> 
LessThanBlockSize (4/4) (628f0445570d0df74ce62c2d0fb9b5c1).
2021-01-27 11:16:40,724 WARN  org.apache.flink.runtime.taskmanager.Task         
           [] - EnrichTradeWithBlockSize -> LessThanBlockSize (4/4) 
(628f0445570d0df74ce62c2d0fb9b5c1) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: JobManager responsible for 
7613291aea3f4892a0deed0e7036e229 lost the leadership.
        at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1415)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:173)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1852)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
        at java.util.Optional.ifPresent(Optional.java:183) ~[?:?]
        at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:1851)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.actor.Actor.aroundReceive(Actor.scala:517) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.actor.Actor.aroundReceive$(Actor.scala:515) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
        at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
[flink-dist_2.12-1.11.2.jar:1.11.2]

Reply via email to