Hi Marcus, thanks for reaching out with your problem. I'm not very experienced with the HA setup, but Till (in CC) might be able to help you.
Best, Fabian 2017-09-14 16:57 GMT+02:00 Marcus Clendenin <[email protected]>: > Hi all, > > > > I am having an issue where one of our task managers that is running in > high availability mode is timing out on the connection to zookeeper. This > is causing it to retry the connection to zookeeper, which succeeds. The > issue is once the taskmanager is back connected to zookeeper it is then > unable to connect to the Job manager. Does anybody know why this is > happening? This is on flink 1.3.1 with checkpointing using RocksDB > > > > Stack Trace: > > 2017-09-14 09:35:16,033 INFO org.apache.zookeeper. > ClientCnxn - Client session timed out, have > not heard from server in 79531ms for sessionid 0x15e428f9953001f, closing > socket connection and attempting reconnect > > 2017-09-14 09:35:17,170 INFO org.apache.flink.shaded.org. > apache.curator.framework.state.ConnectionStateManager - State change: > SUSPENDED > > 2017-09-14 09:35:17,528 WARN org.apache.flink.runtime.leaderretrieval. > ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can > no longer retrieve the leader from ZooKeeper. > > 2017-09-14 09:35:17,796 WARN org.apache.zookeeper. > ClientCnxn - SASL configuration failed: > javax.security.auth.login.LoginException: unable to find LoginModule > class: org.apache.kafka.common.security.plain.PlainLoginModule Will > continue connection to Zookeeper server without SASL authentication, if > Zookeeper server allows it. > > 2017-09-14 09:35:17,796 INFO org.apache.zookeeper. > ClientCnxn - Opening socket connection to > server zookeeper21-01/00.000.00.000:2181 > > 2017-09-14 09:35:17,798 INFO org.apache.zookeeper.ClientCnxn > - Socket connection established to zookeeper21-01/00.000.00. > 000:2181, initiating session > > 2017-09-14 09:35:17,958 ERROR > org.apache.flink.shaded.org.apache.curator.ConnectionState > - Authentication failed > > 2017-09-14 09:35:18,261 WARN akka.remote.RemoteWatcher > - Detected unreachable: > [akka.tcp://flink@jobmanager1:36491] > > 2017-09-14 09:35:18,433 INFO org.apache.flink.shaded.org. > apache.curator.framework.state.ConnectionStateManager - State change: > LOST > > 2017-09-14 09:35:18,433 INFO org.apache.zookeeper. > ClientCnxn - Unable to reconnect to > ZooKeeper service, session 0x15e428f9953001f has expired, closing socket > connection > > 2017-09-14 09:35:18,433 WARN > org.apache.flink.shaded.org.apache.curator.ConnectionState > - Session expired event received > > 2017-09-14 09:35:18,433 WARN org.apache.flink.runtime.leaderretrieval. > ZooKeeperLeaderRetrievalService - Connection to ZooKeeper lost. Can no > longer retrieve the leader from ZooKeeper. > > 2017-09-14 09:35:18,693 INFO org.apache.zookeeper. > ZooKeeper - Initiating client connection, > connectString=zookeeper21-01:2181,zookeeper21-02:2181,zookeeper21-03:2181, > zookeeper22-01:2181,zookeeper22-02:2181 sessionTimeout=60000 > watcher=org.apache.flink.shaded.org.apache.curator. > ConnectionState@781f10f2 > > 2017-09-14 09:35:18,757 INFO org.apache.zookeeper. > ClientCnxn - EventThread shut down > > 2017-09-14 09:35:19,354 WARN org.apache.zookeeper. > ClientCnxn - SASL configuration failed: > javax.security.auth.login.LoginException: unable to find LoginModule > class: org.apache.kafka.common.security.plain.PlainLoginModule Will > continue connection to Zookeeper server without SASL authentication, if > Zookeeper server allows it. > > 2017-09-14 09:35:19,354 INFO org.apache.zookeeper. > ClientCnxn - Opening socket connection to > server zookeeper1/00.000.00.000:2181 > > 2017-09-14 09:35:19,354 ERROR > org.apache.flink.shaded.org.apache.curator.ConnectionState > - Authentication failed > > 2017-09-14 09:35:19,355 INFO org.apache.zookeeper. > ClientCnxn - Socket connection established > to zookeeper1/00.000.00.000:2181, initiating session > > 2017-09-14 09:35:19,358 INFO org.apache.zookeeper. > ClientCnxn - Session establishment complete > on server zookeeper1/00.000.00.000:2181, sessionid = 0x45e446247000012, > negotiated timeout = 60000 > > 2017-09-14 09:35:19,358 INFO org.apache.flink.shaded.org. > apache.curator.framework.state.ConnectionStateManager - State change: > RECONNECTED > > 2017-09-14 09:35:19,359 INFO org.apache.flink.runtime.leaderretrieval. > ZooKeeperLeaderRetrievalService - Connection to ZooKeeper was > reconnected. Leader retrieval can be restarted. > > 2017-09-14 09:35:21,494 INFO org.apache.flink.runtime. > taskmanager.TaskManager - TaskManager > akka://flink/user/taskmanager disconnects from JobManager > akka.tcp://flink@jobmanager1:36491/user/jobmanager: JobManager is no > longer reachable > > 2017-09-14 09:35:21,724 INFO org.apache.flink.runtime. > taskmanager.TaskManager - Cancelling all computations and > discarding all cached data. > > 2017-09-14 09:35:21,856 INFO org.apache.flink.runtime. > taskmanager.Task - Attempting to fail task externally > Map (2/3) (13599aa15283f8c5af1df477cd290629). > > 2017-09-14 09:35:21,856 INFO org.apache.flink.runtime. > taskmanager.Task - Map (2/3) ( > 13599aa15283f8c5af1df477cd290629) switched from RUNNING to FAILED. > > java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects > from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager: > JobManager is no longer reachable > > at org.apache.flink.runtime.taskmanager.TaskManager. > handleJobManagerDisconnect(TaskManager.scala:1095) > > at org.apache.flink.runtime.taskmanager.TaskManager$$ > anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311) > > at scala.runtime.AbstractPartialFunction.apply( > AbstractPartialFunction.scala:36) > > at org.apache.flink.runtime.LeaderSessionMessageFilter$$ > anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) > > at scala.runtime.AbstractPartialFunction.apply( > AbstractPartialFunction.scala:36) > > at org.apache.flink.runtime.LogMessages$$anon$1.apply( > LogMessages.scala:33) > > at org.apache.flink.runtime.LogMessages$$anon$1.apply( > LogMessages.scala:28) > > at scala.PartialFunction$class.applyOrElse(PartialFunction. > scala:123) > > at org.apache.flink.runtime.LogMessages$$anon$1. > applyOrElse(LogMessages.scala:28) > > at akka.actor.Actor$class.aroundReceive(Actor.scala:467) > > at org.apache.flink.runtime.taskmanager.TaskManager. > aroundReceive(TaskManager.scala:120) > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > > at akka.actor.dungeon.DeathWatch$class.receivedTerminated( > DeathWatch.scala:44) > > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > > at akka.dispatch.ForkJoinExecutorConfigurator$ > AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) > > at scala.concurrent.forkjoin.ForkJoinTask.doExec( > ForkJoinTask.java:260) > > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue. > runTask(ForkJoinPool.java:1339) > > at scala.concurrent.forkjoin.ForkJoinPool.runWorker( > ForkJoinPool.java:1979) > > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( > ForkJoinWorkerThread.java:107) > > 2017-09-14 09:35:21,861 INFO org.apache.flink.runtime. > taskmanager.Task - Triggering cancellation of task > code Map (2/3) (13599aa15283f8c5af1df477cd290629). > > 2017-09-14 09:35:21,861 INFO org.apache.flink.runtime. > taskmanager.Task - Attempting to fail task externally > Timestamps/Watermarks (2/3) (9cf3d208a85e4d88fffd93d0b8152d83). > > 2017-09-14 09:35:21,861 INFO org.apache.flink.runtime. > taskmanager.Task - Timestamps/Watermarks (2/3) ( > 9cf3d208a85e4d88fffd93d0b8152d83) switched from RUNNING to FAILED. > > java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects > from JobManager akka.tcp://flink@jobmanager1:36491/user/jobmanager: > JobManager is no longer reachable > > at org.apache.flink.runtime.taskmanager.TaskManager. > handleJobManagerDisconnect(TaskManager.scala:1095) >
