Hi Julio, If the single JobManager lost temporarily and reconnected later, it could be regranted leadership. And if you use Flink on Yarn, the Yarn RM (according to configuration) would start a new ApplicationMaster to act as a take-over JobManager.
Best, tison. Julio Biason <julio.bia...@azion.com> 于2018年9月28日周五 上午3:56写道: > Hey guys, > > I'm seeing a weird error happening here: We have our JobManager configured > in HA mode, but with a single JobManager in the cluster (the second one was > in another machine that start showing flaky network, so we removed it). > Everything is running in Standalone mode. > > Sometimes, the jobs are restarting and the JobManager logs shows this: > > org.apache.flink.util.FlinkException: JobManager responsible for > bbbae593c175e0c17c32718a56527ab9 lost the > leadership. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1167) > > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:137) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1608) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > > at > akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at > akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > > at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at > akka.actor.ActorCell.invoke(ActorCell.scala:495) > > at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at > akka.dispatch.Mailbox.run(Mailbox.scala:224) > > at > akka.dispatch.Mailbox.exec(Mailbox.scala:234) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > Caused by: java.util.concurrent.TimeoutException: The heartbeat of > JobManager with id d4ca7942b20bdf87ccf9335f698a5029 timed > out. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1609) > > ... 15 > more > > > If there is a single JobManager in the cluster... who is taking the > leadership? Is that even possible? > > -- > *Julio Biason*, Sofware Engineer > *AZION* | Deliver. Accelerate. Protect. > Office: +55 51 3083 8101 <callto:+555130838101> | Mobile: +55 51 > <callto:+5551996209291>*99907 0554* >