补充个更完整的日志: .... 2021-11-01 14:15:15,849 INFO [78-cluster-io-thread-1] org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:181) - Recovered JobGraph(jobId: dfced635fd8c224222a9cbaaf1c5054f). 2021-11-01 14:15:15,849 INFO [78-cluster-io-thread-1] org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:125) - Successfully recovered 1 persisted job graphs. 2021-11-01 14:15:15,856 INFO [78-cluster-io-thread-1] org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(AkkaRpcService.java:232) - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_1 . 2021-11-01 14:15:22,867 INFO [30-flink-akka.actor.default-dispatcher-3] org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.start(DefaultLeaderElectionService.java:93) - Starting DefaultLeaderElectionService with ZooKeeperLeaderElectionDriver{leaderPath='/leader/dfced635fd8c224222a9cbaaf1c5054f/job_manager_lock'}.
2021-11-01 14:15:22,892 ERROR [30-flink-akka.actor.default-dispatcher-3] org.apache.flink.runtime.entrypoint.ClusterEntrypoint.onFatalError(ClusterEntrypoint.java:454) - Fatal error occurred in the cluster entrypoint. org.apache.flink.util.FlinkException: JobMaster for job dfced635fd8c224222a9cbaaf1c5054f failed. at org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:459) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:418) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_152] at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_152] at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) ~[?:1.8.0_152] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] Caused by: org.apache.flink.runtime.jobmaster.JobNotFinishedException: The job (dfced635fd8c224222a9cbaaf1c5054f) has not been finished. at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.jobAlreadyDone(JobMasterServiceLeadershipRunner.java:288) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.verifyJobSchedulingStatusAndCreateJobMasterServiceProcess(JobMasterServiceLeadershipRunner.java:276) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$null$8(JobMasterServiceLeadershipRunner.java:262) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfValidLeader(JobMasterServiceLeadershipRunner.java:496) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$startJobMasterServiceProcessAsync$9(JobMasterServiceLeadershipRunner.java:258) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) ~[?:1.8.0_152] at java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717) ~[?:1.8.0_152] at java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010) ~[?:1.8.0_152] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.startJobMasterServiceProcessAsync(JobMasterServiceLeadershipRunner.java:256) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$grantLeadership$7(JobMasterServiceLeadershipRunner.java:249) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:464) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.grantLeadership(JobMasterServiceLeadershipRunner.java:248) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onGrantLeadership(DefaultLeaderElectionService.java:211) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:166) ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0] 2021-11-01 14:15:22,896 INFO [27-StandaloneSessionClusterEntrypoint shutdown hook] org.apache.flink.runtime.entrypoint.ClusterEntrypoint.shutDownAsync(ClusterEntrypoint.java:481) - Shutting StandaloneSessionClusterEntrypoint down with application status UNKNOWN. Diagnostics Cluster entrypoint has been closed externally.. 2021-11-01 14:15:22,897 INFO [27-StandaloneSessionClusterEntrypoint shutdown hook] org.apache.flink.runtime.rest.RestServerEndpoint.closeAsync(RestServerEndpoint.java:309) - Shutting down rest endpoint. 2021-11-01 14:15:22,923 INFO [52-BlobServer shutdown hook] org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:345) - Stopped BLOB server at 0.0.0.0:41066 2021-11-01 14:15:22,937 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.lambda$shutDownInternal$5(WebMonitorEndpoint.java:964) - Removing cache directory /tmp/flink-web-85060404-ac4d-44ff-8ffe-bc2235ff0acf/flink-web-ui 2021-11-01 14:15:22,937 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:101) - Stopping DefaultLeaderElectionService. 2021-11-01 14:15:22,938 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.close(ZooKeeperLeaderElectionDriver.java:132) - Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader/rest_server_lock'} 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.rest.RestServerEndpoint.lambda$closeAsync$1(RestServerEndpoint.java:317) - Shut down complete. 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent.closeAsyncInternal(DispatcherResourceManagerComponent.java:162) - Closing components. 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService.stop(DefaultLeaderRetrievalService.java:106) - Stopping DefaultLeaderRetrievalService. 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.close(ZooKeeperLeaderRetrievalDriver.java:108) - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/dispatcher_lock'}. 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService.stop(DefaultLeaderRetrievalService.java:106) - Stopping DefaultLeaderRetrievalService. 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.close(ZooKeeperLeaderRetrievalDriver.java:108) - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}. 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:101) - Stopping DefaultLeaderElectionService. 2021-11-01 14:15:22,943 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.close(ZooKeeperLeaderElectionDriver.java:132) - Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader/dispatcher_lock'} 2021-11-01 14:15:22,944 INFO [100-ForkJoinPool.commonPool-worker-22] org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.closeInternal(AbstractDispatcherLeaderProcess.java:134) - Stopping SessionDispatcherLeaderProcess. 2021-11-01 14:15:22,945 INFO [29-flink-akka.actor.default-dispatcher-2] org.apache.flink.runtime.util.OperaInstanceMigrateManager.stopMigrateCheck(OperaInstanceMigrateManager.java:179) - Start to stop Migrate check... 2021-11-01 14:15:22,945 INFO [29-flink-akka.actor.default-dispatcher-2] org.apache.flink.runtime.util.OperaInstanceMigrateManager.stopMigrateCheck(OperaInstanceMigrateManager.java:184) - Start to stop jmHeartbeat report... 2021-11-01 14:15:22,946 INFO [29-flink-akka.actor.default-dispatcher-2] org.apache.flink.runtime.util.OperaInstanceMigrateManager.stopMigrateCheck(OperaInstanceMigrateManager.java:189) - Shutdown executorService... 2021-11-01 14:15:22,946 INFO [29-flink-akka.actor.default-dispatcher-2] org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager.close(DeclarativeSlotManager.java:240) - Closing the slot manager. 2021-11-01 14:15:22,947 INFO [29-flink-akka.actor.default-dispatcher-2] org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager.suspend(DeclarativeSlotManager.java:212) - Suspending the slot manager. 2021-11-01 14:15:22,950 INFO [29-flink-akka.actor.default-dispatcher-2] org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:101) - Stopping DefaultLeaderElectionService. 2021-11-01 14:15:22,950 INFO [29-flink-akka.actor.default-dispatcher-2] org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.close(ZooKeeperLeaderElectionDriver.java:132) - Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader/resource_manager_lock'} yidan zhao <hinobl...@gmail.com> 于2021年11月1日周一 下午2:25写道: > 如题,这个问题之前遇到过,当时我email问的是集群不断重启。 > 这次也是这个问题,集群不断重启,但分析下原因如题。看日志片段如下: > > 2021-11-01 14:05:36,954 INFO [78-cluster-io-thread-1] > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:181) > - Recovered JobGraph(jobId: dfced635fd8c224222a9cbaaf1c5054f). > 2021-11-01 14:05:36,954 INFO [78-cluster-io-thread-1] > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:125) > - Successfully recovered 1 persisted job graphs. > 2021-11-01 14:05:36,962 INFO [78-cluster-io-thread-1] > org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(AkkaRpcService.java:232) > - Starting RPC endpoint for > org.apache.flink.runtime.dispatcher.StandaloneDispatcher at > akka://flink/user/rpc/dispatcher_1 . > 2021-11-01 14:05:44,810 INFO [94-flink-akka.actor.default-dispatcher-30] > org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.start(DefaultLeaderElectionService.java:93) > - Starting DefaultLeaderElectionService with > ZooKeeperLeaderElectionDriver{leaderPath='/leader/dfced635fd8c224222a9cbaaf1c5054f/job_manager_lock'}. > 2021-11-01 14:05:44,836 ERROR [94-flink-akka.actor.default-dispatcher-30] > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.onFatalError(ClusterEntrypoint.java:454) > - Fatal error occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: JobMaster for job > dfced635fd8c224222a9cbaaf1c5054f failed. > at > org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873) > ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT] > > 如上,恢复了jobgraph,开启 leader 选举(看起来像是jobmaster的leader选举服务),然后jobmaster 挂了。 > > > 如上,我想知道为什么jobmaster挂了就会导致 standalone JM 进程失败呢? > JM进程是所有任务公用,即使启动后之前的某个job无法恢复,也没必要因此就挂掉吧。 > >