JobManager withing the defined maximum connect time
2019-03-14 06:19:19,092 | ERROR | [flink-akka.actor.default-dispatcher-17] | Failed to register at the JobManager withing the defined maximum connect time. Shutting down … | org.apache.flink.yarn.YarnTaskManager (slf4j.scala:107) 请问有哪些解决方案呢?
Re: Re: blink ha,进程启动就挂掉
感谢您的回答 我的zk群集是正常工作的,而且在blink配置中也有配置zk.quorum,但我并不是在blink/bin下启动的zk; 由于zk是正常工作的所以我并没有定位到引起此问题的原因,还请您指导该怎样解决这个问题。 Thank you. The person who replied to me. 发件人: ZiLi Chen 发送时间: 2019-03-14 17:47 收件人: user-zh@flink.apache.org 主题: Re: blink ha,进程启动就挂掉 注意到这一行 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 你的 ZK 是正常工作并且 blink 正确连接上了吗? Best, tison. xiao...@chinaunicom.cn 于2019年3月13日周三 下午3:58写道: > Hi,All > > 搭建了blink的ha,节点为:JM(node1,node2),TM(node3,node4,node5)但是启动后node1的进程就挂掉,node2的进程不能启动,报错如下: > > node1的JobManager日志: > ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > > ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Fatal error occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Could not retrieve submitted > JobGraph from state handle under /a5ffe00b0bc5688d9a7de5c62b8150e6. This > indicatesthatthe retrieved state handle is broken. Try > cleaning the state handle store. > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:196) > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:646) > > > node2的JobManager日志: > ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Fatal error occurred in the cluster entrypoint. > org.apache.flink.runtime.dispatcher.DispatcherException: Could not > start the added job a5ffe00b0bc5688d9a7de5c62b8150e6 > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$31(Dispatcher.java:878) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > > TaskManager日志: > ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner > - Fatal error occurred while executing the TaskManager. Shutting it down... > java.lang.Exception: Reconnect to RM failed > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$closeResourceManagerConnection$3(TaskExecutor.java:1179) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > > > flink-conf.yaml 配置: > jobmanager.rpc.address: localhost > jobmanager.rpc.port: 6123 > jobmanager.heap.mb: 4096 > taskmanager.heap.mb: 4096 > taskmanager.numberOfTaskSlots: 2 > parallelism.default: 6 > taskmanager.managed.memory.size: 256 > yarn.application-attempts: 10 > env.java.home: /opt/jdk1.8.0_171/ > fs.hdfs.hadoopconf: > /app/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hadoop/etc/hadoop/ > taskmanager.network.numberOfBuffers: 1024 > high-availability: zookeeper > high-availability.storageDir: hdfs://ip:8020/blink/ha/zookeeper/storageDir/ > high-availability.zookeeper.quorum: ip:2181 > high-availability.filesystem.path.jobgraphs: > /app/blinkTmp/TaskTmp/jobgraphs/ > state.backend: filesystem > state.checkpoints.dir: hdfs://ip:8020/blink/flink-checkpoints > state.backend.incremental: true > rest.port: 8081 > > masters配置: > node1:8081 > node2:8081 > > slaves配置: > node3 > node4 > node5 > > 本人刚刚接触blink,我认为是我的配置有问题,大家有人体验了blink的安装部署么?配置能否发给我,我该怎样解决我的环境所出现的问题? > > 谢谢。 > >
Re: blink ha,进程启动就挂掉
注意到这一行 ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Authentication failed 你的 ZK 是正常工作并且 blink 正确连接上了吗? Best, tison. xiao...@chinaunicom.cn 于2019年3月13日周三 下午3:58写道: > Hi,All > > 搭建了blink的ha,节点为:JM(node1,node2),TM(node3,node4,node5)但是启动后node1的进程就挂掉,node2的进程不能启动,报错如下: > > node1的JobManager日志: > ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > > ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Fatal error occurred in the cluster entrypoint. > org.apache.flink.util.FlinkException: Could not retrieve submitted > JobGraph from state handle under /a5ffe00b0bc5688d9a7de5c62b8150e6. This > indicatesthatthe retrieved state handle is broken. Try > cleaning the state handle store. > at > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:196) > at > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:646) > > > node2的JobManager日志: > ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Fatal error occurred in the cluster entrypoint. > org.apache.flink.runtime.dispatcher.DispatcherException: Could not > start the added job a5ffe00b0bc5688d9a7de5c62b8150e6 > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$onAddedJobGraph$31(Dispatcher.java:878) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > > > TaskManager日志: > ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner > - Fatal error occurred while executing the TaskManager. Shutting it down... > java.lang.Exception: Reconnect to RM failed > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$closeResourceManagerConnection$3(TaskExecutor.java:1179) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) > > > flink-conf.yaml 配置: > jobmanager.rpc.address: localhost > jobmanager.rpc.port: 6123 > jobmanager.heap.mb: 4096 > taskmanager.heap.mb: 4096 > taskmanager.numberOfTaskSlots: 2 > parallelism.default: 6 > taskmanager.managed.memory.size: 256 > yarn.application-attempts: 10 > env.java.home: /opt/jdk1.8.0_171/ > fs.hdfs.hadoopconf: > /app/cloudera/parcels/CDH-5.14.0-1.cdh5.14.0.p0.24/lib/hadoop/etc/hadoop/ > taskmanager.network.numberOfBuffers: 1024 > high-availability: zookeeper > high-availability.storageDir: hdfs://ip:8020/blink/ha/zookeeper/storageDir/ > high-availability.zookeeper.quorum: ip:2181 > high-availability.filesystem.path.jobgraphs: > /app/blinkTmp/TaskTmp/jobgraphs/ > state.backend: filesystem > state.checkpoints.dir: hdfs://ip:8020/blink/flink-checkpoints > state.backend.incremental: true > rest.port: 8081 > > masters配置: > node1:8081 > node2:8081 > > slaves配置: > node3 > node4 > node5 > > 本人刚刚接触blink,我认为是我的配置有问题,大家有人体验了blink的安装部署么?配置能否发给我,我该怎样解决我的环境所出现的问题? > > 谢谢。 > >