非常谢谢您的解答,这个问题是zk中有失败任务的jobGraph,导致每次启动群集就会去检索,删除zk中残余后重启即可解决。
Thank you for your reply! 发件人: [email protected] 发送时间: 2019-03-26 09:40 收件人: user-zh 主题: Re: Re: flink ha模式进程hang!!! 是不是跟这个访问控制有关? high-availability.zookeeper.client.acl: open [email protected] 发件人: Han Xiao 发送时间: 2019-03-26 09:33 收件人: [email protected] 主题: Re: Re: flink ha模式进程hang!!! Hi,早上好,谢谢您的回复,以下是我的配置项及参数: flink-conf.yaml common: jobmanager.rpc.address: test10 jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m taskmanager.heap.size: 1024m taskmanager.numberOfTaskSlots: 2 parallelism.default: 2 taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp High Availability: high-availability: zookeeper high-availability.storageDir: hdfs://test10:8020/flink/ha/ ##此文件目录可以正常生成,但无jobGraph相关目录; high-availability.zookeeper.quorum: ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181 high-availability.zookeeper.client.acl: open Fault tolerance and checkpointing: state.backend: filesystem state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints ##此目录没有生成; Web Frontend: rest.port: 8081 masters: slaves: test10:8081 test12 test11 : 8082 test13 test14 以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。 Thank you for your reply! 发件人: Zili Chen 发送时间: 2019-03-25 19:57 收件人: [email protected] 主题: Re: flink ha模式进程hang!!! 看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找 submittedJobGraph,这个看起来就不太对。 Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把 high-availability.storageDir 配成一个有效的 HDFS 路径 Best, tison. Zili Chen <[email protected]> 于2019年3月25日周一 下午7:53写道: > 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊 > Best, > tison. > > > Han Xiao <[email protected]> 于2019年3月25日周一 下午7:26写道: > >> 各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下; >> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误 File does >> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91 >> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph ,他会一直去检索 >> /a5ffe00b0bc5688d9a7de5c62b8150e6 >> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph >> ,让我的ha群集健康的运行起来。 >> >> >> 报错日志: >> 2019-03-25 18:55:00,742 ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >> occurred in the cluster entrypoint. >> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could >> not retrieve submitted JobGraph from state handle under >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state >> handle is broken. Try cleaning the state handle store. >> at >> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) >> at >> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74) >> at >> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) >> ....... >> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >> submitted JobGraph from state handle under >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved state >> handle is broken. Try cleaning the state handle store. >> at >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) >> ........ >> Caused by: java.io.FileNotFoundException: File does not exist: >> /flink/ha/zookeeper/submittedJobGraphb05001535f91 >> at >> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) >> at >> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070) >> ....... >> Caused by: >> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): >> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91 >> at >> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) >> at >> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070) >> ....... >> >> 谢谢! >> >
