这个问题早上的时候已经解决,就是因为zk中有残余的失败jobGraph,删除即可恢复群集。 真的非常谢谢您,以后还要多和您请教学习。
Thank you for your reply! 发件人: Zili Chen 发送时间: 2019-03-26 09:46 收件人: [email protected] 主题: Re: Re: flink ha模式进程hang!!! 如果没有清理此前的 zk 数据的话,有可能是此前你把 high-availability.storageDir 配置成 /flink/ha/zookeeper,随后清理了 hdfs 但是 zk 上还有过期的 handler 的信息 Best, tison. Han Xiao <[email protected]> 于2019年3月26日周二 上午9:33写道: > Hi,早上好,谢谢您的回复,以下是我的配置项及参数: > > flink-conf.yaml > common: > jobmanager.rpc.address: test10 > jobmanager.rpc.port: 6123 > jobmanager.heap.size: 1024m > taskmanager.heap.size: 1024m > taskmanager.numberOfTaskSlots: 2 > parallelism.default: 2 > taskmanager.tmp.dirs: /app/tools/flink-1.7.2/tmp > > High Availability: > high-availability: zookeeper > high-availability.storageDir: hdfs://test10:8020/flink/ha/ > ##此文件目录可以正常生成,但无jobGraph相关目录; > high-availability.zookeeper.quorum: > ip1:2181,ip2:2181,ip3:2181,ip4:2181,ip5:2181 > high-availability.zookeeper.client.acl: open > > Fault tolerance and checkpointing: > state.backend: filesystem > state.checkpoints.dir: hdfs://test10:8020/flink-checkpoints ##此目录没有生成; > > Web Frontend: > rest.port: 8081 > > masters: slaves: > test10:8081 test12 > test11 : 8082 test13 > test14 > > 以上为全部配置项,结合下面报的错误信息检索路径,我的配置中并没有。。。很让我不解。 > > Thank you for your reply! > 发件人: Zili Chen > 发送时间: 2019-03-25 19:57 > 收件人: [email protected] > 主题: Re: flink ha模式进程hang!!! > 看起来是 HDFS 去 /flink/ha/zookeeper/submittedJobGraphb05001535f91 这个路径下找 > submittedJobGraph,这个看起来就不太对。 > > Flink 的 ha 需要配置 zk 的路径和把 state 存到 file system 的路径,你可以试试把 > high-availability.storageDir > 配成一个有效的 HDFS 路径 > > Best, > tison. > > > Zili Chen <[email protected]> 于2019年3月25日周一 下午7:53写道: > > > 能提供你的 ha 配置吗?特别是 high-availability.storageDir,我怀疑是不是没有配置这个啊 > > Best, > > tison. > > > > > > Han Xiao <[email protected]> 于2019年3月25日周一 下午7:26写道: > > > >> 各位朋友大家好,我是flink初学者,部署flink ha的过程中出现一些问题,麻烦大家帮忙看下; > >> 启动flink ha后,jobmanager进程直接hang,使用的flink 1.7.2版本,下面log中有一处出现此错误 File > does > >> not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91 > >> ,让我不解的是我的checkpoint目录以及ha目录并不是这个,为什么会到这个目录去找,我所配置的目录下没有生成JobGraph > ,他会一直去检索 > >> /a5ffe00b0bc5688d9a7de5c62b8150e6 > >> 这个作业图而且找不到,我删除了所有相关的配置路径之后重新搭建,启动时还是会去检索,我该怎样避免flink去检索这个JobGraph > >> ,让我的ha群集健康的运行起来。 > >> > >> > >> 报错日志: > >> 2019-03-25 18:55:00,742 ERROR > >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal > error > >> occurred in the cluster entrypoint. > >> java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could > >> not retrieve submitted JobGraph from state handle under > >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved > state > >> handle is broken. Try cleaning the state handle store. > >> at > >> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) > >> at > >> > org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$2(FunctionUtils.java:74) > >> at > >> > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602) > >> ....... > >> Caused by: org.apache.flink.util.FlinkException: Could not retrieve > >> submitted JobGraph from state handle under > >> /a5ffe00b0bc5688d9a7de5c62b8150e6. This indicates that the retrieved > state > >> handle is broken. Try cleaning the state handle store. > >> at > >> > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) > >> at > >> > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.java:696) > >> at > >> > org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatcher.java:681) > >> ........ > >> Caused by: java.io.FileNotFoundException: File does not exist: > >> /flink/ha/zookeeper/submittedJobGraphb05001535f91 > >> at > >> > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) > >> at > >> > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) > >> at > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100) > >> at > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070) > >> ....... > >> Caused by: org.apache.hadoop.ipc.RemoteException(java.io > .FileNotFoundException): > >> File does not exist: /flink/ha/zookeeper/submittedJobGraphb05001535f91 > >> at > >> > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) > >> at > >> > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) > >> at > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2100) > >> at > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2070) > >> ....... > >> > >> 谢谢! > >> > > >
