Hi,

从异常日志看,应该是因为你的state.checkpoints.dir 或者说 
statebackend的checkpoint目录配置成了本地目录,checkpoint保存到了本地机器上,所以在failover 
restore的时候,必须得让原task部署回原来的机器才能正常运行。将state backend的checkpoint目录更换为一个DFS目录即可。


祝好
唐云
________________________________
From: Carmen Free <ddjxuxi...@gmail.com>
Sent: Tuesday, January 12, 2021 18:14
To: user-zh@flink.apache.org <user-zh@flink.apache.org>
Subject: Re: rocksdb作为statebackend时,TM节点挂掉了,为何任务不能恢复呢?

你好,唐老师,谢谢解答。

不好意思,下面补充一下报错信息,刚才忘记说了。

主要报错信息如下,重新模拟了下:
2021-01-12 18:09:34,236 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
Custom Source -> Flat Map -> Timestamps/Watermarks (2/2)
(3b50a7ce56b408c2978260846b76a28a) switched from RUNNING to FAILED on
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@4c107a3a.

java.lang.Exception: Exception while creating StreamOperatorStateContext.
at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:255)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:989)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:453)
at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:448)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:460)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
at java.lang.Thread.run(Thread.java:748)

Caused by: org.apache.flink.util.FlinkException: Could not restore operator
state backend for StreamSource_cbc357ccb763df2852fee8c4fc7d55f2_(2/2) from
any of the 1 provided restore options.
at
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.operatorStateBackend(StreamTaskStateInitializerImpl.java:252)
at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:139)
... 9 more

Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed
when trying to restore operator state backend
at
org.apache.flink.runtime.state.DefaultOperatorStateBackendBuilder.build(DefaultOperatorStateBackendBuilder.java:86)
at
org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createOperatorStateBackend(RocksDBStateBackend.java:565)
at
org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$operatorStateBackend$0(StreamTaskStateInitializerImpl.java:243)
at
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at
org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
... 11 more

Caused by: java.io.FileNotFoundException:
/data/flink/checkpoints/b261de0447d59bb2f1db9c084f5b1a0b/chk-5/4a160901-8ddd-468e-a82d-6efcb8a9dff9
(No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at
org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
at
org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
at
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:85)
at
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
at
org.apache.flink.runtime.state.OperatorStreamStateHandle.openInputStream(OperatorStreamStateHandle.java:66)
at
org.apache.flink.runtime.state.OperatorStateRestoreOperation.restore(OperatorStateRestoreOperation.java:73)
at
org.apache.flink.runtime.state.DefaultOperatorStateBackendBuilder.build(DefaultOperatorStateBackendBuilder.java:83)
... 15 more

这个文件夹在A节点(JM)上是有的,难道是访问权限问题吗?B节点无法访问A节点吗,有点奇怪啊,配置了ssh免密的啊,文件夹/data/flink/checkpoints访问权限也设置成了777

Yun Tang <myas...@live.com> 于2021年1月12日周二 下午5:46写道:

> Hi
>
> Flink的容错机制是可以保证TM lost时候会尝试重启作业,“为何任务不能恢复”是需要看完整异常栈的,简单描述是无法帮助排查问题的。
>
> 祝好
> 唐云
> ________________________________
> From: Carmen Free <ddjxuxi...@gmail.com>
> Sent: Tuesday, January 12, 2021 15:52
> To: user-zh@flink.apache.org <user-zh@flink.apache.org>
> Subject: rocksdb作为statebackend时,TM节点挂掉了,为何任务不能恢复呢?
>
> hi,
>
> rocksdb作为statebackend时,TM节点挂掉了,为何任务不能恢复呢?
>
> 1、环境说明
>
> flink版本:1.10.2
> 操作系统:centos 7
>
> 2、集群说明(当前模拟了2节点)
>
>                     节点A           |      节点B
> 角色        |   JM、TM        |        TM
> taskslot   |       4               |         4
>
> 3、statebackend配置
>
> # rocksdb作为状态后备
> state.backend: rocksdb
>
>
> # 存储快照的目录(暂时使用的本地目录)
> state.checkpoints.dir: file:///data/flink/checkpoints
>
>
> 4、启动任务后,任务自动分配在A节点的TM上,运行一段时间后,检查点快照正常。接着,仅停掉A节点TM(JM仍正常运行),任务被自动调度至B节点的TM上,但是此时任务一直重启,无法恢复,这是为什么呢?
>
>
> 5、如果我启动A节点,任务依旧无法恢复(此时任务仍在B节点运行),直到我停掉B节点TM,此时任务调度至A节点,任务可以正常恢复。所以有点疑问,4中的场景为何不能恢复任务呢?为什么只有在A节点上才可以进行任务恢复呢?最初以为是访问路径的问题,但是仔细想了想,检查点相关的操作一直都是JM进行的,我觉得只要JM没有挂掉,应该就可以将任务进行恢复啊,是我的理解有偏差吗?
>

回复