回复：流处理任务中checkpoint失败

Robert.Zhang Mon, 24 Aug 2020 09:59:01 -0700

看了日志，是由于部分checkpoint 超时未完成，web界面上 iteration source的checkpoint始终无法完成。
官方文档对于在iterative 
stream的checkpoint没有更详细的说明。对于loop中的数据丢失可以理解。但是checkpoint无法成功不是特别能理解。
按照我对于chandylamport算法的理解，上游operator的barrier应该是直接给到了下游
，不应该存在无法拿到barrier的情况才对。不知道这是什么原因导致的


---原始邮件---
发件人: "Congxian Qiu"<[email protected]&gt;
发送时间: 2020年8月24日(周一) 晚上8:21
收件人: "user-zh"<[email protected]&gt;;
主题: Re: 流处理任务中checkpoint失败


Hi
&nbsp;&nbsp; 从报错 ”Exceeded checkpoint tolerable failure threshold“ 看，你的 
checkpoint
有持续报错，导致了作业失败，你需要找一下为什么 checkpoint 失败，或许这篇文章[1] 可以有一些帮助
&nbsp;&nbsp; 另外从配置看，你开启了 unalign checkpoint，这个是上述文章中暂时没有设计的地方。

[1] https://zhuanlan.zhihu.com/p/87131964
Best,
Congxian


Robert.Zhang <[email protected]&gt; 于2020年8月21日周五 下午6:31写道：

&gt; Hello all,
&gt; 目前遇到一个问题，在iterative stream job
&gt; 使用checkpoint，按照文档进行了相应的配置，测试过程中checkpoint几乎无法成功
&gt; 测试state 很小，只有几k，依然无法成功。会出现org.apache.flink.util.FlinkRuntimeException:
&gt; Exceeded checkpoint tolerable failure threshold.的报错
&gt;
&gt;
&gt; 配置如下：
&gt; env.enableCheckpointing(10000, CheckpointingMode.EXACTLY_ONCE, true);
&gt; CheckpointConfig checkpointConfig = env.getCheckpointConfig();
&gt; checkpointConfig.setCheckpointTimeout(600000);
&gt; checkpointConfig.setMinPauseBetweenCheckpoints(60000);
&gt; checkpointConfig.setMaxConcurrentCheckpoints(4);
&gt;
&gt; 
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
&gt; checkpointConfig.setPreferCheckpointForRecovery(true);
&gt; checkpointConfig.setTolerableCheckpointFailureNumber(2);
&gt; checkpointConfig.enableUnalignedCheckpoints();
&gt;
&gt;
&gt; 任务只处理几条数据，未存在反压的情况。有遇到类似问题的老哥吗？