Hi Qihua,
>From the second picture you provide, the checkpoint 53 is timeout because
subtask which id is 6.
Would you please provide the taskmanage.log of subtask 6, we could try to
find the specific reason for checkpoint 53 failure.
Besides, you said the checkpoint failure appears every 20~25 minutes, is
every failure due to timeout?

Best regards,
JING ZHANG

Qihua Yang <yang...@gmail.com> 于2021年6月25日周五 上午6:21写道:

> Hi,
> We are using flink to consume data from kafka topics and push to elastic
> search cluster. We got an issue. checkpoint success 9 times and fail 2
> times. Those failures cause the job manager to restart. That pattern
> repeats every 20 ~ 25 minutes.
> The flink job has 72 subtasks. For every failed checkpoint, there are a
> few subtasks didn't acknowledge the checkpoint.
> Flink pod cpu usage and memory usage are pretty low.
> Elastic search node cpu and memory usage are also pretty low.
>
> Does anyone know why? And how to fix it?
> Attached are the graphs
>
> Thanks,
> Qihua
>

Reply via email to