Hi Qihua
It seems that the job fail because of checkpoint timeout(10min) from the
second picture. I found that the checkpoint fail is because one of your own
customs source could not acknowledge the cp.
So I think you could add some log in your source to figure out what is
happening at the moment.
Best,
Guowei


On Fri, Jun 25, 2021 at 6:21 AM Qihua Yang <yang...@gmail.com> wrote:

> Hi,
> We are using flink to consume data from kafka topics and push to elastic
> search cluster. We got an issue. checkpoint success 9 times and fail 2
> times. Those failures cause the job manager to restart. That pattern
> repeats every 20 ~ 25 minutes.
> The flink job has 72 subtasks. For every failed checkpoint, there are a
> few subtasks didn't acknowledge the checkpoint.
> Flink pod cpu usage and memory usage are pretty low.
> Elastic search node cpu and memory usage are also pretty low.
>
> Does anyone know why? And how to fix it?
> Attached are the graphs
>
> Thanks,
> Qihua
>

Reply via email to