Hi Qihua It seems that the job fail because of checkpoint timeout(10min) from the second picture. I found that the checkpoint fail is because one of your own customs source could not acknowledge the cp. So I think you could add some log in your source to figure out what is happening at the moment. Best, Guowei
On Fri, Jun 25, 2021 at 6:21 AM Qihua Yang <yang...@gmail.com> wrote: > Hi, > We are using flink to consume data from kafka topics and push to elastic > search cluster. We got an issue. checkpoint success 9 times and fail 2 > times. Those failures cause the job manager to restart. That pattern > repeats every 20 ~ 25 minutes. > The flink job has 72 subtasks. For every failed checkpoint, there are a > few subtasks didn't acknowledge the checkpoint. > Flink pod cpu usage and memory usage are pretty low. > Elastic search node cpu and memory usage are also pretty low. > > Does anyone know why? And how to fix it? > Attached are the graphs > > Thanks, > Qihua >