Hi Dan, Flink should already have integrate a tool in the web UI to monitor the detailed statistics of the checkpoint [1]. It would show the time consumed in each part and each task, thus it could be used to debug the checkpoint timeout.
Best, Yun [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/ ------------------Original Mail ------------------ Sender:Dan Hill <quietgol...@gmail.com> Send Date:Sat Jun 12 09:15:50 2021 Recipients:user <user@flink.apache.org> Subject:Checkpoint is timing out - inspecting state Hi. We're doing something bad with our Flink state. We just launched a feature that creates very big values (lists of objects that we append to) in MapState. Our checkpoints time out (10 minutes). I'm assuming the values are too big. Backpressure is okay and cpu+memory metrics look okay. Questions 1. Is there an easy tool for inspecting the Flink state? I found this post about drilling into Flink state. I was hoping for something more like a CLI. 2. Is there a way to break down the time spent during a checkout if it times out? Thanks! - Dan