Hi Dongwoo, State changelogs are continuously uploaded to the durable storage when Changelog state backend is enabled. In other words, it will also persist data **outside the checkpoint phase**, and the exception at this time will directly cause the job to fail. And only exceptions in the checkpoint phase will be counted as checkpoint failures.
Dongwoo Kim <dongwoo7....@gmail.com> 于2023年6月20日周二 18:31写道: > > Hello all, I have a question about changelog persist failure. > When changelog persist process fails due to an S3 timeout, it seems to lead > to the job failure regardless of our > "execution.checkpointing.tolerable-failed-checkpoints" configuration being > set to 5 with this log. > > Caused by: java.io.IOException: The upload for 522 has already failed > previously > > Upon digging into the source code, I observed that Flink consistently checks > the sequence number against the latest failed sequence number, resulting in > an IOException. I am curious about the reasoning behind this check as it > seems to interfere with the "tolerable-failed-checkpoint" configuration > working as expected. > Can anyone explain the goal behind this design? > Additionally, I'd like to propose a potential solution: What if we adjusted > this section to allow failed changelogs to be uploaded on subsequent > attempts, up to the "tolerable-failed-checkpoint" limit, before declaring the > job as failed? > > Thanks in advance > > Best regards > dongwoo > > > > > > > -- Best, Yanfei