Re: Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

Yanfei Lei Tue, 20 Jun 2023 20:38:55 -0700

Hi Dongwoo,

State changelogs are continuously uploaded to the durable storage when
Changelog state backend is enabled. In other words, it will also
persist data **outside the checkpoint phase**, and the exception at
this time will directly cause the job to fail.  And only exceptions in
the checkpoint phase will be counted as checkpoint failures.


Dongwoo Kim <dongwoo7....@gmail.com> 于2023年6月20日周二 18:31写道：
>
> Hello all, I have a question about changelog persist failure.
> When changelog persist process fails due to an S3 timeout, it seems to lead 
> to the job failure regardless of our 
> "execution.checkpointing.tolerable-failed-checkpoints" configuration being 
> set to 5 with this log.
>
> Caused by: java.io.IOException: The upload for 522 has already failed 
> previously
>
> Upon digging into the source code, I observed that Flink consistently checks 
> the sequence number against the latest failed sequence number, resulting in 
> an IOException. I am curious about the reasoning behind this check as it 
> seems to interfere with the "tolerable-failed-checkpoint" configuration 
> working as expected.
> Can anyone explain the goal behind this design?
> Additionally, I'd like to propose a potential solution: What if we adjusted 
> this section to allow failed changelogs to be uploaded on subsequent 
> attempts, up to the "tolerable-failed-checkpoint" limit, before declaring the 
> job as failed?
>
> Thanks in advance
>
> Best regards
> dongwoo
>
>
>
>
>
>
>


-- 
Best,
Yanfei

Re: Changelog fail leads to job fail regardless of tolerable-failed-checkpoints config

Reply via email to