Hi team, Can anyone shed some light?
On Sat, May 14, 2022 at 8:56 AM tao xiao <xiaotao...@gmail.com> wrote: > Hi team, > > Does anyone have any ideas? > > On Thu, May 12, 2022 at 9:20 PM tao xiao <xiaotao...@gmail.com> wrote: > >> Forgot to mention the Flink version is 1.13.2 and we use kubernetes >> native mode >> >> On Thu, May 12, 2022 at 9:18 PM tao xiao <xiaotao...@gmail.com> wrote: >> >>> Hi team, >>> >>> I met a weird issue when a job tries to recover from JM failure. The >>> success checkpoint before JM crashed is 41205 >>> >>> ``` >>> >>> {"log":"2022-05-10 14:55:40,663 INFO >>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed >>> checkpoint 41205 for job 00000000000000000000000000000000 (9453840 bytes in >>> 1922 ms).\n","stream":"stdout","time":"2022-05-10T14:55:40.663286893Z"} >>> >>> ``` >>> >>> However JM tries to recover the job with an old checkpoint 41051 which >>> doesn't exist that leads to unrecoverable state >>> >>> ``` >>> >>> "2022-05-10 14:59:38,949 INFO >>> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - >>> Trying to retrieve checkpoint 41051.\n" >>> >>> ``` >>> >>> Full log attached >>> >>> -- >>> Regards, >>> Tao >>> >> >> >> -- >> Regards, >> Tao >> > > > -- > Regards, > Tao > -- Regards, Tao