Re: Incorrect checkpoint id used when job is recovering

tao xiao Thu, 19 May 2022 06:16:54 -0700

Hi team,

Can anyone shed some light?


On Sat, May 14, 2022 at 8:56 AM tao xiao <xiaotao...@gmail.com> wrote:

> Hi team,
>
> Does anyone have any ideas?
>
> On Thu, May 12, 2022 at 9:20 PM tao xiao <xiaotao...@gmail.com> wrote:
>
>> Forgot to mention the Flink version is 1.13.2 and we use kubernetes
>> native mode
>>
>> On Thu, May 12, 2022 at 9:18 PM tao xiao <xiaotao...@gmail.com> wrote:
>>
>>> Hi team,
>>>
>>> I met a weird issue when a job tries to recover from JM failure.  The
>>> success checkpoint before JM crashed is 41205
>>>
>>> ```
>>>
>>> {"log":"2022-05-10 14:55:40,663 INFO  
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
>>> checkpoint 41205 for job 00000000000000000000000000000000 (9453840 bytes in 
>>> 1922 ms).\n","stream":"stdout","time":"2022-05-10T14:55:40.663286893Z"}
>>>
>>> ```
>>>
>>> However JM tries to recover the job with an old checkpoint 41051 which
>>> doesn't exist that leads to unrecoverable state
>>>
>>> ```
>>>
>>> "2022-05-10 14:59:38,949 INFO  
>>> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
>>> Trying to retrieve checkpoint 41051.\n"
>>>
>>> ```
>>>
>>> Full log attached
>>>
>>> --
>>> Regards,
>>> Tao
>>>
>>
>>
>> --
>> Regards,
>> Tao
>>
>
>
> --
> Regards,
> Tao
>


-- 
Regards,
Tao

Re: Incorrect checkpoint id used when job is recovering

Reply via email to