Re: Savepoint incomplete when job was killed after a cancel timeout

Till Rohrmann Tue, 29 Sep 2020 07:30:22 -0700

Thanks for sharing the logs with me. It looks as if the total size of the
savepoint is 335kb for a job with a parallelism of 60 and a total of 120
tasks. Hence, the average size of a state per task is between 2.5kb - 5kb.
I think that the state size threshold refers to the size of the per task
state. Hence, I believe that the _metadata file should contain all of your
state. Have you tried restoring from this savepoint?


Cheers,
Till

On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <paullin3...@gmail.com> wrote:

> Hi Till,
>
> Thanks for your quick reply.
>
> The checkpoint/savepoint size would be around 2MB, which is larger than
> `state.backend.fs.memory-threshold`.
>
> The jobmanager logs are attached, which looks normal to me.
>
> Thanks again!
>
> Best,
> Paul Lam
>
> Till Rohrmann <trohrm...@apache.org> 于2020年9月29日周二 下午8:32写道：
>
>> Hi Paul,
>>
>> could you share with us the logs of the JobManager? They might help to
>> better understand in which order each operation occurred.
>>
>> How big are you expecting the size of the state to be? If it is smaller
>> than state.backend.fs.memory-threshold, then the state data will be stored
>> in the _metadata file.
>>
>> Cheers,
>> Till
>>
>> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <paullin3...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> We have a Flink job that was stopped erroneously with no available
>>> checkpoint/savepoint to restore,
>>> and are looking for some help to narrow down the problem.
>>>
>>> How we ran into this problem:
>>>
>>> We stopped the job using cancel with savepoint command (for
>>> compatibility issue), but the command
>>> timed out after 1 min because there was some backpressure. So we force
>>> kill the job by yarn kill command.
>>> Usually, this would not cause troubles because we can still use the last
>>> checkpoint to restore the job.
>>>
>>> But at this time, the last checkpoint dir was cleaned up and empty (the
>>> retained checkpoint number was 1).
>>> According to zookeeper and the logs, the savepoint finished (job master
>>> logged “Savepoint stored in …”)
>>> right after the cancel timeout. However, the savepoint directory
>>> contains only _metadata file, and other
>>> state files referred by metadata are absent.
>>>
>>> Environment & Config:
>>> - Flink 1.11.0
>>> - YARN job cluster
>>> - HA via zookeeper
>>> - FsStateBackend
>>> - Aligned non-incremental checkpoint
>>>
>>> Any comments and suggestions are appreciated! Thanks!
>>>
>>> Best,
>>> Paul Lam
>>>
>>>

Re: Savepoint incomplete when job was killed after a cancel timeout

Reply via email to