Re: Savepoint incomplete when job was killed after a cancel timeout

Till Rohrmann Tue, 29 Sep 2020 23:46:39 -0700

Glad to hear that your job data was not lost!

Cheers,
Till


On Tue, Sep 29, 2020 at 7:28 PM Paul Lam <paullin3...@gmail.com> wrote:

> Hi Till,
>
> Thanks a lot for the pointer! I tried to restore the job using the
> savepoint in a dry run, and it worked!
>
> Guess I've misunderstood the configuration option, and confused by the
> non-existent paths that the metadata contains.
>
> Best,
> Paul Lam
>
> Till Rohrmann <trohrm...@apache.org> 于2020年9月29日周二 下午10:30写道：
>
>> Thanks for sharing the logs with me. It looks as if the total size of the
>> savepoint is 335kb for a job with a parallelism of 60 and a total of 120
>> tasks. Hence, the average size of a state per task is between 2.5kb - 5kb.
>> I think that the state size threshold refers to the size of the per task
>> state. Hence, I believe that the _metadata file should contain all of your
>> state. Have you tried restoring from this savepoint?
>>
>> Cheers,
>> Till
>>
>> On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <paullin3...@gmail.com> wrote:
>>
>>> Hi Till,
>>>
>>> Thanks for your quick reply.
>>>
>>> The checkpoint/savepoint size would be around 2MB, which is larger than
>>> `state.backend.fs.memory-threshold`.
>>>
>>> The jobmanager logs are attached, which looks normal to me.
>>>
>>> Thanks again!
>>>
>>> Best,
>>> Paul Lam
>>>
>>> Till Rohrmann <trohrm...@apache.org> 于2020年9月29日周二 下午8:32写道：
>>>
>>>> Hi Paul,
>>>>
>>>> could you share with us the logs of the JobManager? They might help to
>>>> better understand in which order each operation occurred.
>>>>
>>>> How big are you expecting the size of the state to be? If it is smaller
>>>> than state.backend.fs.memory-threshold, then the state data will be stored
>>>> in the _metadata file.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <paullin3...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We have a Flink job that was stopped erroneously with no available
>>>>> checkpoint/savepoint to restore,
>>>>> and are looking for some help to narrow down the problem.
>>>>>
>>>>> How we ran into this problem:
>>>>>
>>>>> We stopped the job using cancel with savepoint command (for
>>>>> compatibility issue), but the command
>>>>> timed out after 1 min because there was some backpressure. So we force
>>>>> kill the job by yarn kill command.
>>>>> Usually, this would not cause troubles because we can still use the
>>>>> last checkpoint to restore the job.
>>>>>
>>>>> But at this time, the last checkpoint dir was cleaned up and empty
>>>>> (the retained checkpoint number was 1).
>>>>> According to zookeeper and the logs, the savepoint finished (job
>>>>> master logged “Savepoint stored in …”)
>>>>> right after the cancel timeout. However, the savepoint directory
>>>>> contains only _metadata file, and other
>>>>> state files referred by metadata are absent.
>>>>>
>>>>> Environment & Config:
>>>>> - Flink 1.11.0
>>>>> - YARN job cluster
>>>>> - HA via zookeeper
>>>>> - FsStateBackend
>>>>> - Aligned non-incremental checkpoint
>>>>>
>>>>> Any comments and suggestions are appreciated! Thanks!
>>>>>
>>>>> Best,
>>>>> Paul Lam
>>>>>
>>>>>

Re: Savepoint incomplete when job was killed after a cancel timeout

Reply via email to