Thanks for sharing the logs with me. It looks as if the total size of the savepoint is 335kb for a job with a parallelism of 60 and a total of 120 tasks. Hence, the average size of a state per task is between 2.5kb - 5kb. I think that the state size threshold refers to the size of the per task state. Hence, I believe that the _metadata file should contain all of your state. Have you tried restoring from this savepoint?
Cheers, Till On Tue, Sep 29, 2020 at 3:47 PM Paul Lam <paullin3...@gmail.com> wrote: > Hi Till, > > Thanks for your quick reply. > > The checkpoint/savepoint size would be around 2MB, which is larger than > `state.backend.fs.memory-threshold`. > > The jobmanager logs are attached, which looks normal to me. > > Thanks again! > > Best, > Paul Lam > > Till Rohrmann <trohrm...@apache.org> 于2020年9月29日周二 下午8:32写道: > >> Hi Paul, >> >> could you share with us the logs of the JobManager? They might help to >> better understand in which order each operation occurred. >> >> How big are you expecting the size of the state to be? If it is smaller >> than state.backend.fs.memory-threshold, then the state data will be stored >> in the _metadata file. >> >> Cheers, >> Till >> >> On Tue, Sep 29, 2020 at 1:52 PM Paul Lam <paullin3...@gmail.com> wrote: >> >>> Hi, >>> >>> We have a Flink job that was stopped erroneously with no available >>> checkpoint/savepoint to restore, >>> and are looking for some help to narrow down the problem. >>> >>> How we ran into this problem: >>> >>> We stopped the job using cancel with savepoint command (for >>> compatibility issue), but the command >>> timed out after 1 min because there was some backpressure. So we force >>> kill the job by yarn kill command. >>> Usually, this would not cause troubles because we can still use the last >>> checkpoint to restore the job. >>> >>> But at this time, the last checkpoint dir was cleaned up and empty (the >>> retained checkpoint number was 1). >>> According to zookeeper and the logs, the savepoint finished (job master >>> logged “Savepoint stored in …”) >>> right after the cancel timeout. However, the savepoint directory >>> contains only _metadata file, and other >>> state files referred by metadata are absent. >>> >>> Environment & Config: >>> - Flink 1.11.0 >>> - YARN job cluster >>> - HA via zookeeper >>> - FsStateBackend >>> - Aligned non-incremental checkpoint >>> >>> Any comments and suggestions are appreciated! Thanks! >>> >>> Best, >>> Paul Lam >>> >>>