I tried to reproduce the issue and I see that the folder grows (because of the underlying FS) but the files under shared/ are removed. With large state, it takes quite some time though. Do you see any errors/warnings in the logs while stopping the job?
Could you please share: - the commands or API you use to start and stop the job - Flink version - the API to choose the job ID? Regards, Roman On Tue, Aug 31, 2021 at 10:07 PM Alexey Trenikhun <yen...@msn.com> wrote: > > I'm running Flink in Application Mode and set jobId explicitly > > ________________________________ > From: Khachatryan Roman <khachatryan.ro...@gmail.com> > Sent: Monday, August 30, 2021 7:16 AM > To: Alexey Trenikhun <yen...@msn.com> > Cc: Matthias Pohl <matth...@ververica.com>; Flink User Mail List > <user@flink.apache.org>; sjwies...@gmail.com <sjwies...@gmail.com> > Subject: Re: checkpoints/.../shared cleanup > > Hi, > > I think the documentation is correct. Once the job is stopped with > savepoint, any of its "regular" checkpoints are discarded, and as a > result any shared state gets unreferenced and is also discarded. > Savepoints currently do not have shared state. > > Furthermore, the new job should have a new ID and therefore a new folder. > Are you referring to the old folders? > > However, the removal process is asynchronous and the client doesn't > wait for all the artifacts to be removed. > Then the cluster will wait for removal to complete before termination. > Are you running Flink in session mode? > > Regards, > Roman > > On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <yen...@msn.com> wrote: > > > > "the shared subfolder still grows" - while upgrading job, we cancel job > > with savepoint, my expectations that Flink will clean checkpoint including > > shared directory, since checkpoints are not reatained, then we start > > upgraded job from savepoint, however when I look into shared folder I see > > older files from previous version of job. This upgrade process repeated > > again, as result the shared subfolder grows and grows > > > > Thanks, > > Alexey > > ________________________________ > > From: Alexey Trenikhun <yen...@msn.com> > > Sent: Thursday, August 26, 2021 6:37:27 PM > > To: Matthias Pohl <matth...@ververica.com> > > Cc: Flink User Mail List <user@flink.apache.org>; sjwies...@gmail.com > > <sjwies...@gmail.com> > > Subject: Re: checkpoints/.../shared cleanup > > > > Hi Matthias, > > > > I don't use externalized checkpoints (from Flink UI Persist Checkpoints > > Externally: Disabled), why do you think checkpoint(s) should be retained? > > It kind of contradicts with documentation [1] - Checkpoints are by default > > not retained and are only used to resume a job from failures. > > > > [1] - > > https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints > > Checkpoints | Apache Flink > > Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by > > allowing state and the corresponding stream positions to be recovered, > > thereby giving the application the same semantics as a failure-free > > execution. See Checkpointing for how to enable and configure checkpoints > > for your program. Checkpoint Storage # When checkpointing is enabled, > > managed state is persisted to ensure ... > > ci.apache.org > > > > Thanks, > > Alexey > > ________________________________ > > From: Matthias Pohl <matth...@ververica.com> > > Sent: Thursday, August 26, 2021 5:42 AM > > To: Alexey Trenikhun <yen...@msn.com> > > Cc: Flink User Mail List <user@flink.apache.org>; sjwies...@gmail.com > > <sjwies...@gmail.com> > > Subject: Re: checkpoints/.../shared cleanup > > > > Hi Alexey, > > thanks for reaching out to the community. I have a question: What do you > > mean by "the shared subfolder still grows"? As far as I understand, the > > shared folder contains the state of incremental checkpoints. If you cancel > > the corresponding job and start a new job from one of the retained > > incremental checkpoints, it is required for the shared folder of the > > previous job to be still around since it contains the state. The new job > > would then create its own shared subfolder. Any new incremental checkpoints > > will write their state into the new job's shared subfolder while still > > relying on shared state of the previous job for older data. The RocksDB > > Backend is in charge of consolidating the incremental state. > > > > Hence, you should be careful with removing the shared folder in case you're > > planning to restart the job later on. > > > > I'm adding Seth to this thread. He might have more insights and/or correct > > my limited knowledge of the incremental checkpoint process. > > > > Best, > > Matthias > > > > On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <yen...@msn.com> wrote: > > > > Hello, > > I use incremental checkpoints, not externalized, should content of > > checkpoint/.../shared be removed when I cancel job (or cancel with > > savepoint). Looks like in our case shared continutes to grow... > > > > Thanks, > > Alexey