shared cleanup

Roman Khachatryan Mon, 06 Sep 2021 16:21:20 -0700

I tried to reproduce the issue and I see that the folder grows
(because of the underlying FS) but the files under shared/ are
removed. With large state, it takes quite some time though. Do you see
any errors/warnings in the logs while stopping the job?


Could you please share:
- the commands or API you use to start and stop the job
- Flink version
- the API to choose the job ID?


Regards,
Roman

On Tue, Aug 31, 2021 at 10:07 PM Alexey Trenikhun <yen...@msn.com> wrote:
>
> I'm running Flink in Application Mode and set jobId explicitly
>
> ________________________________
> From: Khachatryan Roman <khachatryan.ro...@gmail.com>
> Sent: Monday, August 30, 2021 7:16 AM
> To: Alexey Trenikhun <yen...@msn.com>
> Cc: Matthias Pohl <matth...@ververica.com>; Flink User Mail List 
> <user@flink.apache.org>; sjwies...@gmail.com <sjwies...@gmail.com>
> Subject: Re: checkpoints/.../shared cleanup
>
> Hi,
>
> I think the documentation is correct. Once the job is stopped with
> savepoint, any of its "regular" checkpoints are discarded, and as a
> result any shared state gets unreferenced and is also discarded.
> Savepoints currently do not have shared state.
>
> Furthermore, the new job should have a new ID and therefore a new folder.
> Are you referring to the old folders?
>
> However, the removal process is asynchronous and the client doesn't
> wait for all the artifacts to be removed.
> Then the cluster will wait for removal to complete before termination.
> Are you running Flink in session mode?
>
> Regards,
> Roman
>
> On Fri, Aug 27, 2021 at 8:05 AM Alexey Trenikhun <yen...@msn.com> wrote:
> >
> > "the shared subfolder still grows" - while upgrading job, we cancel job 
> > with savepoint, my expectations that Flink will clean checkpoint  including 
> > shared directory, since checkpoints are not reatained, then we start 
> > upgraded job from savepoint, however when I look into shared folder I see 
> > older files from previous version of job. This upgrade process repeated 
> > again, as result the shared subfolder grows and grows
> >
> > Thanks,
> > Alexey
> > ________________________________
> > From: Alexey Trenikhun <yen...@msn.com>
> > Sent: Thursday, August 26, 2021 6:37:27 PM
> > To: Matthias Pohl <matth...@ververica.com>
> > Cc: Flink User Mail List <user@flink.apache.org>; sjwies...@gmail.com 
> > <sjwies...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi Matthias,
> >
> > I don't use externalized checkpoints (from Flink UI Persist Checkpoints 
> > Externally: Disabled), why do you think checkpoint(s) should be retained? 
> > It kind of contradicts with documentation [1] - Checkpoints are by default 
> > not retained and are only used to resume a job from failures.
> >
> > [1] - 
> > https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/state/checkpoints/#retained-checkpoints
> > Checkpoints | Apache Flink
> > Checkpoints # Overview # Checkpoints make state in Flink fault tolerant by 
> > allowing state and the corresponding stream positions to be recovered, 
> > thereby giving the application the same semantics as a failure-free 
> > execution. See Checkpointing for how to enable and configure checkpoints 
> > for your program. Checkpoint Storage # When checkpointing is enabled, 
> > managed state is persisted to ensure ...
> > ci.apache.org
> >
> > Thanks,
> > Alexey
> > ________________________________
> > From: Matthias Pohl <matth...@ververica.com>
> > Sent: Thursday, August 26, 2021 5:42 AM
> > To: Alexey Trenikhun <yen...@msn.com>
> > Cc: Flink User Mail List <user@flink.apache.org>; sjwies...@gmail.com 
> > <sjwies...@gmail.com>
> > Subject: Re: checkpoints/.../shared cleanup
> >
> > Hi Alexey,
> > thanks for reaching out to the community. I have a question: What do you 
> > mean by "the shared subfolder still grows"? As far as I understand, the 
> > shared folder contains the state of incremental checkpoints. If you cancel 
> > the corresponding job and start a new job from one of the retained 
> > incremental checkpoints, it is required for the shared folder of the 
> > previous job to be still around since it contains the state. The new job 
> > would then create its own shared subfolder. Any new incremental checkpoints 
> > will write their state into the new job's shared subfolder while still 
> > relying on shared state of the previous job for older data. The RocksDB 
> > Backend is in charge of consolidating the incremental state.
> >
> > Hence, you should be careful with removing the shared folder in case you're 
> > planning to restart the job later on.
> >
> > I'm adding Seth to this thread. He might have more insights and/or correct 
> > my limited knowledge of the incremental checkpoint process.
> >
> > Best,
> > Matthias
> >
> > On Wed, Aug 25, 2021 at 1:39 AM Alexey Trenikhun <yen...@msn.com> wrote:
> >
> > Hello,
> > I use incremental checkpoints, not externalized, should content of 
> > checkpoint/.../shared be removed when I cancel job  (or cancel with 
> > savepoint). Looks like in our case shared continutes to grow...
> >
> > Thanks,
> > Alexey

Re: checkpoints/.../shared cleanup

Reply via email to