Re: Cleaning old incremental checkpoint files

Robin Cassan Tue, 07 Sep 2021 05:17:51 -0700

Hey Yun, thanks for the answer!

How would you analyze the checkpoint metadata? Would you build a program
with the State Processor API library, or is there a better way to do it?
I believe the option you mention would indeed facilitate cleaning, it would
still be manual (because we can't set a periodic deletion) but at least we
can safely remove old folders with this option


Thanks,
Robin

Le ven. 3 sept. 2021 à 18:21, Yun Tang <myas...@live.com> a écrit :

> Hi Robin,
>
> It's not easy to clean incremental checkpoints as different job instances
> have different checkpoint sub-directory (due to different job id). You
> could analysis your checkpoint metadata to see what files are still useful
> in older checkpoint directory.
>
> BTW, I also think of a possible solution to provide the ability to
> re-upload all files under some specific configured option so that we could
> let new job get decoupled with older checkpoints. Do you think that could
> resolve your case?
>
> Best
> Yun Tang
> ------------------------------
> *From:* Robin Cassan <robin.cas...@contentsquare.com>
> *Sent:* Wednesday, September 1, 2021 17:38
> *To:* Robert Metzger <rmetz...@apache.org>
> *Cc:* user <user@flink.apache.org>
> *Subject:* Re: Cleaning old incremental checkpoint files
>
> Thanks Robert for your answer, this seems to be what we observed when we
> tried to delete the first time: Flink complained about missing files.
> I'm wondering then how are people cleaning their storage for incremental
> checkpoints? Is there any guarantee when using TTLs that after the TTL has
> expired, no more file older than the TTL will be needed in the shared
> folder?
>
> Le mar. 3 août 2021 à 13:29, Robert Metzger <rmetz...@apache.org> a
> écrit :
>
> Hi Robin,
>
> Let's say you have two checkpoints #1 and #2, where #1 has been created by
> an old version or your job, and #2 has been created by the new version.
> When can you delete #1?
> In #1, there's a directory "/shared" that contains data that is also used
> by #2, because of the incremental nature of the checkpoints.
>
> You can not delete the data in the /shared directory, as this data is
> potentially still in use.
>
> I know this is only a partial answer to your question. I'll try to find
> out more details and extend my answer later.
>
>
> On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan <
> robin.cas...@contentsquare.com> wrote:
>
> Hi all!
>
> We've happily been running a Flink job in production for a year now, with
> the RocksDB state backend and incremental retained checkpointing on S3. We
> often release new versions of our jobs, which means we cancel the running
> one and submit another while restoring the previous jobId's last retained
> checkpoint.
>
> This works fine, but we also need to clean old files from S3 which are
> starting to pile up. We are wondering two things:
> - once the newer job has restored the older job's checkpoint, is it safe
> to delete it? Or will the newer job's checkpoints reference files from the
> older job, in which case deleting the old checkpoints might cause errors
> during the next restore?
> - also, since all our state has a 7 days TTL, is it safe to set a 7 or 8
> days retention policy on S3 which would automatically clean old files, or
> could we still need to retain files older than 7 days even with the TTL?
>
> Don't hesitate to ask me if anything is not clear enough!
>
> Thanks,
> Robin
>
>

Re: Cleaning old incremental checkpoint files

Reply via email to