Hey Yun, thanks for the answer! How would you analyze the checkpoint metadata? Would you build a program with the State Processor API library, or is there a better way to do it? I believe the option you mention would indeed facilitate cleaning, it would still be manual (because we can't set a periodic deletion) but at least we can safely remove old folders with this option
Thanks, Robin Le ven. 3 sept. 2021 à 18:21, Yun Tang <myas...@live.com> a écrit : > Hi Robin, > > It's not easy to clean incremental checkpoints as different job instances > have different checkpoint sub-directory (due to different job id). You > could analysis your checkpoint metadata to see what files are still useful > in older checkpoint directory. > > BTW, I also think of a possible solution to provide the ability to > re-upload all files under some specific configured option so that we could > let new job get decoupled with older checkpoints. Do you think that could > resolve your case? > > Best > Yun Tang > ------------------------------ > *From:* Robin Cassan <robin.cas...@contentsquare.com> > *Sent:* Wednesday, September 1, 2021 17:38 > *To:* Robert Metzger <rmetz...@apache.org> > *Cc:* user <user@flink.apache.org> > *Subject:* Re: Cleaning old incremental checkpoint files > > Thanks Robert for your answer, this seems to be what we observed when we > tried to delete the first time: Flink complained about missing files. > I'm wondering then how are people cleaning their storage for incremental > checkpoints? Is there any guarantee when using TTLs that after the TTL has > expired, no more file older than the TTL will be needed in the shared > folder? > > Le mar. 3 août 2021 à 13:29, Robert Metzger <rmetz...@apache.org> a > écrit : > > Hi Robin, > > Let's say you have two checkpoints #1 and #2, where #1 has been created by > an old version or your job, and #2 has been created by the new version. > When can you delete #1? > In #1, there's a directory "/shared" that contains data that is also used > by #2, because of the incremental nature of the checkpoints. > > You can not delete the data in the /shared directory, as this data is > potentially still in use. > > I know this is only a partial answer to your question. I'll try to find > out more details and extend my answer later. > > > On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan < > robin.cas...@contentsquare.com> wrote: > > Hi all! > > We've happily been running a Flink job in production for a year now, with > the RocksDB state backend and incremental retained checkpointing on S3. We > often release new versions of our jobs, which means we cancel the running > one and submit another while restoring the previous jobId's last retained > checkpoint. > > This works fine, but we also need to clean old files from S3 which are > starting to pile up. We are wondering two things: > - once the newer job has restored the older job's checkpoint, is it safe > to delete it? Or will the newer job's checkpoints reference files from the > older job, in which case deleting the old checkpoints might cause errors > during the next restore? > - also, since all our state has a 7 days TTL, is it safe to set a 7 or 8 > days retention policy on S3 which would automatically clean old files, or > could we still need to retain files older than 7 days even with the TTL? > > Don't hesitate to ask me if anything is not clear enough! > > Thanks, > Robin > >