Hi Robin, Let's say you have two checkpoints #1 and #2, where #1 has been created by an old version or your job, and #2 has been created by the new version. When can you delete #1? In #1, there's a directory "/shared" that contains data that is also used by #2, because of the incremental nature of the checkpoints.
You can not delete the data in the /shared directory, as this data is potentially still in use. I know this is only a partial answer to your question. I'll try to find out more details and extend my answer later. On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan <robin.cas...@contentsquare.com> wrote: > Hi all! > > We've happily been running a Flink job in production for a year now, with > the RocksDB state backend and incremental retained checkpointing on S3. We > often release new versions of our jobs, which means we cancel the running > one and submit another while restoring the previous jobId's last retained > checkpoint. > > This works fine, but we also need to clean old files from S3 which are > starting to pile up. We are wondering two things: > - once the newer job has restored the older job's checkpoint, is it safe > to delete it? Or will the newer job's checkpoints reference files from the > older job, in which case deleting the old checkpoints might cause errors > during the next restore? > - also, since all our state has a 7 days TTL, is it safe to set a 7 or 8 > days retention policy on S3 which would automatically clean old files, or > could we still need to retain files older than 7 days even with the TTL? > > Don't hesitate to ask me if anything is not clear enough! > > Thanks, > Robin >