Hello, I have some confusion about checkpoints vs savepoints, and how to use them effectively in my application.
I am working on an application which is relies on flink's fault tolerant mechanism to ensure exactly once semantics. I have enabled external checkpointing in my application as below: env.enableCheckpointing(CHECKPOINT_TIME_MS) env.setStateBackend(new RocksDBStateBackend(CHECKPOINT_LOCATION)) env.getCheckpointConfig.setMinPauseBetweenCheckpoints(CHECKPOINT_MIN_PAUSE) env.getCheckpointConfig.setCheckpointTimeout(CHECKPOINT_TIMEOUT_MS) env.getCheckpointConfig.setMaxConcurrentCheckpoints(CHECKPOINT_MAX_CONCURRENT) Please correct me incase I am wrong but the above ensures if the application crashes, it is able to recover from the last know location. This however wont work if we cancel the application( for new deployments/restarts). Reading link <https://data-artisans.com/blog/turning-back-time-savepoints> about savepoints, hints that it should a good practice to have savepoints at regular intervals of time(by crons <https://medium.com/@visualskyrim/try-out-the-save-point-in-apache-flink-88b0140b50cd> etc) so that the application can be restarted from a last known location. This also points to using command line option( -s ) to cancel an application, so that the application stops after saving a savepoint. Based on the above understanding I have some questions below: Questions: 1. It seems to me that checkpoints can be treated as flink internal recovery mechanism, and savepoints act more as user-defined recovery points. Would that be a correct assumption? 2. While cancelling an application with -s option, it specifies the savepoint location. Is there a way during application startup to identify the last know savepoint from a folder by itself, and restart from there. Since I am saving my savepoints on s3, I want to avoid issues arising from *ls* command on s3 due to read-after-write consistency of s3. 3. Suppose my application has a checkpoint at point t1, and say i cancel this application sometime in future before the next available checkpoint( say t1+x). If I start the application without specifying the savepoint, it will start from the last known checkpoint(at t1), which wont have the application state saved, since I had cancelled the application. Would this is a correct assumption? 4. Would using ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION be same as manually saving regular savepoints? Please let me know. Thanks, Vipul