Hello,

I have some confusion about checkpoints vs savepoints, and how to use them
effectively in my application.

I am working on an application which is relies on flink's fault tolerant
mechanism to ensure exactly once semantics. I have enabled external
checkpointing in my application as below:

env.enableCheckpointing(CHECKPOINT_TIME_MS)

env.setStateBackend(new RocksDBStateBackend(CHECKPOINT_LOCATION))

env.getCheckpointConfig.setMinPauseBetweenCheckpoints(CHECKPOINT_MIN_PAUSE)
env.getCheckpointConfig.setCheckpointTimeout(CHECKPOINT_TIMEOUT_MS)
env.getCheckpointConfig.setMaxConcurrentCheckpoints(CHECKPOINT_MAX_CONCURRENT)

Please correct me incase I am wrong but the above ensures if the
application crashes, it is able to recover from the last know location.
This however wont work if we cancel the application( for new
deployments/restarts).

Reading link <https://data-artisans.com/blog/turning-back-time-savepoints>
about
savepoints, hints that it should a good practice to have savepoints at
regular intervals of time(by crons
<https://medium.com/@visualskyrim/try-out-the-save-point-in-apache-flink-88b0140b50cd>
etc) so that the application can be restarted from a last known location.
This also points to using command line option( -s ) to cancel an
application, so that the application stops after saving a savepoint. Based
on the above understanding I have some questions below:

Questions:

   1. It seems to me that checkpoints can be treated as flink internal
   recovery mechanism, and savepoints act more as user-defined recovery
   points. Would that be a correct assumption?
   2. While cancelling an application with -s option, it specifies the
   savepoint location. Is there a way during application startup to identify
   the last know savepoint from a folder by itself, and restart from there.
   Since I am saving my savepoints on s3, I want to avoid issues arising from
   *ls* command on s3 due to read-after-write consistency of s3.
   3. Suppose my application has a checkpoint at point t1, and say i cancel
   this application sometime in future before the next available checkpoint(
   say t1+x). If I start the application without specifying the savepoint, it
   will start from the last known checkpoint(at t1), which wont have the
   application state saved, since I had cancelled the application. Would this
   is a correct assumption?
   4. Would using ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION be
   same as manually saving regular savepoints?


Please let me know.

Thanks,
Vipul

Reply via email to