How to reliably verify Flink checkpoint completeness

Prateek Kohli Mon, 02 Feb 2026 06:57:22 -0800

Hi everyone,

I'm implementing active-passive HA across two sites for a Flink streaming
job with large RocksDB state. I sync checkpoint directories via rsync
library from primary → secondary continuously. On primary failure, I want
to auto-start the job on secondary using the latest complete checkpoint.
For that, I need a reliable way to check if a checkpoint is fully complete
before using it for job recovery.


*My Understanding & Concern*
As I understand, the _metadata file is created last by JobManager after all
TaskManagers acknowledge. But even if _metadata exists, there's a chance it
was partially written (crash mid-write/rsync copied an intermittent half
file).

*Questions*

   1. Is there a definitive way to verify checkpoint completeness?
   Something beyond just checking if _metadata file exists?
   2. If I start a job with incomplete _metadata:


   - Does Flink fail immediately during startup?
   - Or does it retry multiple checkpoints before failing? (Tried to
   corrupt the _metadata file but always failed immediately, still, can there
   be a scenario of retrying before failing?)


   3. Any other markers that indicate a checkpoint is fully completed and
   safe to resume from?


Thanks

Prateek

How to reliably verify Flink checkpoint completeness

Reply via email to