Hi everyone, I'm implementing active-passive HA across two sites for a Flink streaming job with large RocksDB state. I sync checkpoint directories via rsync library from primary → secondary continuously. On primary failure, I want to auto-start the job on secondary using the latest complete checkpoint. For that, I need a reliable way to check if a checkpoint is fully complete before using it for job recovery.
*My Understanding & Concern* As I understand, the _metadata file is created last by JobManager after all TaskManagers acknowledge. But even if _metadata exists, there's a chance it was partially written (crash mid-write/rsync copied an intermittent half file). *Questions* 1. Is there a definitive way to verify checkpoint completeness? Something beyond just checking if _metadata file exists? 2. If I start a job with incomplete _metadata: - Does Flink fail immediately during startup? - Or does it retry multiple checkpoints before failing? (Tried to corrupt the _metadata file but always failed immediately, still, can there be a scenario of retrying before failing?) 3. Any other markers that indicate a checkpoint is fully completed and safe to resume from? Thanks Prateek
