Hello Flink Users, We have enabled Kubernetes HPA for our Flink applications (Task Manager only). Our Flink Applications are running in Reactive Mode. When the Kubernetes controller triggers a scale-up/scale-down of our job's Task Managers, we get an alert for failed checkpoints. Interestingly enough, the error is transient and the successful checkpoints continue to progress as well. We want to identify what is causing these failed checkpoints during HPA scale-up/scale-down activities. Below is an example error log during checkpoint failure:
2022-12-13 05:08:22.339 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 40393 for job 00000000000000000000000000000000 (488170 bytes, checkpointDuration=2582 ms, finalizationTime=322 ms). 2022-12-13 05:08:28.083 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointFailureManager - Failed to trigger checkpoint for job 00000000000000000000000000000000 since Checkpoint triggering task Source: Custom Source -> Sink: Unnamed (1/79) of job 00000000000000000000000000000000 is not being executed at the moment. Aborting checkpoint. Failure reason: Not all required tasks are currently running.. 2022-12-13 05:09:19.437 [Checkpoint Timer] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 40394 (type=CheckpointType{name='Checkpoint', sharingFilesStrategy=FORWARD_BACKWARD}) @ 1670908159435 for job 00000000000000000000000000000000. 2022-12-13 05:09:25.208 [jobmanager-io-thread-1] INFO org.apache.flink.fs.s3.common.writer.S3Committer - Committing flink-ingest-sps-nv-consumer/2022-11-15T01:10:30Z/00000000000000000000000000000000/chk-40394/_metadata with MPU ID _3vKXSVBMuBM7207EpGvCXOTRQskAiPPj88DSTTn55Uzuc_76dnubmTAPBovyWbKBKU8Wxqz6SuFBJ8cZnAOH_PkGEP36KJzMFYYPmT.xZvmLnM.YX1oJSHN3VP1TXpJECY8y80psYvRWvbt2e8CMeoa9JiOWiGYGRmqLGRdlQA- 2022-12-13 05:09:25.747 [jobmanager-io-thread-1] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 40394 for job 00000000000000000000000000000000 (482850 bytes, checkpointDuration=5982 ms, finalizationTime=330 ms). Varun