Failed Checkpoints when Flink HPA enabled on Kubernetes

Varun Narayanan Chakravarthy via user Tue, 24 Jan 2023 12:26:04 -0800

Hello Flink Users,
We have enabled Kubernetes HPA for our Flink applications (Task Manager
only). Our Flink Applications are running in Reactive Mode. When the
Kubernetes controller triggers a scale-up/scale-down of our job's Task
Managers, we get an alert for failed checkpoints. Interestingly enough, the
error is transient and the successful checkpoints continue to progress as
well. We want to identify what is causing these failed checkpoints during
HPA scale-up/scale-down activities. Below is an example error log during
checkpoint failure:


2022-12-13 05:08:22.339 [jobmanager-io-thread-1] INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
checkpoint 40393 for job 00000000000000000000000000000000 (488170 bytes,
checkpointDuration=2582 ms, finalizationTime=322 ms).
2022-12-13 05:08:28.083 [Checkpoint Timer] INFO
 org.apache.flink.runtime.checkpoint.CheckpointFailureManager  - Failed to
trigger checkpoint for job 00000000000000000000000000000000 since
Checkpoint triggering task Source: Custom Source -> Sink: Unnamed (1/79) of
job 00000000000000000000000000000000 is not being executed at the moment.
Aborting checkpoint. Failure reason: Not all required tasks are currently
running..
2022-12-13 05:09:19.437 [Checkpoint Timer] INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Triggering
checkpoint 40394 (type=CheckpointType{name='Checkpoint',
sharingFilesStrategy=FORWARD_BACKWARD}) @ 1670908159435 for job
00000000000000000000000000000000.
2022-12-13 05:09:25.208 [jobmanager-io-thread-1] INFO
 org.apache.flink.fs.s3.common.writer.S3Committer  - Committing
flink-ingest-sps-nv-consumer/2022-11-15T01:10:30Z/00000000000000000000000000000000/chk-40394/_metadata
with MPU ID
_3vKXSVBMuBM7207EpGvCXOTRQskAiPPj88DSTTn55Uzuc_76dnubmTAPBovyWbKBKU8Wxqz6SuFBJ8cZnAOH_PkGEP36KJzMFYYPmT.xZvmLnM.YX1oJSHN3VP1TXpJECY8y80psYvRWvbt2e8CMeoa9JiOWiGYGRmqLGRdlQA-
2022-12-13 05:09:25.747 [jobmanager-io-thread-1] INFO
 org.apache.flink.runtime.checkpoint.CheckpointCoordinator  - Completed
checkpoint 40394 for job 00000000000000000000000000000000 (482850 bytes,
checkpointDuration=5982 ms, finalizationTime=330 ms).

Varun

Failed Checkpoints when Flink HPA enabled on Kubernetes

Reply via email to