Hi community, While using Flink's async i/o for interacting with an external system, I got the following exception:
2021-11-06 10:38:35,270 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 54 (type=CHECKPOINT) @ 1636162715262 for job f168a44ea33198cd71783824d49f9554. 2021-11-06 10:38:47,031 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 54 for job f168a44ea33198cd71783824d49f9554 (11930992707 bytes, checkpointDuration=11722 ms, finalizationTime=47 ms). 2021-11-06 10:58:35,270 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 55 (type=CHECKPOINT) @ 1636163915262 for job f168a44ea33198cd71783824d49f9554. 2021-11-06 11:08:35,271 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint 55 of job f168a44ea33198cd71783824d49f9554 expired before completing. 2021-11-06 11:08:35,287 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to recover from a global failure. - FYI, I'm using 1.14.0 and enabled unaligned checkpointing and buffer debloating - the 55th ckpt failed to complete within 10 mins (which is the value of execution.checkpointing.timeout) - the below graph shows that backpressure skyrocketed around the time the 55th ckpt began [image: image.png] What I suspect is the capacity of the asynchronous operation because limiting the value can cause back-pressure once the capacity is exhausted [1]. Although I could increase the value, I want to monitor the current in-flight async i/o requests like the above back-pressure graph on Grafana. [2] does not introduce any system metric specific to async i/o. Best, Dongwon [1] https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api [2] https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#system-metrics