We've noticed a spike in the start delays in our incremental checkpoints every 15 minutes. The Flink job seems to start out smooth, with checkpoints in in the 15s range and negligible start delays. Then every 3rd or 4th checkpoint has a long start delay (~2-3 minutes). Teh checkpoints in between have negligible start delays and are fast. So:
2-3 fast with negligible start delay, total time 15-30s 1-2 slow with 2-3 minute start delay, total time 15-30s longer than the start delay. What could cause this? We have a couple output topics that are EXACTLY_ONCE, but I switched them to AT_LEAST_ONCE and continued to see the behavior. Thanks. Jai