We've noticed a spike in the start delays in our incremental checkpoints
every 15 minutes.  The Flink job seems to start out smooth, with
checkpoints in in the 15s range and negligible start delays.  Then every
3rd or 4th checkpoint has a long start delay (~2-3 minutes).  Teh
checkpoints in between have negligible start delays and are fast.  So:

2-3 fast with negligible start delay, total time 15-30s
1-2 slow with 2-3 minute start delay, total time 15-30s longer than the
start delay.

What could cause this?  We have a couple output topics that are
EXACTLY_ONCE, but I switched them to AT_LEAST_ONCE and continued to see the
behavior.

Thanks.
Jai

Reply via email to