Spike in checkpoint start delay every 15 minutes

Jai Patel Tue, 14 Jun 2022 15:26:47 -0700

We've noticed a spike in the start delays in our incremental checkpoints
every 15 minutes.  The Flink job seems to start out smooth, with
checkpoints in in the 15s range and negligible start delays.  Then every
3rd or 4th checkpoint has a long start delay (~2-3 minutes).  Teh
checkpoints in between have negligible start delays and are fast.  So:


2-3 fast with negligible start delay, total time 15-30s
1-2 slow with 2-3 minute start delay, total time 15-30s longer than the
start delay.

What could cause this?  We have a couple output topics that are
EXACTLY_ONCE, but I switched them to AT_LEAST_ONCE and continued to see the
behavior.

Thanks.
Jai

Spike in checkpoint start delay every 15 minutes

Reply via email to