Hi Sam

I think the first step is to see which part of your Flink APP is blocking
the completion of Checkpoint. Specifically, you can refer to the
"Checkpoint Details" section of the document [1]. Using these methods, you
should be able to observe where the checkpoint is blocked, for example, it
may be an agg operator of the app, or it may be blocked on the sink of
kafka.
Once you know which operator is blocking, you can use FlameGraph [2] to see
where the bottleneck of the operator is. Then do specific operations.

Hope these help!
[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/monitoring/checkpoint_monitoring/#checkpoint-details
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/flame_graphs/

Best,
Guowei


On Fri, Apr 29, 2022 at 2:10 AM Sam Ch <samc...@gmail.com> wrote:

> Hello,
>
> I am running into checkpoint timeouts and am looking for guidance on
> troubleshooting. What should I be looking at? What configuration parameters
> would affect this? I am afraid I am a Flink newbie so I am still picking up
> the concepts. Additional notes are below, anything else I can provide?
> Thanks.
>
>
> The checkpoint size is small (less than 100kB)
> Multiple flink apps are running on a cluster, only one is running into
> checkpoint timeouts
> Timeout is set to 10 mins
> Tried aligned and unaligned checkpoints
> Tried clearing checkpoints to start fresh
> Plenty of disk space
> Dataflow: kafka source -> flink app -> kafka sink
>

Reply via email to