Hi, Oscar.
Just share my thoughts:
Benefits of more aggressive checkpoint:
1. less recovery time as you mentioned (which is also related to data flink
has to rollback to process)
2. less end-to-end latency for checkpoint-bounded sink in exactly-once mode
Costs of more aggressive checkpoint:
1. more resources e.g. CPU, Network
2. performance degradation as you mentioned (It will become more obvious If
there are some resources bottleneck)

So if your job doesn't have high requirements about the above benefits, you
could choose a bigger checkpoint interval, e.g. 3 mins.
If not, you could control it within 1 min and try to decrease it until it
could match your requirements.

On Tue, Dec 5, 2023 at 6:56 PM Oscar Perez via user <user@flink.apache.org>

> Hei,
> We are tuning some of the flink jobs we have in production and we would
> like to know what are the best numbers/considerations for checkpoint
> interval. We have set a default of 30 seconds for checkpoint interval and
> the checkpoint operation takes around 2 seconds.
> We have also enabled incremental checkpoint. I understand there is a
> tradeoff between recovery from failure time vs performance degradation on
> having an aggressive checkpoint policy but would like to know about what
> you guys think it is a good compromise.
> I read this article as reference:
> https://shopify.engineering/optimizing-apache-flink-applications-tips
> But what I would like is some formula or recipe in order to find out the
> best value for checkpoint interval.
> Regards,
> Oscar


Reply via email to