Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-11-17 文章 David Anderson
Rui,

I don't have any direct experience with this topic, but given the
motivation you shared, the proposal makes sense to me. Given that the new
default feels more complex than the current behavior, if we decide to do
this I think it will be important to include the rationale you've shared in
the documentation.

David

On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:

> Hi dear flink users and devs:
>
> FLIP-364[1] intends to make some improvements to restart-strategy
> and discuss updating some of the default values of exponential-delay,
> and whether exponential-delay can be used as the default restart-strategy.
> After discussing at dev mail list[2], we hope to collect more feedback
> from Flink users.
>
> # Why does the default restart-strategy need to be updated?
>
> If checkpointing is enabled, the default value is fixed-delay with
> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
> the job will restart infinitely with high frequency when a job
> continues to fail.
>
> When the Kafka cluster fails, a large number of flink jobs will be
> restarted frequently. After the kafka cluster is recovered, a large
> number of high-frequency restarts of flink jobs may cause the
> kafka cluster to avalanche again.
>
> Considering the exponential-delay as the default strategy with
> a couple of reasons:
>
> - The exponential-delay can reduce the restart frequency when
>   a job continues to fail.
> - It can restart a job quickly when a job fails occasionally.
> - The restart-strategy.exponential-delay.jitter-factor can avoid r
>   estarting multiple jobs at the same time. It’s useful to prevent
>   avalanches.
>
> # What are the current default values[4] of exponential-delay?
>
> restart-strategy.exponential-delay.initial-backoff : 1s
> restart-strategy.exponential-delay.backoff-multiplier : 2.0
> restart-strategy.exponential-delay.jitter-factor : 0.1
> restart-strategy.exponential-delay.max-backoff : 5 min
> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>
> backoff-multiplier=2 means that the delay time of each restart
> will be doubled. The delay times are:
> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
>
> The delay time is increased rapidly, it will affect the recover
> time for flink jobs.
>
> # Option improvements
>
> We think the backoff-multiplier between 1 and 2 is more sensible,
> such as:
>
> restart-strategy.exponential-delay.backoff-multiplier : 1.2
> restart-strategy.exponential-delay.max-backoff : 1 min
>
> After updating, the delay times are:
>
> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
> 22.186s, 26.623s, 31.948s, 38.337s, etc
>
> They achieve the following goals:
> - When restarts are infrequent in a short period of time, flink can
>   quickly restart the job. (For example: the retry delay time when
>   restarting 5 times is 2.073s)
> - When restarting frequently in a short period of time, flink can
>   slightly reduce the restart frequency to prevent avalanches.
>   (For example: the retry delay time when retrying 10 times is 5.1 s,
>   and the retry delay time when retrying 20 times is 38s, which is not very
> large.)
>
> As @Mingliang Liu   mentioned at dev mail list: the
> one-size-fits-all
> default values do not exist. So our goal is that the default values
> can be suitable for most jobs.
>
> Looking forward to your thoughts and feedback, thanks~
>
> [1] https://cwiki.apache.org/confluence/x/uJqzDw
> [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
> [3]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
> [4]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
>
> Best,
> Rui
>


[ANNOUNCE] Apache Flink 1.15.1 released

2022-07-07 文章 David Anderson
The Apache Flink community is very happy to announce the release of Apache
Flink 1.15.1, which is the first bugfix release for the Apache Flink 1.15
series.

Apache Flink® is an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data streaming
applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the improvements
for this bugfix release:

https://flink.apache.org/news/2022/07/06/release-1.15.1.html

The full release notes are available in Jira:


https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351546

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,
David Anderson