Hi dear flink users and devs:

FLIP-364[1] intends to make some improvements to restart-strategy
and discuss updating some of the default values of exponential-delay,
and whether exponential-delay can be used as the default restart-strategy.
After discussing at dev mail list[2], we hope to collect more feedback
from Flink users.

# Why does the default restart-strategy need to be updated?

If checkpointing is enabled, the default value is fixed-delay with
Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
the job will restart infinitely with high frequency when a job
continues to fail.

When the Kafka cluster fails, a large number of flink jobs will be
restarted frequently. After the kafka cluster is recovered, a large
number of high-frequency restarts of flink jobs may cause the
kafka cluster to avalanche again.

Considering the exponential-delay as the default strategy with
a couple of reasons:

- The exponential-delay can reduce the restart frequency when
  a job continues to fail.
- It can restart a job quickly when a job fails occasionally.
- The restart-strategy.exponential-delay.jitter-factor can avoid r
  estarting multiple jobs at the same time. It’s useful to prevent
  avalanches.

# What are the current default values[4] of exponential-delay?

restart-strategy.exponential-delay.initial-backoff : 1s
restart-strategy.exponential-delay.backoff-multiplier : 2.0
restart-strategy.exponential-delay.jitter-factor : 0.1
restart-strategy.exponential-delay.max-backoff : 5 min
restart-strategy.exponential-delay.reset-backoff-threshold : 1h

backoff-multiplier=2 means that the delay time of each restart
will be doubled. The delay times are:
1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.

The delay time is increased rapidly, it will affect the recover
time for flink jobs.

# Option improvements

We think the backoff-multiplier between 1 and 2 is more sensible,
such as:

restart-strategy.exponential-delay.backoff-multiplier : 1.2
restart-strategy.exponential-delay.max-backoff : 1 min

After updating, the delay times are:

1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
22.186s, 26.623s, 31.948s, 38.337s, etc

They achieve the following goals:
- When restarts are infrequent in a short period of time, flink can
  quickly restart the job. (For example: the retry delay time when
  restarting 5 times is 2.073s)
- When restarting frequently in a short period of time, flink can
  slightly reduce the restart frequency to prevent avalanches.
  (For example: the retry delay time when retrying 10 times is 5.1 s,
  and the retry delay time when retrying 20 times is 38s, which is not very
large.)

As @Mingliang Liu <lium...@apache.org>  mentioned at dev mail list: the
one-size-fits-all
default values do not exist. So our goal is that the default values
can be suitable for most jobs.

Looking forward to your thoughts and feedback, thanks~

[1] https://cwiki.apache.org/confluence/x/uJqzDw
[2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
[4]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy

Best,
Rui

回复