Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Rui Fan Mon, 11 Dec 2023 23:19:29 -0800

Hi Maximilian and Mason,

Thanks a lot for your feedback!


After an offline consultation with Max, I guess I understand your
concern for now: when flink job restarts, it will make a bunch of
calls to the Kubernetes API, e.g. read/write to config maps, create
task managers. Currently, the default restart strategy is fixed-delay
with 1s delay time, so flink will restart jobs with high frequency
even if flink jobs cannot be started. It will cause the Kubernetes
cluster became unstable.

That's why I propose changing the default restart strategy to
exponential-delay. It can achieve: restarts happen quickly
enough unless there are consecutive failures. It is helpful for
the stability of external components.

After discussing with Max and Zhu Zhu at the PR comment[1],
Max suggested using 1.5 as the default value of backoff-multiplier
instead of 1.2. The 1.2 is a little small(delay time is too short).
This picture[2] is the relationship between restart-attempts and
retry-delay-time when backoff-multiplier is 1.2 and 1.5:

- The delay-time will reach 1 min after 12 attempts when backoff-multiplier
is 1.5
- The delay-time will reach 1 min after 24 attempts when backoff-multiplier
is 1.2

Is there any other suggestion? Looking forward to more feedback, thanks~

BTW, as Zhu said in the comment[1], if we update the default value,
a new vote is needed for this default value. So I will pause
FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
continued.

To Mason:

If I understand your concerns correctly, I still don't know how
to benchmark. The kubernetes cluster instability only happens
when one cluster has a lot of jobs. In general, the test cannot
reproduce the pressure. Could you elaborate on how to
benchmark for this?

After this FLIP, the default restart frequency will be reduced
significantly. Especially when a job fails consecutively.
Do you think the benchmark is necessary?

Looking forward to your feedback, thanks~

[1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
[2]
https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
[3] https://issues.apache.org/jira/browse/FLINK-33736

Best,
Rui

On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels <[email protected]> wrote:

> Hey Rui,
>
> +1 for changing the default restart strategy to exponential-delay.
> This is something all users eventually run into. They end up changing
> the restart strategy to exponential-delay. I think the current
> defaults are quite balanced. Restarts happen quickly enough unless
> there are consecutive failures where I think it makes sense to double
> the waiting time up till the max.
>
> -Max
>
>
> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen <[email protected]> wrote:
> >
> > Hi Rui,
> >
> > Sorry for the late reply. I was suggesting that perhaps we could do some
> > testing with Kubernetes wrt configuring values for the exponential
> restart
> > strategy. We've noticed that the default strategy in 1.17 caused a lot of
> > requests to the K8s API server for unstable deployments.
> >
> > However, people in different Kubernetes setups will have different limits
> > so it would be challenging to provide a general benchmark. Another thing
> I
> > found helpful in the past is to refer to Kubernetes--for example, the
> > default strategy is exponential for pod restarts and we could draw
> > inspiration from what they have set as a general purpose default config.
> >
> > Best,
> > Mason
> >
> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <[email protected]> wrote:
> >
> > > Hi David and Mason,
> > >
> > > Thanks for your feedback!
> > >
> > > To David:
> > >
> > > > Given that the new default feels more complex than the current
> behavior,
> > > if we decide to do this I think it will be important to include the
> > > rationale you've shared in the documentation.
> > >
> > > Sounds make sense to me, I will add the related doc if we
> > > update the default strategy.
> > >
> > > To Mason:
> > >
> > > > I suppose we could do some benchmarking on what works well for the
> > > resource providers that Flink relies on e.g. Kubernetes. Based on
> > > conferences and blogs,
> > > > it seems most people are relying on Kubernetes to deploy Flink and
> the
> > > restart strategy has a large dependency on how well Kubernetes can
> scale to
> > > requests to redeploy the job.
> > >
> > > Sorry, I didn't understand what type of benchmarking
> > > we should do, could you elaborate on it? Thanks a lot.
> > >
> > > Best,
> > > Rui
> > >
> > > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <[email protected]>
> wrote:
> > >
> > >> Hi Rui,
> > >>
> > >> I suppose we could do some benchmarking on what works well for the
> > >> resource providers that Flink relies on e.g. Kubernetes. Based on
> > >> conferences and blogs, it seems most people are relying on Kubernetes
> to
> > >> deploy Flink and the restart strategy has a large dependency on how
> well
> > >> Kubernetes can scale to requests to redeploy the job.
> > >>
> > >> Best,
> > >> Mason
> > >>
> > >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson <[email protected]
> >
> > >> wrote:
> > >>
> > >>> Rui,
> > >>>
> > >>> I don't have any direct experience with this topic, but given the
> > >>> motivation you shared, the proposal makes sense to me. Given that
> the new
> > >>> default feels more complex than the current behavior, if we decide
> to do
> > >>> this I think it will be important to include the rationale you've
> shared in
> > >>> the documentation.
> > >>>
> > >>> David
> > >>>
> > >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <[email protected]>
> wrote:
> > >>>
> > >>>> Hi dear flink users and devs:
> > >>>>
> > >>>> FLIP-364[1] intends to make some improvements to restart-strategy
> > >>>> and discuss updating some of the default values of
> exponential-delay,
> > >>>> and whether exponential-delay can be used as the default
> > >>>> restart-strategy.
> > >>>> After discussing at dev mail list[2], we hope to collect more
> feedback
> > >>>> from Flink users.
> > >>>>
> > >>>> # Why does the default restart-strategy need to be updated?
> > >>>>
> > >>>> If checkpointing is enabled, the default value is fixed-delay with
> > >>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
> > >>>> the job will restart infinitely with high frequency when a job
> > >>>> continues to fail.
> > >>>>
> > >>>> When the Kafka cluster fails, a large number of flink jobs will be
> > >>>> restarted frequently. After the kafka cluster is recovered, a large
> > >>>> number of high-frequency restarts of flink jobs may cause the
> > >>>> kafka cluster to avalanche again.
> > >>>>
> > >>>> Considering the exponential-delay as the default strategy with
> > >>>> a couple of reasons:
> > >>>>
> > >>>> - The exponential-delay can reduce the restart frequency when
> > >>>>   a job continues to fail.
> > >>>> - It can restart a job quickly when a job fails occasionally.
> > >>>> - The restart-strategy.exponential-delay.jitter-factor can avoid r
> > >>>>   estarting multiple jobs at the same time. It’s useful to prevent
> > >>>>   avalanches.
> > >>>>
> > >>>> # What are the current default values[4] of exponential-delay?
> > >>>>
> > >>>> restart-strategy.exponential-delay.initial-backoff : 1s
> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0
> > >>>> restart-strategy.exponential-delay.jitter-factor : 0.1
> > >>>> restart-strategy.exponential-delay.max-backoff : 5 min
> > >>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
> > >>>>
> > >>>> backoff-multiplier=2 means that the delay time of each restart
> > >>>> will be doubled. The delay times are:
> > >>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
> > >>>>
> > >>>> The delay time is increased rapidly, it will affect the recover
> > >>>> time for flink jobs.
> > >>>>
> > >>>> # Option improvements
> > >>>>
> > >>>> We think the backoff-multiplier between 1 and 2 is more sensible,
> > >>>> such as:
> > >>>>
> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
> > >>>> restart-strategy.exponential-delay.max-backoff : 1 min
> > >>>>
> > >>>> After updating, the delay times are:
> > >>>>
> > >>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
> > >>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
> > >>>> 22.186s, 26.623s, 31.948s, 38.337s, etc
> > >>>>
> > >>>> They achieve the following goals:
> > >>>> - When restarts are infrequent in a short period of time, flink can
> > >>>>   quickly restart the job. (For example: the retry delay time when
> > >>>>   restarting 5 times is 2.073s)
> > >>>> - When restarting frequently in a short period of time, flink can
> > >>>>   slightly reduce the restart frequency to prevent avalanches.
> > >>>>   (For example: the retry delay time when retrying 10 times is 5.1
> s,
> > >>>>   and the retry delay time when retrying 20 times is 38s, which is
> not
> > >>>> very
> > >>>> large.)
> > >>>>
> > >>>> As @Mingliang Liu <[email protected]>  mentioned at dev mail
> list: the
> > >>>> one-size-fits-all
> > >>>> default values do not exist. So our goal is that the default values
> > >>>> can be suitable for most jobs.
> > >>>>
> > >>>> Looking forward to your thoughts and feedback, thanks~
> > >>>>
> > >>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw
> > >>>> [2]
> https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
> > >>>> [3]
> > >>>>
> > >>>>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
> > >>>> [4]
> > >>>>
> > >>>>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
> > >>>>
> > >>>> Best,
> > >>>> Rui
> > >>>>
> > >>>
>

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Reply via email to