Re: [DISCUSS] Change the default restart-strategy to exponential-delay

Maximilian Michels Thu, 07 Dec 2023 06:59:20 -0800

Hey Rui,

+1 for changing the default restart strategy to exponential-delay.
This is something all users eventually run into. They end up changing
the restart strategy to exponential-delay. I think the current
defaults are quite balanced. Restarts happen quickly enough unless
there are consecutive failures where I think it makes sense to double
the waiting time up till the max.


-Max


On Wed, Dec 6, 2023 at 12:51 AM Mason Chen <mas.chen6...@gmail.com> wrote:
>
> Hi Rui,
>
> Sorry for the late reply. I was suggesting that perhaps we could do some
> testing with Kubernetes wrt configuring values for the exponential restart
> strategy. We've noticed that the default strategy in 1.17 caused a lot of
> requests to the K8s API server for unstable deployments.
>
> However, people in different Kubernetes setups will have different limits
> so it would be challenging to provide a general benchmark. Another thing I
> found helpful in the past is to refer to Kubernetes--for example, the
> default strategy is exponential for pod restarts and we could draw
> inspiration from what they have set as a general purpose default config.
>
> Best,
> Mason
>
> On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi David and Mason,
> >
> > Thanks for your feedback!
> >
> > To David:
> >
> > > Given that the new default feels more complex than the current behavior,
> > if we decide to do this I think it will be important to include the
> > rationale you've shared in the documentation.
> >
> > Sounds make sense to me, I will add the related doc if we
> > update the default strategy.
> >
> > To Mason:
> >
> > > I suppose we could do some benchmarking on what works well for the
> > resource providers that Flink relies on e.g. Kubernetes. Based on
> > conferences and blogs,
> > > it seems most people are relying on Kubernetes to deploy Flink and the
> > restart strategy has a large dependency on how well Kubernetes can scale to
> > requests to redeploy the job.
> >
> > Sorry, I didn't understand what type of benchmarking
> > we should do, could you elaborate on it? Thanks a lot.
> >
> > Best,
> > Rui
> >
> > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <mas.chen6...@gmail.com> wrote:
> >
> >> Hi Rui,
> >>
> >> I suppose we could do some benchmarking on what works well for the
> >> resource providers that Flink relies on e.g. Kubernetes. Based on
> >> conferences and blogs, it seems most people are relying on Kubernetes to
> >> deploy Flink and the restart strategy has a large dependency on how well
> >> Kubernetes can scale to requests to redeploy the job.
> >>
> >> Best,
> >> Mason
> >>
> >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson <dander...@apache.org>
> >> wrote:
> >>
> >>> Rui,
> >>>
> >>> I don't have any direct experience with this topic, but given the
> >>> motivation you shared, the proposal makes sense to me. Given that the new
> >>> default feels more complex than the current behavior, if we decide to do
> >>> this I think it will be important to include the rationale you've shared 
> >>> in
> >>> the documentation.
> >>>
> >>> David
> >>>
> >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:
> >>>
> >>>> Hi dear flink users and devs:
> >>>>
> >>>> FLIP-364[1] intends to make some improvements to restart-strategy
> >>>> and discuss updating some of the default values of exponential-delay,
> >>>> and whether exponential-delay can be used as the default
> >>>> restart-strategy.
> >>>> After discussing at dev mail list[2], we hope to collect more feedback
> >>>> from Flink users.
> >>>>
> >>>> # Why does the default restart-strategy need to be updated?
> >>>>
> >>>> If checkpointing is enabled, the default value is fixed-delay with
> >>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
> >>>> the job will restart infinitely with high frequency when a job
> >>>> continues to fail.
> >>>>
> >>>> When the Kafka cluster fails, a large number of flink jobs will be
> >>>> restarted frequently. After the kafka cluster is recovered, a large
> >>>> number of high-frequency restarts of flink jobs may cause the
> >>>> kafka cluster to avalanche again.
> >>>>
> >>>> Considering the exponential-delay as the default strategy with
> >>>> a couple of reasons:
> >>>>
> >>>> - The exponential-delay can reduce the restart frequency when
> >>>>   a job continues to fail.
> >>>> - It can restart a job quickly when a job fails occasionally.
> >>>> - The restart-strategy.exponential-delay.jitter-factor can avoid r
> >>>>   estarting multiple jobs at the same time. It’s useful to prevent
> >>>>   avalanches.
> >>>>
> >>>> # What are the current default values[4] of exponential-delay?
> >>>>
> >>>> restart-strategy.exponential-delay.initial-backoff : 1s
> >>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0
> >>>> restart-strategy.exponential-delay.jitter-factor : 0.1
> >>>> restart-strategy.exponential-delay.max-backoff : 5 min
> >>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
> >>>>
> >>>> backoff-multiplier=2 means that the delay time of each restart
> >>>> will be doubled. The delay times are:
> >>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
> >>>>
> >>>> The delay time is increased rapidly, it will affect the recover
> >>>> time for flink jobs.
> >>>>
> >>>> # Option improvements
> >>>>
> >>>> We think the backoff-multiplier between 1 and 2 is more sensible,
> >>>> such as:
> >>>>
> >>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
> >>>> restart-strategy.exponential-delay.max-backoff : 1 min
> >>>>
> >>>> After updating, the delay times are:
> >>>>
> >>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
> >>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
> >>>> 22.186s, 26.623s, 31.948s, 38.337s, etc
> >>>>
> >>>> They achieve the following goals:
> >>>> - When restarts are infrequent in a short period of time, flink can
> >>>>   quickly restart the job. (For example: the retry delay time when
> >>>>   restarting 5 times is 2.073s)
> >>>> - When restarting frequently in a short period of time, flink can
> >>>>   slightly reduce the restart frequency to prevent avalanches.
> >>>>   (For example: the retry delay time when retrying 10 times is 5.1 s,
> >>>>   and the retry delay time when retrying 20 times is 38s, which is not
> >>>> very
> >>>> large.)
> >>>>
> >>>> As @Mingliang Liu <lium...@apache.org>  mentioned at dev mail list: the
> >>>> one-size-fits-all
> >>>> default values do not exist. So our goal is that the default values
> >>>> can be suitable for most jobs.
> >>>>
> >>>> Looking forward to your thoughts and feedback, thanks~
> >>>>
> >>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw
> >>>> [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
> >>>> [3]
> >>>>
> >>>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
> >>>> [4]
> >>>>
> >>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
> >>>>
> >>>> Best,
> >>>> Rui
> >>>>
> >>>

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

回复