Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-19 文章 Rui Fan
Thanks everyone for the feedback!

It doesn't have more feedback here, so I started the new vote[1]
just now to update the default value of backoff-multiplier from
1.2 to 1.5.

[1] https://lists.apache.org/thread/0b1dcwb49owpm6v1j8rhrg9h0fvs5nkt

Best,
Rui

On Tue, Dec 12, 2023 at 7:14 PM Maximilian Michels  wrote:

> Thank you Rui! I think a 1.5 multiplier is a reasonable tradeoff
> between restarting fast but not putting too much pressure on the
> cluster due to restarts.
>
> -Max
>
> On Tue, Dec 12, 2023 at 8:19 AM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > Hi Maximilian and Mason,
> >
> > Thanks a lot for your feedback!
> >
> > After an offline consultation with Max, I guess I understand your
> > concern for now: when flink job restarts, it will make a bunch of
> > calls to the Kubernetes API, e.g. read/write to config maps, create
> > task managers. Currently, the default restart strategy is fixed-delay
> > with 1s delay time, so flink will restart jobs with high frequency
> > even if flink jobs cannot be started. It will cause the Kubernetes
> > cluster became unstable.
> >
> > That's why I propose changing the default restart strategy to
> > exponential-delay. It can achieve: restarts happen quickly
> > enough unless there are consecutive failures. It is helpful for
> > the stability of external components.
> >
> > After discussing with Max and Zhu Zhu at the PR comment[1],
> > Max suggested using 1.5 as the default value of backoff-multiplier
> > instead of 1.2. The 1.2 is a little small(delay time is too short).
> > This picture[2] is the relationship between restart-attempts and
> > retry-delay-time when backoff-multiplier is 1.2 and 1.5:
> >
> > - The delay-time will reach 1 min after 12 attempts when
> backoff-multiplier is 1.5
> > - The delay-time will reach 1 min after 24 attempts when
> backoff-multiplier is 1.2
> >
> > Is there any other suggestion? Looking forward to more feedback, thanks~
> >
> > BTW, as Zhu said in the comment[1], if we update the default value,
> > a new vote is needed for this default value. So I will pause
> > FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
> > continued.
> >
> > To Mason:
> >
> > If I understand your concerns correctly, I still don't know how
> > to benchmark. The kubernetes cluster instability only happens
> > when one cluster has a lot of jobs. In general, the test cannot
> > reproduce the pressure. Could you elaborate on how to
> > benchmark for this?
> >
> > After this FLIP, the default restart frequency will be reduced
> > significantly. Especially when a job fails consecutively.
> > Do you think the benchmark is necessary?
> >
> > Looking forward to your feedback, thanks~
> >
> > [1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
> > [2]
> https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
> > [3] https://issues.apache.org/jira/browse/FLINK-33736
> >
> > Best,
> > Rui
> >
> > On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels 
> wrote:
> >>
> >> Hey Rui,
> >>
> >> +1 for changing the default restart strategy to exponential-delay.
> >> This is something all users eventually run into. They end up changing
> >> the restart strategy to exponential-delay. I think the current
> >> defaults are quite balanced. Restarts happen quickly enough unless
> >> there are consecutive failures where I think it makes sense to double
> >> the waiting time up till the max.
> >>
> >> -Max
> >>
> >>
> >> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen 
> wrote:
> >> >
> >> > Hi Rui,
> >> >
> >> > Sorry for the late reply. I was suggesting that perhaps we could do
> some
> >> > testing with Kubernetes wrt configuring values for the exponential
> restart
> >> > strategy. We've noticed that the default strategy in 1.17 caused a
> lot of
> >> > requests to the K8s API server for unstable deployments.
> >> >
> >> > However, people in different Kubernetes setups will have different
> limits
> >> > so it would be challenging to provide a general benchmark. Another
> thing I
> >> > found helpful in the past is to refer to Kubernetes--for example, the
> >> > default strategy is exponential for pod restarts and we could draw
> >> > inspiration from what they have set as a general purpose default
> config.
> >> >
> >> > Best,
> >> > Mason
> >> >
> >> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
> >> >
> >> > > Hi David and Mason,
> >> > >
> >> > > Thanks for your feedback!
> >> > >
> >> > > To David:
> >> > >
> >> > > > Given that the new default feels more complex than the current
> behavior,
> >> > > if we decide to do this I think it will be important to include the
> >> > > rationale you've shared in the documentation.
> >> > >
> >> > > Sounds make sense to me, I will add the related doc if we
> >> > > update the default strategy.
> >> > >
> >> > > To Mason:
> >> > >
> >> > > > I suppose we could do some benchmarking on what works well for the
> >> > > resource providers that 

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-12 文章 Maximilian Michels
Thank you Rui! I think a 1.5 multiplier is a reasonable tradeoff
between restarting fast but not putting too much pressure on the
cluster due to restarts.

-Max

On Tue, Dec 12, 2023 at 8:19 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Maximilian and Mason,
>
> Thanks a lot for your feedback!
>
> After an offline consultation with Max, I guess I understand your
> concern for now: when flink job restarts, it will make a bunch of
> calls to the Kubernetes API, e.g. read/write to config maps, create
> task managers. Currently, the default restart strategy is fixed-delay
> with 1s delay time, so flink will restart jobs with high frequency
> even if flink jobs cannot be started. It will cause the Kubernetes
> cluster became unstable.
>
> That's why I propose changing the default restart strategy to
> exponential-delay. It can achieve: restarts happen quickly
> enough unless there are consecutive failures. It is helpful for
> the stability of external components.
>
> After discussing with Max and Zhu Zhu at the PR comment[1],
> Max suggested using 1.5 as the default value of backoff-multiplier
> instead of 1.2. The 1.2 is a little small(delay time is too short).
> This picture[2] is the relationship between restart-attempts and
> retry-delay-time when backoff-multiplier is 1.2 and 1.5:
>
> - The delay-time will reach 1 min after 12 attempts when backoff-multiplier 
> is 1.5
> - The delay-time will reach 1 min after 24 attempts when backoff-multiplier 
> is 1.2
>
> Is there any other suggestion? Looking forward to more feedback, thanks~
>
> BTW, as Zhu said in the comment[1], if we update the default value,
> a new vote is needed for this default value. So I will pause
> FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
> continued.
>
> To Mason:
>
> If I understand your concerns correctly, I still don't know how
> to benchmark. The kubernetes cluster instability only happens
> when one cluster has a lot of jobs. In general, the test cannot
> reproduce the pressure. Could you elaborate on how to
> benchmark for this?
>
> After this FLIP, the default restart frequency will be reduced
> significantly. Especially when a job fails consecutively.
> Do you think the benchmark is necessary?
>
> Looking forward to your feedback, thanks~
>
> [1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
> [2] 
> https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
> [3] https://issues.apache.org/jira/browse/FLINK-33736
>
> Best,
> Rui
>
> On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels  wrote:
>>
>> Hey Rui,
>>
>> +1 for changing the default restart strategy to exponential-delay.
>> This is something all users eventually run into. They end up changing
>> the restart strategy to exponential-delay. I think the current
>> defaults are quite balanced. Restarts happen quickly enough unless
>> there are consecutive failures where I think it makes sense to double
>> the waiting time up till the max.
>>
>> -Max
>>
>>
>> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen  wrote:
>> >
>> > Hi Rui,
>> >
>> > Sorry for the late reply. I was suggesting that perhaps we could do some
>> > testing with Kubernetes wrt configuring values for the exponential restart
>> > strategy. We've noticed that the default strategy in 1.17 caused a lot of
>> > requests to the K8s API server for unstable deployments.
>> >
>> > However, people in different Kubernetes setups will have different limits
>> > so it would be challenging to provide a general benchmark. Another thing I
>> > found helpful in the past is to refer to Kubernetes--for example, the
>> > default strategy is exponential for pod restarts and we could draw
>> > inspiration from what they have set as a general purpose default config.
>> >
>> > Best,
>> > Mason
>> >
>> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
>> >
>> > > Hi David and Mason,
>> > >
>> > > Thanks for your feedback!
>> > >
>> > > To David:
>> > >
>> > > > Given that the new default feels more complex than the current 
>> > > > behavior,
>> > > if we decide to do this I think it will be important to include the
>> > > rationale you've shared in the documentation.
>> > >
>> > > Sounds make sense to me, I will add the related doc if we
>> > > update the default strategy.
>> > >
>> > > To Mason:
>> > >
>> > > > I suppose we could do some benchmarking on what works well for the
>> > > resource providers that Flink relies on e.g. Kubernetes. Based on
>> > > conferences and blogs,
>> > > > it seems most people are relying on Kubernetes to deploy Flink and the
>> > > restart strategy has a large dependency on how well Kubernetes can scale 
>> > > to
>> > > requests to redeploy the job.
>> > >
>> > > Sorry, I didn't understand what type of benchmarking
>> > > we should do, could you elaborate on it? Thanks a lot.
>> > >
>> > > Best,
>> > > Rui
>> > >
>> > > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen  
>> > > wrote:
>> > >
>> > >> Hi Rui,
>> > >>
>> 

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-11 文章 Rui Fan
Hi Maximilian and Mason,

Thanks a lot for your feedback!

After an offline consultation with Max, I guess I understand your
concern for now: when flink job restarts, it will make a bunch of
calls to the Kubernetes API, e.g. read/write to config maps, create
task managers. Currently, the default restart strategy is fixed-delay
with 1s delay time, so flink will restart jobs with high frequency
even if flink jobs cannot be started. It will cause the Kubernetes
cluster became unstable.

That's why I propose changing the default restart strategy to
exponential-delay. It can achieve: restarts happen quickly
enough unless there are consecutive failures. It is helpful for
the stability of external components.

After discussing with Max and Zhu Zhu at the PR comment[1],
Max suggested using 1.5 as the default value of backoff-multiplier
instead of 1.2. The 1.2 is a little small(delay time is too short).
This picture[2] is the relationship between restart-attempts and
retry-delay-time when backoff-multiplier is 1.2 and 1.5:

- The delay-time will reach 1 min after 12 attempts when backoff-multiplier
is 1.5
- The delay-time will reach 1 min after 24 attempts when backoff-multiplier
is 1.2

Is there any other suggestion? Looking forward to more feedback, thanks~

BTW, as Zhu said in the comment[1], if we update the default value,
a new vote is needed for this default value. So I will pause
FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
continued.

To Mason:

If I understand your concerns correctly, I still don't know how
to benchmark. The kubernetes cluster instability only happens
when one cluster has a lot of jobs. In general, the test cannot
reproduce the pressure. Could you elaborate on how to
benchmark for this?

After this FLIP, the default restart frequency will be reduced
significantly. Especially when a job fails consecutively.
Do you think the benchmark is necessary?

Looking forward to your feedback, thanks~

[1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
[2]
https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
[3] https://issues.apache.org/jira/browse/FLINK-33736

Best,
Rui

On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels  wrote:

> Hey Rui,
>
> +1 for changing the default restart strategy to exponential-delay.
> This is something all users eventually run into. They end up changing
> the restart strategy to exponential-delay. I think the current
> defaults are quite balanced. Restarts happen quickly enough unless
> there are consecutive failures where I think it makes sense to double
> the waiting time up till the max.
>
> -Max
>
>
> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen  wrote:
> >
> > Hi Rui,
> >
> > Sorry for the late reply. I was suggesting that perhaps we could do some
> > testing with Kubernetes wrt configuring values for the exponential
> restart
> > strategy. We've noticed that the default strategy in 1.17 caused a lot of
> > requests to the K8s API server for unstable deployments.
> >
> > However, people in different Kubernetes setups will have different limits
> > so it would be challenging to provide a general benchmark. Another thing
> I
> > found helpful in the past is to refer to Kubernetes--for example, the
> > default strategy is exponential for pod restarts and we could draw
> > inspiration from what they have set as a general purpose default config.
> >
> > Best,
> > Mason
> >
> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
> >
> > > Hi David and Mason,
> > >
> > > Thanks for your feedback!
> > >
> > > To David:
> > >
> > > > Given that the new default feels more complex than the current
> behavior,
> > > if we decide to do this I think it will be important to include the
> > > rationale you've shared in the documentation.
> > >
> > > Sounds make sense to me, I will add the related doc if we
> > > update the default strategy.
> > >
> > > To Mason:
> > >
> > > > I suppose we could do some benchmarking on what works well for the
> > > resource providers that Flink relies on e.g. Kubernetes. Based on
> > > conferences and blogs,
> > > > it seems most people are relying on Kubernetes to deploy Flink and
> the
> > > restart strategy has a large dependency on how well Kubernetes can
> scale to
> > > requests to redeploy the job.
> > >
> > > Sorry, I didn't understand what type of benchmarking
> > > we should do, could you elaborate on it? Thanks a lot.
> > >
> > > Best,
> > > Rui
> > >
> > > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen 
> wrote:
> > >
> > >> Hi Rui,
> > >>
> > >> I suppose we could do some benchmarking on what works well for the
> > >> resource providers that Flink relies on e.g. Kubernetes. Based on
> > >> conferences and blogs, it seems most people are relying on Kubernetes
> to
> > >> deploy Flink and the restart strategy has a large dependency on how
> well
> > >> Kubernetes can scale to requests to redeploy the job.
> > >>
> > >> Best,
> > >> Mason
> > >>
> > >> On 

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-07 文章 Maximilian Michels
Hey Rui,

+1 for changing the default restart strategy to exponential-delay.
This is something all users eventually run into. They end up changing
the restart strategy to exponential-delay. I think the current
defaults are quite balanced. Restarts happen quickly enough unless
there are consecutive failures where I think it makes sense to double
the waiting time up till the max.

-Max


On Wed, Dec 6, 2023 at 12:51 AM Mason Chen  wrote:
>
> Hi Rui,
>
> Sorry for the late reply. I was suggesting that perhaps we could do some
> testing with Kubernetes wrt configuring values for the exponential restart
> strategy. We've noticed that the default strategy in 1.17 caused a lot of
> requests to the K8s API server for unstable deployments.
>
> However, people in different Kubernetes setups will have different limits
> so it would be challenging to provide a general benchmark. Another thing I
> found helpful in the past is to refer to Kubernetes--for example, the
> default strategy is exponential for pod restarts and we could draw
> inspiration from what they have set as a general purpose default config.
>
> Best,
> Mason
>
> On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi David and Mason,
> >
> > Thanks for your feedback!
> >
> > To David:
> >
> > > Given that the new default feels more complex than the current behavior,
> > if we decide to do this I think it will be important to include the
> > rationale you've shared in the documentation.
> >
> > Sounds make sense to me, I will add the related doc if we
> > update the default strategy.
> >
> > To Mason:
> >
> > > I suppose we could do some benchmarking on what works well for the
> > resource providers that Flink relies on e.g. Kubernetes. Based on
> > conferences and blogs,
> > > it seems most people are relying on Kubernetes to deploy Flink and the
> > restart strategy has a large dependency on how well Kubernetes can scale to
> > requests to redeploy the job.
> >
> > Sorry, I didn't understand what type of benchmarking
> > we should do, could you elaborate on it? Thanks a lot.
> >
> > Best,
> > Rui
> >
> > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen  wrote:
> >
> >> Hi Rui,
> >>
> >> I suppose we could do some benchmarking on what works well for the
> >> resource providers that Flink relies on e.g. Kubernetes. Based on
> >> conferences and blogs, it seems most people are relying on Kubernetes to
> >> deploy Flink and the restart strategy has a large dependency on how well
> >> Kubernetes can scale to requests to redeploy the job.
> >>
> >> Best,
> >> Mason
> >>
> >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson 
> >> wrote:
> >>
> >>> Rui,
> >>>
> >>> I don't have any direct experience with this topic, but given the
> >>> motivation you shared, the proposal makes sense to me. Given that the new
> >>> default feels more complex than the current behavior, if we decide to do
> >>> this I think it will be important to include the rationale you've shared 
> >>> in
> >>> the documentation.
> >>>
> >>> David
> >>>
> >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:
> >>>
>  Hi dear flink users and devs:
> 
>  FLIP-364[1] intends to make some improvements to restart-strategy
>  and discuss updating some of the default values of exponential-delay,
>  and whether exponential-delay can be used as the default
>  restart-strategy.
>  After discussing at dev mail list[2], we hope to collect more feedback
>  from Flink users.
> 
>  # Why does the default restart-strategy need to be updated?
> 
>  If checkpointing is enabled, the default value is fixed-delay with
>  Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
>  the job will restart infinitely with high frequency when a job
>  continues to fail.
> 
>  When the Kafka cluster fails, a large number of flink jobs will be
>  restarted frequently. After the kafka cluster is recovered, a large
>  number of high-frequency restarts of flink jobs may cause the
>  kafka cluster to avalanche again.
> 
>  Considering the exponential-delay as the default strategy with
>  a couple of reasons:
> 
>  - The exponential-delay can reduce the restart frequency when
>    a job continues to fail.
>  - It can restart a job quickly when a job fails occasionally.
>  - The restart-strategy.exponential-delay.jitter-factor can avoid r
>    estarting multiple jobs at the same time. It’s useful to prevent
>    avalanches.
> 
>  # What are the current default values[4] of exponential-delay?
> 
>  restart-strategy.exponential-delay.initial-backoff : 1s
>  restart-strategy.exponential-delay.backoff-multiplier : 2.0
>  restart-strategy.exponential-delay.jitter-factor : 0.1
>  restart-strategy.exponential-delay.max-backoff : 5 min
>  restart-strategy.exponential-delay.reset-backoff-threshold : 1h
> 
>  backoff-multiplier=2 means 

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-11-19 文章 Rui Fan
Hi David and Mason,

Thanks for your feedback!

To David:

> Given that the new default feels more complex than the current behavior,
if we decide to do this I think it will be important to include the
rationale you've shared in the documentation.

Sounds make sense to me, I will add the related doc if we
update the default strategy.

To Mason:

> I suppose we could do some benchmarking on what works well for the
resource providers that Flink relies on e.g. Kubernetes. Based on
conferences and blogs,
> it seems most people are relying on Kubernetes to deploy Flink and the
restart strategy has a large dependency on how well Kubernetes can scale to
requests to redeploy the job.

Sorry, I didn't understand what type of benchmarking
we should do, could you elaborate on it? Thanks a lot.

Best,
Rui

On Sat, Nov 18, 2023 at 3:32 AM Mason Chen  wrote:

> Hi Rui,
>
> I suppose we could do some benchmarking on what works well for the
> resource providers that Flink relies on e.g. Kubernetes. Based on
> conferences and blogs, it seems most people are relying on Kubernetes to
> deploy Flink and the restart strategy has a large dependency on how well
> Kubernetes can scale to requests to redeploy the job.
>
> Best,
> Mason
>
> On Fri, Nov 17, 2023 at 10:07 AM David Anderson 
> wrote:
>
>> Rui,
>>
>> I don't have any direct experience with this topic, but given the
>> motivation you shared, the proposal makes sense to me. Given that the new
>> default feels more complex than the current behavior, if we decide to do
>> this I think it will be important to include the rationale you've shared in
>> the documentation.
>>
>> David
>>
>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:
>>
>>> Hi dear flink users and devs:
>>>
>>> FLIP-364[1] intends to make some improvements to restart-strategy
>>> and discuss updating some of the default values of exponential-delay,
>>> and whether exponential-delay can be used as the default
>>> restart-strategy.
>>> After discussing at dev mail list[2], we hope to collect more feedback
>>> from Flink users.
>>>
>>> # Why does the default restart-strategy need to be updated?
>>>
>>> If checkpointing is enabled, the default value is fixed-delay with
>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
>>> the job will restart infinitely with high frequency when a job
>>> continues to fail.
>>>
>>> When the Kafka cluster fails, a large number of flink jobs will be
>>> restarted frequently. After the kafka cluster is recovered, a large
>>> number of high-frequency restarts of flink jobs may cause the
>>> kafka cluster to avalanche again.
>>>
>>> Considering the exponential-delay as the default strategy with
>>> a couple of reasons:
>>>
>>> - The exponential-delay can reduce the restart frequency when
>>>   a job continues to fail.
>>> - It can restart a job quickly when a job fails occasionally.
>>> - The restart-strategy.exponential-delay.jitter-factor can avoid r
>>>   estarting multiple jobs at the same time. It’s useful to prevent
>>>   avalanches.
>>>
>>> # What are the current default values[4] of exponential-delay?
>>>
>>> restart-strategy.exponential-delay.initial-backoff : 1s
>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0
>>> restart-strategy.exponential-delay.jitter-factor : 0.1
>>> restart-strategy.exponential-delay.max-backoff : 5 min
>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>>>
>>> backoff-multiplier=2 means that the delay time of each restart
>>> will be doubled. The delay times are:
>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
>>>
>>> The delay time is increased rapidly, it will affect the recover
>>> time for flink jobs.
>>>
>>> # Option improvements
>>>
>>> We think the backoff-multiplier between 1 and 2 is more sensible,
>>> such as:
>>>
>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
>>> restart-strategy.exponential-delay.max-backoff : 1 min
>>>
>>> After updating, the delay times are:
>>>
>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
>>> 22.186s, 26.623s, 31.948s, 38.337s, etc
>>>
>>> They achieve the following goals:
>>> - When restarts are infrequent in a short period of time, flink can
>>>   quickly restart the job. (For example: the retry delay time when
>>>   restarting 5 times is 2.073s)
>>> - When restarting frequently in a short period of time, flink can
>>>   slightly reduce the restart frequency to prevent avalanches.
>>>   (For example: the retry delay time when retrying 10 times is 5.1 s,
>>>   and the retry delay time when retrying 20 times is 38s, which is not
>>> very
>>> large.)
>>>
>>> As @Mingliang Liu   mentioned at dev mail list: the
>>> one-size-fits-all
>>> default values do not exist. So our goal is that the default values
>>> can be suitable for most jobs.
>>>
>>> Looking forward to your thoughts and feedback, thanks~
>>>
>>> [1] 

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-11-17 文章 David Anderson
Rui,

I don't have any direct experience with this topic, but given the
motivation you shared, the proposal makes sense to me. Given that the new
default feels more complex than the current behavior, if we decide to do
this I think it will be important to include the rationale you've shared in
the documentation.

David

On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:

> Hi dear flink users and devs:
>
> FLIP-364[1] intends to make some improvements to restart-strategy
> and discuss updating some of the default values of exponential-delay,
> and whether exponential-delay can be used as the default restart-strategy.
> After discussing at dev mail list[2], we hope to collect more feedback
> from Flink users.
>
> # Why does the default restart-strategy need to be updated?
>
> If checkpointing is enabled, the default value is fixed-delay with
> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
> the job will restart infinitely with high frequency when a job
> continues to fail.
>
> When the Kafka cluster fails, a large number of flink jobs will be
> restarted frequently. After the kafka cluster is recovered, a large
> number of high-frequency restarts of flink jobs may cause the
> kafka cluster to avalanche again.
>
> Considering the exponential-delay as the default strategy with
> a couple of reasons:
>
> - The exponential-delay can reduce the restart frequency when
>   a job continues to fail.
> - It can restart a job quickly when a job fails occasionally.
> - The restart-strategy.exponential-delay.jitter-factor can avoid r
>   estarting multiple jobs at the same time. It’s useful to prevent
>   avalanches.
>
> # What are the current default values[4] of exponential-delay?
>
> restart-strategy.exponential-delay.initial-backoff : 1s
> restart-strategy.exponential-delay.backoff-multiplier : 2.0
> restart-strategy.exponential-delay.jitter-factor : 0.1
> restart-strategy.exponential-delay.max-backoff : 5 min
> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>
> backoff-multiplier=2 means that the delay time of each restart
> will be doubled. The delay times are:
> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
>
> The delay time is increased rapidly, it will affect the recover
> time for flink jobs.
>
> # Option improvements
>
> We think the backoff-multiplier between 1 and 2 is more sensible,
> such as:
>
> restart-strategy.exponential-delay.backoff-multiplier : 1.2
> restart-strategy.exponential-delay.max-backoff : 1 min
>
> After updating, the delay times are:
>
> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
> 22.186s, 26.623s, 31.948s, 38.337s, etc
>
> They achieve the following goals:
> - When restarts are infrequent in a short period of time, flink can
>   quickly restart the job. (For example: the retry delay time when
>   restarting 5 times is 2.073s)
> - When restarting frequently in a short period of time, flink can
>   slightly reduce the restart frequency to prevent avalanches.
>   (For example: the retry delay time when retrying 10 times is 5.1 s,
>   and the retry delay time when retrying 20 times is 38s, which is not very
> large.)
>
> As @Mingliang Liu   mentioned at dev mail list: the
> one-size-fits-all
> default values do not exist. So our goal is that the default values
> can be suitable for most jobs.
>
> Looking forward to your thoughts and feedback, thanks~
>
> [1] https://cwiki.apache.org/confluence/x/uJqzDw
> [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
> [3]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
> [4]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
>
> Best,
> Rui
>


[DISCUSS] Change the default restart-strategy to exponential-delay

2023-11-15 文章 Rui Fan
Hi dear flink users and devs:

FLIP-364[1] intends to make some improvements to restart-strategy
and discuss updating some of the default values of exponential-delay,
and whether exponential-delay can be used as the default restart-strategy.
After discussing at dev mail list[2], we hope to collect more feedback
from Flink users.

# Why does the default restart-strategy need to be updated?

If checkpointing is enabled, the default value is fixed-delay with
Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
the job will restart infinitely with high frequency when a job
continues to fail.

When the Kafka cluster fails, a large number of flink jobs will be
restarted frequently. After the kafka cluster is recovered, a large
number of high-frequency restarts of flink jobs may cause the
kafka cluster to avalanche again.

Considering the exponential-delay as the default strategy with
a couple of reasons:

- The exponential-delay can reduce the restart frequency when
  a job continues to fail.
- It can restart a job quickly when a job fails occasionally.
- The restart-strategy.exponential-delay.jitter-factor can avoid r
  estarting multiple jobs at the same time. It’s useful to prevent
  avalanches.

# What are the current default values[4] of exponential-delay?

restart-strategy.exponential-delay.initial-backoff : 1s
restart-strategy.exponential-delay.backoff-multiplier : 2.0
restart-strategy.exponential-delay.jitter-factor : 0.1
restart-strategy.exponential-delay.max-backoff : 5 min
restart-strategy.exponential-delay.reset-backoff-threshold : 1h

backoff-multiplier=2 means that the delay time of each restart
will be doubled. The delay times are:
1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.

The delay time is increased rapidly, it will affect the recover
time for flink jobs.

# Option improvements

We think the backoff-multiplier between 1 and 2 is more sensible,
such as:

restart-strategy.exponential-delay.backoff-multiplier : 1.2
restart-strategy.exponential-delay.max-backoff : 1 min

After updating, the delay times are:

1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
22.186s, 26.623s, 31.948s, 38.337s, etc

They achieve the following goals:
- When restarts are infrequent in a short period of time, flink can
  quickly restart the job. (For example: the retry delay time when
  restarting 5 times is 2.073s)
- When restarting frequently in a short period of time, flink can
  slightly reduce the restart frequency to prevent avalanches.
  (For example: the retry delay time when retrying 10 times is 5.1 s,
  and the retry delay time when retrying 20 times is 38s, which is not very
large.)

As @Mingliang Liu   mentioned at dev mail list: the
one-size-fits-all
default values do not exist. So our goal is that the default values
can be suitable for most jobs.

Looking forward to your thoughts and feedback, thanks~

[1] https://cwiki.apache.org/confluence/x/uJqzDw
[2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
[4]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy

Best,
Rui