Hi David and Mason,

Thanks for your feedback!

To David:

> Given that the new default feels more complex than the current behavior,
if we decide to do this I think it will be important to include the
rationale you've shared in the documentation.

Sounds make sense to me, I will add the related doc if we
update the default strategy.

To Mason:

> I suppose we could do some benchmarking on what works well for the
resource providers that Flink relies on e.g. Kubernetes. Based on
conferences and blogs,
> it seems most people are relying on Kubernetes to deploy Flink and the
restart strategy has a large dependency on how well Kubernetes can scale to
requests to redeploy the job.

Sorry, I didn't understand what type of benchmarking
we should do, could you elaborate on it? Thanks a lot.

Best,
Rui

On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <mas.chen6...@gmail.com> wrote:

> Hi Rui,
>
> I suppose we could do some benchmarking on what works well for the
> resource providers that Flink relies on e.g. Kubernetes. Based on
> conferences and blogs, it seems most people are relying on Kubernetes to
> deploy Flink and the restart strategy has a large dependency on how well
> Kubernetes can scale to requests to redeploy the job.
>
> Best,
> Mason
>
> On Fri, Nov 17, 2023 at 10:07 AM David Anderson <dander...@apache.org>
> wrote:
>
>> Rui,
>>
>> I don't have any direct experience with this topic, but given the
>> motivation you shared, the proposal makes sense to me. Given that the new
>> default feels more complex than the current behavior, if we decide to do
>> this I think it will be important to include the rationale you've shared in
>> the documentation.
>>
>> David
>>
>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:
>>
>>> Hi dear flink users and devs:
>>>
>>> FLIP-364[1] intends to make some improvements to restart-strategy
>>> and discuss updating some of the default values of exponential-delay,
>>> and whether exponential-delay can be used as the default
>>> restart-strategy.
>>> After discussing at dev mail list[2], we hope to collect more feedback
>>> from Flink users.
>>>
>>> # Why does the default restart-strategy need to be updated?
>>>
>>> If checkpointing is enabled, the default value is fixed-delay with
>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
>>> the job will restart infinitely with high frequency when a job
>>> continues to fail.
>>>
>>> When the Kafka cluster fails, a large number of flink jobs will be
>>> restarted frequently. After the kafka cluster is recovered, a large
>>> number of high-frequency restarts of flink jobs may cause the
>>> kafka cluster to avalanche again.
>>>
>>> Considering the exponential-delay as the default strategy with
>>> a couple of reasons:
>>>
>>> - The exponential-delay can reduce the restart frequency when
>>>   a job continues to fail.
>>> - It can restart a job quickly when a job fails occasionally.
>>> - The restart-strategy.exponential-delay.jitter-factor can avoid r
>>>   estarting multiple jobs at the same time. It’s useful to prevent
>>>   avalanches.
>>>
>>> # What are the current default values[4] of exponential-delay?
>>>
>>> restart-strategy.exponential-delay.initial-backoff : 1s
>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0
>>> restart-strategy.exponential-delay.jitter-factor : 0.1
>>> restart-strategy.exponential-delay.max-backoff : 5 min
>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h
>>>
>>> backoff-multiplier=2 means that the delay time of each restart
>>> will be doubled. The delay times are:
>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc.
>>>
>>> The delay time is increased rapidly, it will affect the recover
>>> time for flink jobs.
>>>
>>> # Option improvements
>>>
>>> We think the backoff-multiplier between 1 and 2 is more sensible,
>>> such as:
>>>
>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2
>>> restart-strategy.exponential-delay.max-backoff : 1 min
>>>
>>> After updating, the delay times are:
>>>
>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s,
>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s,
>>> 22.186s, 26.623s, 31.948s, 38.337s, etc
>>>
>>> They achieve the following goals:
>>> - When restarts are infrequent in a short period of time, flink can
>>>   quickly restart the job. (For example: the retry delay time when
>>>   restarting 5 times is 2.073s)
>>> - When restarting frequently in a short period of time, flink can
>>>   slightly reduce the restart frequency to prevent avalanches.
>>>   (For example: the retry delay time when retrying 10 times is 5.1 s,
>>>   and the retry delay time when retrying 20 times is 38s, which is not
>>> very
>>> large.)
>>>
>>> As @Mingliang Liu <lium...@apache.org>  mentioned at dev mail list: the
>>> one-size-fits-all
>>> default values do not exist. So our goal is that the default values
>>> can be suitable for most jobs.
>>>
>>> Looking forward to your thoughts and feedback, thanks~
>>>
>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw
>>> [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym
>>> [3]
>>>
>>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type
>>> [4]
>>>
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy
>>>
>>> Best,
>>> Rui
>>>
>>

回复