Hey Rui, +1 for changing the default restart strategy to exponential-delay. This is something all users eventually run into. They end up changing the restart strategy to exponential-delay. I think the current defaults are quite balanced. Restarts happen quickly enough unless there are consecutive failures where I think it makes sense to double the waiting time up till the max.
-Max On Wed, Dec 6, 2023 at 12:51 AM Mason Chen <mas.chen6...@gmail.com> wrote: > > Hi Rui, > > Sorry for the late reply. I was suggesting that perhaps we could do some > testing with Kubernetes wrt configuring values for the exponential restart > strategy. We've noticed that the default strategy in 1.17 caused a lot of > requests to the K8s API server for unstable deployments. > > However, people in different Kubernetes setups will have different limits > so it would be challenging to provide a general benchmark. Another thing I > found helpful in the past is to refer to Kubernetes--for example, the > default strategy is exponential for pod restarts and we could draw > inspiration from what they have set as a general purpose default config. > > Best, > Mason > > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote: > > > Hi David and Mason, > > > > Thanks for your feedback! > > > > To David: > > > > > Given that the new default feels more complex than the current behavior, > > if we decide to do this I think it will be important to include the > > rationale you've shared in the documentation. > > > > Sounds make sense to me, I will add the related doc if we > > update the default strategy. > > > > To Mason: > > > > > I suppose we could do some benchmarking on what works well for the > > resource providers that Flink relies on e.g. Kubernetes. Based on > > conferences and blogs, > > > it seems most people are relying on Kubernetes to deploy Flink and the > > restart strategy has a large dependency on how well Kubernetes can scale to > > requests to redeploy the job. > > > > Sorry, I didn't understand what type of benchmarking > > we should do, could you elaborate on it? Thanks a lot. > > > > Best, > > Rui > > > > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen <mas.chen6...@gmail.com> wrote: > > > >> Hi Rui, > >> > >> I suppose we could do some benchmarking on what works well for the > >> resource providers that Flink relies on e.g. Kubernetes. Based on > >> conferences and blogs, it seems most people are relying on Kubernetes to > >> deploy Flink and the restart strategy has a large dependency on how well > >> Kubernetes can scale to requests to redeploy the job. > >> > >> Best, > >> Mason > >> > >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson <dander...@apache.org> > >> wrote: > >> > >>> Rui, > >>> > >>> I don't have any direct experience with this topic, but given the > >>> motivation you shared, the proposal makes sense to me. Given that the new > >>> default feels more complex than the current behavior, if we decide to do > >>> this I think it will be important to include the rationale you've shared > >>> in > >>> the documentation. > >>> > >>> David > >>> > >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote: > >>> > >>>> Hi dear flink users and devs: > >>>> > >>>> FLIP-364[1] intends to make some improvements to restart-strategy > >>>> and discuss updating some of the default values of exponential-delay, > >>>> and whether exponential-delay can be used as the default > >>>> restart-strategy. > >>>> After discussing at dev mail list[2], we hope to collect more feedback > >>>> from Flink users. > >>>> > >>>> # Why does the default restart-strategy need to be updated? > >>>> > >>>> If checkpointing is enabled, the default value is fixed-delay with > >>>> Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means > >>>> the job will restart infinitely with high frequency when a job > >>>> continues to fail. > >>>> > >>>> When the Kafka cluster fails, a large number of flink jobs will be > >>>> restarted frequently. After the kafka cluster is recovered, a large > >>>> number of high-frequency restarts of flink jobs may cause the > >>>> kafka cluster to avalanche again. > >>>> > >>>> Considering the exponential-delay as the default strategy with > >>>> a couple of reasons: > >>>> > >>>> - The exponential-delay can reduce the restart frequency when > >>>> a job continues to fail. > >>>> - It can restart a job quickly when a job fails occasionally. > >>>> - The restart-strategy.exponential-delay.jitter-factor can avoid r > >>>> estarting multiple jobs at the same time. It’s useful to prevent > >>>> avalanches. > >>>> > >>>> # What are the current default values[4] of exponential-delay? > >>>> > >>>> restart-strategy.exponential-delay.initial-backoff : 1s > >>>> restart-strategy.exponential-delay.backoff-multiplier : 2.0 > >>>> restart-strategy.exponential-delay.jitter-factor : 0.1 > >>>> restart-strategy.exponential-delay.max-backoff : 5 min > >>>> restart-strategy.exponential-delay.reset-backoff-threshold : 1h > >>>> > >>>> backoff-multiplier=2 means that the delay time of each restart > >>>> will be doubled. The delay times are: > >>>> 1s, 2s, 4s, 8s, 16s, 32s, 64s, 128s, 256s, 300s, 300s, etc. > >>>> > >>>> The delay time is increased rapidly, it will affect the recover > >>>> time for flink jobs. > >>>> > >>>> # Option improvements > >>>> > >>>> We think the backoff-multiplier between 1 and 2 is more sensible, > >>>> such as: > >>>> > >>>> restart-strategy.exponential-delay.backoff-multiplier : 1.2 > >>>> restart-strategy.exponential-delay.max-backoff : 1 min > >>>> > >>>> After updating, the delay times are: > >>>> > >>>> 1s, 1.2s, 1.44s, 1.728s, 2.073s, 2.488s, 2.985s, 3.583s, 4.299s, > >>>> 5.159s, 6.191s, 7.430s, 8.916s, 10.699s, 12.839s, 15.407s, 18.488s, > >>>> 22.186s, 26.623s, 31.948s, 38.337s, etc > >>>> > >>>> They achieve the following goals: > >>>> - When restarts are infrequent in a short period of time, flink can > >>>> quickly restart the job. (For example: the retry delay time when > >>>> restarting 5 times is 2.073s) > >>>> - When restarting frequently in a short period of time, flink can > >>>> slightly reduce the restart frequency to prevent avalanches. > >>>> (For example: the retry delay time when retrying 10 times is 5.1 s, > >>>> and the retry delay time when retrying 20 times is 38s, which is not > >>>> very > >>>> large.) > >>>> > >>>> As @Mingliang Liu <lium...@apache.org> mentioned at dev mail list: the > >>>> one-size-fits-all > >>>> default values do not exist. So our goal is that the default values > >>>> can be suitable for most jobs. > >>>> > >>>> Looking forward to your thoughts and feedback, thanks~ > >>>> > >>>> [1] https://cwiki.apache.org/confluence/x/uJqzDw > >>>> [2] https://lists.apache.org/thread/5cgrft73kgkzkgjozf9zfk0w2oj7rjym > >>>> [3] > >>>> > >>>> https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/deployment/config/#restart-strategy-type > >>>> [4] > >>>> > >>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#exponential-delay-restart-strategy > >>>> > >>>> Best, > >>>> Rui > >>>> > >>>