[ANNOUNCE] Apache Flink Kubernetes Operator 1.8.0 released

2024-03-25 文章 Maximilian Michels
The Apache Flink community is very happy to announce the release of
the Apache Flink Kubernetes Operator version 1.8.0.

The Flink Kubernetes Operator allows users to manage their Apache
Flink applications on Kubernetes through all aspects of their
lifecycle.

Release highlights:
- Flink Autotuning automatically adjusts TaskManager memory
- Flink Autoscaling metrics and decision accuracy improved
- Improve standalone Flink Autoscaling
- Savepoint trigger nonce for savepoint-based restarts
- Operator stability improvements for cluster shutdown

Blog post: 
https://flink.apache.org/2024/03/21/apache-flink-kubernetes-operator-1.8.0-release-announcement/

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink Kubernetes Operator can be found at:
https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator

Official Docker image for Flink Kubernetes Operator can be found at:
https://hub.docker.com/r/apache/flink-kubernetes-operator

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12353866=12315522

We would like to thank the Apache Flink community and its contributors
who made this release possible!

Cheers,
Max


Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-12 文章 Maximilian Michels
Thank you Rui! I think a 1.5 multiplier is a reasonable tradeoff
between restarting fast but not putting too much pressure on the
cluster due to restarts.

-Max

On Tue, Dec 12, 2023 at 8:19 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> Hi Maximilian and Mason,
>
> Thanks a lot for your feedback!
>
> After an offline consultation with Max, I guess I understand your
> concern for now: when flink job restarts, it will make a bunch of
> calls to the Kubernetes API, e.g. read/write to config maps, create
> task managers. Currently, the default restart strategy is fixed-delay
> with 1s delay time, so flink will restart jobs with high frequency
> even if flink jobs cannot be started. It will cause the Kubernetes
> cluster became unstable.
>
> That's why I propose changing the default restart strategy to
> exponential-delay. It can achieve: restarts happen quickly
> enough unless there are consecutive failures. It is helpful for
> the stability of external components.
>
> After discussing with Max and Zhu Zhu at the PR comment[1],
> Max suggested using 1.5 as the default value of backoff-multiplier
> instead of 1.2. The 1.2 is a little small(delay time is too short).
> This picture[2] is the relationship between restart-attempts and
> retry-delay-time when backoff-multiplier is 1.2 and 1.5:
>
> - The delay-time will reach 1 min after 12 attempts when backoff-multiplier 
> is 1.5
> - The delay-time will reach 1 min after 24 attempts when backoff-multiplier 
> is 1.2
>
> Is there any other suggestion? Looking forward to more feedback, thanks~
>
> BTW, as Zhu said in the comment[1], if we update the default value,
> a new vote is needed for this default value. So I will pause
> FLINK-33736[1] first, and the rest of the JIRAs of FLIP-364 will be
> continued.
>
> To Mason:
>
> If I understand your concerns correctly, I still don't know how
> to benchmark. The kubernetes cluster instability only happens
> when one cluster has a lot of jobs. In general, the test cannot
> reproduce the pressure. Could you elaborate on how to
> benchmark for this?
>
> After this FLIP, the default restart frequency will be reduced
> significantly. Especially when a job fails consecutively.
> Do you think the benchmark is necessary?
>
> Looking forward to your feedback, thanks~
>
> [1] https://github.com/apache/flink/pull/23247#discussion_r1422626734
> [2] 
> https://github.com/apache/flink/assets/38427477/642c57e0-b415-4326-af05-8b506c5fbb3a
> [3] https://issues.apache.org/jira/browse/FLINK-33736
>
> Best,
> Rui
>
> On Thu, Dec 7, 2023 at 10:57 PM Maximilian Michels  wrote:
>>
>> Hey Rui,
>>
>> +1 for changing the default restart strategy to exponential-delay.
>> This is something all users eventually run into. They end up changing
>> the restart strategy to exponential-delay. I think the current
>> defaults are quite balanced. Restarts happen quickly enough unless
>> there are consecutive failures where I think it makes sense to double
>> the waiting time up till the max.
>>
>> -Max
>>
>>
>> On Wed, Dec 6, 2023 at 12:51 AM Mason Chen  wrote:
>> >
>> > Hi Rui,
>> >
>> > Sorry for the late reply. I was suggesting that perhaps we could do some
>> > testing with Kubernetes wrt configuring values for the exponential restart
>> > strategy. We've noticed that the default strategy in 1.17 caused a lot of
>> > requests to the K8s API server for unstable deployments.
>> >
>> > However, people in different Kubernetes setups will have different limits
>> > so it would be challenging to provide a general benchmark. Another thing I
>> > found helpful in the past is to refer to Kubernetes--for example, the
>> > default strategy is exponential for pod restarts and we could draw
>> > inspiration from what they have set as a general purpose default config.
>> >
>> > Best,
>> > Mason
>> >
>> > On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
>> >
>> > > Hi David and Mason,
>> > >
>> > > Thanks for your feedback!
>> > >
>> > > To David:
>> > >
>> > > > Given that the new default feels more complex than the current 
>> > > > behavior,
>> > > if we decide to do this I think it will be important to include the
>> > > rationale you've shared in the documentation.
>> > >
>> > > Sounds make sense to me, I will add the related doc if we
>> > > update the default strategy.
>> > >
>> > > To Mason:
>> > >

Re: [DISCUSS] Change the default restart-strategy to exponential-delay

2023-12-07 文章 Maximilian Michels
Hey Rui,

+1 for changing the default restart strategy to exponential-delay.
This is something all users eventually run into. They end up changing
the restart strategy to exponential-delay. I think the current
defaults are quite balanced. Restarts happen quickly enough unless
there are consecutive failures where I think it makes sense to double
the waiting time up till the max.

-Max


On Wed, Dec 6, 2023 at 12:51 AM Mason Chen  wrote:
>
> Hi Rui,
>
> Sorry for the late reply. I was suggesting that perhaps we could do some
> testing with Kubernetes wrt configuring values for the exponential restart
> strategy. We've noticed that the default strategy in 1.17 caused a lot of
> requests to the K8s API server for unstable deployments.
>
> However, people in different Kubernetes setups will have different limits
> so it would be challenging to provide a general benchmark. Another thing I
> found helpful in the past is to refer to Kubernetes--for example, the
> default strategy is exponential for pod restarts and we could draw
> inspiration from what they have set as a general purpose default config.
>
> Best,
> Mason
>
> On Sun, Nov 19, 2023 at 9:43 PM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hi David and Mason,
> >
> > Thanks for your feedback!
> >
> > To David:
> >
> > > Given that the new default feels more complex than the current behavior,
> > if we decide to do this I think it will be important to include the
> > rationale you've shared in the documentation.
> >
> > Sounds make sense to me, I will add the related doc if we
> > update the default strategy.
> >
> > To Mason:
> >
> > > I suppose we could do some benchmarking on what works well for the
> > resource providers that Flink relies on e.g. Kubernetes. Based on
> > conferences and blogs,
> > > it seems most people are relying on Kubernetes to deploy Flink and the
> > restart strategy has a large dependency on how well Kubernetes can scale to
> > requests to redeploy the job.
> >
> > Sorry, I didn't understand what type of benchmarking
> > we should do, could you elaborate on it? Thanks a lot.
> >
> > Best,
> > Rui
> >
> > On Sat, Nov 18, 2023 at 3:32 AM Mason Chen  wrote:
> >
> >> Hi Rui,
> >>
> >> I suppose we could do some benchmarking on what works well for the
> >> resource providers that Flink relies on e.g. Kubernetes. Based on
> >> conferences and blogs, it seems most people are relying on Kubernetes to
> >> deploy Flink and the restart strategy has a large dependency on how well
> >> Kubernetes can scale to requests to redeploy the job.
> >>
> >> Best,
> >> Mason
> >>
> >> On Fri, Nov 17, 2023 at 10:07 AM David Anderson 
> >> wrote:
> >>
> >>> Rui,
> >>>
> >>> I don't have any direct experience with this topic, but given the
> >>> motivation you shared, the proposal makes sense to me. Given that the new
> >>> default feels more complex than the current behavior, if we decide to do
> >>> this I think it will be important to include the rationale you've shared 
> >>> in
> >>> the documentation.
> >>>
> >>> David
> >>>
> >>> On Wed, Nov 15, 2023 at 10:17 PM Rui Fan <1996fan...@gmail.com> wrote:
> >>>
>  Hi dear flink users and devs:
> 
>  FLIP-364[1] intends to make some improvements to restart-strategy
>  and discuss updating some of the default values of exponential-delay,
>  and whether exponential-delay can be used as the default
>  restart-strategy.
>  After discussing at dev mail list[2], we hope to collect more feedback
>  from Flink users.
> 
>  # Why does the default restart-strategy need to be updated?
> 
>  If checkpointing is enabled, the default value is fixed-delay with
>  Integer.MAX_VALUE restart attempts and '1 s' delay[3]. It means
>  the job will restart infinitely with high frequency when a job
>  continues to fail.
> 
>  When the Kafka cluster fails, a large number of flink jobs will be
>  restarted frequently. After the kafka cluster is recovered, a large
>  number of high-frequency restarts of flink jobs may cause the
>  kafka cluster to avalanche again.
> 
>  Considering the exponential-delay as the default strategy with
>  a couple of reasons:
> 
>  - The exponential-delay can reduce the restart frequency when
>    a job continues to fail.
>  - It can restart a job quickly when a job fails occasionally.
>  - The restart-strategy.exponential-delay.jitter-factor can avoid r
>    estarting multiple jobs at the same time. It’s useful to prevent
>    avalanches.
> 
>  # What are the current default values[4] of exponential-delay?
> 
>  restart-strategy.exponential-delay.initial-backoff : 1s
>  restart-strategy.exponential-delay.backoff-multiplier : 2.0
>  restart-strategy.exponential-delay.jitter-factor : 0.1
>  restart-strategy.exponential-delay.max-backoff : 5 min
>  restart-strategy.exponential-delay.reset-backoff-threshold : 1h
> 
>  backoff-multiplier=2 means 

Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.5.0 released

2023-05-23 文章 Maximilian Michels
Niceee. Thanks for managing the release, Gyula!

-Max

On Wed, May 17, 2023 at 8:25 PM Márton Balassi  wrote:
>
> Thanks, awesome! :-)
>
> On Wed, May 17, 2023 at 2:24 PM Gyula Fóra  wrote:
>>
>> The Apache Flink community is very happy to announce the release of Apache 
>> Flink Kubernetes Operator 1.5.0.
>>
>> The Flink Kubernetes Operator allows users to manage their Apache Flink 
>> applications and their lifecycle through native k8s tooling like kubectl.
>>
>> Release highlights:
>>  - Autoscaler improvements
>>  - Operator stability, observability improvements
>>
>> Release blogpost:
>> https://flink.apache.org/2023/05/17/apache-flink-kubernetes-operator-1.5.0-release-announcement/
>>
>> The release is available for download at: 
>> https://flink.apache.org/downloads.html
>>
>> Maven artifacts for Flink Kubernetes Operator can be found at: 
>> https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator
>>
>> Official Docker image for Flink Kubernetes Operator applications can be 
>> found at: https://hub.docker.com/r/apache/flink-kubernetes-operator
>>
>> The full release notes are available in Jira:
>> https://issues.apache.org/jira/projects/FLINK/versions/12352931
>>
>> We would like to thank all contributors of the Apache Flink community who 
>> made this release possible!
>>
>> Regards,
>> Gyula Fora