Hello Flink Community,

We run the Flink Kubernetes Operator 1.13.0 with Flink 2.2 (Application mode) 
and the built-in Autoscaler enabled on AWS EKS, using Karpenter for node 
provisioning and consolidation.
We're hitting a collision between Karpenter's consolidation and the Flink 
Operator Autoscaler: when the Autoscaler rescales a job, nodes briefly appear 
underutilized, so Karpenter consolidates and evicts active 
TaskManager/JobManager pods. This leads to cluster instability and abnormal 
behavior that does not seem to recover without manual intervention.
To mitigate the issue and avoid aggressive pod evictions (especially 
TaskManagers), we annotated the TM pods with karpenter.sh/do-not-disrupt: 
"true". However, this introduced a new problem: some nodes are no longer 
consolidated at all.

Questions:

  1.
What is the recommended pattern for running the Autoscaler alongside Karpenter?
  2.
JobManagers are currently configured with PodDisruptionBudgets. Should 
TaskManagers have PDBs as well?
  3.
Would you recommend isolating Flink onto a dedicated Karpenter NodePool with 
consolidationPolicy: WhenEmpty?

We're currently considering do-not-disrupt annotations, PDBs, a dedicated 
NodePool , and a larger stabilization window.

Your feedback on whether these align with community best practices would be 
very welcome.

Thanks!
Tamir

Confidentiality: This communication and any attachments are intended for the 
above-named persons only and may be confidential and/or legally privileged. Any 
opinions expressed in this communication are not necessarily those of NICE 
Actimize. If this communication has come to you in error you must take no 
action based on it, nor must you copy or show it to anyone; please 
delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.

Reply via email to