Are there any special precautions that need to be taken for undergoing
regular K8s maintenance procedures such as migrating/upgrading clusters?

For the sake of concreteness, I'm running my jobs via the Flink K8s
Operator and I'm finding that when rolling out new nodes and migrating my
jobs to them in some cases they get stuck and/or don't restart properly, or
do so multiple times causing more downtime than expected.

As of now my migration/rollout process is as follows:

- Create new K8s nodes/instances
- Cordon old ones to be replaced (where my jobs are running)
- Take savepoints
- Drain old nodes
- Wait until all jobs show up as RUNNING and STABLE

Nothing special here I would say. However, I wonder if there are any best
practices for Flink specifically which help minimize the downtime/potential
failures during these maintenance windows. Things such as tweaking budget
disruption policies and/or pod affinities, or maybe considering a HA setup
with multiple jobmangers vs just one. To be clear, all my jobs are deployed
like this:

```
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
...
spec:
  ...
  mode: native
```

and for what it's worth, their HA setup is based on the native K8s mode (vs
Zookeeper) and a single jobmanager.

Reply via email to