Using Flink k8s operator, you may use the yaml property
job.initialSavepointPath to set a path that you want to start your pipeline
from. This would be the full path. Including the jobid. And then, you'll
have the new ID generated and such.

To avoid maintenance issues like this one, a multi-node cluster may help
you. k8s will try to spread the deployments among the different nodes. Even
if one dies, it will make sure everything is there due to k8s desired state
mechanism.



Att,
Pedro Mázala
Be awesome


On Thu, 12 Jun 2025 at 15:52, gustavo panizzo <g...@zumbi.com.ar> wrote:

> Hello
>
> I run flink (v 1.20) on k8s using the native integration and the k8s
> operator (v 1.30), we keep savepoints and checkpoints in S3.
>
> We'd like to be able to continue running the same jobs (with the same
> config, same image, using the same sink and sources, connecting to kafka
> using the same credentials and groups, restoring the state from were the
> previous job left) from another k8s cluster in the event of maintenance or
> simply failure of the k8s cluster, hence we need to restore the state from
> a savepoint or checkpoint.
>
> however the problem we face is that the jobID is is part of the path where
> checkpoints and savepoints are stored in S3 and it is generated dynamically
> every time a job (kind: flinkdeployments) is deployed into k8s
>
> So i cannot re create the same job in another k8s cluster to pick up where
> the previous job left
>
> I could copy file around in S3 but feels racy and not really great, how
> others move stateful jobs from k8s clusters to other k8s clusters?
>
>
> cheers
>
>

Reply via email to