Hi All

I hope this message finds you well.
I'm currently running PyFlink on Kubernetes with the following setup:

  *   Flink Operator is deployed in namespace-A version 1.20.1
  *   FlinkDeployment is created in namespace-B
  *   FlinkSessionJob is used to run various jobs with upgradeMode set to 
last-state
To update the PyFlink script, I upload a new version to a folder on a PVC disk. 
Then, I remove the existing job to trigger a sync via ArgoCD. However, I'm 
encountering several issues during this process:

  1.  The job with the same name continues to run, and multiple new jobs are 
created unexpectedly.
  2.  I tried changing the restartNonce to a random value to force a state 
sync, but the new script doesn't seem to be picked up — it appears the job 
isn't actually restarted.
  3.  I also attempted to cancel the job using flink cancel, but new jobs keep 
spawning without removing the previous job ID.
It seems that Flink is not properly cleaning up the previous job or reloading 
the updated script as expected.
Questions:

  *   What is the correct way to restart a FlinkSessionJob to ensure the 
updated PyFlink script is used?
  *   Is there a recommended approach for managing last-state upgrade mode to 
avoid duplicate or stuck jobs?
  *   Do I need to handle job IDs or state cleanup manually in this scenario?
Any insights or recommendations would be greatly appreciated.

Best regards,

Pachara Aryuyuen (Kas)
Cloud Platform Engineer

Reply via email to