Hi! I could not reproduce your issue, last-state suspend/restore seems to work as before. However these 2 logs seem very suspicious:
2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO ][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO ][rec-job/rec-job] JobManager is being deployed Looks like after suspending (and deleting the JobManager Deployment) somebody restarted the JobManager manually. Is that possible? Cheers, Gyula On Mon, Sep 11, 2023 at 2:59 PM Evgeniy Lyutikov <eblyuti...@avito.ru> wrote: > Hi all! > After updating the operator to version 1.6.0, suspended and resuming > flink jobs stopped working. > When job resumes, the high availability metadata is removed. > > Suspend job: > 2023-09-11 06:01:41,548 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Event | Info | SPECCHANGED | UPGRADE > change(s) detected (Diff: FlinkDeploymentSpec[job.state : running -> > suspended]), starting reconciliation. > 2023-09-11 06:01:41,548 o.a.f.k.o.r.d.AbstractJobReconciler [INFO > ][rec-job/rec-job] Job is in running state, ready for upgrade with > LAST_STATE > 2023-09-11 06:01:41,558 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Event | Info | SUSPENDED | Suspending > existing deployment. > 2023-09-11 06:01:41,558 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Deleting cluster with Foreground propagation > 2023-09-11 06:01:41,558 o.a.f.k.o.s.NativeFlinkService [INFO > ][rec-job/rec-job] Deleting JobManager deployment while preserving HA > metadata. > 2023-09-11 06:01:41,598 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Waiting for cluster shutdown... > 2023-09-11 06:01:45,667 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Waiting for cluster shutdown... (5s) > 2023-09-11 06:01:50,730 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Waiting for cluster shutdown... (10s) > 2023-09-11 06:01:55,837 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Waiting for cluster shutdown... (15s) > 2023-09-11 06:02:00,885 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Waiting for cluster shutdown... (20s) > 2023-09-11 06:02:01,895 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Cluster shutdown completed. > 2023-09-11 06:02:01,973 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Status | Info | SUSPENDED | The resource > (job) has been suspended > 2023-09-11 06:02:01,981 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler > [INFO ][rec-job/rec-job] Resource fully reconciled, nothing to do... > > Resume: > 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO > ][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING > 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO > ][rec-job/rec-job] JobManager is being deployed > 2023-09-11 06:02:07,563 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Status | Info | SUSPENDED | The resource > (job) has been suspended > 2023-09-11 06:02:07,576 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Event | Info | SPECCHANGED | UPGRADE > change(s) detected (Diff: FlinkDeploymentSpec[job.state : suspended -> > running]), starting reconciliation. > 2023-09-11 06:02:07,649 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Status | Info | UPGRADING | The resource is > being upgraded > 2023-09-11 06:02:07,649 o.a.f.k.o.r.d.ApplicationReconciler [INFO > ][rec-job/rec-job] Deleting deployment with terminated application before > new deployment > 2023-09-11 06:02:07,649 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Deleting cluster with Foreground propagation > 2023-09-11 06:02:07,649 o.a.f.k.o.s.NativeFlinkService [INFO > ][rec-job/rec-job] Deleting JobManager deployment and HA metadata. > 2023-09-11 06:02:07,691 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Waiting for cluster shutdown... > 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Cluster shutdown completed. > 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Deleting Kubernetes HA metadata > 2023-09-11 06:02:07,820 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Waiting for cluster shutdown... > 2023-09-11 06:02:07,831 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Cluster shutdown completed. > 2023-09-11 06:02:07,975 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Status | Info | UPGRADING | The resource is > being upgraded > 2023-09-11 06:02:07,987 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Event | Info | SUBMIT | Starting > deployment > 2023-09-11 06:02:07,987 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Deploying application cluster requiring last-state from > HA metadata > 2023-09-11 06:02:07,999 o.a.f.k.o.c.FlinkDeploymentController > [ERROR][rec-job/rec-job] Flink recovery failed > 2023-09-11 06:02:08,012 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Event | Warning | RESTOREFAILED | HA metadata not > available to restore from last state. It is possible that the job has > finished or terminally failed, or the configmaps have been deleted. Manual > restore required. > 2023-09-11 06:02:08,099 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Status | Error | UPGRADING | > {"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA > metadata not available to restore from last state. It is possible that the > job has finished or terminally failed, or the configmaps have been deleted. > Manual restore required.","additionalMetadata":{},"throwableList":[]} > 2023-09-11 06:02:08,193 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Status | Info | UPGRADING | The resource is > being upgraded > 2023-09-11 06:02:08,218 o.a.f.k.o.l.AuditUtils [INFO > ][rec-job/rec-job] >>> Event | Info | SUBMIT | Starting > deployment > 2023-09-11 06:02:08,218 o.a.f.k.o.s.AbstractFlinkService [INFO > ][rec-job/rec-job] Deploying application cluster requiring last-state from > HA metadata > 2023-09-11 06:02:08,228 o.a.f.k.o.c.FlinkDeploymentController > [ERROR][rec-job/rec-job] Flink recovery failed > > > > > * ------------------------------ *“This message contains confidential > information/commercial secret. If you are not the intended addressee of > this message you may not copy, save, print or forward it to any third party > and you are kindly requested to destroy this message and notify the sender > thereof by email. > Данное сообщение содержит конфиденциальную информацию/информацию, > являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом > данного сообщения, Вы не вправе копировать, сохранять, печатать или > пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и > уведомить об этом отправителя электронным письмом.” >