Hi!

I could not reproduce your issue, last-state suspend/restore seems to work
as before.
However these 2 logs seem very suspicious:

2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO
][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING
2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO
][rec-job/rec-job] JobManager is being deployed

Looks like after suspending (and deleting the JobManager Deployment)
somebody restarted the JobManager manually. Is that possible?

Cheers,
Gyula

On Mon, Sep 11, 2023 at 2:59 PM Evgeniy Lyutikov <eblyuti...@avito.ru>
wrote:

> Hi all!
> After updating the operator to version 1.6.0, suspended and resuming
> flink jobs stopped working.
> When job resumes, the high availability metadata is removed.
>
> Suspend job:
> 2023-09-11 06:01:41,548 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Event  | Info    | SPECCHANGED     | UPGRADE
> change(s) detected (Diff: FlinkDeploymentSpec[job.state : running ->
> suspended]), starting reconciliation.
> 2023-09-11 06:01:41,548 o.a.f.k.o.r.d.AbstractJobReconciler [INFO
> ][rec-job/rec-job] Job is in running state, ready for upgrade with
> LAST_STATE
> 2023-09-11 06:01:41,558 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Event  | Info    | SUSPENDED       | Suspending
> existing deployment.
> 2023-09-11 06:01:41,558 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Deleting cluster with Foreground propagation
> 2023-09-11 06:01:41,558 o.a.f.k.o.s.NativeFlinkService [INFO
> ][rec-job/rec-job] Deleting JobManager deployment while preserving HA
> metadata.
> 2023-09-11 06:01:41,598 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Waiting for cluster shutdown...
> 2023-09-11 06:01:45,667 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Waiting for cluster shutdown... (5s)
> 2023-09-11 06:01:50,730 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Waiting for cluster shutdown... (10s)
> 2023-09-11 06:01:55,837 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Waiting for cluster shutdown... (15s)
> 2023-09-11 06:02:00,885 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Waiting for cluster shutdown... (20s)
> 2023-09-11 06:02:01,895 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Cluster shutdown completed.
> 2023-09-11 06:02:01,973 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Status | Info    | SUSPENDED       | The resource
> (job) has been suspended
> 2023-09-11 06:02:01,981 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler
> [INFO ][rec-job/rec-job] Resource fully reconciled, nothing to do...
>
> Resume:
> 2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO
> ][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING
> 2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO
> ][rec-job/rec-job] JobManager is being deployed
> 2023-09-11 06:02:07,563 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Status | Info    | SUSPENDED       | The resource
> (job) has been suspended
> 2023-09-11 06:02:07,576 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Event  | Info    | SPECCHANGED     | UPGRADE
> change(s) detected (Diff: FlinkDeploymentSpec[job.state : suspended ->
> running]), starting reconciliation.
> 2023-09-11 06:02:07,649 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource is
> being upgraded
> 2023-09-11 06:02:07,649 o.a.f.k.o.r.d.ApplicationReconciler [INFO
> ][rec-job/rec-job] Deleting deployment with terminated application before
> new deployment
> 2023-09-11 06:02:07,649 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Deleting cluster with Foreground propagation
> 2023-09-11 06:02:07,649 o.a.f.k.o.s.NativeFlinkService [INFO
> ][rec-job/rec-job] Deleting JobManager deployment and HA metadata.
> 2023-09-11 06:02:07,691 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Waiting for cluster shutdown...
> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Cluster shutdown completed.
> 2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Deleting Kubernetes HA metadata
> 2023-09-11 06:02:07,820 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Waiting for cluster shutdown...
> 2023-09-11 06:02:07,831 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Cluster shutdown completed.
> 2023-09-11 06:02:07,975 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource is
> being upgraded
> 2023-09-11 06:02:07,987 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Event  | Info    | SUBMIT          | Starting
> deployment
> 2023-09-11 06:02:07,987 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Deploying application cluster requiring last-state from
> HA metadata
> 2023-09-11 06:02:07,999 o.a.f.k.o.c.FlinkDeploymentController
> [ERROR][rec-job/rec-job] Flink recovery failed
> 2023-09-11 06:02:08,012 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Event  | Warning | RESTOREFAILED   | HA metadata not
> available to restore from last state. It is possible that the job has
> finished or terminally failed, or the configmaps have been deleted. Manual
> restore required.
> 2023-09-11 06:02:08,099 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Status | Error   | UPGRADING       |
> {"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
> metadata not available to restore from last state. It is possible that the
> job has finished or terminally failed, or the configmaps have been deleted.
> Manual restore required.","additionalMetadata":{},"throwableList":[]}
> 2023-09-11 06:02:08,193 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Status | Info    | UPGRADING       | The resource is
> being upgraded
> 2023-09-11 06:02:08,218 o.a.f.k.o.l.AuditUtils         [INFO
> ][rec-job/rec-job] >>> Event  | Info    | SUBMIT          | Starting
> deployment
> 2023-09-11 06:02:08,218 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][rec-job/rec-job] Deploying application cluster requiring last-state from
> HA metadata
> 2023-09-11 06:02:08,228 o.a.f.k.o.c.FlinkDeploymentController
> [ERROR][rec-job/rec-job] Flink recovery failed
>
>
>
>
> * ------------------------------ *“This message contains confidential
> information/commercial secret. If you are not the intended addressee of
> this message you may not copy, save, print or forward it to any third party
> and you are kindly requested to destroy this message and notify the sender
> thereof by email.
> Данное сообщение содержит конфиденциальную информацию/информацию,
> являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом
> данного сообщения, Вы не вправе копировать, сохранять, печатать или
> пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и
> уведомить об этом отправителя электронным письмом.”
>

Reply via email to