Hi.

I patched my copy of the 1.6.0 operator with edits from 
https://github.com/apache/flink-kubernetes-operator/pull/673
This solved the problem



________________________________
От: Tony Chen <tony.ch...@robinhood.com>
Отправлено: 19 октября 2023 г. 4:18:36
Кому: Evgeniy Lyutikov
Копия: user@flink.apache.org; Gyula Fóra
Тема: Re: Flink kubernets operator delete HA metadata after resuming from 
suspend

HI Evgeniy,

Did you rollback your operator version? If yes, did you run into any issues?

I ran into the following exception in my flink-kubernetes-operator pod while 
rolling back, and I was wondering if you encountered this.

2023-10-18 21:01:15,251 i.f.k.c.e.l.LeaderElector      [ERROR] Exception 
occurred while releasing lock 'LeaseLock: flink-kubernetes-operator - 
flink-operator-lease (flink-kubernetes-operator-74f9688dd-bcqr2)'
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
 Unable to update LeaseLock
at 
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LeaseLock.update(LeaseLock.java:102)
at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.release(LeaderElector.java:139)
at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.stopLeading(LeaderElector.java:120)
at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$start$2(LeaderElector.java:104)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)
at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(Unknown 
Source)
at io.fabric8.kubernetes.client.utils.Utils.lambda$null$12(Utils.java:523)
at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown 
Source)
at 
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown
 Source)
at java.base/java.util.concurrent.CompletableFuture.postComplete(Unknown Source)
at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: PUT at: 
https://10.241.0.1/apis/coordination.k8s.io/v1/namespaces/flink-kubernetes-operator/leases/flink-operator-lease<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2F10.241.0.1%2Fapis%2Fcoordination.k8s.io%2Fv1%2Fnamespaces%2Fflink-kubernetes-operator%2Fleases%2Fflink-operator-lease&data=05%7C01%7Ceblyutikov%40avito.ru%7Cb3331280021e47d1da6e08dbd01fd244%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638332607322558199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=n5H47J4tRlnE4S4AQsVj1jK8bhexUm9tbA1Zwu07LC8%3D&reserved=0>.
 Message: Operation cannot be fulfilled on 
leases.coordination.k8s.io<https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fleases.coordination.k8s.io%2F&data=05%7C01%7Ceblyutikov%40avito.ru%7Cb3331280021e47d1da6e08dbd01fd244%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638332607322558199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8HIPvlz33QGMYoP%2BVvOYLtHIV9XoWZXtvQFNJPgiEx8%3D&reserved=0>
 "flink-operator-lease": the object has been modified; please apply your 
changes to the latest version and try again. Received status: 
Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], 
group=coordination.k8s.io<https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fcoordination.k8s.io%2F&data=05%7C01%7Ceblyutikov%40avito.ru%7Cb3331280021e47d1da6e08dbd01fd244%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638332607322558199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DMECoc2sISic7vPN8JhRr5g0WMuxheeChCaEYvUeM5I%3D&reserved=0>,
 kind=leases, name=flink-operator-lease, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on 
leases.coordination.k8s.io<https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fleases.coordination.k8s.io%2F&data=05%7C01%7Ceblyutikov%40avito.ru%7Cb3331280021e47d1da6e08dbd01fd244%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638332607322558199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8HIPvlz33QGMYoP%2BVvOYLtHIV9XoWZXtvQFNJPgiEx8%3D&reserved=0>
 "flink-operator-lease": the object has been modified; please apply your 
changes to the latest version and try again, metadata=ListMeta(_continue=null, 
remainingItemCount=null, resourceVersion=null, selfLink=null, 
additionalProperties={}), reason=Conflict, status=Failure, 
additionalProperties={}).

On Tue, Sep 12, 2023 at 5:51 AM Gyula Fóra 
<gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>> wrote:
Hi!

I think this issue is the same as 
https://issues.apache.org/jira/browse/FLINK-33011<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-33011&data=05%7C01%7Ceblyutikov%40avito.ru%7Cb3331280021e47d1da6e08dbd01fd244%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638332607322558199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=sNLtl6lblfC4od8xlT1VUPOrev6f93p7wst6bttZSJ4%3D&reserved=0>
Not sure what exactly is the underlying cause as I could not repro it, but the 
fix should be simple.

Also I believe it's not 1.6.0 related unless a JOSDK/Fabric8 upgrade caused it.

Cheers,
Gyula


On Mon, Sep 11, 2023 at 7:47 PM Gyula Fóra 
<gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>> wrote:
You don’t need it but you can really mess up clusters by rolling back CRD 
changes…

On Mon, 11 Sep 2023 at 19:42, Evgeniy Lyutikov 
<eblyuti...@avito.ru<mailto:eblyuti...@avito.ru>> wrote:

Why we need to use latest CRD version with older operator version?

________________________________
От: Gyula Fóra <gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>>
Отправлено: 12 сентября 2023 г. 0:36:26

Кому: Evgeniy Lyutikov
Копия: user@flink.apache.org<mailto:user@flink.apache.org>
Тема: Re: Flink kubernets operator delete HA metadata after resuming from 
suspend

Do not change the CRD but you can roll back the operator itself I believe

Gyula

On Mon, 11 Sep 2023 at 18:52, Evgeniy Lyutikov 
<eblyuti...@avito.ru<mailto:eblyuti...@avito.ru>> wrote:

Is it safe to rollback the operator version with replace to old CRDs?

________________________________
От: Evgeniy Lyutikov <eblyuti...@avito.ru<mailto:eblyuti...@avito.ru>>
Отправлено: 11 сентября 2023 г. 23:50:26
Кому: Gyula Fóra

Копия: user@flink.apache.org<mailto:user@flink.apache.org>
Тема: Re: Flink kubernets operator delete HA metadata after resuming from 
suspend


Hi!

No, no one could restart jobmanager,
I monitored the pods in real time, they all deleted when suspended as expected.



________________________________
От: Gyula Fóra <gyula.f...@gmail.com<mailto:gyula.f...@gmail.com>>
Отправлено: 11 сентября 2023 г. 20:34:52
Кому: Evgeniy Lyutikov
Копия: user@flink.apache.org<mailto:user@flink.apache.org>
Тема: Re: Flink kubernets operator delete HA metadata after resuming from 
suspend

Hi!

I could not reproduce your issue, last-state suspend/restore seems to work as 
before.
However these 2 logs seem very suspicious:

2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING
2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][rec-job/rec-job] JobManager is being deployed

Looks like after suspending (and deleting the JobManager Deployment) somebody 
restarted the JobManager manually. Is that possible?

Cheers,
Gyula

On Mon, Sep 11, 2023 at 2:59 PM Evgeniy Lyutikov 
<eblyuti...@avito.ru<mailto:eblyuti...@avito.ru>> wrote:

Hi all!
After updating the operator to version 1.6.0, suspended and resuming flink jobs 
stopped working.
When job resumes, the high availability metadata is removed.

Suspend job:
2023-09-11 06:01:41,548 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Event  | Info    | SPECCHANGED     | UPGRADE change(s) detected (Diff: 
FlinkDeploymentSpec[job.state : running -> suspended]), starting reconciliation.
2023-09-11 06:01:41,548 o.a.f.k.o.r.d.AbstractJobReconciler [INFO 
][rec-job/rec-job] Job is in running state, ready for upgrade with LAST_STATE
2023-09-11 06:01:41,558 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Event  | Info    | SUSPENDED       | Suspending existing deployment.
2023-09-11 06:01:41,558 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Deleting cluster with Foreground propagation
2023-09-11 06:01:41,558 o.a.f.k.o.s.NativeFlinkService [INFO ][rec-job/rec-job] 
Deleting JobManager deployment while preserving HA metadata.
2023-09-11 06:01:41,598 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Waiting for cluster shutdown...
2023-09-11 06:01:45,667 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Waiting for cluster shutdown... (5s)
2023-09-11 06:01:50,730 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Waiting for cluster shutdown... (10s)
2023-09-11 06:01:55,837 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Waiting for cluster shutdown... (15s)
2023-09-11 06:02:00,885 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Waiting for cluster shutdown... (20s)
2023-09-11 06:02:01,895 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Cluster shutdown completed.
2023-09-11 06:02:01,973 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Status | Info    | SUSPENDED       | The resource (job) has been suspended
2023-09-11 06:02:01,981 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][rec-job/rec-job] Resource fully reconciled, nothing to do...

Resume:
2023-09-11 06:02:07,481 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][rec-job/rec-job] Observing JobManager deployment. Previous status: MISSING
2023-09-11 06:02:07,488 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][rec-job/rec-job] JobManager is being deployed
2023-09-11 06:02:07,563 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Status | Info    | SUSPENDED       | The resource (job) has been suspended
2023-09-11 06:02:07,576 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Event  | Info    | SPECCHANGED     | UPGRADE change(s) detected (Diff: 
FlinkDeploymentSpec[job.state : suspended -> running]), starting reconciliation.
2023-09-11 06:02:07,649 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Status | Info    | UPGRADING       | The resource is being upgraded
2023-09-11 06:02:07,649 o.a.f.k.o.r.d.ApplicationReconciler [INFO 
][rec-job/rec-job] Deleting deployment with terminated application before new 
deployment
2023-09-11 06:02:07,649 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Deleting cluster with Foreground propagation
2023-09-11 06:02:07,649 o.a.f.k.o.s.NativeFlinkService [INFO ][rec-job/rec-job] 
Deleting JobManager deployment and HA metadata.
2023-09-11 06:02:07,691 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Waiting for cluster shutdown...
2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Cluster shutdown completed.
2023-09-11 06:02:07,763 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Deleting Kubernetes HA metadata
2023-09-11 06:02:07,820 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Waiting for cluster shutdown...
2023-09-11 06:02:07,831 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Cluster shutdown completed.
2023-09-11 06:02:07,975 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Status | Info    | UPGRADING       | The resource is being upgraded
2023-09-11 06:02:07,987 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Event  | Info    | SUBMIT          | Starting deployment
2023-09-11 06:02:07,987 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Deploying application cluster requiring last-state from HA 
metadata
2023-09-11 06:02:07,999 o.a.f.k.o.c.FlinkDeploymentController 
[ERROR][rec-job/rec-job] Flink recovery failed
2023-09-11 06:02:08,012 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Event  | Warning | RESTOREFAILED   | HA metadata not available to restore 
from last state. It is possible that the job has finished or terminally failed, 
or the configmaps have been deleted. Manual restore required.
2023-09-11 06:02:08,099 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Status | Error   | UPGRADING       | 
{"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
 metadata not available to restore from last state. It is possible that the job 
has finished or terminally failed, or the configmaps have been deleted. Manual 
restore required.","additionalMetadata":{},"throwableList":[]}
2023-09-11 06:02:08,193 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Status | Info    | UPGRADING       | The resource is being upgraded
2023-09-11 06:02:08,218 o.a.f.k.o.l.AuditUtils         [INFO ][rec-job/rec-job] 
>>> Event  | Info    | SUBMIT          | Starting deployment
2023-09-11 06:02:08,218 o.a.f.k.o.s.AbstractFlinkService [INFO 
][rec-job/rec-job] Deploying application cluster requiring last-state from HA 
metadata
2023-09-11 06:02:08,228 o.a.f.k.o.c.FlinkDeploymentController 
[ERROR][rec-job/rec-job] Flink recovery failed




________________________________
“This message contains confidential information/commercial secret. If you are 
not the intended addressee of this message you may not copy, save, print or 
forward it to any third party and you are kindly requested to destroy this 
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся 
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного 
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его 
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом 
отправителя электронным письмом.”


--


[https://deploy-artifacts.it.robinhood.net/e6566c456677fbade78464d6793eabc78/ic_avatar_email-signature_72px_2x.png]<https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.robinhood.com%2F&data=05%7C01%7Ceblyutikov%40avito.ru%7Cb3331280021e47d1da6e08dbd01fd244%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638332607322558199%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ahrq7Zdsey%2F%2F4DBVSAg28oE6oBP%2B1pfZYdvEsBF5GKw%3D&reserved=0>


Tony Chen

Software Engineer

Menlo Park, CA


Don't copy, share, or use this email without permission. If you received it by 
accident, please let us know and then delete it right away.

Reply via email to