I am afraid we do not handle the scenario that the JobManager deployment is deleted externally.
Best, Yang Őrhidi Mátyás <matyas.orh...@gmail.com> 于2022年5月2日周一 16:52写道: > I filed a Jira for tracking this issue: > https://issues.apache.org/jira/browse/FLINK-27468 > > On Mon, May 2, 2022 at 10:31 AM Őrhidi Mátyás <matyas.orh...@gmail.com> > wrote: > >> This can be reproduced simply by deleting the kubernetes deployment. The >> operator cannot recover from this state automatically, by defining a >> restartNonce on the deployment should recover the state. >> >> Regards, >> Matyas >> >> On Mon, May 2, 2022 at 10:00 AM Márton Balassi <balassi.mar...@gmail.com> >> wrote: >> >>> Hi ChangZhuo, >>> >>> Thanks for reporting this, I think I have just run into this myself too. >>> Will try to reproduce it, but I do not fully comprehend it yet. If anyone >>> has a way to reproduce it is more than welcome. :-) >>> >>> On Fri, Apr 29, 2022 at 12:16 PM ChangZhuo Chen (陳昌倬) <czc...@czchen.org> >>> wrote: >>> >>>> Hi, >>>> >>>> We found that flink operator [0] sometimes cannot start jobmanager after >>>> upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix >>>> the problem. Anyone has this issue? >>>> >>>> The following is redacted log from flink operator. After status becomes >>>> MISSING, it keeps in MISSING status for at least 15 minutes. >>>> >>>> >>>> 2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer >>>> [INFO ][namespace/flink-deployment-name] Submitting application in >>>> 'Application Mode'. >>>> 2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO >>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead >>>> memory (2.400gb (2576980416 bytes)) is greater than its max value >>>> 1024.000mb (1073741824 bytes), max value will be used instead >>>> 2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO >>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead >>>> memory (5.200gb (5583457568 bytes)) is greater than its max value >>>> 1024.000mb (1073741824 bytes), max value will be used instead >>>> 2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO >>>> ][namespace/flink-deployment-name] The derived from fraction network memory >>>> (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb >>>> (4294967296 bytes), max value will be used instead >>>> 2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils [INFO >>>> ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed >>>> port. Configuration high-availability.jobmanager.port will be set to 6123 >>>> 2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN >>>> ][namespace/flink-deployment-name] Please note that Flink client >>>> operations(e.g. cancel, list, stop, savepoint, etc.) won't work from >>>> outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type' >>>> has been set to ClusterIP. >>>> 2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO >>>> ][namespace/flink-deployment-name] Create flink application cluster >>>> flink-deployment-name successfully, JobManager Web Interface: >>>> http://flink-deployment-name.namespace:8081 >>>> 2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService [INFO >>>> ][namespace/flink-deployment-name] Application cluster successfully >>>> deployed >>>> 2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYING >>>> 2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager is being deployed >>>> 2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYING >>>> 2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager is being deployed >>>> 2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYING >>>> 2022-04-29 09:41:25,997 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager is being deployed >>>> 2022-04-29 09:41:25,998 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:29,518 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:29,520 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYING >>>> 2022-04-29 09:41:30,631 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager is being deployed >>>> 2022-04-29 09:41:30,631 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:35,639 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:35,640 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYING >>>> 2022-04-29 09:41:35,756 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager is being deployed >>>> 2022-04-29 09:41:35,756 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:40,759 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:40,760 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYING >>>> 2022-04-29 09:41:40,864 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager is being deployed >>>> 2022-04-29 09:41:40,864 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:45,867 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:45,868 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYING >>>> 2022-04-29 09:41:45,870 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager deployment port is ready, >>>> waiting for the Flink REST API... >>>> 2022-04-29 09:41:45,870 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:55,901 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: DEPLOYED_NOT_READY >>>> 2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager deployment is ready >>>> 2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing job status >>>> 2022-04-29 09:41:56,294 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] No job found on cluster yet >>>> 2022-04-29 09:41:56,294 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:41:58,443 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:41:58,445 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing job status >>>> 2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver >>>> [ERROR][namespace/flink-deployment-name] Exception while listing jobs >>>> 2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: READY >>>> 2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager deployment does not exist >>>> 2022-04-29 09:42:10,490 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:42:25,521 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: MISSING >>>> 2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager deployment does not exist >>>> 2022-04-29 09:42:25,522 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> 2022-04-29 09:42:40,526 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: MISSING >>>> 2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager deployment does not exist >>>> 2022-04-29 09:42:40,527 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> ... >>>> >>>> 2022-04-29 10:00:55,862 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Starting reconciliation >>>> 2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] Observing JobManager deployment. >>>> Previous status: MISSING >>>> 2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver [INFO >>>> ][namespace/flink-deployment-name] JobManager deployment does not exist >>>> 2022-04-29 10:00:55,863 o.a.f.k.o.c.FlinkDeploymentController [INFO >>>> ][namespace/flink-deployment-name] Reconciliation successfully completed >>>> >>>> >>>> [0] https://github.com/apache/flink-kubernetes-operator >>>> >>>> >>>> -- >>>> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org >>>> http://czchen.info/ >>>> Key fingerprint = BA04 346D C2E1 FE63 C790 8793 CC65 B0CD EC27 5D5B >>>> >>>