I am afraid we do not handle the scenario that the JobManager deployment is
deleted externally.

Best,
Yang

Őrhidi Mátyás <matyas.orh...@gmail.com> 于2022年5月2日周一 16:52写道:

> I filed a Jira for tracking this issue:
> https://issues.apache.org/jira/browse/FLINK-27468
>
> On Mon, May 2, 2022 at 10:31 AM Őrhidi Mátyás <matyas.orh...@gmail.com>
> wrote:
>
>> This can be reproduced simply by deleting the kubernetes deployment. The
>> operator cannot recover from this state automatically, by defining a
>> restartNonce on the deployment should recover the state.
>>
>> Regards,
>> Matyas
>>
>> On Mon, May 2, 2022 at 10:00 AM Márton Balassi <balassi.mar...@gmail.com>
>> wrote:
>>
>>> Hi ChangZhuo,
>>>
>>> Thanks for reporting this, I think I have just run into this myself too.
>>> Will try to reproduce it, but I do not fully comprehend it yet. If anyone
>>> has a way to reproduce it is more than welcome. :-)
>>>
>>> On Fri, Apr 29, 2022 at 12:16 PM ChangZhuo Chen (陳昌倬) <czc...@czchen.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We found that flink operator [0] sometimes cannot start jobmanager after
>>>> upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix
>>>> the problem. Anyone has this issue?
>>>>
>>>> The following is redacted log from flink operator. After status becomes
>>>> MISSING, it keeps in MISSING status for at least 15 minutes.
>>>>
>>>>
>>>>     2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer
>>>> [INFO ][namespace/flink-deployment-name] Submitting application in
>>>> 'Application Mode'.
>>>>     2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>>>> memory (2.400gb (2576980416 bytes)) is greater than its max value
>>>> 1024.000mb (1073741824 bytes), max value will be used instead
>>>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>>>> memory (5.200gb (5583457568 bytes)) is greater than its max value
>>>> 1024.000mb (1073741824 bytes), max value will be used instead
>>>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>>> ][namespace/flink-deployment-name] The derived from fraction network memory
>>>> (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb
>>>> (4294967296 bytes), max value will be used instead
>>>>     2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils      [INFO
>>>> ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed
>>>> port. Configuration high-availability.jobmanager.port will be set to 6123
>>>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN
>>>> ][namespace/flink-deployment-name] Please note that Flink client
>>>> operations(e.g. cancel, list, stop, savepoint, etc.) won't work from
>>>> outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type'
>>>> has been set to ClusterIP.
>>>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO
>>>> ][namespace/flink-deployment-name] Create flink application cluster
>>>> flink-deployment-name successfully, JobManager Web Interface:
>>>> http://flink-deployment-name.namespace:8081
>>>>     2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService       [INFO
>>>> ][namespace/flink-deployment-name] Application cluster successfully 
>>>> deployed
>>>>     2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:25,997 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:25,998 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:29,518 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:29,520 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:30,631 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:30,631 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:35,639 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:35,640 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:35,756 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:35,756 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:40,759 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:40,760 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:40,864 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:40,864 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:45,867 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:45,868 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:45,870 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment port is ready,
>>>> waiting for the Flink REST API...
>>>>     2022-04-29 09:41:45,870 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:55,901 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYED_NOT_READY
>>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment is ready
>>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing job status
>>>>     2022-04-29 09:41:56,294 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] No job found on cluster yet
>>>>     2022-04-29 09:41:56,294 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:58,443 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:58,445 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing job status
>>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver
>>>> [ERROR][namespace/flink-deployment-name] Exception while listing jobs
>>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: READY
>>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 09:42:10,490 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:42:25,521 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: MISSING
>>>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 09:42:25,522 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:42:40,526 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: MISSING
>>>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 09:42:40,527 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     ...
>>>>
>>>>     2022-04-29 10:00:55,862 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: MISSING
>>>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 10:00:55,863 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>
>>>>
>>>> [0] https://github.com/apache/flink-kubernetes-operator
>>>>
>>>>
>>>> --
>>>> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
>>>> http://czchen.info/
>>>> Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B
>>>>
>>>

Reply via email to