Ruibin Xing created FLINK-33011:
-----------------------------------

             Summary: Operator deletes HA data unexpectedly
                 Key: FLINK-33011
                 URL: https://issues.apache.org/jira/browse/FLINK-33011
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.6.0, 1.17.1
         Environment: Flink: 1.17.1

Flink Kubernetes Operator: 1.6.0
            Reporter: Ruibin Xing
         Attachments: flink_operator_logs_0831.csv

We encountered a problem where the operator unexpectedly deleted HA data.

The timeline is as follows:

12:08 We submitted the first spec, which suspended the job with savepoint 
upgrade mode.

12:08 The job was suspended, while the HA data was preserved, and the log 
showed the observed job deployment status was MISSING.

12:10 We submitted the second spec, which deployed the job with the last state 
upgrade mode.

12:10 Logs showed the operator deleted both the Flink deployment and the HA 
data again.

12:10 The job failed to start because the HA data was missing.

According to the log, the deletion was triggered by 
https://github.com/apache/flink-kubernetes-operator/blob/a728ba768e20236184e2b9e9e45163304b8b196c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/ApplicationReconciler.java#L168

I think this would only be triggered if the job deployment status wasn't 
MISSING. But the log before the deletion showed the observed job status was 
MISSING at that moment.

Related logs:

 
{code:java}
2023-08-30 12:08:48.190 +0000 o.a.f.k.o.s.AbstractFlinkService [INFO 
][default/pipeline-pipeline-se-3] Cluster shutdown completed.
2023-08-30 12:10:27.010 +0000 o.a.f.k.o.o.d.ApplicationObserver [INFO 
][default/pipeline-pipeline-se-3] Observing JobManager deployment. Previous 
status: MISSING
2023-08-30 12:10:27.533 +0000 o.a.f.k.o.l.AuditUtils         [INFO 
][default/pipeline-pipeline-se-3] >>> Event  | Info    | SPECCHANGED     | 
UPGRADE change(s) detected (Diff: FlinkDeploymentSpec[image : 
docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:0835137c-362 
-> 
docker-registry.randomcompany.com/octopus/pipeline-pipeline-online:23db7ae8-365,
 podTemplate.metadata.labels.app.kubernetes.io~1version : 
0835137cd803b7258695eb53a6ec520cb62a48a7 -> 
23db7ae84bdab8d91fa527fe2f8f2fce292d0abc, job.state : suspended -> running, 
job.upgradeMode : last-state -> savepoint, restartNonce : 1545 -> 1547]), 
starting reconciliation.
2023-08-30 12:10:27.679 +0000 o.a.f.k.o.s.NativeFlinkService [INFO 
][default/pipeline-pipeline-se-3] Deleting JobManager deployment and HA 
metadata.
{code}
A more complete log file is attached. Thanks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to