Hi Flink Community,

I am currently running flink-kubernetes-operator 1.6-patched (
https://github.com/apache/flink-kubernetes-operator/commit/3f0dc2ee5534084bc162e6deaded36e93bb5e384),
and I have 3 flink-kubernetes-operator pods running. Recently, I deployed
around 110 new FlinkDeployments, and I had no issues with this initial
deployment. However, when I applied changes to all of these 110 new
FlinkDeployments concurrently to update their container image, the
flink-kubernetes-operator pods seemed to be in conflict with each other
constantly.

For example, before the SPECCHANGE, FlinkDeployment rh-flinkdeployment-01
would be RUNNING (status.jobStatus.state) and STABLE
(status.lifecycleState). After the FlinkDeployment spec is updated,
rh-flinkdeployment-01 goes through FINISHED (status.jobStatus.state) and
UPGRADING (status.jobStatus.state), and then RECONCILING
(status.jobStatus.state) and DEPLOYED (status.jobStatus.state). It reaches
RUNNING and STABLE again, but then for some reason it goes back to FINISHED
and UPGRADING again, and I do notice that the newly created jobmanager pod
gets deleted and then recreated. rh-flinkdeployment-01 basically becomes
stuck in this loop where it becomes stable and then gets re-deployed by the
operator.

This doesn't happen to all 110 FlinkDeployments, but it happens to around
30 of them concurrently.

I have pasted some logs from one of the operator pods on one of the
FlinkDeployments. I have also highlighted messages that seem suspicious to
me. I will try to gather more logs and send them tomorrow.

For now, to mitigate this, I had to delete all of these FlinkDeployments
and run them with the deprecated GoogleCloudPlatform operator. I'm hoping
to resolve this soon so that I don't have to run anything on the
GoogleCloudPlatform operator anymore.

Thanks!
Tony


�[m�[33m2023-11-02 05:26:02,132�[m
�[36mi.j.o.p.e.ReconciliationDispatcher�[m
�[1;31m[ERROR][<namespace>/<flinkdeployment>] Error during event processing
ExecutionScope{ resource id: ResourceID{name='<flinkdeployment',
namespace='<namespace>'}, version: 17772349729} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException:
org.apache.flink.kubernetes.operator.exception.StatusConflictException:
Status have been modified externally in version 17772349851 Previous:
<REDACTED>
...
2023-11-02 05:27:25,945 o.a.f.k.o.o.d.ApplicationObserver [WARN
][<namespace>/<flinkdeployment>] *Running deployment generation -1 doesn't
match upgrade target generation 2.*
2023-11-02 05:27:25,946 o.a.f.c.Configuration          [WARN
][<namespace>/<flinkdeployment>] Config uses deprecated configuration key
'high-availability' instead of proper key 'high-availability.type'
2023-11-02 05:27:26,034 o.a.f.k.o.l.AuditUtils         [INFO
][<namespace>/<flinkdeployment>] >>> Status | Info    | UPGRADING       |
The resource is being upgraded
2023-11-02 05:27:26,057 o.a.f.k.o.l.AuditUtils         [INFO
][<namespace>/<flinkdeployment>] >>> Event  | Info    | SUBMIT          |
Starting deployment
2023-11-02 05:27:26,057 o.a.f.k.o.s.AbstractFlinkService [INFO
][<namespace>/<flinkdeployment>] Deploying application cluster requiring
last-state from HA metadata
2023-11-02 05:27:26,057 o.a.f.c.Configuration          [WARN
][<namespace>/<flinkdeployment>] Config uses deprecated configuration key
'high-availability' instead of proper key 'high-availability.type'
2023-11-02 05:27:26,084 o.a.f.c.Configuration          [WARN
][<namespace>/<flinkdeployment>] Config uses deprecated configuration key
'high-availability' instead of proper key 'high-availability.type'
2023-11-02 05:27:26,110 o.a.f.k.o.s.NativeFlinkService [INFO
][<namespace>/<flinkdeployment>] Deploying application cluster
2023-11-02 05:27:26,110 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO
][<namespace>/<flinkdeployment>] Submitting application in 'Application
Mode'.
2023-11-02 05:27:26,112 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
][<namespace>/<flinkdeployment>] The derived from fraction jvm overhead
memory (1.000gb (1073741840 bytes)) is greater than its max value
1024.000mb (1073741824 bytes), max value will be used instead
2023-11-02 05:27:26,112 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
][<namespace>/<flinkdeployment>] The derived from fraction jvm overhead
memory (1.000gb (1073741840 bytes)) is greater than its max value
1024.000mb (1073741824 bytes), max value will be used instead
2023-11-02 05:27:26,163 o.a.f.k.o.s.AbstractFlinkService [INFO
][<namespace>/<flinkdeployment>] Waiting for cluster shutdown... (30s)
2023-11-02 05:27:26,193 o.a.f.k.o.l.AuditUtils         [INFO
][<namespace>/<flinkdeployment>] >>> Event  | Warning |
*CLUSTERDEPLOYMENTEXCEPTION
| The Flink cluster <flinkdeployment> already exists.*
2023-11-02 05:27:26,193 o.a.f.k.o.r.ReconciliationUtils [WARN
][<namespace>/<flinkdeployment>] Attempt count: 0, last attempt: false
2023-11-02 05:27:26,277 o.a.f.k.o.l.AuditUtils         [INFO
][<namespace>/<flinkdeployment>] *>>> Status | Error   | UPGRADING       |
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
The Flink cluster <flinkdeployment> already
exists.","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"The
Flink cluster <flinkdeployment> already exists.","additionalMetadata":{}}]}*


-- 

<http://www.robinhood.com/>

Tony Chen

Software Engineer

Menlo Park, CA

Don't copy, share, or use this email without permission. If you received it
by accident, please let us know and then delete it right away.

Reply via email to