Flink Kubernetes Operator problems

Nikola Milutinovic Sat, 13 Sep 2025 01:34:00 -0700

Hi all.

It seems that we have problems when we try to create a group of Flink Session 
Jobs, the Operator first runs into some timeouts, and most jobs go into 
RECONCILING state. Then they do manage to get to running state, but when we 
inspect Flink UI, we can see that some of the jobs are duplicates. The K8s 
Operator mixed up


So, for instance, we would see:


% kubectl get FlinkSessionJobs

NAME                                    JOB STATUS   LIFECYCLE STATE

cache-cdc-dynamic-ims-config-entities   RUNNING      STABLE

cache-cdc-equipment                     RUNNING      STABLE

cache-cdc-floc                          RUNNING      STABLE

cache-cdc-maintenance-entities          RUNNING      STABLE

cache-cdc-notification-entities         RUNNING      STABLE

cache-cdc-schedule                      RUNNING      STABLE

cache-cdc-static-entities               RUNNING      STABLE

cache-cdc-task-list-entities            RUNNING      STABLE

cache-cdc-work-order-entities           RUNNING      STABLE

But listing the jobs shows duplicates.


# flink list

------------------ Running/Restarting Jobs -------------------

10.09.2025 09:36:26 : d0f18a972db81a49763dfddc59eff21d : CACHE - CDC Static 
Entities (RUNNING)

10.09.2025 09:36:28 : b045747e9f83a5dfb99f306838d23346 : CACHE - CDC Equipment 
(RUNNING)

10.09.2025 09:36:42 : 57874e342a9386ad5c022d15da947508 : CACHE - CDC 
Notification entities (RUNNING)

10.09.2025 09:36:49 : 8dce7b2ffb701b619938ab4989f18aba : CACHE - CDC 
Maintenance entities (RUNNING)

10.09.2025 09:36:51 : bac0c76a4c48e207b065576f80750e8d : CACHE - CDC FLOC 
(RUNNING)

10.09.2025 09:37:21 : 4e065fa31fcd640c18b8d2f9d832a9ea : CACHE - CDC 
Maintenance entities (RUNNING)

10.09.2025 09:37:30 : 3ff775f7bda5b83d26b43ecfaddbf030 : CACHE - CDC Schedule 
(RUNNING)

10.09.2025 09:38:02 : cc7c25a0ed3c1dbf4ea28c9b31ae328d : CACHE - CDC Schedule 
(RUNNING)

10.09.2025 09:38:42 : 7ea9c0ff50499003e669e27d9d15767f : CACHE - CDC Work Order 
entities (RUNNING)

10.09.2025 09:39:07 : 119ef8358eebfbe7ff0fa7223ab2e22d : CACHE - CDC Work Order 
entities (RUNNING)

10.09.2025 09:39:17 : c00161bf832557f7afac66cd8d32b3ac : CACHE - CDC Work Order 
entities (RUNNING)

--------------------------------------------------------------

No scheduled jobs.

So, the job with name bolded IS the correct one (K8s resource and the Flink job 
match) other two jobs are attached to wrong K8s resources. So, for instance 
FlinkSessionJob/cache-cdc-schedule is referencing correct parameters in “spec” 
section, but its “status” section is showing another job entirely. There is 
some sort of mixup.

What can we do to at least mitigate this? It seems that creating K8s resources 
one-by-one is OK.
Can we set an operator config option 
“kubernetes.operator.reconcile.parallelism” to 1 to force one-by-one launching?
Nikola.

Flink Kubernetes Operator problems

Reply via email to