Hello,

We have a Flink session cluster deployment in Kubernetes of around 100
TaskManagers. It processes around 20-30 Kafka Source jobs at the moment.
The jobs run are all using the same jar and only differ in the SQL query
used and other UDFs. We are using the official flink:1.14.3 image.

We observed that one specific task manager has been doing more garbage
collection compared to the others, So much actually, that at a specific
hour of the day, it pauses execution to do GC and thus causes huge consumer
lag to build up. By garbage collection, I mean GC of the Young Generation.
The old generation GC looks fine.

We checked this in our other running Flink clusters and found that actually
in most of them, this behaviour is being seen. In fact, there are always
2-3 TaskManagers which seem to be doing more GC than the others.

Is this a known issue ? Our clusters run long running kafka source to kafka
sink jobs, so wanted to know if this can happen because of  that.

Would appreciate any kind of guidance.
-- 
*Regards,*
*Meghajit*

Reply via email to