Hi,

I am trying to understand the following behavior in our Flink application
cluster. Any assistance would be appreciated.

We are running a Flink application cluster with 5 task managers, each with
the following configuration:

   - jobManagerMemory: 12g
   - taskManagerMemory: 20g
   - taskManagerMemoryHeapSize: 12g
   - taskManagerMemoryNetworkMax: 4g
   - taskManagerMemoryNetworkMin: 1g
   - taskManagerMemoryManagedSize: 50m
   - taskManagerMemoryOffHeapSize: 2g
   - taskManagerMemoryNetworkFraction: 0.2
   - taskManagerNetworkMemorySegmentSize: 4mb
   - taskManagerMemoryFloatingBuffersPerGate: 64
   - taskmanager.memory.jvm-overhead.min: 256mb
   - taskmanager.memory.jvm-overhead.max: 2g
   - taskmanager.memory.jvm-overhead.fraction: 0.1

Our pipeline includes stateful transformations, and we are verifying that
we clear the state once it is no longer needed.

Through the Flink UI, we observe that the heap size increases and decreases
during the job lifecycle.

However, there is a noticeable delay between clearing the state and the
reduction in heap size usage, which I assume is related to the garbage
collector frequency.

What is puzzling is the task manager pod memory usage. It appears that the
memory usage increases intermittently and is not released. We verified the
different state metrics and confirmed they are changing according to the
logic.

Additionally, if we had a state that was never released, I would expect to
see the heap size increasing constantly as well.

Any insights or ideas?

Thanks,

Sigalit

Reply via email to