Hi, > In addition to your comments, what are the items retained by > NetworkEnvironment? They grew seems like indefinitely, do they ever reduce? >
Mostly the network buffers, which should be ok. They are always recycled and should not be released until the network environment is GCed. > I think there is a GC issue because my task manager is killed somehow after a > job run. The duration correlates to the volume of Kafka topics. More volume > TM dies quickly. Do you have any tips to debug it? > What killed your task manager? For example do you see a see an java.lang.OutOfMemoryError or is the process killed by the OS’s OOM killer? In case of an OOM killer, you might need to grant more process memory or reduce the memory that you have configured for Flink to stay below the configured threshold that would kill the process. What exactly do you mean by „volume“ of Kafka topics? To debug, I suggest that you first figure out why the process is killed, maybe your thresholds are simply to low and the consumption can go beyond with your configuration of Flink. Then you should figure out what is actually growing more than you expect, e.g. is the problem triggered by heap space or native memory? Depending on the answer, e.g. heap dumps could help to spot the problematic objects. Best, Stefan