Hi,
> In addition to your comments, what are the items retained by 
> NetworkEnvironment? They grew seems like indefinitely, do they ever reduce?
> 

Mostly the network buffers, which should be ok. They are always recycled and 
should not be released until the network environment is GCed.
> I think there is a GC issue because my task manager is killed somehow after a 
> job run. The duration correlates to the volume of Kafka topics. More volume 
> TM dies quickly. Do you have any tips to debug it?
> 
What killed your task manager? For example do you see a see an 
java.lang.OutOfMemoryError or is the process killed by the OS’s OOM killer? In 
case of an OOM killer, you might need to grant more process memory or reduce 
the memory that you have configured for Flink to stay below the configured 
threshold that would kill the process. What exactly do you mean by „volume“ of 
Kafka topics? 

To debug, I suggest that you first figure out why the process is killed, maybe 
your thresholds are simply to low and the consumption can go beyond with your 
configuration of Flink. Then you should figure out what is actually growing 
more than you expect, e.g. is the problem triggered by heap space or native 
memory? Depending on the answer, e.g. heap dumps could help to spot the 
problematic objects.

Best,
Stefan

Reply via email to