Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Piotr Nowojski Fri, 17 Nov 2017 02:21:08 -0800

Hi,

If the TM is not responding check the TM logs if there is some long gap in 
logs. There might be three main reasons for such gaps:


1. Machine is swapping - setup/configure your machine/processes that machine 
never swap (best to disable swap altogether)
2. Long GC full stops - look how to analyse those either by printing GC logs or 
attaching to the JVM with some profiler.
3. Network issues - but this usually shouldn’t cause gaps in the logs.

Piotrek

> On 16 Nov 2017, at 17:48, Hao Sun <ha...@zendesk.com> wrote:
> 
> Sorry, the "killed" I mean here is JM lost the TM. The TM instance is still 
> running inside kubernetes, but it is not responding to any requests, probably 
> due to high load. And from JM side, JM lost heartbeat tracking of the TM, so 
> it marked the TM as died.
> 
> The „volume“ of Kafka topics, I mean, the volume of messages for a topic. 
> e.g. 10000 msg/sec, I have not check the size of the message yet.
> But overall, as you suggested, I think I need more tuning for my TM params, 
> so it can maintain a reasonable load. I am not sure what params to look for, 
> but I will do my research first.
> 
> Always thanks for your help Stefan.
> 
> On Thu, Nov 16, 2017 at 8:27 AM Stefan Richter <s.rich...@data-artisans.com 
> <mailto:s.rich...@data-artisans.com>> wrote:
> Hi,
>> In addition to your comments, what are the items retained by 
>> NetworkEnvironment? They grew seems like indefinitely, do they ever reduce?
>> 
> 
> Mostly the network buffers, which should be ok. They are always recycled and 
> should not be released until the network environment is GCed.
> 
>> I think there is a GC issue because my task manager is killed somehow after 
>> a job run. The duration correlates to the volume of Kafka topics. More 
>> volume TM dies quickly. Do you have any tips to debug it?
>> 
> 
> What killed your task manager? For example do you see a see an 
> java.lang.OutOfMemoryError or is the process killed by the OS’s OOM killer? 
> In case of an OOM killer, you might need to grant more process memory or 
> reduce the memory that you have configured for Flink to stay below the 
> configured threshold that would kill the process. What exactly do you mean by 
> „volume“ of Kafka topics? 
> 
> To debug, I suggest that you first figure out why the process is killed, 
> maybe your thresholds are simply to low and the consumption can go beyond 
> with your configuration of Flink. Then you should figure out what is actually 
> growing more than you expect, e.g. is the problem triggered by heap space or 
> native memory? Depending on the answer, e.g. heap dumps could help to spot 
> the problematic objects.
> 
> Best,
> Stefan

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Reply via email to