Nawaid Shamim created FLINK-11205:
-------------------------------------

             Summary: Task Manager Metaspace Memory Leak 
                 Key: FLINK-11205
                 URL: https://issues.apache.org/jira/browse/FLINK-11205
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.7.0, 1.6.2, 1.5.5
            Reporter: Nawaid Shamim
         Attachments: Screenshot 2018-12-18 at 12.14.11.png

Job Restarts causes task manager to dynamically load duplicate classes. 
Metaspace is unbounded and grows with every restart. YARN aggressively kill 
such containers but this affect is immediately seems on different task manager 
which results in death spiral.

!Screenshot 2018-12-18 at 12.14.11.png!width=480!

Task Manager uses dynamic loader as described in 
[https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
{quote}
*YARN*

YARN classloading differs between single job deployments and sessions:
 * When submitting a Flink job/application directly to YARN (via {{bin/flink 
run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are started 
for that job. Those JVMs have both Flink framework classes and user code 
classes in the Java classpath. That means that there is _no dynamic 
classloading_ involved in that case.

 * When starting a YARN session, the JobManagers and TaskManagers are started 
with the Flink framework classes in the classpath. The classes from all jobs 
that are submitted against the session are loaded dynamically.
{quote}

The above is not entirely true specially when you set {{-yD 
classloader.resolve-order=parent-first}} . We also above observed the above 
behaviour when submitting a Flink job/application directly to YARN (via 
{{bin/flink run -m yarn-cluster ...}}).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to