Alex Smirnov created FLINK-9132:
-----------------------------------

             Summary: Cluster runs out of task slots when a job falls into 
restart loop
                 Key: FLINK-9132
                 URL: https://issues.apache.org/jira/browse/FLINK-9132
             Project: Flink
          Issue Type: Bug
    Affects Versions: 1.4.2
         Environment: env.java.opts in flink-conf.yaml file:

 

env.java.opts: -Xloggc:/home/user/flink/log/flinkServer-gc.log  -verbose:gc 
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseG1GC 
-XX:MaxGCPauseMillis=150 -XX:InitiatingHeapOccupancyPercent=55 
-XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=2 -XX:-ResizePLAB 
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=100M
            Reporter: Alex Smirnov
         Attachments: FailedJob.java, jconsole-classes.png

If there's a job which is restarting in a loop, then Task Manager hosting it 
goes down after some time. Job manager automatically assigns the job to another 
Task Manager and the new Task Manager goes down as well. After some time, all 
Task Managers are gone. Cluster becomes paralyzed.

I've attached to TaskManager's java process using jconsole and noticed that 
number of loaded classes increases dramatically if a job is in restarting loop 
and restores from checkpoint.

See attachment for the graph with G1GC enabled for the node. Standard GC 
performs even worse - task manager shuts down within 20 minutes since the 
restart loop start.

I've also attached minimal program to reproduce the problem

 

please let me know if additional information is required from me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to