[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Umesh Prasad updated YARN-4852:
-------------------------------
    Description: 
Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down 
itself. 

GC related settings Settings :
 
 -XX:CMSInitiatingOccupancyFraction=75 
-XX:+CMSParallelRemarkEnabled 
 -XX:InitialTenuringThreshold=1
 -XX:+ManagementServer
-XX:InitialHeapSize=611042752
 -XX:MaxHeapSize=8589934592
 -XX:MaxNewSize=348966912 
-XX:MaxTenuringThreshold=1 
-XX:OldPLABSize=16
 -XX:ParallelGCThreads=4 
 -XX:SurvivorRatio=8 
-XX:+UseCMSInitiatingOccupancyOnly 
 -XX:+UseConcMarkSweepGC
  -XX:+UseParNewGC 

Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of 
memory. When digging  deeper, there are around 0.5 million objects of 
UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains 
around 1.7 million objects of YarnProtos$ContainerIdProto, 
ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
which retain around 1 GB heap.

Back to Back Full GC kept on happening. GC wasn't able to recover any heap and 
went OOM. JVM dumped the heap before quitting. We analyzed the heap. 

RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins 
time and went OOM.

There are no spike in job submissions, container numbers at the time of issue 
occurrence. 


  was:
Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down 
itself. 

Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of 
memory. When digged deep, there are around 0.5 million objects of 
UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn contains 
around 1.7 million objects of YarnProtos$ContainerIdProto, 
ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
which retain around 1 GB heap.

Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
was released. So all these objects look like live objects.

RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins 
time and went OOM.

There are no spike in job submissions, container numbers at the time of issue 
occurrence. 



> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>         Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> GC related settings Settings :
>  
>  -XX:CMSInitiatingOccupancyFraction=75 
> -XX:+CMSParallelRemarkEnabled 
>  -XX:InitialTenuringThreshold=1
>  -XX:+ManagementServer
> -XX:InitialHeapSize=611042752
>  -XX:MaxHeapSize=8589934592
>  -XX:MaxNewSize=348966912 
> -XX:MaxTenuringThreshold=1 
> -XX:OldPLABSize=16
>  -XX:ParallelGCThreads=4 
>  -XX:SurvivorRatio=8 
> -XX:+UseCMSInitiatingOccupancyOnly 
>  -XX:+UseConcMarkSweepGC
>   -XX:+UseParNewGC 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to