[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206164#comment-15206164
 ] 

Gokul commented on YARN-4852:
-----------------------------

Hi [~rohithsharma],

* During the sudden heap size increase we didn't notice any NM going 
down/restarted. But once the heap reached it's max we noticed a couple of NMs 
going down. This might be due to RM being in OOM state.
{quote}Any NM got restarted? If so how many and how many containers were 
running in each NM.?{quote}

* Not sure if CapacityScheduler was in deadlock but can see from thread dump 
that three IPC Handler threads were waiting on *CapacityScheduler.getQueueInfo* 
and the *Resource Manager Event Processor* Thread was printing the log line 
"Null Container Completed". There are millions of such log lines are seen in 
ResourceManager during the time of issue occurrence.
{quote}Was there RM heavily loaded or any deadlock in scheduler where most of 
the node heart beat was not processed by scheduler?{quote}

* Will attach the thread dump in a while
{quote}Do you have Jstack report for RM while memory is increasing?{quote}

> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to