[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208246#comment-15208246
 ] 

Rohith Sharma K S commented on YARN-4852:
-----------------------------------------

To be more clear, 
*Flow-1* : Each AM heart beat or application submission try to acquire CS lock. 
In your cluster, 93 apps running concurrently would send resource request in  
AM heartbeat to RM. These many AM's heartbeat are race to obtain CS lock. 

*Flow-2* And other hand, scheduler event process thread dispatches events one 
by one. So at any point of time, only one nodeUpdate event is processed.This 
nodeUpdate event try to acquire a CS lock which is also in race ( From your 
thread dump, nodeUpdate has acquired the CS lock as I mentioned previous 
comment).

Consider worst case where always AM heart beat is getting chance to acquire CS 
lock, then nodeUpdate call would be delayed. As I said scheduler event 
processor process an event one by one, other node update events will be piled 
up. Note that scheduler node status event is triggered from RMNodeIMpl. Delay 
in scheduler event processing does not block NodeManagers heartbeat. So 
NodeManager keep sending node heart beat and updating the 
RMNodeImpl#nodeUpdateQueue.

> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>         Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to