[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207936#comment-15207936
 ] 

Gokul commented on YARN-4852:
-----------------------------

Hi [~jianhe],

The previous thread dump may be looking like we've hit 
[YARN-3487|https://issues.apache.org/jira/browse/YARN-3487], but that was 
extracted from the heap dump. I'm now attaching the thread dump taken when the 
second time the issue had occurred(removing the old one I attached before which 
is not complete). There the CS thread(Resource Manager Event Processor) which 
is supposed to consume from UpdatedContainerInfo is not in blocked state. Still 
the queue filled up and the issue recurred. Any pointers here? One common 
observation is there were huge number of log lines I mentioned above both times 
when the issue occurred.

> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to