[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208204#comment-15208204
 ] 

Gokul commented on YARN-4852:
-----------------------------

Agreed 7 threads are waiting to lock CapacityScheduler.getQueueInfo. What is 
the impact if these many threads are waiting on this lock in application 
submission phase? Will it be the cause for RMNodeImpl.nodeUpdateQueue piling 
up? If yes then YARN-3487 will fix the issue. Else there should be some other 
reason - like the consumer thread of the queue(RMNodeImpl.nodeUpdateQueue) 
which is ResourceManager Event processor stuck at something that it is not 
draining the queue.

Also the thread which is doing nodeUpdate(ResourceManager Event processor) is 
not in blocked state. It is still runnable. 

There are around 1200 NMs in the cluster. 93 apps were running when issue 
occurred. The number of containers allocated were 17803 and pending were 63422. 
Job submission rate was roughly 6 per minute.

> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>         Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to