[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208172#comment-15208172
 ] 

Rohith Sharma K S commented on YARN-4852:
-----------------------------------------

Looking at your attached threadump, I feel root cause for your issue is 
YARN-3487. May be you can try if it is recurring regularly.

>From the thread dump,
I see that there are 8 threads are waiting for CS lock out of 7 are 
{{CapacityScheduler.getQueueInf}} which are called from validating resource 
request either during application submission for AM resource request OR for AM 
heartbeat request. 
At this time, nodeUpdate is holding the CS lock. This would take few mills to 
process containers status if more are there.
{code}
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1190)
        - locked <0x00000005d4cfe5c8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:951)
        - locked <0x00000005d4cfe5c8> (a 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
{code}


In larger cluster what can happen is if more ApplicationsMaster are running 
concurrently and application submission rate is very high, then significantly 
nodeUpdate will be blocked for obtaining CS lock. The reason for blocking is 
YARN-3487. So if more NodeManagers are there then time consumed to process each 
node update increase which internally pill up the container status and might be 
causing oom.

Just for an info,  How many NodeManagers are there in cluster? How many AM are 
running concurrently and How many tasks per job? what is the job submission 
rate? 


> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>         Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digged deep, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Full GC was triggered multiple times when RM went OOM and only 300 MB of heap 
> was released. So all these objects look like live objects.
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to