[ 
https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209520#comment-15209520
 ] 

Sharad Agarwal commented on YARN-4852:
--------------------------------------

[~rohithsharma] the slowness in schedulers still does not explain the built up 
of UpdatedContainerInfo to be 0.5 million objects in a short span. 
UpdatedContainerInfo should only be created in case of newly launched/completed 
containers. 
Looking at the code at RMNodeImpl.StatusUpdateWhenHealthyTransition  (branch 
2.6.0)
{code}
 // Process running containers
        if (remoteContainer.getState() == ContainerState.RUNNING) {
          if (!rmNode.launchedContainers.contains(containerId)) {
            // Just launched container. RM knows about it the first time.
            rmNode.launchedContainers.add(containerId);
            newlyLaunchedContainers.add(remoteContainer);
          }
        } else {
          // A finished container
          rmNode.launchedContainers.remove(containerId);
          completedContainers.add(remoteContainer);
        }
      }
      if(newlyLaunchedContainers.size() != 0 
          || completedContainers.size() != 0) {
        rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo
            (newlyLaunchedContainers, completedContainers));
      }
{code}

Above UpdatedContainerInfo is seemed to be getting created each time there is a 
completed containers in the container status (it is not checking if from 
previous update this has already been created). Wouldn't this lead to lot of 
duplicates UpdatedContainerInfo objects and further putting stress on the 
scheduler unnecessarily.


> Resource Manager Ran Out of Memory
> ----------------------------------
>
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>         Attachments: threadDump.log
>
>
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut 
> down itself. 
> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% 
> of memory. When digging  deeper, there are around 0.5 million objects of 
> UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn 
> contains around 1.7 million objects of YarnProtos$ContainerIdProto, 
> ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of 
> which retain around 1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap 
> and went OOM. JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 
> mins time and went OOM.
> There are no spike in job submissions, container numbers at the time of issue 
> occurrence. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to