[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15633063#comment-15633063 ] Rohith Sharma K S commented on YARN-4852: - [~sharadag] [~slukog] As per analysis in earlier discussions, YARN-4862 was identified as reason for OOM. Should this JIRA can be resolved since YARN-4862 patch got committed? Does YARN-4862 need to backported to Hadoop-2.6? > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209726#comment-15209726 ] Rohith Sharma K S commented on YARN-4852: - Raised a JIRA YARN-4862 for handling duplicated container status check. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209715#comment-15209715 ] Rohith Sharma K S commented on YARN-4852: - I will raise a new ticket for this. Thanks:-) > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209689#comment-15209689 ] Sharad Agarwal commented on YARN-4852: -- Thanks Rohith. Should we consider adding duplicate check in the RM side as well for completed containers as we are doing for launched ones. This will make it more full proof and eliminate scenarious like resync etc where NM might still send duplicates. we can open a new ticket for the same. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209609#comment-15209609 ] Rohith Sharma K S commented on YARN-4852: - Adding to above point, since NM->RM is push design, already sent containers are not supposed to send again unless there is RESYNC command from RM. So it should be a bug from NodeManager > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209583#comment-15209583 ] Rohith Sharma K S commented on YARN-4852: - Thanks for bringing out duplicated container status stored in UpdatedContainerInfo. This makes to think of ticket YARN-2997 which is already solved. Scenario is NM keeps the containers in NMContext as long as RM sends notification to NM in response to remove from NM. Every heart beat these(pendingCompletedContainers) container status is sent to RM which could be duplicated!! But from RM , while creating UpdatedContainerInfo validation is not done for duplicated entries. This is keep accumulating when there is slow in scheduler event processing. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209535#comment-15209535 ] Sharad Agarwal commented on YARN-4852: -- Further analysis shows that we are seeing exceptionally high log lines of "Null container completed...", somewhere in between 100k to 200k every minute. This could be related to lot of duplicate UpdatedContainerInfo objects for completed containers. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209520#comment-15209520 ] Sharad Agarwal commented on YARN-4852: -- [~rohithsharma] the slowness in schedulers still does not explain the built up of UpdatedContainerInfo to be 0.5 million objects in a short span. UpdatedContainerInfo should only be created in case of newly launched/completed containers. Looking at the code at RMNodeImpl.StatusUpdateWhenHealthyTransition (branch 2.6.0) {code} // Process running containers if (remoteContainer.getState() == ContainerState.RUNNING) { if (!rmNode.launchedContainers.contains(containerId)) { // Just launched container. RM knows about it the first time. rmNode.launchedContainers.add(containerId); newlyLaunchedContainers.add(remoteContainer); } } else { // A finished container rmNode.launchedContainers.remove(containerId); completedContainers.add(remoteContainer); } } if(newlyLaunchedContainers.size() != 0 || completedContainers.size() != 0) { rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo (newlyLaunchedContainers, completedContainers)); } {code} Above UpdatedContainerInfo is seemed to be getting created each time there is a completed containers in the container status (it is not checking if from previous update this has already been created). Wouldn't this lead to lot of duplicates UpdatedContainerInfo objects and further putting stress on the scheduler unnecessarily. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digging deeper, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Back to Back Full GC kept on happening. GC wasn't able to recover any heap > and went OOM. JVM dumped the heap before quitting. We analyzed the heap. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208314#comment-15208314 ] Rohith Sharma K S commented on YARN-4852: - bq. By the way what is the hearbeat interval from AM to RM in which it will acquire the CS lock. MRAppMaster heartbeat is 1sec default. And CS lock is aquired only if there are ask resource request in heartbeat. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208301#comment-15208301 ] Gokul commented on YARN-4852: - Thanks [~rohithsharma], this gives some perspective about the starvation of Scheduler Event Processor Thread. May be YARN-3487 would bring down the probability of this issue. It took more than 30 minutes for the heap to double and go OOM. So Scheduler Event Processor would have got to process at least some nodeUpdate events. But heap was on growing state continuously and never came down. That's why I am not fully convinced that YARN-3487 would solve the issue. By the way what is the hearbeat interval from AM to RM in which it will acquire the CS lock. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208283#comment-15208283 ] Rohith Sharma K S commented on YARN-4852: - I was justifying how without YARN-3487 might cause oom. There could be other reason causing for nodeUpdate queue pill up which need to be analysed. For leaving out a suspect of YARN-3487, apply the patch in the cluster. If issue occur again it is easy to focus on particular area. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208246#comment-15208246 ] Rohith Sharma K S commented on YARN-4852: - To be more clear, *Flow-1* : Each AM heart beat or application submission try to acquire CS lock. In your cluster, 93 apps running concurrently would send resource request in AM heartbeat to RM. These many AM's heartbeat are race to obtain CS lock. *Flow-2* And other hand, scheduler event process thread dispatches events one by one. So at any point of time, only one nodeUpdate event is processed.This nodeUpdate event try to acquire a CS lock which is also in race ( From your thread dump, nodeUpdate has acquired the CS lock as I mentioned previous comment). Consider worst case where always AM heart beat is getting chance to acquire CS lock, then nodeUpdate call would be delayed. As I said scheduler event processor process an event one by one, other node update events will be piled up. Note that scheduler node status event is triggered from RMNodeIMpl. Delay in scheduler event processing does not block NodeManagers heartbeat. So NodeManager keep sending node heart beat and updating the RMNodeImpl#nodeUpdateQueue. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208204#comment-15208204 ] Gokul commented on YARN-4852: - Agreed 7 threads are waiting to lock CapacityScheduler.getQueueInfo. What is the impact if these many threads are waiting on this lock in application submission phase? Will it be the cause for RMNodeImpl.nodeUpdateQueue piling up? If yes then YARN-3487 will fix the issue. Else there should be some other reason - like the consumer thread of the queue(RMNodeImpl.nodeUpdateQueue) which is ResourceManager Event processor stuck at something that it is not draining the queue. Also the thread which is doing nodeUpdate(ResourceManager Event processor) is not in blocked state. It is still runnable. There are around 1200 NMs in the cluster. 93 apps were running when issue occurred. The number of containers allocated were 17803 and pending were 63422. Job submission rate was roughly 6 per minute. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208172#comment-15208172 ] Rohith Sharma K S commented on YARN-4852: - Looking at your attached threadump, I feel root cause for your issue is YARN-3487. May be you can try if it is recurring regularly. >From the thread dump, I see that there are 8 threads are waiting for CS lock out of 7 are {{CapacityScheduler.getQueueInf}} which are called from validating resource request either during application submission for AM resource request OR for AM heartbeat request. At this time, nodeUpdate is holding the CS lock. This would take few mills to process containers status if more are there. {code} at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:1190) - locked <0x0005d4cfe5c8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:951) - locked <0x0005d4cfe5c8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler) {code} In larger cluster what can happen is if more ApplicationsMaster are running concurrently and application submission rate is very high, then significantly nodeUpdate will be blocked for obtaining CS lock. The reason for blocking is YARN-3487. So if more NodeManagers are there then time consumed to process each node update increase which internally pill up the container status and might be causing oom. Just for an info, How many NodeManagers are there in cluster? How many AM are running concurrently and How many tasks per job? what is the job submission rate? > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: threadDump.log > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207936#comment-15207936 ] Gokul commented on YARN-4852: - Hi [~jianhe], The previous thread dump may be looking like we've hit [YARN-3487|https://issues.apache.org/jira/browse/YARN-3487], but that was extracted from the heap dump. I'm now attaching the thread dump taken when the second time the issue had occurred(removing the old one I attached before which is not complete). There the CS thread(Resource Manager Event Processor) which is supposed to consume from UpdatedContainerInfo is not in blocked state. Still the queue filled up and the issue recurred. Any pointers here? One common observation is there were huge number of log lines I mentioned above both times when the issue occurred. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206873#comment-15206873 ] Jian He commented on YARN-4852: --- Suspect you may run into YARN-3487, in which case CS lock is hold and the UpdatedContainerInfo gets piled up. [~slukog], how often do you see this ? would you like to patch YARN-3487 and give it a try ? > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > Attachments: rm-thread-dump.txt > > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206761#comment-15206761 ] Gokul commented on YARN-4852: - Sure. Here are the log messages that are seen in RM logs before it goes OOM. There are millions of such entries. 2016-03-22 20:04:40,443 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2016-03-22 20:04:40,443 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2016-03-22 20:04:40,443 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... 2016-03-22 20:04:40,443 INFO capacity.CapacityScheduler (CapacityScheduler.java:completedContainer(1190)) - Null container completed... > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206613#comment-15206613 ] Daniel Templeton commented on YARN-4852: [~slukog], could you post an excerpt from the log so we can see the exact log messages? > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206334#comment-15206334 ] Gokul commented on YARN-4852: - The cause for the OOM looks like huge number of sudden logs of "Null container completed". Do we know in what scenarios we will get this? > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206164#comment-15206164 ] Gokul commented on YARN-4852: - Hi [~rohithsharma], * During the sudden heap size increase we didn't notice any NM going down/restarted. But once the heap reached it's max we noticed a couple of NMs going down. This might be due to RM being in OOM state. {quote}Any NM got restarted? If so how many and how many containers were running in each NM.?{quote} * Not sure if CapacityScheduler was in deadlock but can see from thread dump that three IPC Handler threads were waiting on *CapacityScheduler.getQueueInfo* and the *Resource Manager Event Processor* Thread was printing the log line "Null Container Completed". There are millions of such log lines are seen in ResourceManager during the time of issue occurrence. {quote}Was there RM heavily loaded or any deadlock in scheduler where most of the node heart beat was not processed by scheduler?{quote} * Will attach the thread dump in a while {quote}Do you have Jstack report for RM while memory is increasing?{quote} > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
[ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15206113#comment-15206113 ] Rohith Sharma K S commented on YARN-4852: - [~slukog] Can you more information to verify why there was immediate glitch # Any NM got restarted? If so how many and how many containers were running in each NM.? # Was there RM heavily loaded or any deadlock in scheduler where most of the node heart beat was not processed by scheduler? # Do you have Jstack report for RM while memory is increasing? These container status are cleared from nodeUpdateQueue when node heartbeat is processed by scheduler. If there is any issue/slow from scheduler, these status would pile up and cause OOM. > Resource Manager Ran Out of Memory > -- > > Key: YARN-4852 > URL: https://issues.apache.org/jira/browse/YARN-4852 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.6.0 >Reporter: Gokul > > Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut > down itself. > Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% > of memory. When digged deep, there are around 0.5 million objects of > UpdatedContainerInfo (nodeUpdateQueue inside RMNodeImpl). This in turn > contains around 1.7 million objects of YarnProtos$ContainerIdProto, > ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of > which retain around 1 GB heap. > Full GC was triggered multiple times when RM went OOM and only 300 MB of heap > was released. So all these objects look like live objects. > RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 > mins time and went OOM. > There are no spike in job submissions, container numbers at the time of issue > occurrence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)