Kanwaljeet Sachdev created YARN-8242:
----------------------------------------

             Summary: YARN NM: OOM error while reading back the state store on 
recovery
                 Key: YARN-8242
                 URL: https://issues.apache.org/jira/browse/YARN-8242
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: yarn
    Affects Versions: 3.2.0
            Reporter: Kanwaljeet Sachdev


On startup the NM reads its state store and builds a list of application in the 
state store to process. If the number of applications in the state store is 
large and have a lot of "state" connected to it the NM can run OOM and never 
get to the point that it can start processing the recovery.
Since it never starts the recovery there is no way for the NM to ever pass this 
point. It will require a change in heap size to get the NM started.

 

Following is the stack trace
{code:java}
at java.lang.OutOfMemoryError.<init> (OutOfMemoryError.java:48) at 
com.google.protobuf.ByteString.copyFrom (ByteString.java:192) at 
com.google.protobuf.CodedInputStream.readBytes (CodedInputStream.java:324) at 
org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto.<init> 
(YarnProtos.java:47069) at 
org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto.<init> 
(YarnProtos.java:47014) at 
org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom 
(YarnProtos.java:47102) at 
org.apache.hadoop.yarn.proto.YarnProtos$StringStringMapProto$1.parsePartialFrom 
(YarnProtos.java:47097) at com.google.protobuf.CodedInputStream.readMessage 
(CodedInputStream.java:309) at 
org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.<init> 
(YarnProtos.java:41016) at 
org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto.<init> 
(YarnProtos.java:40942) at 
org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
 (YarnProtos.java:41080) at 
org.apache.hadoop.yarn.proto.YarnProtos$ContainerLaunchContextProto$1.parsePartialFrom
 (YarnProtos.java:41075) at com.google.protobuf.CodedInputStream.readMessage 
(CodedInputStream.java:309) at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.<init>
 (YarnServiceProtos.java:24517) at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.<init>
 (YarnServiceProtos.java:24464) at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
 (YarnServiceProtos.java:24568) at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto$1.parsePartialFrom
 (YarnServiceProtos.java:24563) at 
com.google.protobuf.AbstractParser.parsePartialFrom (AbstractParser.java:141) 
at com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:176) at 
com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:188) at 
com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:193) at 
com.google.protobuf.AbstractParser.parseFrom (AbstractParser.java:49) at 
org.apache.hadoop.yarn.proto.YarnServiceProtos$StartContainerRequestProto.parseFrom
 (YarnServiceProtos.java:24739) at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainerState
 (NMLeveldbStateStoreService.java:217) at 
org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.loadContainersState
 (NMLeveldbStateStoreService.java:170) at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover
 (ContainerManagerImpl.java:253) at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit
 (ContainerManagerImpl.java:237) at 
org.apache.hadoop.service.AbstractService.init (AbstractService.java:163) at 
org.apache.hadoop.service.CompositeService.serviceInit 
(CompositeService.java:107) at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit 
(NodeManager.java:255) at org.apache.hadoop.service.AbstractService.init 
(AbstractService.java:163) at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager 
(NodeManager.java:474) at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.main 
(NodeManager.java:521){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to