[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199866#comment-14199866 ]
zhihai xu commented on YARN-2816: --------------------------------- [levelDB document|http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/]: LevelDB never writes in place: it always appends to a log file, or merges existing files together to produce new ones. So an OS crash will cause a partially written log record (or a few partially written log records). LevelDB recovery code uses checksums to detect this and will skip the incomplete records. Based on above information, if the incomplete record is the CONTAINER_REQUEST_KEY_SUFFIX record used to store container startRequest, this issue will happen. NM can't protect OS crash. This means we must add the error handling code to avoid NM shutdown due to NPE. This justify the patch. > NM fail to start with NPE during container recovery > --------------------------------------------------- > > Key: YARN-2816 > URL: https://issues.apache.org/jira/browse/YARN-2816 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: YARN-2816.000.patch > > > NM fail to start with NPE during container recovery. > We saw the following crash happen: > 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state INITED; cause: java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > The reason is some DB files used in NMLeveldbStateStoreService are > accidentally deleted to save disk space at > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete > container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) > entry in the DB. When container is recovered at > ContainerManagerImpl#recoverContainer, > The NullPointerException at the following code cause NM shutdown. > {code} > StartContainerRequest req = rcs.getStartRequest(); > ContainerLaunchContext launchContext = req.getContainerLaunchContext(); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)