[
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199866#comment-14199866
]
zhihai xu commented on YARN-2816:
---------------------------------
[levelDB
document|http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/]:
LevelDB never writes in place: it always appends to a log file, or merges
existing files together to produce new ones. So an OS crash will cause a
partially written log record (or a few partially written log records). LevelDB
recovery code uses checksums to detect this and will skip the incomplete
records.
Based on above information, if the incomplete record is the
CONTAINER_REQUEST_KEY_SUFFIX record used to store container startRequest, this
issue will happen. NM can't protect OS crash. This means we must add the error
handling code to avoid NM shutdown due to NPE. This justify the patch.
> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>
> Key: YARN-2816
> URL: https://issues.apache.org/jira/browse/YARN-2816
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.5.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Attachments: YARN-2816.000.patch
>
>
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService:
> Service
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
> failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are
> accidentally deleted to save disk space at
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete
> container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest)
> entry in the DB. When container is recovered at
> ContainerManagerImpl#recoverContainer,
> The NullPointerException at the following code cause NM shutdown.
> {code}
> StartContainerRequest req = rcs.getStartRequest();
> ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)