[
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200241#comment-14200241
]
Jason Lowe commented on YARN-2816:
----------------------------------
This seems like a dubious use case. If something comes along and deletes
(i.e.: corrupts) the leveldb database then in general the NM will not be able
to recover properly. Trying to patch up one particular scenario won't cover
the rest, and containers could "leak" (i.e.: be forgotten even though they're
still running), container start requests lost, etc.
As for the OS crash scenario, if the OS crashes then there's nothing left for
the NM to recover. If we really want to protect against OS crashes then a much
better way is to perform synchronous writes to leveldb. However this is _much_
slower than asynchronous writes and could easily impact NM performance. Given
that there's nothing to recover from the OS crash scenario, it doesn't seem
worth worrying about that case.
The real issue for the reported scenario is that the leveldb database location
is a poor one for the way that system is configured, since something is coming
along and corrupting the database. Either the leveldb database needs to be
moved somewhere else or the file cleanup procedure needs to exclude the leveldb
database.
> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>
> Key: YARN-2816
> URL: https://issues.apache.org/jira/browse/YARN-2816
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.5.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Priority: Critical
> Attachments: YARN-2816.000.patch
>
>
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService:
> Service
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
> failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are
> accidentally deleted to save disk space at
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete
> container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest)
> entry in the DB. When container is recovered at
> ContainerManagerImpl#recoverContainer,
> The NullPointerException at the following code cause NM shutdown.
> {code}
> StartContainerRequest req = rcs.getStartRequest();
> ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)