[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery

Jason Lowe (JIRA) Thu, 06 Nov 2014 06:41:58 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200241#comment-14200241
 ]


Jason Lowe commented on YARN-2816:
----------------------------------

This seems like a dubious use case.  If something comes along and deletes 
(i.e.: corrupts) the leveldb database then in general the NM will not be able 
to recover properly.  Trying to patch up one particular scenario won't cover 
the rest, and containers could "leak" (i.e.: be forgotten even though they're 
still running), container start requests lost, etc.

As for the OS crash scenario, if the OS crashes then there's nothing left for 
the NM to recover.  If we really want to protect against OS crashes then a much 
better way is to perform synchronous writes to leveldb.  However this is _much_ 
slower than asynchronous writes and could easily impact NM performance.  Given 
that there's nothing to recover from the OS crash scenario, it doesn't seem 
worth worrying about that case.

The real issue for the reported scenario is that the leveldb database location 
is a poor one for the way that system is configured, since something is coming 
along and corrupting the database.  Either the leveldb database needs to be 
moved somewhere else or the file cleanup procedure needs to exclude the leveldb 
database.

> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>
>                 Key: YARN-2816
>                 URL: https://issues.apache.org/jira/browse/YARN-2816
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-2816.000.patch
>
>
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are 
> accidentally deleted to save disk space at 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
> container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
> entry in the DB. When container is recovered at 
> ContainerManagerImpl#recoverContainer, 
> The NullPointerException at the following code cause NM shutdown.
> {code}
>     StartContainerRequest req = rcs.getStartRequest();
>     ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery

Reply via email to