[ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202177#comment-14202177
 ] 

Jason Lowe commented on YARN-2816:
----------------------------------

bq.  It won't cause containers leaks. Because container start request is always 
the first entry to store(startContainerInternal) in the levelDB for each 
container records and it is always the first entry to remove (removeContainer) 
in the levelDB for each container records.

I don't understand this statement.  If the start container request is the first 
record to be lost, what happens if we write the start container request, launch 
(or maybe don't launch) the container, then restart?  If we lost the container 
start record but the container had not completed before the restart, didn't we 
just lose track of it upon recovery?

Anyway this shouldn't make things much worse than the NM failing to start up if 
this specific instance of database corruption occurs.  I just think we need to 
realize that there are _many_ other ways the database could be corrupted and 
this only works around a very specific instance of it.  Comments on the patch:

{noformat}
+      LOG.info("Remove container " + containerId +
+          " with incomplete records");
{noformat}

The above needs to be logged at least at the warn level if not error.  We have 
very likely leaked a container.  Also the code should do much more than just 
forget the container and instead look for the pid file, try to kill it if 
found, and return a recovered container status of killed/lost or something 
similar.  We shouldn't just pretend the container didn't exist when returning 
recovered containers.

{noformat}
-        LOG.info("Creating state database at " + dbfile);
+        LOG.info("Creating state database at " + dbfile, e);
{noformat}

Why was this change made?  I don't see the point of logging the exception 
showing the database didn't exist when we already checked for that condition in 
this code path.

> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>
>                 Key: YARN-2816
>                 URL: https://issues.apache.org/jira/browse/YARN-2816
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-2816.000.patch, leveldb_records.txt
>
>
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are 
> accidentally deleted to save disk space at 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
> container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
> entry in the DB. When container is recovered at 
> ContainerManagerImpl#recoverContainer, 
> The NullPointerException at the following code cause NM shutdown.
> {code}
>     StartContainerRequest req = rcs.getStartRequest();
>     ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to