[ 
https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202595#comment-14202595
 ] 

zhihai xu commented on YARN-2816:
---------------------------------

Hi [~jlowe],

thanks for the review.
Sorry I misunderstands the containers leaks you talk about. My containers leaks 
means if the container has container start record, it must have complete 
records, which is for my error case in the attached levelDB files. So if we 
remove all these containers without container start record, the remaining 
containers in levelDB will have complete records.

I fully agree to your comment. I attached a new patch which addressed all the 
comments except one:
{quote}
We have very likely leaked a container. Also the code should do much more than 
just forget the container and instead look for the pid file, try to kill it if 
found, and return a recovered container status of killed/lost or something 
similar.
{quote}

I added todo comment in the code for this. 
        // TODO: kill and cleanup the leaked container
I will create a separate JIRA to address this:
Because it need do a lot of stuff and also it can be shared by all other 
container leakage cases in the NM:
For example, when AM asks a container status which doesn't exist in the NM 
containers' list, which most likely is a leaked container.

It is a great pleasure to discuss the issue with you, I learned a lot from your 
comment.


> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>
>                 Key: YARN-2816
>                 URL: https://issues.apache.org/jira/browse/YARN-2816
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-2816.000.patch, leveldb_records.txt
>
>
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are 
> accidentally deleted to save disk space at 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
> container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
> entry in the DB. When container is recovered at 
> ContainerManagerImpl#recoverContainer, 
> The NullPointerException at the following code cause NM shutdown.
> {code}
>     StartContainerRequest req = rcs.getStartRequest();
>     ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to