[ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202595#comment-14202595 ]
zhihai xu commented on YARN-2816: --------------------------------- Hi [~jlowe], thanks for the review. Sorry I misunderstands the containers leaks you talk about. My containers leaks means if the container has container start record, it must have complete records, which is for my error case in the attached levelDB files. So if we remove all these containers without container start record, the remaining containers in levelDB will have complete records. I fully agree to your comment. I attached a new patch which addressed all the comments except one: {quote} We have very likely leaked a container. Also the code should do much more than just forget the container and instead look for the pid file, try to kill it if found, and return a recovered container status of killed/lost or something similar. {quote} I added todo comment in the code for this. // TODO: kill and cleanup the leaked container I will create a separate JIRA to address this: Because it need do a lot of stuff and also it can be shared by all other container leakage cases in the NM: For example, when AM asks a container status which doesn't exist in the NM containers' list, which most likely is a leaked container. It is a great pleasure to discuss the issue with you, I learned a lot from your comment. > NM fail to start with NPE during container recovery > --------------------------------------------------- > > Key: YARN-2816 > URL: https://issues.apache.org/jira/browse/YARN-2816 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.5.0 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: YARN-2816.000.patch, leveldb_records.txt > > > NM fail to start with NPE during container recovery. > We saw the following crash happen: > 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl > failed in state INITED; cause: java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492) > The reason is some DB files used in NMLeveldbStateStoreService are > accidentally deleted to save disk space at > /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete > container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) > entry in the DB. When container is recovered at > ContainerManagerImpl#recoverContainer, > The NullPointerException at the following code cause NM shutdown. > {code} > StartContainerRequest req = rcs.getStartRequest(); > ContainerLaunchContext launchContext = req.getContainerLaunchContext(); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)