zhihai xu commented on YARN-2816:

Hi [~jlowe], thanks for the review,
Based on your comment, it look like the the issue is not as serious as what I 
thought previously, I lowered the issue from Critical to Major.
move the levelDB database location from /tmp directory to other safe directory 
is a very good suggestion.

But I still think the patch is good for better error handling and NM recovery.

1. If an OS crash will cause a few partially written log record in the levelDB, 
we can't restart the NM due to NPE.
To restart the NM, we need manually delete all these stateStore files. I think 
it will be bad for the user.
Also with the patch, we still can recover most of the containers in the NM 
instead of losing all these container information, which will cause all these 
containers to be reallocated by RM.

2. The patch is to delete all these containers which don't have container start 
requests. It won't cause containers leaks. Because container start request is 
always the first entry to store(startContainerInternal) in the levelDB for each 
container records and it is always the first entry to remove (removeContainer) 
in the levelDB for each container records. Also when I debugged this problem, 
the levelDB stored most latest records in a LRU cache(memory), when the levelDB 
is closed, it will store these latest records from cache to the disk and our 
use case for levelDB is always to write key-value record and no read after NM 
initialization. The data record lost on disk will always the old record, so 
container start request record is more likely lost than other container 
records. This makes the patch meaningful for our use case.
I attached a LevelDB container key record which has this issue.

3. Missing containers record is not a problem if we can recover most containers 
in NM.
When AM calls ContainerManagementProtocol#getContainerStatuses, the NM will put 
the missing containers in failed requests of GetContainerStatusesResponse, then 
the AM will notify RM to release these missing containers in 
ApplicationMasterProtocol#allocate(AllocateRequest.getReleaseList). So the 
missing containers will be recovered quickly by AM.
With the patch, we still can recover all these containers with complete record 
in levelDB.
The less containers we missed to recover, the better.

4. The patch is small and safe, It won't cause any side effect.

> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>                 Key: YARN-2816
>                 URL: https://issues.apache.org/jira/browse/YARN-2816
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-2816.000.patch
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
>  failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are 
> accidentally deleted to save disk space at 
> /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete 
> container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) 
> entry in the DB. When container is recovered at 
> ContainerManagerImpl#recoverContainer, 
> The NullPointerException at the following code cause NM shutdown.
> {code}
>     StartContainerRequest req = rcs.getStartRequest();
>     ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}

This message was sent by Atlassian JIRA

Reply via email to