[
https://issues.apache.org/jira/browse/YARN-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900399#comment-14900399
]
Sunil G commented on YARN-4152:
-------------------------------
Thanks [~bibinchundatt].
Yes, container seems like was not present in context. And this has happened in
CONTAINER_FINISHED event, so absent container scenario can be handled with this
check. And looks like this case is also handled in other events, may be you
could double check it and make sure similar incidents are handled for other
events also.
Other wise patch looks good to me.
> NM crash with NPE when LogAggregationService#stopContainer called for absent
> container
> --------------------------------------------------------------------------------------
>
> Key: YARN-4152
> URL: https://issues.apache.org/jira/browse/YARN-4152
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Bibin A Chundatt
> Assignee: Bibin A Chundatt
> Priority: Critical
> Attachments: 0001-YARN-4152.patch, 0002-YARN-4152.patch,
> 0003-YARN-4152.patch
>
>
> NM crash during of log aggregation.
> Ran Pi job with 500 container and killed application in between
> *Logs*
> {code}
> 2015-09-12 18:44:25,597 WARN
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code
> from container container_e51_1442063466801_0001_01_000099 is : 143
> 2015-09-12 18:44:25,670 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> Event EventType: KILL_CONTAINER sent to absent container
> container_e51_1442063466801_0001_01_000101
> 2015-09-12 18:44:25,670 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl:
> Removing container_e51_1442063466801_0001_01_000101 from application
> application_1442063466801_0001
> 2015-09-12 18:44:25,670 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher:
> Error in dispatcher thread
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.stopContainer(LogAggregationService.java:422)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:745)
> 2015-09-12 18:44:25,692 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got
> event CONTAINER_STOP for appId application_1442063466801_0001
> 2015-09-12 18:44:25,692 INFO org.apache.hadoop.yarn.event.AsyncDispatcher:
> Exiting, bbye..
> 2015-09-12 18:44:25,692 INFO
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=dsperf
> OPERATION=Container Finished - Succeeded TARGET=ContainerImpl
> RESULT=SUCCESS APPID=application_1442063466801_0001
> CONTAINERID=container_e51_1442063466801_0001_01_000100
> {code}
> *Analysis*
> Looks like for absent container also {{stopContainer}} is called
> {code}
> case CONTAINER_FINISHED:
> LogHandlerContainerFinishedEvent containerFinishEvent =
> (LogHandlerContainerFinishedEvent) event;
> stopContainer(containerFinishEvent.getContainerId(),
> containerFinishEvent.getExitCode());
> break;
> {code}
> *Event EventType: KILL_CONTAINER sent to absent container
> container_e51_1442063466801_0001_01_000101*
> Should skip when {{null==context.getContainers().get(containerId)}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)