[ https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957101#comment-16957101 ]
Tarun Parimi commented on YARN-9928: ------------------------------------ The issue is occurring since container returned in below code snippet becomes null. {code:java} private void publishContainerCreatedEvent(ContainerEvent event) { if (publishNMContainerEvents) { ContainerId containerId = event.getContainerID(); ContainerEntity entity = createContainerEntity(containerId); Container container = context.getContainers().get(containerId); Resource resource = container.getResource(); {code} This issue does not usually occur because there is a previous null check for the same done in ContainerManagerImpl . {code:java} Map<ContainerId,Container> containers = ContainerManagerImpl.this.context.getContainers(); Container c = containers.get(event.getContainerID()); if (c != null) { c.handle(event); if (nmMetricsPublisher != null) { nmMetricsPublisher.publishContainerEvent(event); } {code} But in a heavily loaded prod cluster with lots of events in the ContainerManager dispatcher and when NM is also resyncing with RM at the same time in a separate NM dispatcher thread, it can suddenly remove all the completed containers. So an additional null check is needed for the container in these scenarios. > ATSv2 can make NM go down with a FATAL error while it is resyncing with RM > -------------------------------------------------------------------------- > > Key: YARN-9928 > URL: https://issues.apache.org/jira/browse/YARN-9928 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 > Affects Versions: 3.1.0 > Reporter: Tarun Parimi > Assignee: Tarun Parimi > Priority: Major > > Encountered the below FATAL errorĀ in the NodeManager which was under heavy > load and was also resyncing with RM at the same. This caused the NM to go > down. > {code:java} > 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher > (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org