[
https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957101#comment-16957101
]
Tarun Parimi commented on YARN-9928:
------------------------------------
The issue is occurring since container returned in below code snippet becomes
null.
{code:java}
private void publishContainerCreatedEvent(ContainerEvent event) {
if (publishNMContainerEvents) {
ContainerId containerId = event.getContainerID();
ContainerEntity entity = createContainerEntity(containerId);
Container container = context.getContainers().get(containerId);
Resource resource = container.getResource();
{code}
This issue does not usually occur because there is a previous null check for
the same done in ContainerManagerImpl .
{code:java}
Map<ContainerId,Container> containers =
ContainerManagerImpl.this.context.getContainers();
Container c = containers.get(event.getContainerID());
if (c != null) {
c.handle(event);
if (nmMetricsPublisher != null) {
nmMetricsPublisher.publishContainerEvent(event);
}
{code}
But in a heavily loaded prod cluster with lots of events in the
ContainerManager dispatcher and when NM is also resyncing with RM at the same
time in a separate NM dispatcher thread, it can suddenly remove all the
completed containers.
So an additional null check is needed for the container in these scenarios.
> ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
> --------------------------------------------------------------------------
>
> Key: YARN-9928
> URL: https://issues.apache.org/jira/browse/YARN-9928
> Project: Hadoop YARN
> Issue Type: Bug
> Components: ATSv2
> Affects Versions: 3.1.0
> Reporter: Tarun Parimi
> Assignee: Tarun Parimi
> Priority: Major
>
> Encountered the below FATAL errorĀ in the NodeManager which was under heavy
> load and was also resyncing with RM at the same. This caused the NM to go
> down.
> {code:java}
> 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher
> (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216)
> at
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]