[jira] [Commented] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM

Tarun Parimi (Jira) Tue, 22 Oct 2019 07:07:48 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957101#comment-16957101
 ]


Tarun Parimi commented on YARN-9928:
------------------------------------

The issue is occurring since container returned in below code snippet becomes 
null.

{code:java}
  private void publishContainerCreatedEvent(ContainerEvent event) {
    if (publishNMContainerEvents) {
      ContainerId containerId = event.getContainerID();
      ContainerEntity entity = createContainerEntity(containerId);
      Container container = context.getContainers().get(containerId);
      Resource resource = container.getResource();
{code}

This issue does not usually occur because there is a previous null check for 
the same done in ContainerManagerImpl . 

{code:java}
Map<ContainerId,Container> containers =
        ContainerManagerImpl.this.context.getContainers();
      Container c = containers.get(event.getContainerID());
      if (c != null) {
        c.handle(event);
        if (nmMetricsPublisher != null) {
          nmMetricsPublisher.publishContainerEvent(event);
        }
{code}

But in a heavily loaded prod cluster with lots of events in the 
ContainerManager dispatcher and when NM is also resyncing with RM at the same 
time in a separate NM dispatcher thread, it can suddenly remove all the 
completed containers.

So an additional null check is needed for the container in these scenarios.




> ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
> --------------------------------------------------------------------------
>
>                 Key: YARN-9928
>                 URL: https://issues.apache.org/jira/browse/YARN-9928
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: ATSv2
>    Affects Versions: 3.1.0
>            Reporter: Tarun Parimi
>            Assignee: Tarun Parimi
>            Priority: Major
>
> Encountered the below FATAL error in the NodeManager which was under heavy 
> load and was also resyncing with RM at the same. This caused the NM to go 
> down. 
> {code:java}
> 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520)
>     at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511)
>     at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>     at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>     at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM

Reply via email to