[ 
https://issues.apache.org/jira/browse/YARN-8649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lujie reassigned YARN-8649:
---------------------------

      Assignee: lujie
    Attachment: YARN-8649.patch

Hi [~jlowe], [~pradeepambati],[~$iddhe$h]

I have restudied the bug according the logs.

*The root cause:*
 # When NM shutdowns, it will sent KILL_CONTAINER to the Container, The log has 
shown this event:

{code:java}
2018-08-21 20:11:08,316 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
 Container container_1534853453424_0001_01_000001 transitioned from LOCALIZING 
to KILLING
{code}
this will led the KillBeforeRunningTransition to execute.
 # In KillBeforeRunningTransition, it will call "container.cleanup()", and in 
"cleanup" function, it will sent "ContainerLocalizationCleanupEvent".
 # ContainerLocalizationCleanupEvent will cause the 
ResourceLocalizationService.handleCleanupContainerResources to execute, and in 
"handleCleanupContainerResources", it  will send  "ResourceReleaseEvent".
 # ResourceReleaseEvent will led cause the LocalResourcesTrackerImpl.handle to 
execute, and in handle(at line 199in source code) it will call removeResouce:

{code:java}
if (event.getType() == ResourceEventType.RELEASE) {
    if (rsrc.getState() == ResourceState.DOWNLOADING &&
        rsrc.getRefCount() <= 0 &&
        rsrc.getRequest().getVisibility() != LocalResourceVisibility.PUBLIC) {
        removeResource(req);
    }
}
{code}



 # in removeResouce, it will do:

{code:java}
LocalizedResource rsrc = localrsrc.remove(req);
{code}

 # when heartbeat come in, the LocalResourcesTrackerImpl.getPathForLocalization 
will  do:

{code:java}
Path localPath = new Path(rPath, req.getPath().getName());
LocalizedResource rsrc = localrsrc.get(req);//rsec is null
rsrc.setLocalPath(localPath);//NPE
{code}
NPE happens!

*Unit test:*


While fixing YARN-4355, the patch added the test 
"testLocalizerHeartbeatWhenAppCleaningUp" in Class 
"TestResourceLocalizationService"

In the test, it also send the "ContainerLocalizationCleanupEvent", but the test 
doesn't  cover that heartbeat can comes at this moment.

In this patch, we change the "testLocalizerHeartbeatWhenAppCleaningUp" to cover 
this situation. This change will trigger the bug.

 

Fixing:

When we fix the NPE, we only add null check, i think it is suitable here!

> Similar as YARN-4355:NPE while processing localizer heartbeat
> -------------------------------------------------------------
>
>                 Key: YARN-8649
>                 URL: https://issues.apache.org/jira/browse/YARN-8649
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.1
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Major
>         Attachments: YARN-8649.patch, hadoop-hires-nodemanager-hadoop11.log
>
>
> I have noticed that a nodemanager was getting NPEs while tearing down. The 
> reason maybe  similar to YARN-4355 which is reported by [# Jason Lowe]. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to