[ 
https://issues.apache.org/jira/browse/YARN-4354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15004462#comment-15004462
 ] 

Jason Lowe commented on YARN-4354:
----------------------------------

Looks like this can cause nodemanagers to crash as well:
{noformat}
2015-11-13 17:22:51,063 [AsyncDispatcher event handler] FATAL 
event.AsyncDispatcher: Error in dispatcher thread
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourcesTrackerImpl.getPathForLocalization(LocalResourcesTrackerImpl.java:448)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:802)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:704)
        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.handle(ResourceLocalizationService.java:646)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

 I think it was trying to lookup a resource that it assumed was still there but 
had been removed.

bq. I think a check for resource visibility should suffice. What do you think ?

What worries me about that approach is if we somehow allowed a heartbeat from a 
localizer to come in just after we cleaned up a resource because a container 
happened to be released then we get the same kind of badness if the 
localization completed just after we removed it.  We may still want a null 
check just in case we get a late event for a removed resource.

> Public resource localization fails with NPE
> -------------------------------------------
>
>                 Key: YARN-4354
>                 URL: https://issues.apache.org/jira/browse/YARN-4354
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.2
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: YARN-4354-unittest.patch
>
>
> I saw public localization on nodemanagers get stuck because it was constantly 
> rejecting requests to the thread pool executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to