[ 
https://issues.apache.org/jira/browse/YARN-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-6837:
--------------------------------

    Assignee: Jinjiang Ling

Thanks for the report and the patch!  Looking at the patch, I'm not a fan of 
letting an NPE occur then catching it and assuming we know where the NPE came 
from.  It's error prone for maintenance since someone could accidentally 
introduce another NPE problem and then we are catching and suppressing for the 
wrong reason making things harder to debug.

Speaking of repressing exceptions, this simply logs a warning when we have no 
visibility, but then it just continues.  What will happen to the resource after 
that?  It doesn't look like we add it to any localizer list and therefore I 
think the container will just hang waiting for a resource to localize that 
never will.

A better way to handle this is to sanity-check the container launch request in 
ContainerManagerImpl#startContainerInternal and throw an exception if the 
request is malformed.  This has the benefit of propagating the error back to 
the client who is making the bad request so they know both that the request was 
bad and the corresponding container will not be launched.  This looks similar 
to YARN-6403, and the resource visibility was missed in that change.

> When the LocalResource's visibility is null, the NodeManager will shutdown
> --------------------------------------------------------------------------
>
>                 Key: YARN-6837
>                 URL: https://issues.apache.org/jira/browse/YARN-6837
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Jinjiang Ling
>            Assignee: Jinjiang Ling
>         Attachments: YARN-6837.patch
>
>
> When I write an yarn application, I create a LocalResource like this
> {quote}
> LocalResource resource = Records.newRecord(LocalResource.class);
> {quote}
> Because I forget to set the visibilty of it, so the job is failed when I 
> submit it.
> But NodeManager shutdown one by one at the same time, and there is 
> NullPointerExceptionin NodeManager's log:
> {quote}
> 2017-07-18 17:54:09,289 INFO 
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop       
> IP=10.43.156.177        OPERATION=Start Container Request       
> TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_1499221670783_0067    
> CONTAINERID=container_1499221670783_0067_02_000003
> 2017-07-18 17:54:09,292 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.addResources(ResourceSet.java:84)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:868)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:819)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1684)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:96)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1418)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1411)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-07-18 17:54:09,292 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
>  Start request for container_1499221670783_0067_02_000002 by user hadoop
> {quote}
> Then I change my code and still set the visibility to null
> {quote}
> LocalResource resource = LocalResource.newInstance(
>                                 URL.fromURI(dst.toUri()),
>                                 LocalResourceType.FILE, 
> {color:red}null{color},
>                                 fileStatus.getLen(), 
> fileStatus.getModificationTime());
> {quote}
> This error still happen.
> At last I set the visibility to correct value, the error do not happen again.
> So I think the visibility of LocalResource is null will cause NodeManager 
> shutdown.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to