[ https://issues.apache.org/jira/browse/YARN-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe reassigned YARN-6837: -------------------------------- Assignee: Jinjiang Ling Thanks for the report and the patch! Looking at the patch, I'm not a fan of letting an NPE occur then catching it and assuming we know where the NPE came from. It's error prone for maintenance since someone could accidentally introduce another NPE problem and then we are catching and suppressing for the wrong reason making things harder to debug. Speaking of repressing exceptions, this simply logs a warning when we have no visibility, but then it just continues. What will happen to the resource after that? It doesn't look like we add it to any localizer list and therefore I think the container will just hang waiting for a resource to localize that never will. A better way to handle this is to sanity-check the container launch request in ContainerManagerImpl#startContainerInternal and throw an exception if the request is malformed. This has the benefit of propagating the error back to the client who is making the bad request so they know both that the request was bad and the corresponding container will not be launched. This looks similar to YARN-6403, and the resource visibility was missed in that change. > When the LocalResource's visibility is null, the NodeManager will shutdown > -------------------------------------------------------------------------- > > Key: YARN-6837 > URL: https://issues.apache.org/jira/browse/YARN-6837 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.0.0-alpha4 > Reporter: Jinjiang Ling > Assignee: Jinjiang Ling > Attachments: YARN-6837.patch > > > When I write an yarn application, I create a LocalResource like this > {quote} > LocalResource resource = Records.newRecord(LocalResource.class); > {quote} > Because I forget to set the visibilty of it, so the job is failed when I > submit it. > But NodeManager shutdown one by one at the same time, and there is > NullPointerExceptionin NodeManager's log: > {quote} > 2017-07-18 17:54:09,289 INFO > org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop > IP=10.43.156.177 OPERATION=Start Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1499221670783_0067 > CONTAINERID=container_1499221670783_0067_02_000003 > 2017-07-18 17:54:09,292 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.addResources(ResourceSet.java:84) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:868) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:819) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1684) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:96) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1418) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1411) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2017-07-18 17:54:09,292 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: > Start request for container_1499221670783_0067_02_000002 by user hadoop > {quote} > Then I change my code and still set the visibility to null > {quote} > LocalResource resource = LocalResource.newInstance( > URL.fromURI(dst.toUri()), > LocalResourceType.FILE, > {color:red}null{color}, > fileStatus.getLen(), > fileStatus.getModificationTime()); > {quote} > This error still happen. > At last I set the visibility to correct value, the error do not happen again. > So I think the visibility of LocalResource is null will cause NodeManager > shutdown. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org