[
https://issues.apache.org/jira/browse/YARN-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe reassigned YARN-6837:
--------------------------------
Assignee: Jinjiang Ling
Thanks for the report and the patch! Looking at the patch, I'm not a fan of
letting an NPE occur then catching it and assuming we know where the NPE came
from. It's error prone for maintenance since someone could accidentally
introduce another NPE problem and then we are catching and suppressing for the
wrong reason making things harder to debug.
Speaking of repressing exceptions, this simply logs a warning when we have no
visibility, but then it just continues. What will happen to the resource after
that? It doesn't look like we add it to any localizer list and therefore I
think the container will just hang waiting for a resource to localize that
never will.
A better way to handle this is to sanity-check the container launch request in
ContainerManagerImpl#startContainerInternal and throw an exception if the
request is malformed. This has the benefit of propagating the error back to
the client who is making the bad request so they know both that the request was
bad and the corresponding container will not be launched. This looks similar
to YARN-6403, and the resource visibility was missed in that change.
> When the LocalResource's visibility is null, the NodeManager will shutdown
> --------------------------------------------------------------------------
>
> Key: YARN-6837
> URL: https://issues.apache.org/jira/browse/YARN-6837
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.0.0-alpha4
> Reporter: Jinjiang Ling
> Assignee: Jinjiang Ling
> Attachments: YARN-6837.patch
>
>
> When I write an yarn application, I create a LocalResource like this
> {quote}
> LocalResource resource = Records.newRecord(LocalResource.class);
> {quote}
> Because I forget to set the visibilty of it, so the job is failed when I
> submit it.
> But NodeManager shutdown one by one at the same time, and there is
> NullPointerExceptionin NodeManager's log:
> {quote}
> 2017-07-18 17:54:09,289 INFO
> org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=hadoop
> IP=10.43.156.177 OPERATION=Start Container Request
> TARGET=ContainerManageImpl RESULT=SUCCESS
> APPID=application_1499221670783_0067
> CONTAINERID=container_1499221670783_0067_02_000003
> 2017-07-18 17:54:09,292 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher:
> Error in dispatcher thread
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceSet.addResources(ResourceSet.java:84)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:868)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:819)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1684)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:96)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1418)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1411)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> 2017-07-18 17:54:09,292 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
> Start request for container_1499221670783_0067_02_000002 by user hadoop
> {quote}
> Then I change my code and still set the visibility to null
> {quote}
> LocalResource resource = LocalResource.newInstance(
> URL.fromURI(dst.toUri()),
> LocalResourceType.FILE,
> {color:red}null{color},
> fileStatus.getLen(),
> fileStatus.getModificationTime());
> {quote}
> This error still happen.
> At last I set the visibility to correct value, the error do not happen again.
> So I think the visibility of LocalResource is null will cause NodeManager
> shutdown.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]