Tao Yang created YARN-6403:
------------------------------
Summary: Invalid local resource request can raise NPE and make NM
exit
Key: YARN-6403
URL: https://issues.apache.org/jira/browse/YARN-6403
Project: Hadoop YARN
Issue Type: Bug
Components: nodemanager
Affects Versions: 2.8.0
Reporter: Tao Yang
Recently we found this problem on our testing environment. The app that caused
this problem added a invalid local resource request(have no location) into
ContainerLaunchContext like this:
{code}
localResources.put("test", LocalResource.newInstance(location,
LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
System.currentTimeMillis()));
ContainerLaunchContext amContainer =
ContainerLaunchContext.newInstance(localResources, environment,
vargsFinal, null, securityTokens, acls);
{code}
The actual value of location was null although app doesn't expect that. This
mistake cause several NMs exited with the NPE below and can't restart until the
nm recovery dirs were deleted.
{code}
java.lang.NullPointerException
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
at
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
at java.lang.Thread.run(Thread.java:745)
{code}
NPE occured when created LocalResourceRequest instance for invalid resource
request.
{code}
public LocalResourceRequest(LocalResource resource)
throws URISyntaxException {
this(resource.getResource().toPath(), //NPE occurred here
resource.getTimestamp(),
resource.getType(),
resource.getVisibility(),
resource.getPattern());
}
{code}
We can't guarantee the validity of local resource request now, but we could
avoid damaging the cluster. Perhaps we can verify the resource both in
ContainerLaunchContext and LocalResourceRequest? Please feel free to give your
suggestions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]