[
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe reassigned YARN-6403:
--------------------------------
Assignee: Tao Yang
Target Version/s: 2.8.1
Submitting patch so Jenkins can comment on it as well.
bq. For the client-side change, IIUIC the generated protobuf code won't throws
NPE for this case actually.
The generated protobuf code won't throw NPE for _this_ particular case, but it
does throw NPE for _other_ fields that you try to set directly to null. For
example, if one tried to call setLocalResources(null) on the protobuf (not the
PBImpl) then the generated protobuf code explicitly throws NPE. As such, I
believe it's appropriate to throw NPE in our client check code as well rather
than a generic RuntimeException. It's a minor point since the net effect will
be similar for the client in either case.
The {{localResources != null}} check in checkLocalResources is not necessary
since the calling code explicitly checks for it already.
The error message should be a bit more specific. It just logs the local
resource as a string, but unfortunately that won't log the fact that the
resource itself is null. We should change "Got invalid local resource" to
something like "Null resource URL for local resource". Throwing NPE instead of
RuntimeException would at least hint to the user that there's a problem with a
null field here, but we should also be more explicit in the error message to
help them along.
This test code:
{code}
boolean throwsException = false;
try {
[....]
containerLaunchContext.setLocalResources(localResources);
} catch (Throwable e) {
throwsException = true;
Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
}
Assert.assertTrue(throwsException);
{code}
can be simplified to this:
{code}
try {
[....]
containerLaunchContext.setLocalResources(localResources);
Assert.fail("setting an invalid local resource should be an error!");
} catch (RuntimeException e) {
Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
}
{code}
Note that we should be checking for the specific exception type we are throwing
in the test rather than Throwable, since this is essentially part of the client
API.
testClientFailureWithInvalidResource does not belong in ContainerManagerImpl
since it has nothing to do with ContainerManagerImpl. It's really a test for
ContainerLaunchContextPBImpl and should be moved to an appropriate place in the
yarn-common project. TestApplicationClientProtocolRecords looks like a decent
place since it's already has another test for ContainerLaunchContextPBImpl
there. The unit test method should be renamed to something more appropriate
when moved there, like testCLCPBImplNullResource.
> Invalid local resource request can raise NPE and make NM exit
> -------------------------------------------------------------
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.8.0
> Reporter: Tao Yang
> Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that
> caused this problem added a invalid local resource request(have no location)
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
> vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This
> mistake cause several NMs exited with the NPE below and can't restart until
> the nm recovery dirs were deleted.
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource
> request.
> {code}
> public LocalResourceRequest(LocalResource resource)
> throws URISyntaxException {
> this(resource.getResource().toPath(), //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
> }
> {code}
> We can't guarantee the validity of local resource request now, but we could
> avoid damaging the cluster. Perhaps we can verify the resource both in
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give
> your suggestions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]