[ https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe reassigned YARN-6403: -------------------------------- Assignee: Tao Yang Target Version/s: 2.8.1 Submitting patch so Jenkins can comment on it as well. bq. For the client-side change, IIUIC the generated protobuf code won't throws NPE for this case actually. The generated protobuf code won't throw NPE for _this_ particular case, but it does throw NPE for _other_ fields that you try to set directly to null. For example, if one tried to call setLocalResources(null) on the protobuf (not the PBImpl) then the generated protobuf code explicitly throws NPE. As such, I believe it's appropriate to throw NPE in our client check code as well rather than a generic RuntimeException. It's a minor point since the net effect will be similar for the client in either case. The {{localResources != null}} check in checkLocalResources is not necessary since the calling code explicitly checks for it already. The error message should be a bit more specific. It just logs the local resource as a string, but unfortunately that won't log the fact that the resource itself is null. We should change "Got invalid local resource" to something like "Null resource URL for local resource". Throwing NPE instead of RuntimeException would at least hint to the user that there's a problem with a null field here, but we should also be more explicit in the error message to help them along. This test code: {code} boolean throwsException = false; try { [....] containerLaunchContext.setLocalResources(localResources); } catch (Throwable e) { throwsException = true; Assert.assertTrue(e.getMessage().contains("Got invalid local resource")); } Assert.assertTrue(throwsException); {code} can be simplified to this: {code} try { [....] containerLaunchContext.setLocalResources(localResources); Assert.fail("setting an invalid local resource should be an error!"); } catch (RuntimeException e) { Assert.assertTrue(e.getMessage().contains("Got invalid local resource")); } {code} Note that we should be checking for the specific exception type we are throwing in the test rather than Throwable, since this is essentially part of the client API. testClientFailureWithInvalidResource does not belong in ContainerManagerImpl since it has nothing to do with ContainerManagerImpl. It's really a test for ContainerLaunchContextPBImpl and should be moved to an appropriate place in the yarn-common project. TestApplicationClientProtocolRecords looks like a decent place since it's already has another test for ContainerLaunchContextPBImpl there. The unit test method should be renamed to something more appropriate when moved there, like testCLCPBImplNullResource. > Invalid local resource request can raise NPE and make NM exit > ------------------------------------------------------------- > > Key: YARN-6403 > URL: https://issues.apache.org/jira/browse/YARN-6403 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.8.0 > Reporter: Tao Yang > Assignee: Tao Yang > Attachments: YARN-6403.001.patch, YARN-6403.002.patch > > > Recently we found this problem on our testing environment. The app that > caused this problem added a invalid local resource request(have no location) > into ContainerLaunchContext like this: > {code} > localResources.put("test", LocalResource.newInstance(location, > LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100, > System.currentTimeMillis())); > ContainerLaunchContext amContainer = > ContainerLaunchContext.newInstance(localResources, environment, > vargsFinal, null, securityTokens, acls); > {code} > The actual value of location was null although app doesn't expect that. This > mistake cause several NMs exited with the NPE below and can't restart until > the nm recovery dirs were deleted. > {code} > FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110) > at java.lang.Thread.run(Thread.java:745) > {code} > NPE occured when created LocalResourceRequest instance for invalid resource > request. > {code} > public LocalResourceRequest(LocalResource resource) > throws URISyntaxException { > this(resource.getResource().toPath(), //NPE occurred here > resource.getTimestamp(), > resource.getType(), > resource.getVisibility(), > resource.getPattern()); > } > {code} > We can't guarantee the validity of local resource request now, but we could > avoid damaging the cluster. Perhaps we can verify the resource both in > ContainerLaunchContext and LocalResourceRequest? Please feel free to give > your suggestions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org