[
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950419#comment-15950419
]
Tao Yang commented on YARN-6403:
--------------------------------
[~jlowe] Thanks for your time!
{quote}
I believe it's appropriate to throw NPE in our client check code as well rather
than a generic RuntimeException. It's a minor point since the net effect will
be similar for the client in either case.
{quote}
Make sense, sorry for missing the point before.
{quote}
TestApplicationClientProtocolRecords looks like a decent place since it's
already has another test for ContainerLaunchContextPBImpl there.
{quote}
TestApplicationClientProtocolRecords is not exist in branch-2.8, so is it ok to
place the UT for client-side in
TestPBImplRecords#testContainerLaunchContextPBImpl?
In addition, the error message and unit test code will be improved in next
patch.
One patch can't fit for all branches, perhaps it's necessary to submit patches
for 2.9(branch-2) and 3.0.0-alpha3(trunk)?
> Invalid local resource request can raise NPE and make NM exit
> -------------------------------------------------------------
>
> Key: YARN-6403
> URL: https://issues.apache.org/jira/browse/YARN-6403
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.8.0
> Reporter: Tao Yang
> Assignee: Tao Yang
> Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that
> caused this problem added a invalid local resource request(have no location)
> into ContainerLaunchContext like this:
> {code}
> localResources.put("test", LocalResource.newInstance(location,
> LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
> System.currentTimeMillis()));
> ContainerLaunchContext amContainer =
> ContainerLaunchContext.newInstance(localResources, environment,
> vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This
> mistake cause several NMs exited with the NPE below and can't restart until
> the nm recovery dirs were deleted.
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource
> request.
> {code}
> public LocalResourceRequest(LocalResource resource)
> throws URISyntaxException {
> this(resource.getResource().toPath(), //NPE occurred here
> resource.getTimestamp(),
> resource.getType(),
> resource.getVisibility(),
> resource.getPattern());
> }
> {code}
> We can't guarantee the validity of local resource request now, but we could
> avoid damaging the cluster. Perhaps we can verify the resource both in
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give
> your suggestions.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]