[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950419#comment-15950419
 ] 

Tao Yang commented on YARN-6403:
--------------------------------

[~jlowe] Thanks for your time! 
{quote}
I believe it's appropriate to throw NPE in our client check code as well rather 
than a generic RuntimeException. It's a minor point since the net effect will 
be similar for the client in either case.
{quote}
Make sense, sorry for missing the point before.
{quote}
TestApplicationClientProtocolRecords looks like a decent place since it's 
already has another test for ContainerLaunchContextPBImpl there.
{quote}
TestApplicationClientProtocolRecords is not exist in branch-2.8, so is it ok to 
place the UT for client-side in 
TestPBImplRecords#testContainerLaunchContextPBImpl?
In addition, the error message and unit test code will be improved in next 
patch.
One patch can't fit for all branches, perhaps it's necessary to submit patches 
for 2.9(branch-2) and 3.0.0-alpha3(trunk)?

> Invalid local resource request can raise NPE and make NM exit
> -------------------------------------------------------------
>
>                 Key: YARN-6403
>                 URL: https://issues.apache.org/jira/browse/YARN-6403
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>         Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
>     localResources.put("test", LocalResource.newInstance(location,
>         LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
>         System.currentTimeMillis()));
>     ContainerLaunchContext amContainer =
>         ContainerLaunchContext.newInstance(localResources, environment,
>           vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>       throws URISyntaxException {
>     this(resource.getResource().toPath(),  //NPE occurred here
>         resource.getTimestamp(),
>         resource.getType(),
>         resource.getVisibility(),
>         resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to