[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951182#comment-15951182
 ] 

Jason Lowe commented on YARN-6403:
----------------------------------

Thanks for updating the patch!

bq. TestApplicationClientProtocolRecords is not exist in branch-2.8, so is it 
ok to place the UT for client-side in 
TestPBImplRecords#testContainerLaunchContextPBImpl?

I'd rather the change appears in the same file so if there are subsequent 
modifications to the code it can be cherry-picked.  Therefore I agree we need a 
new patch for branch-2.8 so it can add the new 
TestApplicationClientProtocolRecords file.  Alternatively we can go with just 
one patch where it adds a new TestContainerLaunchContextPBImpl file that has 
the test.

Otherwise changes in the 2.8 patch look good.  There will need to be a patch 
for trunk at a minimum.  We'll need a separate one for branch-2.8 if the test 
goes in TestApplicationClientProtocolRecords instead of a new 
TestContainerLaunchContextPBImpl file.  Either works for me.

> Invalid local resource request can raise NPE and make NM exit
> -------------------------------------------------------------
>
>                 Key: YARN-6403
>                 URL: https://issues.apache.org/jira/browse/YARN-6403
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>         Attachments: YARN-6403.001.patch, YARN-6403.002.patch, 
> YARN-6403.branch-2.8.003.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
>     localResources.put("test", LocalResource.newInstance(location,
>         LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
>         System.currentTimeMillis()));
>     ContainerLaunchContext amContainer =
>         ContainerLaunchContext.newInstance(localResources, environment,
>           vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>       throws URISyntaxException {
>     this(resource.getResource().toPath(),  //NPE occurred here
>         resource.getTimestamp(),
>         resource.getType(),
>         resource.getVisibility(),
>         resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to