[ 
https://issues.apache.org/jira/browse/YARN-6403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-6403:
--------------------------------

            Assignee: Tao Yang
    Target Version/s: 2.8.1

Submitting patch so Jenkins can comment on it as well.

bq. For the client-side change, IIUIC the generated protobuf code won't throws 
NPE for this case actually.

The generated protobuf code won't throw NPE for _this_ particular case, but it 
does throw NPE for _other_ fields that you try to set directly to null.  For 
example, if one tried to call setLocalResources(null) on the protobuf (not the 
PBImpl) then the generated protobuf code explicitly throws NPE.  As such, I 
believe it's appropriate to throw NPE in our client check code as well rather 
than a generic RuntimeException.  It's a minor point since the net effect will 
be similar for the client in either case.

The {{localResources != null}} check in checkLocalResources is not necessary 
since the calling code explicitly checks for it already.

The error message should be a bit more specific.  It just logs the local 
resource as a string, but unfortunately that won't log the fact that the 
resource itself is null.  We should change "Got invalid local resource" to 
something like "Null resource URL for local resource".  Throwing NPE instead of 
RuntimeException would at least hint to the user that there's a problem with a 
null field here, but we should also be more explicit in the error message to 
help them along.

This test code:
{code}
    boolean throwsException = false;
    try {
[....]
      containerLaunchContext.setLocalResources(localResources);
    } catch (Throwable e) {
      throwsException = true;
      Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
    }
    Assert.assertTrue(throwsException);
{code}
can be simplified to this:
{code}
    try {
[....]
      containerLaunchContext.setLocalResources(localResources);
      Assert.fail("setting an invalid local resource should be an error!");
    } catch (RuntimeException e) {
      Assert.assertTrue(e.getMessage().contains("Got invalid local resource"));
    }
{code}
Note that we should be checking for the specific exception type we are throwing 
in the test rather than Throwable, since this is essentially part of the client 
API.

testClientFailureWithInvalidResource does not belong in ContainerManagerImpl 
since it has nothing to do with ContainerManagerImpl.  It's really a test for 
ContainerLaunchContextPBImpl and should be moved to an appropriate place in the 
yarn-common project.  TestApplicationClientProtocolRecords looks like a decent 
place since it's already has another test for ContainerLaunchContextPBImpl 
there.  The unit test method should be renamed to something more appropriate 
when moved there, like testCLCPBImplNullResource.


> Invalid local resource request can raise NPE and make NM exit
> -------------------------------------------------------------
>
>                 Key: YARN-6403
>                 URL: https://issues.apache.org/jira/browse/YARN-6403
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>         Attachments: YARN-6403.001.patch, YARN-6403.002.patch
>
>
> Recently we found this problem on our testing environment. The app that 
> caused this problem added a invalid local resource request(have no location) 
> into ContainerLaunchContext like this:
> {code}
>     localResources.put("test", LocalResource.newInstance(location,
>         LocalResourceType.FILE, LocalResourceVisibility.PRIVATE, 100,
>         System.currentTimeMillis()));
>     ContainerLaunchContext amContainer =
>         ContainerLaunchContext.newInstance(localResources, environment,
>           vargsFinal, null, securityTokens, acls);
> {code}
> The actual value of location was null although app doesn't expect that. This 
> mistake cause several NMs exited with the NPE below and can't restart until 
> the nm recovery dirs were deleted. 
> {code}
> FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalResourceRequest.<init>(LocalResourceRequest.java:46)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:711)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$RequestResourcesTransition.transition(ContainerImpl.java:660)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1320)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:88)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1293)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1286)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> NPE occured when created LocalResourceRequest instance for invalid resource 
> request.
> {code}
>   public LocalResourceRequest(LocalResource resource)
>       throws URISyntaxException {
>     this(resource.getResource().toPath(),  //NPE occurred here
>         resource.getTimestamp(),
>         resource.getType(),
>         resource.getVisibility(),
>         resource.getPattern());
>   }
> {code}
> We can't guarantee the validity of local resource request now, but we could 
> avoid damaging the cluster. Perhaps we can verify the resource both in 
> ContainerLaunchContext and LocalResourceRequest? Please feel free to give 
> your suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to