[ 
https://issues.apache.org/jira/browse/YARN-7998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486638#comment-16486638
 ] 

Wilfred Spiegelenburg commented on YARN-7998:
---------------------------------------------

I don't think we should fail restore of a running application at all when the 
ACL was changed. Logging the failure is good but just killing the application 
is not the right thing to do. We should either not start up at all and tell the 
end user to fix the configuration or allow the application to be restored and 
finish. The ACL change when made on a running RM is also not triggering a 
running application review.  You do not kill any running application that is 
not allowed by the ACL when it gets changed. Restore should not behave any 
different.

Based on the details in YARN-7913 I think we need to close this as a duplicate 
and come up with a general fix that handles all these cases and not do one of 
changes to fix a specific corner case.

> RM crashes with NPE during recovering if ACL configuration was changed
> ----------------------------------------------------------------------
>
>                 Key: YARN-7998
>                 URL: https://issues.apache.org/jira/browse/YARN-7998
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: Oleksandr Shevchenko
>            Assignee: Oleksandr Shevchenko
>            Priority: Major
>         Attachments: YARN-7998.000.patch, YARN-7998.001.patch, 
> YARN-7998.002.patch, YARN-7998.003.patch
>
>
> RM crashes with NPE during failover because ACL configurations were changed 
> as a result we no longer have a rights to submit an application to a queue.
> Scenario:
>  # Submit an application
>  # Change ACL configuration for a queue that accepted the application so that 
> an owner of the application will no longer have a rights to submit this 
> application.
>  # Restart RM.
> As a result, we get NPE:
> 2018-02-27 18:14:00,968 INFO org.apache.hadoop.service.AbstractService: 
> Service ResourceManager failed in state STARTED; cause: 
> java.lang.NullPointerException
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:738)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1286)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:116)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1098)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1044)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to