[ 
https://issues.apache.org/jira/browse/YARN-7701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313300#comment-16313300
 ] 

Rohith Sharma K S commented on YARN-7701:
-----------------------------------------

Got complete RM logs. The cluster is some what in 2.8 code base matching. 
# My suspect is *ClientRMService#getDelegationToken* does synchronous call 
RMStatestore for storing passwords. If RMStateStore is fenced then RM will be 
moved to standby on this synchronous call. In secure cluster, transitioning to 
standby happens to be in context of callerUgi. When RM is transitioned to 
standby, service initialization and elector reset happens in context of 
callerUgi who invoked _getDelegationToken_. As a result any subsequent call to 
become active or standby from elector will have callerUgi context which will 
fail at ACLs check. 
# Below is the log trace that gives hint that transition to standby in 
ClientRMService#getDelegationToken method call which is in the context of 
callerUgi.
{noformat}
2017-12-20 11:55:01,302 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: State 
store operation failed 
org.apache.hadoop.yarn.server.resourcemanager.recovery.StoreFencedException: 
RMStateStore has been fenced
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1213)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeRMDelegationTokenState(ZKRMStateStore.java:752)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreRMDTTransition.transition(RMStateStore.java:345)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreRMDTTransition.transition(RMStateStore.java:330)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:960)
        at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeRMDelegationToken(RMStateStore.java:775)
        at 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewToken(RMDelegationTokenSecretManager.java:110)
        at 
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewToken(RMDelegationTokenSecretManager.java:47)
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.storeToken(AbstractDelegationTokenSecretManager.java:272)
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:391)
        at 
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:47)
        at org.apache.hadoop.security.token.Token.<init>(Token.java:62)
        at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:968)
        at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:296)
        at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:433)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
2017-12-20 11:55:01,303 WARN 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: 
State-store fenced ! Transitioning RM to standby
2017-12-20 11:55:01,398 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: 
RMStateStore state change from ACTIVE to FENCED
2017-12-20 11:55:01,398 INFO 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: 
RMStateStore has been fenced
2017-12-20 11:55:01,404 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning RM 
to Standby mode
2017-12-20 11:55:01,404 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to 
standby state
2017-12-20 11:55:03,114 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to 
standby state
2017-12-20 11:55:03,115 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election
2017-12-20 11:55:03,116 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Terminating ZK connection for elector id=693267461 
2017-12-20 11:55:03,231 WARN 
org.apache.hadoop.yarn.server.resourcemanager.AdminService: User odsuser 
doesn't have permission to call 'refreshAdminAcls'
2017-12-20 11:55:03,231 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=odsuser  
OPERATION=refreshAdminAcls      TARGET=AdminService     RESULT=FAILURE  
DESCRIPTION=Unauthorized user   PERMISSIONS=
2017-12-20 11:55:03,231 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService: RM could 
not transition to Standby
org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:346)
        at 
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:147)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:970)
        at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:480)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: 
org.apache.hadoop.security.AccessControlException: User odsuser doesn't have 
permission to call 'refreshAdminAcls'
        at 
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:239)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:476)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:344)
        ... 5 more
Caused by: org.apache.hadoop.security.AccessControlException: User odsuser 
doesn't have permission to call 'refreshAdminAcls'
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:191)
        at 
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:157)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:232)
        at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:237)
        ... 7 more
{noformat}
In trunk I see YARN-6061 and YARN-3742 got committed which brings up async 
dispatcher in between i.e RMStateStore trigger event to dispatcher and 
dispatcher thread spawn another thread to transition to standby. I am not sure 
is this sufficient or need to safe guard more by doing doAs with rmLoginUgi for 
transition to standby. 

Method RM#transitionToActive is executed using doAs but method 
RM#transitionToStandby is not. I think it is better to execute transition to 
standby with Privileged action. 
cc :/ [~jianhe] [~jlowe] would you please suggest your opinion on doing doAs 
for transitionToStandby.

> Both RM are in standby in secure cluster
> ----------------------------------------
>
>                 Key: YARN-7701
>                 URL: https://issues.apache.org/jira/browse/YARN-7701
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.9.0, 2.8.3, 3.0.0
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>            Priority: Critical
>
> Both RM were running perfectly fine for many days and switched multiple 
> times. At some point of time when RM is switched from ACTIVE -> STANDBY, UGI 
> information got either changed or to subject new user got added.  
> As a result UGI#getShortUserName() is returning wrong user which result in 
> fail to  transition to ACTIVE with AccessControlException!
> {code}Caused by: org.apache.hadoop.security.AccessControlException: User 
> odsuser doesn't have permission to call 'refreshAdminAcls' 
> {code}
> _odsuser_ user is application submitted user. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to