[
https://issues.apache.org/jira/browse/YARN-7701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313300#comment-16313300
]
Rohith Sharma K S commented on YARN-7701:
-----------------------------------------
Got complete RM logs. The cluster is some what in 2.8 code base matching.
# My suspect is *ClientRMService#getDelegationToken* does synchronous call
RMStatestore for storing passwords. If RMStateStore is fenced then RM will be
moved to standby on this synchronous call. In secure cluster, transitioning to
standby happens to be in context of callerUgi. When RM is transitioned to
standby, service initialization and elector reset happens in context of
callerUgi who invoked _getDelegationToken_. As a result any subsequent call to
become active or standby from elector will have callerUgi context which will
fail at ACLs check.
# Below is the log trace that gives hint that transition to standby in
ClientRMService#getDelegationToken method call which is in the context of
callerUgi.
{noformat}
2017-12-20 11:55:01,302 ERROR
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: State
store operation failed
org.apache.hadoop.yarn.server.resourcemanager.recovery.StoreFencedException:
RMStateStore has been fenced
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1213)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doStoreMultiWithRetries(ZKRMStateStore.java:995)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeRMDelegationTokenState(ZKRMStateStore.java:752)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreRMDTTransition.transition(RMStateStore.java:345)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreRMDTTransition.transition(RMStateStore.java:330)
at
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:960)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeRMDelegationToken(RMStateStore.java:775)
at
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewToken(RMDelegationTokenSecretManager.java:110)
at
org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager.storeNewToken(RMDelegationTokenSecretManager.java:47)
at
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.storeToken(AbstractDelegationTokenSecretManager.java:272)
at
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:391)
at
org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager.createPassword(AbstractDelegationTokenSecretManager.java:47)
at org.apache.hadoop.security.token.Token.<init>(Token.java:62)
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:968)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getDelegationToken(ApplicationClientProtocolPBServiceImpl.java:296)
at
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:433)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
2017-12-20 11:55:01,303 WARN
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
State-store fenced ! Transitioning RM to standby
2017-12-20 11:55:01,398 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
RMStateStore state change from ACTIVE to FENCED
2017-12-20 11:55:01,398 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
RMStateStore has been fenced
2017-12-20 11:55:01,404 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning RM
to Standby mode
2017-12-20 11:55:01,404 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioning to
standby state
2017-12-20 11:55:03,114 INFO
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Transitioned to
standby state
2017-12-20 11:55:03,115 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Yielding from election
2017-12-20 11:55:03,116 INFO org.apache.hadoop.ha.ActiveStandbyElector:
Terminating ZK connection for elector id=693267461
2017-12-20 11:55:03,231 WARN
org.apache.hadoop.yarn.server.resourcemanager.AdminService: User odsuser
doesn't have permission to call 'refreshAdminAcls'
2017-12-20 11:55:03,231 WARN
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=odsuser
OPERATION=refreshAdminAcls TARGET=AdminService RESULT=FAILURE
DESCRIPTION=Unauthorized user PERMISSIONS=
2017-12-20 11:55:03,231 ERROR
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService: RM could
not transition to Standby
org.apache.hadoop.ha.ServiceFailedException: Can not execute refreshAdminAcls
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:346)
at
org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:147)
at
org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:970)
at
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:480)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
Caused by: org.apache.hadoop.yarn.exceptions.YarnException:
org.apache.hadoop.security.AccessControlException: User odsuser doesn't have
permission to call 'refreshAdminAcls'
at
org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:239)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAdminAcls(AdminService.java:476)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:344)
... 5 more
Caused by: org.apache.hadoop.security.AccessControlException: User odsuser
doesn't have permission to call 'refreshAdminAcls'
at
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:191)
at
org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.verifyAdminAccess(RMServerUtils.java:157)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAccess(AdminService.java:232)
at
org.apache.hadoop.yarn.server.resourcemanager.AdminService.checkAcls(AdminService.java:237)
... 7 more
{noformat}
In trunk I see YARN-6061 and YARN-3742 got committed which brings up async
dispatcher in between i.e RMStateStore trigger event to dispatcher and
dispatcher thread spawn another thread to transition to standby. I am not sure
is this sufficient or need to safe guard more by doing doAs with rmLoginUgi for
transition to standby.
Method RM#transitionToActive is executed using doAs but method
RM#transitionToStandby is not. I think it is better to execute transition to
standby with Privileged action.
cc :/ [~jianhe] [~jlowe] would you please suggest your opinion on doing doAs
for transitionToStandby.
> Both RM are in standby in secure cluster
> ----------------------------------------
>
> Key: YARN-7701
> URL: https://issues.apache.org/jira/browse/YARN-7701
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.9.0, 2.8.3, 3.0.0
> Reporter: Rohith Sharma K S
> Assignee: Rohith Sharma K S
> Priority: Critical
>
> Both RM were running perfectly fine for many days and switched multiple
> times. At some point of time when RM is switched from ACTIVE -> STANDBY, UGI
> information got either changed or to subject new user got added.
> As a result UGI#getShortUserName() is returning wrong user which result in
> fail to transition to ACTIVE with AccessControlException!
> {code}Caused by: org.apache.hadoop.security.AccessControlException: User
> odsuser doesn't have permission to call 'refreshAdminAcls'
> {code}
> _odsuser_ user is application submitted user.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]