[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813716#comment-13813716 ]
Bikas Saha commented on YARN-1222: ---------------------------------- @Private? {code}+ public static String getConfValueForRMInstance(String prefix,{code} If RM is the one creating root znode then how can someone else's ACL's be present on that znode? ie. how can the ACLs on root znode have any other entries? My concern is that we are only adding new ACLs every time we failover but never deleting them. Is it possible that we end up creating too many ACLs for the root znode and hit ZK issues? {code} + Id rmId = new Id(zkRootNodeAuthScheme, + DigestAuthenticationProvider.generateDigest( + zkRootNodeUsername + ":" + zkRootNodePassword)); + zkRootNodeAcl.add(new ACL(CREATE_DELETE_PERMS, rmId)); + return zkRootNodeAcl; {code} For both of the above, can we use well-known prefixes for the root znode acls (rm-admin-acl and rm-cd-acl). When fencing we dont touch the rm-admin-acl but remove all rm-cd-acl's. We then add a new rm-cd-acl for ourselves. we dont touch any other acl. Where is the shared rm-admin-acl being set such that both RMs have admin access to the root znode? How is the following case going to work? How can the root node acl be set in the conf? Upon active, we have to remove the old RM's cd-acl and set our cd-acl. That cannot be statically set in conf right? {code} if (HAUtil.isHAEnabled(conf)) { + String zkRootNodeAclConf = HAUtil.getConfValueForRMInstance + (YarnConfiguration.ZK_RM_STATE_STORE_ROOT_NODE_ACL, conf); + if (zkRootNodeAclConf != null) { + zkRootNodeAclConf = ZKUtil.resolveConfIndirection(zkRootNodeAclConf); + try { + zkRootNodeAcl = ZKUtil.parseACLs(zkRootNodeAclConf); + } catch (ZKUtil.BadAclFormatException bafe) { + LOG.error("Invalid format for " + + YarnConfiguration.ZK_RM_STATE_STORE_ROOT_NODE_ACL); + throw bafe; + } + } {code} The test should probably create separate copies of conf for the 2 RM's Wont we get an exception/error from this? {code}+ rmService.submitApplication(SubmitApplicationRequest.newInstance(asc)); {code} Lets put a comment saying, triggering a state store operation that makes rm1 realize that its not the master because it got fenced by the store. This and other similar places need an @Private {code}+ @VisibleForTesting + public void createWithRetries({code} Can you please specify in comments which operations are exempt from multi-operation. Looks like only "write" operations go through multi. Exceptions being initial znode creation and fence-on-active. Right? Can we move this logic into the common RMStateStore and notify it about HA state loss via a standard HA exception. Will the null return make the state store crash? {code} + } catch (KeeperException.NoAuthException nae) { + if (HAUtil.isHAEnabled(getConfig())) { + // Transition to standby + RMHAServiceTarget target = new RMHAServiceTarget( + (YarnConfiguration)getConfig()); + target.getProxy(getConfig(), 1000).transitionToStandby( + new HAServiceProtocol.StateChangeRequestInfo( + HAServiceProtocol.RequestSource.REQUEST_BY_USER_FORCED)); + return null; + } {code} > Make improvements in ZKRMStateStore for fencing > ----------------------------------------------- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)