[ https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819895#comment-13819895 ]
Bikas Saha commented on YARN-1222: ---------------------------------- Looks good. Some minor comments. In default fencing, we rely on the default zkacl to allow shared admin access for both RM's right? Can we document that please? Should these and others in handleStoreEvent() be inside the try block. If we caught an exception then we should not be informing the apps that their store operation is complete. We are probably getting away with it because everything happens on the async dispatcher thread. Hence the RM transitions to standby before these notification events are processed. {code} } } catch (Exception e) { LOG.error("Error storing app: " + appId, e); - storedException = e; - } finally { - if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { - notifyDoneStoringApplication(appId, storedException); - } else { - notifyDoneUpdatingApplication(appId, storedException); - } + notifyStoreOperationFailed(e); + } + if (event.getType().equals(RMStateStoreEventType.STORE_APP)) { + notifyDoneStoringApplication(appId, storedException); + } else { + notifyDoneUpdatingApplication(appId, storedException); } {code} Log in main and catch block and before ExitUtil would help in debugging. {code}+ if (haService.haEnabled) { + try { + // Transition to standby and reinit active services + haService.transitionToStandby(true); + return; + } catch (Exception e) { + // Do nothing. RM is going to shutdown + } {code} Why is there a new synchronized block here? What needs to be guarded against concurrent modification? {code} - if (shouldRetry(ke.code()) && ++retry < numRetries) { - continue; + synchronized (ZKRMStateStore.this) { + if (shouldRetry(ke.code()) && ++retry < numRetries) { + continue; + } {code} The test should probably create separate copies of conf for the 2 RM's In the test, lets put a comment saying, triggering a state store operation that makes rm1 realize that its not the master because it got fenced by the store. > Make improvements in ZKRMStateStore for fencing > ----------------------------------------------- > > Key: YARN-1222 > URL: https://issues.apache.org/jira/browse/YARN-1222 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Bikas Saha > Assignee: Karthik Kambatla > Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, > yarn-1222-4.patch, yarn-1222-5.patch, yarn-1222-6.patch, yarn-1222-7.patch, > yarn-1222-8.patch, yarn-1222-8.patch > > > Using multi-operations for every ZK interaction. > In every operation, automatically creating/deleting a lock znode that is the > child of the root znode. This is to achieve fencing by modifying the > create/delete permissions on the root znode. -- This message was sent by Atlassian JIRA (v6.1#6144)