[ 
https://issues.apache.org/jira/browse/YARN-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13819895#comment-13819895
 ] 

Bikas Saha commented on YARN-1222:
----------------------------------

Looks good.

Some minor comments.

In default fencing, we rely on the default zkacl to allow shared admin access 
for both RM's right? Can we document that please?

Should these and others in handleStoreEvent() be inside the try block. If we 
caught an exception then we should not be informing the apps that their store 
operation is complete. We are probably getting away with it because everything 
happens on the async dispatcher thread. Hence the RM transitions to standby 
before these notification events are processed.
{code}
         }
       } catch (Exception e) {
         LOG.error("Error storing app: " + appId, e);
-        storedException = e;
-      } finally {
-        if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
-          notifyDoneStoringApplication(appId, storedException);
-        } else {
-          notifyDoneUpdatingApplication(appId, storedException);
-        }
+        notifyStoreOperationFailed(e);
+      }
+      if (event.getType().equals(RMStateStoreEventType.STORE_APP)) {
+        notifyDoneStoringApplication(appId, storedException);
+      } else {
+        notifyDoneUpdatingApplication(appId, storedException);
       }
{code}

Log in main and catch block and before ExitUtil would help in debugging.
{code}+          if (haService.haEnabled) {
+            try {
+              // Transition to standby and reinit active services
+              haService.transitionToStandby(true);
+              return;
+            } catch (Exception e) {
+              // Do nothing. RM is going to shutdown
+            }
{code}

Why is there a new synchronized block here? What needs to be guarded against 
concurrent modification?
{code}
-          if (shouldRetry(ke.code()) && ++retry < numRetries) {
-            continue;
+          synchronized (ZKRMStateStore.this) {
+            if (shouldRetry(ke.code()) && ++retry < numRetries) {
+              continue;
+            }
{code}

The test should probably create separate copies of conf for the 2 RM's

In the test, lets put a comment saying, triggering a state store operation that 
makes rm1 realize that its not the master because it got fenced by the store.



> Make improvements in ZKRMStateStore for fencing
> -----------------------------------------------
>
>                 Key: YARN-1222
>                 URL: https://issues.apache.org/jira/browse/YARN-1222
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Karthik Kambatla
>         Attachments: yarn-1222-1.patch, yarn-1222-2.patch, yarn-1222-3.patch, 
> yarn-1222-4.patch, yarn-1222-5.patch, yarn-1222-6.patch, yarn-1222-7.patch, 
> yarn-1222-8.patch, yarn-1222-8.patch
>
>
> Using multi-operations for every ZK interaction. 
> In every operation, automatically creating/deleting a lock znode that is the 
> child of the root znode. This is to achieve fencing by modifying the 
> create/delete permissions on the root znode.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to