[ 
https://issues.apache.org/jira/browse/YARN-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13762268#comment-13762268
 ] 

Bikas Saha commented on YARN-540:
---------------------------------

getIsUnregistered() instead of getUnregistered() ?

Better grammar/more clear. e.g. is expected to retry until this flag becomes 
true.
{code}
Note: This flag only needed for RM recovery purpose. If RM recovery is
+ * enabled, user is expected to retry this flag until it becomes true.
{code}

Why is the default false? 
{code}
+  optional bool unregistered = 1 [default = false];
{code}

Very large value of sleep. 100ms? The log should be before the sleep.
{code}
+      while (true) {
+        FinishApplicationMasterResponse response =
+            rmClient.finishApplicationMaster(request);
+        if (response.getUnregistered()) {
+          break;
+        }
+        Thread.sleep(1000);
+        LOG.info("Waiting for application to be successfully unregistered.");
{code}

Instead of checking for an exact state I think it we should check for all 
terminal states of an RMApp. This will make the code more resilient to future 
changes in the state machines. So we check for FINISHING, FINISHED. FAILED, 
KILLED. This will also allow us to not special case the unmanaged AM in the 
latter half of the same function. Also, this is open to race conditions. e.g. 
someone kills the app before the app is removed from the store. We should 
probably make this an RMApp method like RMApp.isAppRemovedFromStore(). In this 
method we can either check the state or some boolean that we can set when the 
App_Removed event comes.
{code}
+      // Application state has been removed from RMStateStore, if it's in
+      // FINISHING state
+      if (rmContext.getRMApps().get(applicationAttemptId.getApplicationId())
+        .getState().equals(RMAppState.FINISHING)) {
+        return FinishApplicationMasterResponse.newInstance(true);
+      }
{code}

Is there a version of delete that will not fail if the file does not exist? OR 
we can have a boolean in RMApp to show that the removal request has already 
been sent and not send it multiple times. Lets try to avoid 2 remote HDFS calls 
in the common case.
{code}
+    if(!fs.exists(deletePath))
+      return;
     if(!fs.delete(deletePath, true)) {
{code}




                
> Race condition causing RM to potentially relaunch already unregistered AMs on 
> RM restart
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-540
>                 URL: https://issues.apache.org/jira/browse/YARN-540
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-540.1.patch, YARN-540.2.patch, YARN-540.3.patch, 
> YARN-540.4.patch, YARN-540.5.patch, YARN-540.patch, YARN-540.patch
>
>
> When job succeeds and successfully call finishApplicationMaster, RM shutdown 
> and restart-dispatcher is stopped before it can process REMOVE_APP event. The 
> next time RM comes back, it will reload the existing state files even though 
> the job is succeeded

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to