[jira] [Commented] (YARN-534) AM max attempts is not checked when RM restart and try to recover attempts

Bikas Saha (JIRA) Wed, 10 Apr 2013 10:32:16 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628015#comment-13628015
 ]


Bikas Saha commented on YARN-534:
---------------------------------

There needs to be a comment here that this logic needs to change with work 
preserving restart since then if attemptCount==maxAttempts then the job still 
needs to be recovered because the last attempt may still be running. I had made 
this comment earlier above.
{code}
+      if(appState.getAttemptCount() >= maxAppAttempts) {
+        LOG.info("Not recovering application " + appState.getAppId() +
+            " due to recovering attempt is beyond maxAppAttempt limit");
+        shouldRecover = false;
+      }
{code}

Since the current value of DEFAULT_RM_AM_MAX_ATTEMPTS==1 how do we know that 
the unmanaged AM is not being recovered because it is unmanaged or because its 
max attempt limit is reached?
{code}
+    RMApp appUnmanaged = rm1.submitApp(200, "someApp", "someUser", null, true,
+        null, conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
+          YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));
{code}

I think this also needs to check that app2 exists in store and app1 does not 
exist in store. We dont want the RM to by mistake continue to store apps like 
app1 which it is not going to recover. The specific bug in this case would be 
to not call recover() but also forget to call store.removeApplication().
{code}
+    // verify that app2 exists  app1 is removed
+    Assert.assertEquals(1, rm2.getRMContext().getRMApps().size());
+    Assert.assertNotNull(rm2.getRMContext().getRMApps()
+        .get(app2.getApplicationId()));
+    Assert.assertNull(rm2.getRMContext().getRMApps()
+        .get(app1.getApplicationId()));
+
+    // stop the RM
{code}
                
> AM max attempts is not checked when RM restart and try to recover attempts
> --------------------------------------------------------------------------
>
>                 Key: YARN-534
>                 URL: https://issues.apache.org/jira/browse/YARN-534
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
>             Fix For: 2.0.5-beta
>
>         Attachments: YARN-534.1.patch, YARN-534.2.patch, YARN-534.3.patch, 
> YARN-534.4.patch
>
>
> Currently,AM max attempts is only checked if the current attempt fails and 
> check to see whether to create new attempt. If the RM restarts before the 
> max-attempt fails, it'll not clean the state store, when RM comes back, it 
> will retry attempt again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (YARN-534) AM max attempts is not checked when RM restart and try to recover attempts

Reply via email to