[
https://issues.apache.org/jira/browse/YARN-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13628015#comment-13628015
]
Bikas Saha commented on YARN-534:
---------------------------------
There needs to be a comment here that this logic needs to change with work
preserving restart since then if attemptCount==maxAttempts then the job still
needs to be recovered because the last attempt may still be running. I had made
this comment earlier above.
{code}
+ if(appState.getAttemptCount() >= maxAppAttempts) {
+ LOG.info("Not recovering application " + appState.getAppId() +
+ " due to recovering attempt is beyond maxAppAttempt limit");
+ shouldRecover = false;
+ }
{code}
Since the current value of DEFAULT_RM_AM_MAX_ATTEMPTS==1 how do we know that
the unmanaged AM is not being recovered because it is unmanaged or because its
max attempt limit is reached?
{code}
+ RMApp appUnmanaged = rm1.submitApp(200, "someApp", "someUser", null, true,
+ null, conf.getInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS,
+ YarnConfiguration.DEFAULT_RM_AM_MAX_ATTEMPTS));
{code}
I think this also needs to check that app2 exists in store and app1 does not
exist in store. We dont want the RM to by mistake continue to store apps like
app1 which it is not going to recover. The specific bug in this case would be
to not call recover() but also forget to call store.removeApplication().
{code}
+ // verify that app2 exists app1 is removed
+ Assert.assertEquals(1, rm2.getRMContext().getRMApps().size());
+ Assert.assertNotNull(rm2.getRMContext().getRMApps()
+ .get(app2.getApplicationId()));
+ Assert.assertNull(rm2.getRMContext().getRMApps()
+ .get(app1.getApplicationId()));
+
+ // stop the RM
{code}
> AM max attempts is not checked when RM restart and try to recover attempts
> --------------------------------------------------------------------------
>
> Key: YARN-534
> URL: https://issues.apache.org/jira/browse/YARN-534
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Jian He
> Assignee: Jian He
> Fix For: 2.0.5-beta
>
> Attachments: YARN-534.1.patch, YARN-534.2.patch, YARN-534.3.patch,
> YARN-534.4.patch
>
>
> Currently,AM max attempts is only checked if the current attempt fails and
> check to see whether to create new attempt. If the RM restarts before the
> max-attempt fails, it'll not clean the state store, when RM comes back, it
> will retry attempt again.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira