Prabhu Joseph created YARN-10871:
------------------------------------

             Summary: Aborted AM is considered as App Failure when user sets 
MaxAttempts as 1
                 Key: YARN-10871
                 URL: https://issues.apache.org/jira/browse/YARN-10871
             Project: Hadoop YARN
          Issue Type: Bug
          Components: RM
    Affects Versions: 3.3.1
            Reporter: Prabhu Joseph
            Assignee: Prabhu Joseph


When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
failure is not counted. But if user sets number of attempts as 1, then YARN 
considers the ABORTED AM as a failure. 

{code}
      int numberOfFailure = app.getNumFailedAppAttempts();
      if (app.maxAppAttempts == 1) {
        // If the user explicitly set the attempts to 1 then there are likely
        // correctness issues if the AM restarts for any reason.
        LOG.info("Max app attempts is 1 for " + app.applicationId
            + ", preventing further attempts.");
        numberOfFailure = app.maxAppAttempts;
      } 
{code}

Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
support multiple connections for the same registered app. But in our case AM is 
ABORTED before even the AM starts (AM was in ACAUIRED state)

Usually users won't decommission the node where the Container is in RUNNING 
state (where the session is established). But the decommission can happen on 
nodes where the container is in ACQUIRED or ALLOCATED state. 

Will suggest to expose an config where user can decide whether to consider this 
as a failure or not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to