[ 
https://issues.apache.org/jira/browse/YARN-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-10871:
---------------------------------
    Description: 
When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
failure is not counted. But if user sets number of attempts as 1, then YARN 
considers the ABORTED AM as a failure. 

{code}
      int numberOfFailure = app.getNumFailedAppAttempts();
      if (app.maxAppAttempts == 1) {
        // If the user explicitly set the attempts to 1 then there are likely
        // correctness issues if the AM restarts for any reason.
        LOG.info("Max app attempts is 1 for " + app.applicationId
            + ", preventing further attempts.");
        numberOfFailure = app.maxAppAttempts;
      } 
{code}

Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
support multiple connections for the same registered app. But in our case AM is 
ABORTED before even the AM starts (AM was in ACQUIRED state)

Usually users won't decommission the node where the Container is in RUNNING 
state (where the session is established). But the decommission can happen on 
nodes where the container is in ACQUIRED or ALLOCATED state. 

Will suggest to expose an config where user can decide whether to consider this 
as a failure or not. 

  was:
When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
failure is not counted. But if user sets number of attempts as 1, then YARN 
considers the ABORTED AM as a failure. 

{code}
      int numberOfFailure = app.getNumFailedAppAttempts();
      if (app.maxAppAttempts == 1) {
        // If the user explicitly set the attempts to 1 then there are likely
        // correctness issues if the AM restarts for any reason.
        LOG.info("Max app attempts is 1 for " + app.applicationId
            + ", preventing further attempts.");
        numberOfFailure = app.maxAppAttempts;
      } 
{code}

Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
support multiple connections for the same registered app. But in our case AM is 
ABORTED before even the AM starts (AM was in ACAUIRED state)

Usually users won't decommission the node where the Container is in RUNNING 
state (where the session is established). But the decommission can happen on 
nodes where the container is in ACQUIRED or ALLOCATED state. 

Will suggest to expose an config where user can decide whether to consider this 
as a failure or not. 


> Aborted AM is considered as App Failure when user sets MaxAttempts as 1
> -----------------------------------------------------------------------
>
>                 Key: YARN-10871
>                 URL: https://issues.apache.org/jira/browse/YARN-10871
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: RM
>    Affects Versions: 3.3.1
>            Reporter: Prabhu Joseph
>            Assignee: Srinivas S T
>            Priority: Major
>
> When an AM Container is ABORTED due to Node Decommission, the AppAttempt 
> failure is not counted. But if user sets number of attempts as 1, then YARN 
> considers the ABORTED AM as a failure. 
> {code}
>       int numberOfFailure = app.getNumFailedAppAttempts();
>       if (app.maxAppAttempts == 1) {
>         // If the user explicitly set the attempts to 1 then there are likely
>         // correctness issues if the AM restarts for any reason.
>         LOG.info("Max app attempts is 1 for " + app.applicationId
>             + ", preventing further attempts.");
>         numberOfFailure = app.maxAppAttempts;
>       } 
> {code}
> Livy sets the number of attempts as 1 since it's Rpc Server does not yet 
> support multiple connections for the same registered app. But in our case AM 
> is ABORTED before even the AM starts (AM was in ACQUIRED state)
> Usually users won't decommission the node where the Container is in RUNNING 
> state (where the session is established). But the decommission can happen on 
> nodes where the container is in ACQUIRED or ALLOCATED state. 
> Will suggest to expose an config where user can decide whether to consider 
> this as a failure or not. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to