Yesha Vora created YARN-8580:
--------------------------------

             Summary: yarn.resourcemanager.am.max-attempts is not respected for 
yarn services
                 Key: YARN-8580
                 URL: https://issues.apache.org/jira/browse/YARN-8580
             Project: Hadoop YARN
          Issue Type: Bug
          Components: yarn-native-services
    Affects Versions: 3.1.1
            Reporter: Yesha Vora


1) Max am attempt is set to 100 on all nodes. ( including gateway)
{code}
 <property>
      <name>yarn.resourcemanager.am.max-attempts</name>
      <value>100</value>
    </property>{code}
2) Start a Yarn service ( Hbase tarball ) application
3) Kill AM 20 times

Here, App fails with below diagnostics.

{code}
bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status 
application_1532481557746_0001
18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History 
server at xxx/xxx:10200
18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
to rm2
18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml at 
file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml
Application Report : 
        Application-Id : application_1532481557746_0001
        Application-Name : hbase-tarball-lr
        Application-Type : yarn-service
        User : hbase
        Queue : default
        Application Priority : 0
        Start-Time : 1532481864863
        Finish-Time : 1532522943103
        Progress : 100%
        State : FAILED
        Final-State : FAILED
        Tracking-URL : 
https://xxx:8090/cluster/app/application_1532481557746_0001
        RPC Port : -1
        AM Host : N/A
        Aggregate Resource Allocation : 252150112 MB-seconds, 164141 
vcore-seconds
        Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
        Log Aggregation Status : SUCCEEDED
        Diagnostics : Application application_1532481557746_0001 failed 20 
times (global limit =100; local limit is =20) due to AM Container for 
appattempt_1532481557746_0001_000020 exited with  exitCode: 137
Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed on 
request. Exit code is 137
[2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137. 
[2018-07-25 12:49:03.045]Killed by external signal
For more detailed output, check the application tracking page: 
https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on links 
to logs of each attempt.
. Failing the application.
        Unmanaged Application : false
        Application Node Label Expression : <Not set>
        AM container Node Label Expression : <DEFAULT_PARTITION>
        TimeoutType : LIFETIME  ExpiryTime : 2018-07-25T22:26:15.419+0000       
RemainingTime : 0seconds
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to