Yesha Vora created YARN-8580:
--------------------------------
Summary: yarn.resourcemanager.am.max-attempts is not respected for
yarn services
Key: YARN-8580
URL: https://issues.apache.org/jira/browse/YARN-8580
Project: Hadoop YARN
Issue Type: Bug
Components: yarn-native-services
Affects Versions: 3.1.1
Reporter: Yesha Vora
1) Max am attempt is set to 100 on all nodes. ( including gateway)
{code}
<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>100</value>
</property>{code}
2) Start a Yarn service ( Hbase tarball ) application
3) Kill AM 20 times
Here, App fails with below diagnostics.
{code}
bash-4.2$ /usr/hdp/current/hadoop-yarn-client/bin/yarn application -status
application_1532481557746_0001
18/07/25 18:43:34 INFO client.AHSProxy: Connecting to Application History
server at xxx/xxx:10200
18/07/25 18:43:34 INFO client.ConfiguredRMFailoverProxyProvider: Failing over
to rm2
18/07/25 18:43:34 INFO conf.Configuration: found resource resource-types.xml at
file:/etc/hadoop/3.0.0.0-1634/0/resource-types.xml
Application Report :
Application-Id : application_1532481557746_0001
Application-Name : hbase-tarball-lr
Application-Type : yarn-service
User : hbase
Queue : default
Application Priority : 0
Start-Time : 1532481864863
Finish-Time : 1532522943103
Progress : 100%
State : FAILED
Final-State : FAILED
Tracking-URL :
https://xxx:8090/cluster/app/application_1532481557746_0001
RPC Port : -1
AM Host : N/A
Aggregate Resource Allocation : 252150112 MB-seconds, 164141
vcore-seconds
Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
Log Aggregation Status : SUCCEEDED
Diagnostics : Application application_1532481557746_0001 failed 20
times (global limit =100; local limit is =20) due to AM Container for
appattempt_1532481557746_0001_000020 exited with exitCode: 137
Failing this attempt.Diagnostics: [2018-07-25 12:49:00.784]Container killed on
request. Exit code is 137
[2018-07-25 12:49:03.045]Container exited with a non-zero exit code 137.
[2018-07-25 12:49:03.045]Killed by external signal
For more detailed output, check the application tracking page:
https://xxx:8090/cluster/app/application_1532481557746_0001 Then click on links
to logs of each attempt.
. Failing the application.
Unmanaged Application : false
Application Node Label Expression : <Not set>
AM container Node Label Expression : <DEFAULT_PARTITION>
TimeoutType : LIFETIME ExpiryTime : 2018-07-25T22:26:15.419+0000
RemainingTime : 0seconds
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]