Prabhu Joseph created YARN-11355:
------------------------------------
Summary: YARN Client Failovers immediately to rm2 but takes
~30000ms to rm3
Key: YARN-11355
URL: https://issues.apache.org/jira/browse/YARN-11355
Project: Hadoop YARN
Issue Type: Bug
Components: client
Affects Versions: 3.4.0
Reporter: Prabhu Joseph
Assignee: Prabhu Joseph
YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 during
initial retry.
*Repro:*
{code:java}
1. YARN Cluster with three master nodes rm1,rm2 and rm3
2. rm3 is active
3. yarn node -list or any other yarn client calls takes more than 30 seconds.
{code}
The initial failover to rm2 is immediate but then the failover to rm3 is after
~30000 ms. Current RetryPolicy does not honor the number of master nodes. It
has to perform atleast one immediate failover to every rm.
{code:java}
2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing
over to rm2
2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler:
java.net.ConnectException: Call From local to remote:8032 failed on connection
exception: java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover
attempts. Trying to failover after sleeping for 21139ms.
{code}
*{*}Workaround:{*}*
Reduce yarn.resourcemanager.connect.retry-interval.ms from 30000 to like 100.
This will do immediate failover to rm3 but there will be too many retries when
there is no active resourcemanager.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]