[ 
https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-11355:
---------------------------------
    Description: 
YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 during 
initial retry.

*Repro:*
{code:java}
1. YARN Cluster with three master nodes rm1,rm2 and rm3
2. rm3 is active
3. yarn node -list or any other yarn client calls takes more than 30 seconds.
 {code}
The initial failover to rm2 is immediate but then the failover to rm3 is after 
~30000 ms. Current RetryPolicy does not honor the number of master nodes. It 
has to perform atleast one immediate failover to every rm.
{code:java}
2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
java.net.ConnectException: Call From local to remote:8032 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover 
attempts. Trying to failover after sleeping for 21139ms.
{code}
 

*Workaround:*

Reduce yarn.resourcemanager.connect.retry-interval.ms from 30000 to like 100. 
This will do immediate failover to rm3 but there will be too many retries when 
there is no active resourcemanager.
 

 

  was:
YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 during 
initial retry.

*Repro:*
{code:java}
1. YARN Cluster with three master nodes rm1,rm2 and rm3
2. rm3 is active
3. yarn node -list or any other yarn client calls takes more than 30 seconds.
 {code}
The initial failover to rm2 is immediate but then the failover to rm3 is after 
~30000 ms. Current RetryPolicy does not honor the number of master nodes. It 
has to perform atleast one immediate failover to every rm.
{code:java}
2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing 
over to rm2
2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
java.net.ConnectException: Call From local to remote:8032 failed on connection 
exception: java.net.ConnectException: Connection refused; For more details see: 
 http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover 
attempts. Trying to failover after sleeping for 21139ms.
{code}
 

*{*}Workaround:{*}*

Reduce yarn.resourcemanager.connect.retry-interval.ms from 30000 to like 100. 
This will do immediate failover to rm3 but there will be too many retries when 
there is no active resourcemanager.
 

 


> YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3
> ------------------------------------------------------------------
>
>                 Key: YARN-11355
>                 URL: https://issues.apache.org/jira/browse/YARN-11355
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 3.4.0
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>            Priority: Major
>
> YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 during 
> initial retry.
> *Repro:*
> {code:java}
> 1. YARN Cluster with three master nodes rm1,rm2 and rm3
> 2. rm3 is active
> 3. yarn node -list or any other yarn client calls takes more than 30 seconds.
>  {code}
> The initial failover to rm2 is immediate but then the failover to rm3 is 
> after ~30000 ms. Current RetryPolicy does not honor the number of master 
> nodes. It has to perform atleast one immediate failover to every rm.
> {code:java}
> 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: 
> Failing over to rm2
> 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: 
> java.net.ConnectException: Call From local to remote:8032 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 
> failover attempts. Trying to failover after sleeping for 21139ms.
> {code}
>  
> *Workaround:*
> Reduce yarn.resourcemanager.connect.retry-interval.ms from 30000 to like 100. 
> This will do immediate failover to rm3 but there will be too many retries 
> when there is no active resourcemanager.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to