[ 
https://issues.apache.org/jira/browse/YARN-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Botong Huang updated YARN-6093:
-------------------------------
    Description: 
AMRMProxy uses expired AMRMToken to talk to RM, leading to the "Invalid 
AMRMToken" exception. The bug is triggered when both conditions are met: 
1. RM rolls master key and renews AMRMToken for a running AM.
2. Existing RPC connection between AMRMProxy and RM drops and attempt to 
reconnect via failover in FederationRMFailoverProxyProvider. 

Here's what happened: 

In DefaultRequestInterceptor.init(), we create a proxy ugi, load it with the 
initial AMRMToken issued by RM, and used it for initiating rmClient. Then we 
arrive at FederationRMFailoverProxyProvider.init(), a full copy of ugi tokens 
are saved locally, create an actual RM proxy and setup the RPC connection. 

Later when RM rolls master key and issues a new AMRMToken, 
DefaultRequestInterceptor.updateAMRMToken() updates it into the proxy ugi. 

However the new token is never used, until the existing RPC connection between 
AMRMProxy and RM drops for other reasons (say master RM crashes). 

When we try to reconnect, since the service name of the new AMRMToken is not 
yet set correctly in DefaultRequestInterceptor.updateAMRMToken(), RPC found no 
valid AMRMToken when trying to setup a new connection. We first hit a "Client 
cannot authenticate via:[TOKEN]" exception. This is expected. 

Next, FederationRMFailoverProxyProvider fails over, we reset the service token 
via ClientRMProxy.getRMAddress() and reconnect. Supposedly this would have 
worked. 

However since DefaultRequestInterceptor does not use the proxy user for later 
calls to rmClient, when performing failover in 
FederationRMFailoverProxyProvider, we are not in the proxy user. Currently the 
code solve the problem by reloading the current ugi with all tokens saved 
locally in originalTokens in method addOriginalTokens(). The problem is that 
the original AMRMToken loaded is no longer accepted by RM, and thus we keep 
hitting the "Invalid AMRMToken" exception until AM fails. 

The correct way is that rather than saving the original tokens in the proxy 
ugi, we save the original ugi itself. Every time we perform failover and create 
the new RM proxy, we use the original ugi, which is always loaded with the 
up-to-date AMRMToken. 

  was:
AMRMProxy uses expired AMRMToken to talk to RM, leading to the "Invalid 
AMRMToken" exception. The bug is triggered when both conditions are met: 
1. RM rolls master key and renews AMRMToken for a running AM.
2. Existing RPC connection between AMRMProxy and RM drops and attempt to 
reconnect via failover in FederationRMFailoverProxyProvider. 

Here's what happened: 

In DefaultRequestInterceptor.init(), we create a proxy ugi, load it with the 
initial AMRMToken issued by RM, and used it for initiating rmClient. 

Then we arrive at FederationRMFailoverProxyProvider.init(), a full copy of ugi 
tokens are saved locally, create an actual RM proxy and setup the RPC 
connection. 

Later when RM rolls master key and issues a new AMRMToken, 
DefaultRequestInterceptor.updateAMRMToken() updates it into the proxy ugi. 

However the new token is never used until the existing RPC connection between 
AMRMProxy and RM drops for other reasons (say master RM crashes). 

At this point, since the service name of the new AMRMToken is not yet set 
correctly in DefaultRequestInterceptor.updateAMRMToken(), RPC found no valid 
AMRMToken when trying to setup a new connection. 

We first hit a "Client cannot authenticate via:[TOKEN]" exception. This is 
expected. 

Next, FederationRMFailoverProxyProvider fails over, we reset the service token 
via ClientRMProxy.getRMAddress() and reconnect. Supposedly this would have 
worked. 

However since DefaultRequestInterceptor does not use the proxy user for later 
calls to rmClient, when performing failover in 
FederationRMFailoverProxyProvider, we are not in the proxy user. 

Currently the code solve the problem by reloading the current ugi with all 
tokens saved locally in originalTokens in method addOriginalTokens(). 

The problem is that the original AMRMToken loaded is no longer accepted by RM, 
and thus we keep hitting the "Invalid AMRMToken" exception until AM fails. 

The correct way is that rather than saving the original tokens in the proxy 
ugi, we save the original ugi itself. 

Every time we perform failover and create the new RM proxy, we use the original 
ugi, which is always loaded with the up-to-date AMRMToken. 


> Invalid AMRM token exception when RM renew AMRMtoken and 
> FederationRMFailoverProxyProvider failover
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-6093
>                 URL: https://issues.apache.org/jira/browse/YARN-6093
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: federation
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Minor
>             Fix For: YARN-2915
>
>
> AMRMProxy uses expired AMRMToken to talk to RM, leading to the "Invalid 
> AMRMToken" exception. The bug is triggered when both conditions are met: 
> 1. RM rolls master key and renews AMRMToken for a running AM.
> 2. Existing RPC connection between AMRMProxy and RM drops and attempt to 
> reconnect via failover in FederationRMFailoverProxyProvider. 
> Here's what happened: 
> In DefaultRequestInterceptor.init(), we create a proxy ugi, load it with the 
> initial AMRMToken issued by RM, and used it for initiating rmClient. Then we 
> arrive at FederationRMFailoverProxyProvider.init(), a full copy of ugi tokens 
> are saved locally, create an actual RM proxy and setup the RPC connection. 
> Later when RM rolls master key and issues a new AMRMToken, 
> DefaultRequestInterceptor.updateAMRMToken() updates it into the proxy ugi. 
> However the new token is never used, until the existing RPC connection 
> between AMRMProxy and RM drops for other reasons (say master RM crashes). 
> When we try to reconnect, since the service name of the new AMRMToken is not 
> yet set correctly in DefaultRequestInterceptor.updateAMRMToken(), RPC found 
> no valid AMRMToken when trying to setup a new connection. We first hit a 
> "Client cannot authenticate via:[TOKEN]" exception. This is expected. 
> Next, FederationRMFailoverProxyProvider fails over, we reset the service 
> token via ClientRMProxy.getRMAddress() and reconnect. Supposedly this would 
> have worked. 
> However since DefaultRequestInterceptor does not use the proxy user for later 
> calls to rmClient, when performing failover in 
> FederationRMFailoverProxyProvider, we are not in the proxy user. Currently 
> the code solve the problem by reloading the current ugi with all tokens saved 
> locally in originalTokens in method addOriginalTokens(). The problem is that 
> the original AMRMToken loaded is no longer accepted by RM, and thus we keep 
> hitting the "Invalid AMRMToken" exception until AM fails. 
> The correct way is that rather than saving the original tokens in the proxy 
> ugi, we save the original ugi itself. Every time we perform failover and 
> create the new RM proxy, we use the original ugi, which is always loaded with 
> the up-to-date AMRMToken. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to