Kevin Wikant created YARN-11210:
-----------------------------------

             Summary: Fix YARN RMAdminCLI retry logic for non-retryable 
kerberos configuration exception
                 Key: YARN-11210
                 URL: https://issues.apache.org/jira/browse/YARN-11210
             Project: Hadoop YARN
          Issue Type: Bug
          Components: client
            Reporter: Kevin Wikant


h2. Description of Problem

Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client) 
synchronously can be blocked for up to 15 minutes with the default 
configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an 
issue in of itself, but there is a non-retryable IllegalArgumentException 
exception thrown within the YARN ResourceManager client that is getting 
swallowed & treated as a retryable "connection exception" meaning that it gets 
retried for 15 minutes.

The purpose of this JIRA (and PR) is to modify the YARN client so that it does 
not retry on this non-retryable exception.
h2. Background Information

YARN ResourceManager client treats connection exceptions as retryable & with 
the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt to 
connect to the ResourceManager for up to 15 minutes when facing "connection 
exceptions". This arguably makes sense because connection exceptions are in 
some cases transient & can be recovered from without any action needed from the 
client. See example below where YARN ResourceManager client was able to recover 
from connection issues that resulted from the ResourceManager process being 
down.
{quote}> yarn rmadmin -refreshNodes

22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8033
22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:40:37 INFO retry.RetryInvocationHandler: java.net.ConnectException: 
Your endpoint configuration is wrong; For more details see:  
http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
1 failover attempts. Trying to failover after sleeping for 41061ms.
22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:41:28 INFO retry.RetryInvocationHandler: java.net.ConnectException: 
Your endpoint configuration is wrong; For more details see:  
http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
2 failover attempts. Trying to failover after sleeping for 25962ms.

** Success is silent in client logs, but can be seen in the ResourceManager 
logs **
{quote}
Then there are cases where the YARN ResourceManager client will stop retrying 
because it has encountered a non retryable exception. Some examples:
 * client is configured with SIMPLE auth when ResourceManager is configured 
with KERBEROS auth
 ** this RemoteException is not a transient failure & will not recover without 
the client taking action to modify their configuration, this is why it fails 
immediately
 ** the exception is coming from ResourceManager server-side & will occur once 
the client successfully calls the ResourceManager

{quote}> yarn rmadmin -refreshNodes

22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at 
/10.0.0.106:8033
refreshNodes: org.apache.hadoop.security.AccessControlException: SIMPLE 
authentication is not enabled.  Available:[KERBEROS]
{quote} * client & server as configured with KERBEROS auth but the client has 
not kinit
 ** this SaslException is not a transient failure & will not recover without 
the client taking action to modify their configuration, this is why it fails 
immediately
 ** the exception is coming from client-side & will occur before the client 
even attempts to call the ResourceManager

{quote}> yarn rmadmin -refreshNodes

22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at 
/10.0.0.106:8033
22/07/12 15:20:33 WARN ipc.Client: Exception encountered while connecting to 
the server
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
        at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
        at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
        at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
        at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:820)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:820)
        at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
        at org.apache.hadoop.ipc.Client.call(Client.java:1448)
        at org.apache.hadoop.ipc.Client.call(Client.java:1401)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
        at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed 
to find any Kerberos tgt)
        at 
sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:162)
        at 
sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
        at 
sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:189)
        at 
sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
        at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
        at 
sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
        at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
        ... 34 more
refreshNodes: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]; Host Details : local host is: "ip-10-0-0-106/10.0.0.106"; destination 
host is: "ip-10-0-0-106.us-west-2.compute.internal":8033;
{quote}
h2. The Problem

When the client has:
 * kerberos enabled by setting "hadoop.security.authentication = kerberos" in 
"core-site.xml"
 * a bad kerberos configuration where "yarn.resourcemanager.principal" is unset 
or malformed in "yarn-site.xml"

This bad configuration can never successfully connect to the ResourceManager & 
therefore should result in a non-retryable failure.

When the YARN ResourceManager client has this bad configuration an 
IllegalArugmentException gets thrown (in 
org.apache.hadoop.security.SaslRpcClient) but then swallowed by an IOException 
(in org.apache.hadoop.ipc.Client) that gets treated as a retryable failure & 
therefore will be retried for 15 minutes:
{quote}> yarn rmadmin -refreshNodes

22/06/28 14:23:45 INFO client.RMProxy: Connecting to ResourceManager at 
/0.0.0.0:8033
22/06/28 14:23:46 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:23:47 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:23:48 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
22/06/28 14:23:54 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:23:55 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:23:56 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:23:57 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:23:58 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
22/06/28 14:24:04 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:24:05 INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is 
RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
22/06/28 14:24:05 INFO retry.RetryInvocationHandler: java.net.ConnectException: 
Your endpoint configuration is wrong; For more details see:  
http://wiki.apache.org/hadoop/UnsetHostnameOrPort, while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
1 failover attempts. Trying to failover after sleeping for 27166ms.
22/06/28 14:24:32 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
2 failover attempts. Trying to failover after sleeping for 22291ms.
22/06/28 14:24:54 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
3 failover attempts. Trying to failover after sleeping for 24773ms.
22/06/28 14:25:19 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
4 failover attempts. Trying to failover after sleeping for 39187ms.
...
22/06/28 14:36:50 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
26 failover attempts. Trying to failover after sleeping for 26235ms.
22/06/28 14:37:16 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
27 failover attempts. Trying to failover after sleeping for 40535ms.
22/06/28 14:37:57 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
28 failover attempts. Trying to failover after sleeping for 26721ms.
22/06/28 14:38:23 INFO retry.RetryInvocationHandler: java.io.IOException: 
Failed on local exception: java.io.IOException: Couldn't set up IO streams: 
java.lang.IllegalArgumentException: Failed to specify server's Kerberos 
principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033; , while invoking 
ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null after 
29 failover attempts. Trying to failover after sleeping for 27641ms.
refreshNodes: Failed on local exception: java.io.IOException: Couldn't set up 
IO streams: java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name; Host Details : local host is: "ip-10-0-0-4/10.0.0.4"; 
destination host is: "0.0.0.0":8033;
{quote}
This non-retryable failure should not be treated as a retryable "connection 
failure"
h2. The Solution

Surface the IllegalArgumentException to the RetryInvocationHandler & have YARN 
RMProxy treat IllegalArugmentException as non-retryable

Note that surfacing IllegalArgumentException has the side-effect of causing the 
[command usage to be printed 
here|https://github.com/apache/hadoop/blob/c0bdba8face85fbd40f5d7ba46af11e24a8ef25b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java#L790]
{quote}...
refreshNodes: Failed to specify server's Kerberos principal name
Usage: yarn rmadmin [-refreshNodes [-g|graceful [timeout in seconds] 
-client|server]]

Generic options supported are:
-conf <configuration file>        specify an application configuration file
-D <property=value>               define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, 
overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port>  specify a ResourceManager
-files <file1,...>                specify a comma-separated list of files to be 
copied to the map reduce cluster
-libjars <jar1,...>               specify a comma-separated list of jar files 
to be included in the classpath
-archives <archive1,...>          specify a comma-separated list of archives to 
be unarchived on the compute machines

The general command line syntax is:
command [genericOptions] [commandOptions]
{quote}
To resolve this issue the IllegalArgumentException is swallowed & surfaced as a 
KerberosAuthException, this chosen because it already gets [treated as 
non-retryable in FailoverOnNetworkExceptionRetry]()

Note that in terms of RetryPolicy:
 * non-HA YARN ResourceManager client should use 
OtherThanRemoteExceptionDependentRetry (but because of a bug uses 
FailoverOnNetworkExceptionRetry)
 * HA YARN ResourceManager client uses FailoverOnNetworkExceptionRetry

The result of this change is a much quicker failure when the YARN client is 
misconfigured:
 * non-HA YARN ResourceManager client 

{quote}> yarn rmadmin -refreshNodes

22/07/13 17:36:03 INFO client.RMProxy: Connecting to ResourceManager at 
/10.0.200.11:8033
22/07/13 17:36:03 WARN ipc.Client: Exception encountered while connecting to 
the server
javax.security.sasl.SaslException: Bad Kerberos server principal configuration 
[Caused by java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name]
        at 
org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237)
        at 
org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
        at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397)
        at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630)
        at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
        at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612)
        at org.apache.hadoop.ipc.Client.call(Client.java:1442)
        at org.apache.hadoop.ipc.Client.call(Client.java:1395)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
        at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
Caused by: java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name
        at 
org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332)
        at 
org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233)
        ... 35 more
refreshNodes: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: Bad Kerberos server principal configuration 
[Caused by java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name]; Host Details : local host is: 
"ip-10-0-200-11/10.0.200.11"; destination host is: 
"ip-10-0-200-11.us-west-2.compute.internal":8033;
{quote} * HA YARN ResourceManager client

{quote}> yarn rmadmin -refreshNodes

22/07/13 17:37:50 INFO client.RMProxy: Connecting to ResourceManager at 
/10.0.200.11:8033
22/07/13 17:37:50 WARN ipc.Client: Exception encountered while connecting to 
the server
javax.security.sasl.SaslException: Bad Kerberos server principal configuration 
[Caused by java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name]
        at 
org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237)
        at 
org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
        at 
org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397)
        at 
org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630)
        at org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424)
        at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)       
 at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
        at org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612)
        at org.apache.hadoop.ipc.Client.call(Client.java:1442)
        at org.apache.hadoop.ipc.Client.call(Client.java:1395)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
        at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
        at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
        at org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at 
org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
Caused by: java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name
        at 
org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332)
        at 
org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233)
        ... 35 more
refreshNodes: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: Bad Kerberos server principal configuration 
[Caused by java.lang.IllegalArgumentException: Failed to specify server's 
Kerberos principal name]; Host Details : local host is: 
"ip-10-0-200-11/10.0.200.11"; destination host is: 
"ip-10-0-200-11.us-west-2.compute.internal":8033;
{quote}
h2. Other Notes

The YARN RMProxy will return separate RetryPolicies for HA & non-HA, but the 
YARN client will always use the HA policy because a configuration related to 
[Federation 
Failover|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L102]
 is [enabled by 
default|https://github.com/apache/hadoop/blob/e044a46f97dcc7998dc0737f15cf3956dca170c4/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java#L3901].
 This is presumably a bug because YARN Federation is not enabled for the 
cluster I am testing on.

The fix is to modify HAUtil.isFederationFailoverEnabled to check if 
"yarn.federation.enabled" (default false) in addition to checking if 
"yarn.federation.failover.enabled" (default true).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to