[
https://issues.apache.org/jira/browse/YARN-11210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan updated YARN-11210:
------------------------------
Hadoop Flags: Reviewed
Target Version/s: 3.4.0
Affects Version/s: 3.4.0
> Fix YARN RMAdminCLI retry logic for non-retryable kerberos configuration
> exception
> ----------------------------------------------------------------------------------
>
> Key: YARN-11210
> URL: https://issues.apache.org/jira/browse/YARN-11210
> Project: Hadoop YARN
> Issue Type: Bug
> Components: client
> Affects Versions: 3.4.0
> Reporter: Kevin Wikant
> Assignee: Kevin Wikant
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Time Spent: 1h 40m
> Remaining Estimate: 0h
>
> h2. Description of Problem
> Applications which call YARN RMAdminCLI (i.e. YARN ResourceManager client)
> synchronously can be blocked for up to 15 minutes with the default
> configuration of "yarn.resourcemanager.connect.max-wait.ms"; this is not an
> issue in of itself, but there is a non-retryable IllegalArgumentException
> exception thrown within the YARN ResourceManager client that is getting
> swallowed & treated as a retryable "connection exception" meaning that it
> gets retried for 15 minutes.
> The purpose of this JIRA (and PR) is to modify the YARN client so that it
> does not retry on this non-retryable exception.
> h2. Background Information
> YARN ResourceManager client treats connection exceptions as retryable & with
> the default value of "yarn.resourcemanager.connect.max-wait.ms" will attempt
> to connect to the ResourceManager for up to 15 minutes when facing
> "connection exceptions". This arguably makes sense because connection
> exceptions are in some cases transient & can be recovered from without any
> action needed from the client. See example below where YARN ResourceManager
> client was able to recover from connection issues that resulted from the
> ResourceManager process being down.
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:40:17 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8033
> 22/06/28 14:40:18 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:40:19 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:40:20 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> ...
> 22/06/28 14:40:27 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:40:28 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:40:29 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> ...
> 22/06/28 14:40:37 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:40:37 INFO retry.RetryInvocationHandler:
> java.net.ConnectException: Your endpoint configuration is wrong; For more
> details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over
> null after 1 failover attempts. Trying to failover after sleeping for 41061ms.
> 22/06/28 14:41:19 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:41:20 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> ...
> 22/06/28 14:41:28 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:41:28 INFO retry.RetryInvocationHandler:
> java.net.ConnectException: Your endpoint configuration is wrong; For more
> details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over
> null after 2 failover attempts. Trying to failover after sleeping for 25962ms.
> >> Success is silent in client logs, but can be seen in the ResourceManager
> >> logs <<
> {quote}
> Then there are cases where the YARN ResourceManager client will stop retrying
> because it has encountered a non retryable exception. Some examples:
> * client is configured with SIMPLE auth when ResourceManager is configured
> with KERBEROS auth
> ** this RemoteException is not a transient failure & will not recover
> without the client taking action to modify their configuration, this is why
> it fails immediately
> ** the exception is coming from ResourceManager server-side & will occur
> once the client successfully calls the ResourceManager
>
> {quote}> yarn rmadmin -refreshNodes
> 22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8033
> refreshNodes: org.apache.hadoop.security.AccessControlException: SIMPLE
> authentication is not enabled. Available:[KERBEROS]
> {quote}
>
> * client & server as configured with KERBEROS auth but the client has not
> kinit
> ** this SaslException is not a transient failure & will not recover without
> the client taking action to modify their configuration, this is why it fails
> immediately
> ** the exception is coming from client-side & will occur before the client
> even attempts to call the ResourceManager
> {quote}> yarn rmadmin -refreshNodes
> 22/07/12 15:20:33 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8033
> 22/07/12 15:20:33 WARN ipc.Client: Exception encountered while connecting to
> the server
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to find
> any Kerberos tgt)]
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
> at
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:408)
> at
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:629)
> at
> org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:423)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:820)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:820)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:423)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
> at org.apache.hadoop.ipc.Client.call(Client.java:1448)
> at org.apache.hadoop.ipc.Client.call(Client.java:1401)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
> Caused by: GSSException: No valid credentials provided (Mechanism level:
> Failed to find any Kerberos tgt)
> at
> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:162)
> at
> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
> at
> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:189)
> at
> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
> at
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
> at
> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:192)
> ... 34 more
> refreshNodes: Failed on local exception: java.io.IOException:
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to find
> any Kerberos tgt)]; Host Details : local host is: "0.0.0.0/0.0.0.0";
> destination host is: "0.0.0.0":8033;
> {quote}
> h2. The Problem
> When the client has:
> * kerberos enabled by setting "hadoop.security.authentication = kerberos" in
> "core-site.xml"
> * a bad kerberos configuration where "yarn.resourcemanager.principal" is
> unset or malformed in "yarn-site.xml"
> This bad configuration can never successfully connect to the ResourceManager
> & therefore should result in a non-retryable failure.
> When the YARN ResourceManager client has this bad configuration an
> IllegalArugmentException gets thrown (in
> org.apache.hadoop.security.SaslRpcClient) but then swallowed by an
> IOException (in org.apache.hadoop.ipc.Client) that gets treated as a
> retryable failure & therefore will be retried for 15 minutes:
> {quote}> yarn rmadmin -refreshNodes
> 22/06/28 14:23:45 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8033
> 22/06/28 14:23:46 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:23:47 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:23:48 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> ...
> 22/06/28 14:23:54 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:23:55 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:23:56 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 0 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:23:57 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:23:58 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 2 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> ...
> 22/06/28 14:24:04 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 8 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:24:05 INFO ipc.Client: Retrying connect to server:
> 0.0.0.0/0.0.0.0:8033. Already tried 9 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> 22/06/28 14:24:05 INFO retry.RetryInvocationHandler:
> java.net.ConnectException: Your endpoint configuration is wrong; For more
> details see: [http://wiki.apache.org/hadoop/UnsetHostnameOrPort], while
> invoking ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over
> null after 1 failover attempts. Trying to failover after sleeping for 27166ms.
> 22/06/28 14:24:32 INFO retry.RetryInvocationHandler: java.io.IOException:
> Failed on local exception: java.io.IOException: Couldn't set up IO streams:
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination
> host is: "0.0.0.0":8033; , while invoking
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null
> after 2 failover attempts. Trying to failover after sleeping for 22291ms.
> 22/06/28 14:24:54 INFO retry.RetryInvocationHandler: java.io.IOException:
> Failed on local exception: java.io.IOException: Couldn't set up IO streams:
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination
> host is: "0.0.0.0":8033; , while invoking
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null
> after 3 failover attempts. Trying to failover after sleeping for 24773ms.
> 22/06/28 14:25:19 INFO retry.RetryInvocationHandler: java.io.IOException:
> Failed on local exception: java.io.IOException: Couldn't set up IO streams:
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination
> host is: "0.0.0.0":8033; , while invoking
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null
> after 4 failover attempts. Trying to failover after sleeping for 39187ms.
> ...
> 22/06/28 14:36:50 INFO retry.RetryInvocationHandler: java.io.IOException:
> Failed on local exception: java.io.IOException: Couldn't set up IO streams:
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination
> host is: "0.0.0.0":8033; , while invoking
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null
> after 26 failover attempts. Trying to failover after sleeping for 26235ms.
> 22/06/28 14:37:16 INFO retry.RetryInvocationHandler: java.io.IOException:
> Failed on local exception: java.io.IOException: Couldn't set up IO streams:
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination
> host is: "0.0.0.0":8033; , while invoking
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null
> after 27 failover attempts. Trying to failover after sleeping for 40535ms.
> 22/06/28 14:37:57 INFO retry.RetryInvocationHandler: java.io.IOException:
> Failed on local exception: java.io.IOException: Couldn't set up IO streams:
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination
> host is: "0.0.0.0":8033; , while invoking
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null
> after 28 failover attempts. Trying to failover after sleeping for 26721ms.
> 22/06/28 14:38:23 INFO retry.RetryInvocationHandler: java.io.IOException:
> Failed on local exception: java.io.IOException: Couldn't set up IO streams:
> java.lang.IllegalArgumentException: Failed to specify server's Kerberos
> principal name; Host Details : local host is: "0.0.0.0/0.0.0.0"; destination
> host is: "0.0.0.0":8033; , while invoking
> ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes over null
> after 29 failover attempts. Trying to failover after sleeping for 27641ms.
> refreshNodes: Failed on local exception: java.io.IOException: Couldn't set up
> IO streams: java.lang.IllegalArgumentException: Failed to specify server's
> Kerberos principal name; Host Details : local host is: "0.0.0.0/0.0.0.0";
> destination host is: "0.0.0.0":8033;
> {quote}
> This non-retryable failure should not be treated as a retryable "connection
> failure"
> h2. The Solution
> Surface the IllegalArgumentException to the RetryInvocationHandler & have
> YARN RMProxy treat IllegalArugmentException as non-retryable
> Note that surfacing IllegalArgumentException has the side-effect of causing
> the [command usage to be printed
> here|https://github.com/apache/hadoop/blob/c0bdba8face85fbd40f5d7ba46af11e24a8ef25b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/cli/RMAdminCLI.java#L790]
> {quote}...
> refreshNodes: Failed to specify server's Kerberos principal name
> Usage: yarn rmadmin [-refreshNodes [-g|graceful [timeout in seconds]
> -client|server]]
> Generic options supported are:
> -conf <configuration file> specify an application configuration file
> -D <property=value> define a value for a given property
> -fs <[file:///]|hdfs://namenode:port> specify default filesystem URL to use,
> overrides 'fs.defaultFS' property from configurations.
> -jt <local|resourcemanager:port> specify a ResourceManager
> -files <file1,...> specify a comma-separated list of files to
> be copied to the map reduce cluster
> -libjars <jar1,...> specify a comma-separated list of jar files
> to be included in the classpath
> -archives <archive1,...> specify a comma-separated list of archives
> to be unarchived on the compute machines
> The general command line syntax is:
> command [genericOptions] [commandOptions]
> {quote}
> To resolve this issue the IllegalArgumentException is swallowed & surfaced as
> a KerberosAuthException, this chosen because it already gets [treated as
> non-retryable in FailoverOnNetworkExceptionRetry]()
> Note that in terms of RetryPolicy:
> * non-HA YARN ResourceManager client should use
> OtherThanRemoteExceptionDependentRetry (but because of a bug uses
> FailoverOnNetworkExceptionRetry)
> * HA YARN ResourceManager client uses FailoverOnNetworkExceptionRetry
> The result of this change is a much quicker failure when the YARN client is
> misconfigured:
> * non-HA YARN ResourceManager client
>
> {quote}> yarn rmadmin -refreshNodes
> 22/07/13 17:36:03 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8033
> 22/07/13 17:36:03 WARN ipc.Client: Exception encountered while connecting to
> the server
> javax.security.sasl.SaslException: Bad Kerberos server principal
> configuration [Caused by java.lang.IllegalArgumentException: Failed to
> specify server's Kerberos principal name]
> at
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237)
> at
> org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
> at
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397)
> at
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630)
> at
> org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612)
> at org.apache.hadoop.ipc.Client.call(Client.java:1442)
> at org.apache.hadoop.ipc.Client.call(Client.java:1395)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
> Caused by: java.lang.IllegalArgumentException: Failed to specify server's
> Kerberos principal name
> at
> org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332)
> at
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233)
> ... 35 more
> refreshNodes: Failed on local exception: java.io.IOException:
> javax.security.sasl.SaslException: Bad Kerberos server principal
> configuration [Caused by java.lang.IllegalArgumentException: Failed to
> specify server's Kerberos principal name]; Host Details : local host is:
> "0.0.0.0/0.0.0.0"; destination host is: "0.0.0.0":8033;
> {quote}
> * HA YARN ResourceManager client
>
>
> {quote}> yarn rmadmin -refreshNodes
> 22/07/13 17:37:50 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8033
> 22/07/13 17:37:50 WARN ipc.Client: Exception encountered while connecting to
> the server
> javax.security.sasl.SaslException: Bad Kerberos server principal
> configuration [Caused by java.lang.IllegalArgumentException: Failed to
> specify server's Kerberos principal name]
> at
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:237)
> at
> org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:159)
> at
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:397)
> at
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:630)
> at
> org.apache.hadoop.ipc.Client$Connection.access$2200(Client.java:424)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:825)
> at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:821)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:821)
> at
> org.apache.hadoop.ipc.Client$Connection.access$3700(Client.java:424)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1612)
> at org.apache.hadoop.ipc.Client.call(Client.java:1442)
> at org.apache.hadoop.ipc.Client.call(Client.java:1395)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
> at com.sun.proxy.$Proxy7.refreshNodes(Unknown Source)
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceManagerAdministrationProtocolPBClientImpl.refreshNodes(ResourceManagerAdministrationProtocolPBClientImpl.java:145)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
> at com.sun.proxy.$Proxy8.refreshNodes(Unknown Source)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:349)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.refreshNodes(RMAdminCLI.java:423)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.handleRefreshNodes(RMAdminCLI.java:917)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.run(RMAdminCLI.java:816)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
> at
> org.apache.hadoop.yarn.client.cli.RMAdminCLI.main(RMAdminCLI.java:1027)
> Caused by: java.lang.IllegalArgumentException: Failed to specify server's
> Kerberos principal name
> at
> org.apache.hadoop.security.SaslRpcClient.getServerPrincipal(SaslRpcClient.java:332)
> at
> org.apache.hadoop.security.SaslRpcClient.createSaslClient(SaslRpcClient.java:233)
> ... 35 more
> refreshNodes: Failed on local exception: java.io.IOException:
> javax.security.sasl.SaslException: Bad Kerberos server principal
> configuration [Caused by java.lang.IllegalArgumentException: Failed to
> specify server's Kerberos principal name]; Host Details : local host is:
> "0.0.0.0/0.0.0.0"; destination host is: "0.0.0.0":8033;
> {quote}
> h2. Other Notes
> The YARN RMProxy will return separate RetryPolicies for HA & non-HA, but the
> YARN client will always use the HA policy because a configuration related to
> [Federation
> Failover|https://github.com/apache/hadoop/blob/d336227e5c63a70db06ac26697994c96ed89d230/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L102]
> is [enabled by
> default|https://github.com/apache/hadoop/blob/e044a46f97dcc7998dc0737f15cf3956dca170c4/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java#L3901].
> This is presumably a bug because YARN Federation is not enabled for the
> cluster I am testing on.
> The fix is to modify HAUtil.isFederationFailoverEnabled to check if
> "yarn.federation.enabled" (default false) in addition to checking if
> "yarn.federation.failover.enabled" (default true).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]