[
https://issues.apache.org/jira/browse/YARN-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928312#comment-16928312
]
Bibin A Chundatt commented on YARN-9823:
----------------------------------------
[~lichaojacobs] YARN-8434 should help you.
> NodeManager cannot get right ResourceTrack address in Federation mode
> ---------------------------------------------------------------------
>
> Key: YARN-9823
> URL: https://issues.apache.org/jira/browse/YARN-9823
> Project: Hadoop YARN
> Issue Type: Bug
> Components: federation, nodemanager
> Affects Versions: 2.9.2
> Environment: h2. Hadoop:
> Hadoop 2.9.2 (some line number may not be right because we have merged some
> 3.0+ patch)
> Security with Kerberos
> configure from
> [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html]
> h2. Java:
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
> Kerberos:
>
>
> Reporter: qiwei huang
> Priority: Major
>
> {{the NM will infinitely try to connect the wrong RM's resource tracker port}}
> {quote}{{INFO [main:RetryInvocationHandler@411] - java.net.ConnectException:
> Call From standby.rm.server/10.122.138.139 to }}{{standby.rm.server}}{{:8031
> failed on connection exception: java.net.ConnectException: Connection
> refused; For more details see:
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
> ResourceTrackerPBClientImpl.registerNodeManager over dev1 after 19 failover
> attempts. Trying to failover after sleeping for 40497ms.}}
> {quote}
>
> {{After change *yarn.client.failover-proxy-provider* to
> *org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider*,
> the ** NodeManager cannot find the right ResourceTracker address:}}
> {quote}{{getRMHAId:233, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfKeyForRMInstance:294, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfValueForRMInstance:302, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfValueForRMInstance:314, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getSocketAddr:3341, YarnConfiguration (org.apache.hadoop.yarn.conf)}}
> {{getRMAddress:77, ServerRMProxy (org.apache.hadoop.yarn.server.api)}}
> {{run:144, FederationRMFailoverProxyProvider$1
> (org.apache.hadoop.yarn.server.federation.failover)}}
> {{doPrivileged:-1, AccessController (java.security)}}
> {{doAs:422, Subject (javax.security.auth)}}
> {{doAs:1893, UserGroupInformation (org.apache.hadoop.security)}}
> {{getProxyInternal:141, FederationRMFailoverProxyProvider
> (org.apache.hadoop.yarn.server.federation.failover)}}
> {{performFailover:192, FederationRMFailoverProxyProvider
> (org.apache.hadoop.yarn.server.federation.failover)}}
> {{failover:217, RetryInvocationHandler$ProxyDescriptor
> (org.apache.hadoop.io.retry)}}
> {{processRetryInfo:149, RetryInvocationHandler$Call
> (org.apache.hadoop.io.retry)}}
> {{processWaitTimeAndRetryInfo:142, RetryInvocationHandler$Call
> (org.apache.hadoop.io.retry)}}
> {{invokeOnce:107, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)}}
> {{invoke:359, RetryInvocationHandler (org.apache.hadoop.io.retry)}}
> {{registerNodeManager:-1, $Proxy85 (com.sun.proxy)}}
> {{registerWithRM:378, NodeStatusUpdaterImpl
> (org.apache.hadoop.yarn.server.nodemanager)}}
> {{serviceStart:252, NodeStatusUpdaterImpl
> (org.apache.hadoop.yarn.server.nodemanager)}}
> {{start:194, AbstractService (org.apache.hadoop.service)}}
> {{serviceStart:121, CompositeService (org.apache.hadoop.service)}}
> {{start:194, AbstractService (org.apache.hadoop.service)}}
> {{initAndStartNodeManager:864, NodeManager
> (org.apache.hadoop.yarn.server.nodemanager)}}
> {{main:931, NodeManager (org.apache.hadoop.yarn.server.nodemanager)}}
> {quote}
> {{the Provider will try to find the main RM address on }}*{{getRMHAId:233,}}*
> {{but it cannot find the right address because it can just return the local
> Address: }}{{}}
> {quote}{{if (!s.isUnresolved() && NetUtils.isLocalAddress(s.getAddress())) {}}
> {{ currentRMId = rmId.trim();}}
> {{ found++;}}
> {{}}}
> {quote}
> {{If the NM and RM is on the same node, and the this RM is in standby
> situation, the NM will }}{{infinitely}}{{ call RPC to RM}}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]