[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672567#comment-16672567 ] Rahul Anand commented on YARN-7592: --- Thanks [~bibinchundatt] and [~subru] for the comment. Yes his works well with both HA and non HA scenario and for the *yarn.federation.failover.enabled* flag, IIUC, we can remove that flag too. > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-6900) ZooKeeper based implementation of the FederationStateStore
[ https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672561#comment-16672561 ] Rahul Anand edited comment on YARN-6900 at 11/2/18 4:48 AM: Thanks [~subru] or [~elgoiri] for the comment. So right now we have two properties in the yarn-site for setting up policy manager and its related params i.e. *yarn.federation.policy-manager-params* and *yarn.federation.policy-manager*. According to the code, we should set this param property with bytes but obviously that is not what user expects. So for discusstion, first thing is the Format of this input. Here is how things look is the zknode if I tried to put policy configurations in statestore for the default queue * using PriorityBroadcastPolicyManager *get /federationstore/policies/** {code:java} *Xorg.apache.hadoop.yarn.server.federation.policies.manager.PriorityBroadcastPolicyManager�{"routerPolicyWeights":{"entry":[{"key":{"id":"cluster-bb"},"value":"1.0"},{"key":{"id":"cluster-ra"},"value":"3.0"}]},"amrmPolicyWeights":null,"headroomAlpha":"0.0"} {code} So overall we have these many things to put in the params field 1. routerpolicyweights (weight of particular subcluter to be used for router policy) 2. amrmpolicyweights (weight of particular subcluter to be used for amrm policy) 3. headroomAlpha was (Author: rahulanand90): Thanks [~subru] or [~elgoiri] for the comment. So right now we have two properties in the yarn-site for setting up policy manager and its related params i.e. *yarn.federation.policy-manager-params* and *yarn.federation.policy-manager*. According to the code, we should set this param property with bytes but obviously that is not what user expects. So for discusstion, first thing is the Format of this input. Here is how things look is the zknode if I tried to put policy configurations in statestore for the default queue * using PriorityBroadcastPolicyManager *get /federationstore/policies/** {code:java} *Xorg.apache.hadoop.yarn.server.federation.policies.manager.PriorityBroadcastPolicyManager�{"routerPolicyWeights":{"entry":[{"key":{"id":"cluster-bb"},"value":"1.0"},{"key":{"id":"cluster-ra"},"value":"3.0"}]},"amrmPolicyWeights":null,"headroomAlpha":"0.0"} {code} So overall we have these many things to put in the params field 1. amrmpolicyweights (weight of particular subcluter to be used for router policy) 2. routerpolicyweights (weight of particular subcluter to be used for amrm policy) 3. headroomAlpha > ZooKeeper based implementation of the FederationStateStore > -- > > Key: YARN-6900 > URL: https://issues.apache.org/jira/browse/YARN-6900 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation, nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-6900-002.patch, YARN-6900-003.patch, > YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, > YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, > YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, > YARN-6900-YARN-2915-001.patch > > > YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only > support SQL based stores, this JIRA tracks adding a ZooKeeper based > implementation for simplifying deployment as it's already popularly used for > {{RMStateStore}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore
[ https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672561#comment-16672561 ] Rahul Anand commented on YARN-6900: --- Thanks [~subru] or [~elgoiri] for the comment. So right now we have two properties in the yarn-site for setting up policy manager and its related params i.e. *yarn.federation.policy-manager-params* and *yarn.federation.policy-manager*. According to the code, we should set this param property with bytes but obviously that is not what user expects. So for discusstion, first thing is the Format of this input. Here is how things look is the zknode if I tried to put policy configurations in statestore for the default queue * using PriorityBroadcastPolicyManager *get /federationstore/policies/** {code:java} *Xorg.apache.hadoop.yarn.server.federation.policies.manager.PriorityBroadcastPolicyManager�{"routerPolicyWeights":{"entry":[{"key":{"id":"cluster-bb"},"value":"1.0"},{"key":{"id":"cluster-ra"},"value":"3.0"}]},"amrmPolicyWeights":null,"headroomAlpha":"0.0"} {code} So overall we have these many things to put in the params field 1. amrmpolicyweights (weight of particular subcluter to be used for router policy) 2. routerpolicyweights (weight of particular subcluter to be used for amrm policy) 3. headroomAlpha > ZooKeeper based implementation of the FederationStateStore > -- > > Key: YARN-6900 > URL: https://issues.apache.org/jira/browse/YARN-6900 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation, nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-6900-002.patch, YARN-6900-003.patch, > YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, > YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, > YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, > YARN-6900-YARN-2915-001.patch > > > YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only > support SQL based stores, this JIRA tracks adding a ZooKeeper based > implementation for simplifying deployment as it's already popularly used for > {{RMStateStore}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8855) Application fails if one of the sublcluster is down.
[ https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rahul Anand updated YARN-8855: -- Summary: Application fails if one of the sublcluster is down. (was: Application submission fails if one of the sublcluster is down.) > Application fails if one of the sublcluster is down. > > > Key: YARN-8855 > URL: https://issues.apache.org/jira/browse/YARN-8855 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Rahul Anand >Priority: Major > > If one of sub cluster is down then application keeps on trying multiple times > and then it fails About 30 failover attempts found in the logs. Below is the > detailed exception. > {code:java} > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container > container_e03_1538297667953_0005_01_01 transitioned from > CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 > 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing > container_e03_1538297667953_0005_01_01 from application > application_1538297667953_0005 | ApplicationImpl.java:512 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > resource-monitoring for container_e03_1538297667953_0005_01_01 | > ContainersMonitorImpl.java:932 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering > container container_e03_1538297667953_0005_01_01 for log-aggregation | > AppLogAggregatorImpl.java:538 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event > CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 > 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping > container container_e03_1538297667953_0005_01_01 | > YarnShuffleService.java:295 > 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find > container container_e03_1538297667953_0005_01_01 while processing > FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 > 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed > containers from NM context: [container_e03_1538297667953_0005_01_01] | > NodeStatusUpdaterImpl.java:696 > 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 28 failover attempts. Trying to failover after sleeping for 15261ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to > /192.168.0.25:8032 subClusterId cluster2 with protocol > ApplicationClientProtocol as user root (auth:SIMPLE) | > FederationRMFailoverProxyProvider.java:145 > 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | > java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to > node-master1-IYTxR:8032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused, while invoking > ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after > 29 failover attempts. Trying to failover after sleeping for 21175ms. | > RetryInvocationHandler.java:411 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the > ResourceManager for SubClusterId: cluster2 | > FederationRMFailoverProxyProvider.java:124 > 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from > cache and rehydrating from store, most likely on account of RM failover. | > FederationStateStoreFacade.java:258 > 2018-10-08 14:22:03,186 | INFO |
[jira] [Created] (YARN-8855) Application submission fails if one of the sublcluster is down.
Rahul Anand created YARN-8855: - Summary: Application submission fails if one of the sublcluster is down. Key: YARN-8855 URL: https://issues.apache.org/jira/browse/YARN-8855 Project: Hadoop YARN Issue Type: Bug Reporter: Rahul Anand If one of sub cluster is down then application keeps on trying multiple times and then it fails About 30 failover attempts found in the logs. Below is the detailed exception. {code:java} 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container container_e03_1538297667953_0005_01_01 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing container_e03_1538297667953_0005_01_01 from application application_1538297667953_0005 | ApplicationImpl.java:512 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping resource-monitoring for container_e03_1538297667953_0005_01_01 | ContainersMonitorImpl.java:932 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering container container_e03_1538297667953_0005_01_01 for log-aggregation | AppLogAggregatorImpl.java:538 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping container container_e03_1538297667953_0005_01_01 | YarnShuffleService.java:295 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find container container_e03_1538297667953_0005_01_01 while processing FINISH_CONTAINERS event | ContainerManagerImpl.java:1660 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed containers from NM context: [container_e03_1538297667953_0005_01_01] | NodeStatusUpdaterImpl.java:696 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 28 failover attempts. Trying to failover after sleeping for 15261ms. | RetryInvocationHandler.java:411 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 29 failover attempts. Trying to failover after sleeping for 21175ms. | RetryInvocationHandler.java:411 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the ResourceManager for SubClusterId: cluster2 | FederationRMFailoverProxyProvider.java:124 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from cache and rehydrating from store, most likely on account of RM failover. | FederationStateStoreFacade.java:258 2018-10-08 14:22:03,186 | INFO | pool-16-thread-1 | Connecting to /192.168.0.25:8032 subClusterId cluster2 with protocol ApplicationClientProtocol as user root (auth:SIMPLE) | FederationRMFailoverProxyProvider.java:145 2018-10-08 14:22:03,189 | ERROR | pool-16-thread-1 | Failed to register application master: cluster2 Application: appattempt_1538297667953_0005_01 | FederationInterceptor.java:1106 java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032
[jira] [Comment Edited] (YARN-6900) ZooKeeper based implementation of the FederationStateStore
[ https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637795#comment-16637795 ] Rahul Anand edited comment on YARN-6900 at 10/4/18 12:47 PM: - [~elgoiri] [~subru] I would like to know how one can configure the znode so as to specify the queue policies. I mean how can I specify the router, amrmpolicy, associated weights etc in the znode and in which format. I am unable to find this anywhere. Alternatively, if not that then can you help in setting me params in yarn.federation.policy-manager-params so that I can set policies weights. was (Author: rahulanand90): [~elgoiri] [~subru] I would like to know how one can configure the znode so as to specify the queue policies. I mean how can I specify the router, amrmpolicy, associated weights etc in the znode and in which format. I am unable to find this anywhere. > ZooKeeper based implementation of the FederationStateStore > -- > > Key: YARN-6900 > URL: https://issues.apache.org/jira/browse/YARN-6900 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation, nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-6900-002.patch, YARN-6900-003.patch, > YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, > YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, > YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, > YARN-6900-YARN-2915-001.patch > > > YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only > support SQL based stores, this JIRA tracks adding a ZooKeeper based > implementation for simplifying deployment as it's already popularly used for > {{RMStateStore}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore
[ https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637795#comment-16637795 ] Rahul Anand commented on YARN-6900: --- [~elgoiri] [~subru] I would like to know how one can configure the znode so as to specify the queue policies. I mean how can I specify the router, amrmpolicy, associated weights etc in the znode and in which format. I am unable to find this anywhere. > ZooKeeper based implementation of the FederationStateStore > -- > > Key: YARN-6900 > URL: https://issues.apache.org/jira/browse/YARN-6900 > Project: Hadoop YARN > Issue Type: Sub-task > Components: federation, nodemanager, resourcemanager >Reporter: Subru Krishnan >Assignee: Íñigo Goiri >Priority: Major > Fix For: 2.9.0, 3.0.0-beta1 > > Attachments: YARN-6900-002.patch, YARN-6900-003.patch, > YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, > YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, > YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, > YARN-6900-YARN-2915-001.patch > > > YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only > support SQL based stores, this JIRA tracks adding a ZooKeeper based > implementation for simplifying deployment as it's already popularly used for > {{RMStateStore}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625870#comment-16625870 ] Rahul Anand edited comment on YARN-7592 at 9/24/18 3:46 PM: Thanks [~bibinchundatt] and [~subru]. Removing *yarn.federation.enabled* from yarn-site.xml can solve this issue but would definitely create a confusion. So, instead of changing/removing a meaningful federation flag or updating doc, an alternative solution can be creation of a {{FederationCustomClientRMProxy}} which can override the {{ClientRMProxy#createRMProxy}} in {{AMRMClientUtils}} to always select *proxy provider* as {{FederationRMFailoverProxyProvider}} for federation. {code:java} public static T createRMProxy(final Configuration configuration, final Class protocol, UserGroupInformation user, final Token token) throws IOException { ... return FederationCustomClientRMProxy.createRMProxy(configuration, protocol); } ... } } {code} After this, we can remove the {{isFederationEnabled}} check from the {{RMProxy.java}} as before. {code:java} protected static T createRMProxy(final Configuration configuration, final Class protocol, RMProxy instance) throws IOException { ... RetryPolicy retryPolicy = createRetryPolicy(conf, (HAUtil.isHAEnabled(conf))); ... } {code} {code:java} protected static T createRMProxy(final Configuration configuration, final Class protocol, RMProxy instance, final long retryTime, final long retryInterval) throws IOException { ... RetryPolicy retryPolicy = createRetryPolicy(conf, retryTime, retryInterval, HAUtil.isHAEnabled(conf)); ... } {code} With this change we don't need to separately specify the *proxy provider* for HA and non-HA scenarios in case of federation while other non federation settings will continue as it is. was (Author: rahulanand90): Thanks [~bibinchundatt] and [~subru]. Removing *yarn.federation.enabled* from yarn-site.xml can solve this issue but would definitely create a confusion. So, instead of changing/removing a meaningful federation flag or updating doc, an alternative solution can be creation of a {{FederationCustomClientRMProxy}} which can override the {{ClientRMProxy#createRMProxy}} in {{AMRMClientUtils}} to always select *proxy provider* as {{FederationRMFailoverProxyProvider}} for federation. {code:java} public static T createRMProxy(final Configuration configuration, final Class protocol, UserGroupInformation user, final Token token) throws IOException { ... return FederationCustomClientRMProxy.createRMProxy(configuration, protocol); } ... } } {code} After this, we can remove the {{isFederationEnabled}} check from the {{RMProxy.java}} as before. {code:java} protected static T createRMProxy(final Configuration configuration, final Class protocol, RMProxy instance) throws IOException { ... RetryPolicy retryPolicy = createRetryPolicy(conf, (HAUtil.isHAEnabled(conf))); ... } {code} {code:java} protected static T createRMProxy(final Configuration configuration, final Class protocol, RMProxy instance, final long retryTime, final long retryInterval) throws IOException { ... RetryPolicy retryPolicy = createRetryPolicy(conf, retryTime, retryInterval, HAUtil.isHAEnabled(conf)); ... } {code} With this change, we don't need to seperately specify the *proxy provider* for HA and non-HA scenarios. > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625870#comment-16625870 ] Rahul Anand commented on YARN-7592: --- Thanks [~bibinchundatt] and [~subru]. Removing *yarn.federation.enabled* from yarn-site.xml can solve this issue but would definitely create a confusion. So, instead of changing/removing a meaningful federation flag or updating doc, an alternative solution can be creation of a {{FederationCustomClientRMProxy}} which can override the {{ClientRMProxy#createRMProxy}} in {{AMRMClientUtils}} to always select *proxy provider* as {{FederationRMFailoverProxyProvider}} for federation. {code:java} public static T createRMProxy(final Configuration configuration, final Class protocol, UserGroupInformation user, final Token token) throws IOException { ... return FederationCustomClientRMProxy.createRMProxy(configuration, protocol); } ... } } {code} After this, we can remove the {{isFederationEnabled}} check from the {{RMProxy.java}} as before. {code:java} protected static T createRMProxy(final Configuration configuration, final Class protocol, RMProxy instance) throws IOException { ... RetryPolicy retryPolicy = createRetryPolicy(conf, (HAUtil.isHAEnabled(conf))); ... } {code} {code:java} protected static T createRMProxy(final Configuration configuration, final Class protocol, RMProxy instance, final long retryTime, final long retryInterval) throws IOException { ... RetryPolicy retryPolicy = createRetryPolicy(conf, retryTime, retryInterval, HAUtil.isHAEnabled(conf)); ... } {code} With this change, we don't need to seperately specify the *proxy provider* for HA and non-HA scenarios. > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606769#comment-16606769 ] Rahul Anand edited comment on YARN-7592 at 9/7/18 9:42 AM: --- As per my understanding, for a Non-HA setup, with the default configuration, this will always create a problem. I have listed down my analysis. NodeManager registration starts from {{NodeManager#main}} and evetually invokes {{NodeStatusUpdaterImpl#serviceStart}} {code:java} protected void serviceStart() throws Exception { ... this.resourceTracker = getRMClient(); .. } catch (Exception e) { String errorMessage = "Unexpected error starting NodeStatusUpdater"; LOG.error(errorMessage, e); throw new YarnRuntimeException(e); } } {code} Then, NodeStatusUpdaterImpl#getRMClient tries to create RM proxy for resource tracker protocol. Now, the Federation enabled check in RMProxy#newProxyInstance {code:java} if (HAUtil.isHAEnabled(conf) || HAUtil.isFederationEnabled(conf)) { RMFailoverProxyProvider provider = instance.createRMFailoverProxyProvider(conf, protocol);{code} is failing the registration of the nodemanager. By default, RMProxy#createRMFailoverProxyProvider will always select ConfiguredRMFailoverProxyProvider {code:java} RMFailoverProxyProvider provider = ReflectionUtils.newInstance( conf.getClass(YarnConfiguration.CLIENT_FAILOVER_PROXY_PROVIDER, defaultProviderClass, RMFailoverProxyProvider.class), conf); provider.init(conf, (RMProxy) this, protocol);{code} and eventually, it will try to get RM's id from ConfiguredRMFailoverProxyProvider#init {code:java} Collection rmIds = HAUtil.getRMHAIds(conf); {code} which would have been set only in case of HA setup according to ResourceManager#serviceInit. {code} this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf)); if (this.rmContext.isHAEnabled()) { HAUtil.verifyAndSetConfiguration(this.conf); } {code} When I tried to run with the proxy provider as FederationRMFailoverProxyProvider, it started the nodemanager but this would be idealistic to work with only in case of 1 RM. {code:java} yarn.client.failover-proxy-provider org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider {code} Please correct if I am wrong at any point. was (Author: rahulanand90): As per my understanding, for a Non-HA setup, with the default configuration, this will always create a problem. I have listed down my analysis. NodeManager registration starts from {{NodeManager#main}} and evetually invokes {{NodeStatusUpdaterImpl#serviceStart}} {code:java} protected void serviceStart() throws Exception \{ ... this.resourceTracker = getRMClient(); .. } catch (Exception e) \{ String errorMessage = "Unexpected error starting NodeStatusUpdater"; LOG.error(errorMessage, e); throw new YarnRuntimeException(e); } } {code} Then, NodeStatusUpdaterImpl#getRMClient tries to create RM proxy for resource tracker protocol. Now, the Federation enabled check in RMProxy#newProxyInstance {code:java} if (HAUtil.isHAEnabled(conf) || HAUtil.isFederationEnabled(conf)) { RMFailoverProxyProvider provider = instance.createRMFailoverProxyProvider(conf, protocol);{code} is failing the registration of the nodemanager. By default, RMProxy#createRMFailoverProxyProvider will always select ConfiguredRMFailoverProxyProvider {code:java} RMFailoverProxyProvider provider = ReflectionUtils.newInstance( conf.getClass(YarnConfiguration.CLIENT_FAILOVER_PROXY_PROVIDER, defaultProviderClass, RMFailoverProxyProvider.class), conf); provider.init(conf, (RMProxy) this, protocol);{code} and eventually, it will try to get RM's id from ConfiguredRMFailoverProxyProvider#init {code:java} Collection rmIds = HAUtil.getRMHAIds(conf); which would have been set only in case of HA setup according to ResourceManager#serviceInit. this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf)); if (this.rmContext.isHAEnabled()) \{ HAUtil.verifyAndSetConfiguration(this.conf); } {code} When I tried to run with the proxy provider as FederationRMFailoverProxyProvider, it started the nodemanager but this would be idealistic to work with only in case of 1 RM. {code:java} yarn.client.failover-proxy-provider org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider {code} Please correct if I am wrong at any point. > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > >
[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606769#comment-16606769 ] Rahul Anand commented on YARN-7592: --- As per my understanding, for a Non-HA setup, with the default configuration, this will always create a problem. I have listed down my analysis. NodeManager registration starts from {{NodeManager#main}} and evetually invokes {{NodeStatusUpdaterImpl#serviceStart}} {code:java} protected void serviceStart() throws Exception \{ ... this.resourceTracker = getRMClient(); .. } catch (Exception e) \{ String errorMessage = "Unexpected error starting NodeStatusUpdater"; LOG.error(errorMessage, e); throw new YarnRuntimeException(e); } } {code} Then, NodeStatusUpdaterImpl#getRMClient tries to create RM proxy for resource tracker protocol. Now, the Federation enabled check in RMProxy#newProxyInstance {code:java} if (HAUtil.isHAEnabled(conf) || HAUtil.isFederationEnabled(conf)) { RMFailoverProxyProvider provider = instance.createRMFailoverProxyProvider(conf, protocol);{code} is failing the registration of the nodemanager. By default, RMProxy#createRMFailoverProxyProvider will always select ConfiguredRMFailoverProxyProvider {code:java} RMFailoverProxyProvider provider = ReflectionUtils.newInstance( conf.getClass(YarnConfiguration.CLIENT_FAILOVER_PROXY_PROVIDER, defaultProviderClass, RMFailoverProxyProvider.class), conf); provider.init(conf, (RMProxy) this, protocol);{code} and eventually, it will try to get RM's id from ConfiguredRMFailoverProxyProvider#init {code:java} Collection rmIds = HAUtil.getRMHAIds(conf); which would have been set only in case of HA setup according to ResourceManager#serviceInit. this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf)); if (this.rmContext.isHAEnabled()) \{ HAUtil.verifyAndSetConfiguration(this.conf); } {code} When I tried to run with the proxy provider as FederationRMFailoverProxyProvider, it started the nodemanager but this would be idealistic to work with only in case of 1 RM. {code:java} yarn.client.failover-proxy-provider org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider {code} Please correct if I am wrong at any point. > yarn.federation.failover.enabled missing in yarn-default.xml > > > Key: YARN-7592 > URL: https://issues.apache.org/jira/browse/YARN-7592 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.0.0-beta1 >Reporter: Gera Shegalov >Priority: Major > Attachments: IssueReproduce.patch > > > yarn.federation.failover.enabled should be documented in yarn-default.xml. I > am also not sure why it should be true by default and force the HA retry > policy in {{RMProxy#createRMProxy}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org