[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-11-01 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672567#comment-16672567
 ] 

Rahul Anand commented on YARN-7592:
---

Thanks [~bibinchundatt] and [~subru] for the comment.

Yes his works well with both HA and non HA scenario and for the 
*yarn.federation.failover.enabled* flag, IIUC, we can remove that flag too.

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-6900) ZooKeeper based implementation of the FederationStateStore

2018-11-01 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672561#comment-16672561
 ] 

Rahul Anand edited comment on YARN-6900 at 11/2/18 4:48 AM:


Thanks [~subru] or [~elgoiri] for the comment.

So right now we have two properties in the yarn-site for setting up policy 
manager and its related params i.e. *yarn.federation.policy-manager-params* and 
*yarn.federation.policy-manager*. According to the code, we should set this 
param property with bytes but obviously that is not what user expects. So for 
discusstion, first thing is the Format of this input.

Here is how things look is the zknode if I tried to put policy configurations 
in statestore for the default queue * using PriorityBroadcastPolicyManager

*get /federationstore/policies/**
{code:java}
*Xorg.apache.hadoop.yarn.server.federation.policies.manager.PriorityBroadcastPolicyManager�{"routerPolicyWeights":{"entry":[{"key":{"id":"cluster-bb"},"value":"1.0"},{"key":{"id":"cluster-ra"},"value":"3.0"}]},"amrmPolicyWeights":null,"headroomAlpha":"0.0"}
{code}
So overall we have these many things to put in the params field
 1. routerpolicyweights  (weight of particular subcluter to be used for router 
policy)
 2. amrmpolicyweights  (weight of particular subcluter to be used for amrm 
policy)
 3. headroomAlpha


was (Author: rahulanand90):
Thanks [~subru] or [~elgoiri] for the comment.

So right now we have two properties in the yarn-site for setting up policy 
manager and its related params i.e. *yarn.federation.policy-manager-params* and 
*yarn.federation.policy-manager*. According to the code, we should set this 
param property with bytes but obviously that is not what user expects. So for 
discusstion, first thing is the Format of this input.

Here is how things look is the zknode if I tried to put policy configurations 
in statestore for the default queue * using PriorityBroadcastPolicyManager

*get /federationstore/policies/**
{code:java}
*Xorg.apache.hadoop.yarn.server.federation.policies.manager.PriorityBroadcastPolicyManager�{"routerPolicyWeights":{"entry":[{"key":{"id":"cluster-bb"},"value":"1.0"},{"key":{"id":"cluster-ra"},"value":"3.0"}]},"amrmPolicyWeights":null,"headroomAlpha":"0.0"}
{code}
So overall we have these many things to put in the params field
 1. amrmpolicyweights (weight of particular subcluter to be used for router 
policy)
 2. routerpolicyweights (weight of particular subcluter to be used for amrm 
policy)
 3. headroomAlpha

> ZooKeeper based implementation of the FederationStateStore
> --
>
> Key: YARN-6900
> URL: https://issues.apache.org/jira/browse/YARN-6900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6900-002.patch, YARN-6900-003.patch, 
> YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, 
> YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, 
> YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, 
> YARN-6900-YARN-2915-001.patch
>
>
> YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only 
> support SQL based stores, this JIRA tracks adding a ZooKeeper based 
> implementation for simplifying deployment as it's already popularly used for 
> {{RMStateStore}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore

2018-11-01 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672561#comment-16672561
 ] 

Rahul Anand commented on YARN-6900:
---

Thanks [~subru] or [~elgoiri] for the comment.

So right now we have two properties in the yarn-site for setting up policy 
manager and its related params i.e. *yarn.federation.policy-manager-params* and 
*yarn.federation.policy-manager*. According to the code, we should set this 
param property with bytes but obviously that is not what user expects. So for 
discusstion, first thing is the Format of this input.

Here is how things look is the zknode if I tried to put policy configurations 
in statestore for the default queue * using PriorityBroadcastPolicyManager

*get /federationstore/policies/**
{code:java}
*Xorg.apache.hadoop.yarn.server.federation.policies.manager.PriorityBroadcastPolicyManager�{"routerPolicyWeights":{"entry":[{"key":{"id":"cluster-bb"},"value":"1.0"},{"key":{"id":"cluster-ra"},"value":"3.0"}]},"amrmPolicyWeights":null,"headroomAlpha":"0.0"}
{code}
So overall we have these many things to put in the params field
 1. amrmpolicyweights (weight of particular subcluter to be used for router 
policy)
 2. routerpolicyweights (weight of particular subcluter to be used for amrm 
policy)
 3. headroomAlpha

> ZooKeeper based implementation of the FederationStateStore
> --
>
> Key: YARN-6900
> URL: https://issues.apache.org/jira/browse/YARN-6900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6900-002.patch, YARN-6900-003.patch, 
> YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, 
> YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, 
> YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, 
> YARN-6900-YARN-2915-001.patch
>
>
> YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only 
> support SQL based stores, this JIRA tracks adding a ZooKeeper based 
> implementation for simplifying deployment as it's already popularly used for 
> {{RMStateStore}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8855) Application fails if one of the sublcluster is down.

2018-10-08 Thread Rahul Anand (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rahul Anand updated YARN-8855:
--
Summary: Application fails if one of the sublcluster is down.  (was: 
Application submission fails if one of the sublcluster is down.)

> Application fails if one of the sublcluster is down.
> 
>
> Key: YARN-8855
> URL: https://issues.apache.org/jira/browse/YARN-8855
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rahul Anand
>Priority: Major
>
> If one of sub cluster is down then application keeps on trying multiple times 
> and then it fails About 30 failover attempts found in the logs. Below is the 
> detailed exception. 
> {code:java}
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
> container_e03_1538297667953_0005_01_01 transitioned from 
> CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
> 2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
> container_e03_1538297667953_0005_01_01 from application 
> application_1538297667953_0005 | ApplicationImpl.java:512
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> resource-monitoring for container_e03_1538297667953_0005_01_01 | 
> ContainersMonitorImpl.java:932
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
> container container_e03_1538297667953_0005_01_01 for log-aggregation | 
> AppLogAggregatorImpl.java:538
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
> CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
> 2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
> container container_e03_1538297667953_0005_01_01 | 
> YarnShuffleService.java:295
> 2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find 
> container container_e03_1538297667953_0005_01_01 while processing 
> FINISH_CONTAINERS event | ContainerManagerImpl.java:1660
> 2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
> containers from NM context: [container_e03_1538297667953_0005_01_01] | 
> NodeStatusUpdaterImpl.java:696
> 2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 28 failover attempts. Trying to failover after sleeping for 15261ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
> /192.168.0.25:8032 subClusterId cluster2 with protocol 
> ApplicationClientProtocol as user root (auth:SIMPLE) | 
> FederationRMFailoverProxyProvider.java:145
> 2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | 
> java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
> node-master1-IYTxR:8032 failed on connection exception: 
> java.net.ConnectException: Connection refused; For more details see: 
> http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
> ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 
> 29 failover attempts. Trying to failover after sleeping for 21175ms. | 
> RetryInvocationHandler.java:411
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the 
> ResourceManager for SubClusterId: cluster2 | 
> FederationRMFailoverProxyProvider.java:124
> 2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from 
> cache and rehydrating from store, most likely on account of RM failover. | 
> FederationStateStoreFacade.java:258
> 2018-10-08 14:22:03,186 | INFO | 

[jira] [Created] (YARN-8855) Application submission fails if one of the sublcluster is down.

2018-10-08 Thread Rahul Anand (JIRA)
Rahul Anand created YARN-8855:
-

 Summary: Application submission fails if one of the sublcluster is 
down.
 Key: YARN-8855
 URL: https://issues.apache.org/jira/browse/YARN-8855
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Rahul Anand


If one of sub cluster is down then application keeps on trying multiple times 
and then it fails About 30 failover attempts found in the logs. Below is the 
detailed exception. 
{code:java}
2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container 
container_e03_1538297667953_0005_01_01 transitioned from 
CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing 
container_e03_1538297667953_0005_01_01 from application 
application_1538297667953_0005 | ApplicationImpl.java:512
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
resource-monitoring for container_e03_1538297667953_0005_01_01 | 
ContainersMonitorImpl.java:932
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering 
container container_e03_1538297667953_0005_01_01 for log-aggregation | 
AppLogAggregatorImpl.java:538
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event 
CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping 
container container_e03_1538297667953_0005_01_01 | 
YarnShuffleService.java:295
2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find container 
container_e03_1538297667953_0005_01_01 while processing FINISH_CONTAINERS 
event | ContainerManagerImpl.java:1660
2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed 
containers from NM context: [container_e03_1538297667953_0005_01_01] | 
NodeStatusUpdaterImpl.java:696
2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the 
ResourceManager for SubClusterId: cluster2 | 
FederationRMFailoverProxyProvider.java:124
2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from 
cache and rehydrating from store, most likely on account of RM failover. | 
FederationStateStoreFacade.java:258
2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to 
/192.168.0.25:8032 subClusterId cluster2 with protocol 
ApplicationClientProtocol as user root (auth:SIMPLE) | 
FederationRMFailoverProxyProvider.java:145
2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | java.net.ConnectException: 
Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 28 
failover attempts. Trying to failover after sleeping for 15261ms. | 
RetryInvocationHandler.java:411
2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the 
ResourceManager for SubClusterId: cluster2 | 
FederationRMFailoverProxyProvider.java:124
2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from 
cache and rehydrating from store, most likely on account of RM failover. | 
FederationStateStoreFacade.java:258
2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to 
/192.168.0.25:8032 subClusterId cluster2 with protocol 
ApplicationClientProtocol as user root (auth:SIMPLE) | 
FederationRMFailoverProxyProvider.java:145
2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | java.net.ConnectException: 
Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking 
ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 29 
failover attempts. Trying to failover after sleeping for 21175ms. | 
RetryInvocationHandler.java:411
2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the 
ResourceManager for SubClusterId: cluster2 | 
FederationRMFailoverProxyProvider.java:124
2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from 
cache and rehydrating from store, most likely on account of RM failover. | 
FederationStateStoreFacade.java:258
2018-10-08 14:22:03,186 | INFO | pool-16-thread-1 | Connecting to 
/192.168.0.25:8032 subClusterId cluster2 with protocol 
ApplicationClientProtocol as user root (auth:SIMPLE) | 
FederationRMFailoverProxyProvider.java:145
2018-10-08 14:22:03,189 | ERROR | pool-16-thread-1 | Failed to register 
application master: cluster2 Application: appattempt_1538297667953_0005_01 
| FederationInterceptor.java:1106
java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to 
node-master1-IYTxR:8032 

[jira] [Comment Edited] (YARN-6900) ZooKeeper based implementation of the FederationStateStore

2018-10-04 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637795#comment-16637795
 ] 

Rahul Anand edited comment on YARN-6900 at 10/4/18 12:47 PM:
-

[~elgoiri] [~subru]

I would like to know how one can configure the znode so as to specify the queue 
policies. I mean how can I specify the router, amrmpolicy, associated weights 
etc in the znode and in which format. I am unable to find this anywhere. 

Alternatively, if not that then can you help in setting me params in 
yarn.federation.policy-manager-params so that I can set policies weights.


was (Author: rahulanand90):
[~elgoiri] [~subru]

I would like to know how one can configure the znode so as to specify the queue 
policies. I mean how can I specify the router, amrmpolicy, associated weights 
etc in the znode and in which format. I am unable to find this anywhere. 

> ZooKeeper based implementation of the FederationStateStore
> --
>
> Key: YARN-6900
> URL: https://issues.apache.org/jira/browse/YARN-6900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6900-002.patch, YARN-6900-003.patch, 
> YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, 
> YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, 
> YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, 
> YARN-6900-YARN-2915-001.patch
>
>
> YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only 
> support SQL based stores, this JIRA tracks adding a ZooKeeper based 
> implementation for simplifying deployment as it's already popularly used for 
> {{RMStateStore}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6900) ZooKeeper based implementation of the FederationStateStore

2018-10-03 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637795#comment-16637795
 ] 

Rahul Anand commented on YARN-6900:
---

[~elgoiri] [~subru]

I would like to know how one can configure the znode so as to specify the queue 
policies. I mean how can I specify the router, amrmpolicy, associated weights 
etc in the znode and in which format. I am unable to find this anywhere. 

> ZooKeeper based implementation of the FederationStateStore
> --
>
> Key: YARN-6900
> URL: https://issues.apache.org/jira/browse/YARN-6900
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: federation, nodemanager, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Íñigo Goiri
>Priority: Major
> Fix For: 2.9.0, 3.0.0-beta1
>
> Attachments: YARN-6900-002.patch, YARN-6900-003.patch, 
> YARN-6900-004.patch, YARN-6900-005.patch, YARN-6900-006.patch, 
> YARN-6900-007.patch, YARN-6900-008.patch, YARN-6900-009.patch, 
> YARN-6900-010.patch, YARN-6900-011.patch, YARN-6900-YARN-2915-000.patch, 
> YARN-6900-YARN-2915-001.patch
>
>
> YARN-5408 defines the unified {{FederationStateStore}} API. Currently we only 
> support SQL based stores, this JIRA tracks adding a ZooKeeper based 
> implementation for simplifying deployment as it's already popularly used for 
> {{RMStateStore}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-24 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625870#comment-16625870
 ] 

Rahul Anand edited comment on YARN-7592 at 9/24/18 3:46 PM:


Thanks [~bibinchundatt] and [~subru].

Removing *yarn.federation.enabled* from yarn-site.xml can solve this issue but 
would definitely create a confusion. So, instead of changing/removing a 
meaningful federation flag or updating doc, an alternative solution can be 
creation of a {{FederationCustomClientRMProxy}} which can override the 
{{ClientRMProxy#createRMProxy}} in {{AMRMClientUtils}} to always select *proxy 
provider* as {{FederationRMFailoverProxyProvider}} for federation.
{code:java}
public static  T createRMProxy(final Configuration configuration,
  final Class protocol, UserGroupInformation user,
  final Token token) throws IOException {
 ...
  return FederationCustomClientRMProxy.createRMProxy(configuration, 
protocol);
}
 ...
}
  }
{code}
After this, we can remove the {{isFederationEnabled}} check from the 
{{RMProxy.java}} as before. 
{code:java}
protected static  T createRMProxy(final Configuration configuration,
  final Class protocol, RMProxy instance) throws IOException {
...
RetryPolicy retryPolicy = createRetryPolicy(conf,
(HAUtil.isHAEnabled(conf)));
...
  }
{code}
{code:java}
  protected static  T createRMProxy(final Configuration configuration,
  final Class protocol, RMProxy instance, final long retryTime,
  final long retryInterval) throws IOException {
...
RetryPolicy retryPolicy = createRetryPolicy(conf, retryTime, retryInterval,
HAUtil.isHAEnabled(conf));
...
  }
{code}
With this change we don't need to separately specify the *proxy provider* for 
HA and non-HA scenarios in case of federation while other non federation 
settings will continue as it is.


was (Author: rahulanand90):
Thanks [~bibinchundatt] and [~subru]. 

Removing *yarn.federation.enabled* from yarn-site.xml can solve this issue but 
would definitely create a confusion. So, instead of changing/removing a 
meaningful federation flag or updating doc, an alternative solution can be 
creation of a {{FederationCustomClientRMProxy}} which can override the 
{{ClientRMProxy#createRMProxy}} in {{AMRMClientUtils}} to always select *proxy 
provider* as {{FederationRMFailoverProxyProvider}} for federation.
{code:java}
public static  T createRMProxy(final Configuration configuration,
  final Class protocol, UserGroupInformation user,
  final Token token) throws IOException {
 ...
  return FederationCustomClientRMProxy.createRMProxy(configuration, 
protocol);
}
 ...
}
  }
{code}
After this, we can remove the {{isFederationEnabled}} check from the 
{{RMProxy.java}} as before. 
{code:java}
protected static  T createRMProxy(final Configuration configuration,
  final Class protocol, RMProxy instance) throws IOException {
...
RetryPolicy retryPolicy = createRetryPolicy(conf,
(HAUtil.isHAEnabled(conf)));
...
  }
{code}
{code:java}
  protected static  T createRMProxy(final Configuration configuration,
  final Class protocol, RMProxy instance, final long retryTime,
  final long retryInterval) throws IOException {
...
RetryPolicy retryPolicy = createRetryPolicy(conf, retryTime, retryInterval,
HAUtil.isHAEnabled(conf));
...
  }
{code}
With this change, we don't need to seperately  specify the *proxy provider* for 
HA and non-HA scenarios.

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-24 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625870#comment-16625870
 ] 

Rahul Anand commented on YARN-7592:
---

Thanks [~bibinchundatt] and [~subru]. 

Removing *yarn.federation.enabled* from yarn-site.xml can solve this issue but 
would definitely create a confusion. So, instead of changing/removing a 
meaningful federation flag or updating doc, an alternative solution can be 
creation of a {{FederationCustomClientRMProxy}} which can override the 
{{ClientRMProxy#createRMProxy}} in {{AMRMClientUtils}} to always select *proxy 
provider* as {{FederationRMFailoverProxyProvider}} for federation.
{code:java}
public static  T createRMProxy(final Configuration configuration,
  final Class protocol, UserGroupInformation user,
  final Token token) throws IOException {
 ...
  return FederationCustomClientRMProxy.createRMProxy(configuration, 
protocol);
}
 ...
}
  }
{code}
After this, we can remove the {{isFederationEnabled}} check from the 
{{RMProxy.java}} as before. 
{code:java}
protected static  T createRMProxy(final Configuration configuration,
  final Class protocol, RMProxy instance) throws IOException {
...
RetryPolicy retryPolicy = createRetryPolicy(conf,
(HAUtil.isHAEnabled(conf)));
...
  }
{code}
{code:java}
  protected static  T createRMProxy(final Configuration configuration,
  final Class protocol, RMProxy instance, final long retryTime,
  final long retryInterval) throws IOException {
...
RetryPolicy retryPolicy = createRetryPolicy(conf, retryTime, retryInterval,
HAUtil.isHAEnabled(conf));
...
  }
{code}
With this change, we don't need to seperately  specify the *proxy provider* for 
HA and non-HA scenarios.

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-07 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606769#comment-16606769
 ] 

Rahul Anand edited comment on YARN-7592 at 9/7/18 9:42 AM:
---

As per my understanding, for a Non-HA setup, with the default configuration, 
this will always create a problem. I have listed down my analysis.

NodeManager registration starts from {{NodeManager#main}} and evetually invokes 
{{NodeStatusUpdaterImpl#serviceStart}}
{code:java}
protected void serviceStart() throws Exception {
...
this.resourceTracker = getRMClient();
..
  } catch (Exception e) {
  String errorMessage = "Unexpected error starting NodeStatusUpdater";
  LOG.error(errorMessage, e);
  throw new YarnRuntimeException(e);
 }
}
 {code}
Then, NodeStatusUpdaterImpl#getRMClient tries to create RM proxy for resource 
tracker protocol. Now, the Federation enabled check in RMProxy#newProxyInstance 
{code:java}
if (HAUtil.isHAEnabled(conf) || HAUtil.isFederationEnabled(conf)) {
   RMFailoverProxyProvider provider =
   instance.createRMFailoverProxyProvider(conf, protocol);{code}
is failing the registration of the nodemanager. By default, 
RMProxy#createRMFailoverProxyProvider will always select 
ConfiguredRMFailoverProxyProvider 
{code:java}
RMFailoverProxyProvider provider = ReflectionUtils.newInstance(
  conf.getClass(YarnConfiguration.CLIENT_FAILOVER_PROXY_PROVIDER,
 defaultProviderClass, RMFailoverProxyProvider.class), conf);
provider.init(conf, (RMProxy) this, protocol);{code}
and eventually, it will try to get RM's id from 
ConfiguredRMFailoverProxyProvider#init
{code:java}
Collection rmIds = HAUtil.getRMHAIds(conf);
{code}
which would have been set only in case of HA setup according to 
ResourceManager#serviceInit.
{code}
this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf));
if (this.rmContext.isHAEnabled()) {
HAUtil.verifyAndSetConfiguration(this.conf);
}
  {code}
 

When I tried to run with the proxy provider as 
FederationRMFailoverProxyProvider, it started the nodemanager but this would be 
idealistic to work with only in case of 1 RM. 
{code:java}

yarn.client.failover-proxy-provider

org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider
{code}
Please correct if I am wrong at any point. 

 


was (Author: rahulanand90):
As per my understanding, for a Non-HA setup, with the default configuration, 
this will always create a problem. I have listed down my analysis.

NodeManager registration starts from {{NodeManager#main}} and evetually invokes 
{{NodeStatusUpdaterImpl#serviceStart}} 
{code:java}
protected void serviceStart() throws Exception \{
...
this.resourceTracker = getRMClient();
..
  } catch (Exception e) \{
  String errorMessage = "Unexpected error starting NodeStatusUpdater";
  LOG.error(errorMessage, e);
  throw new YarnRuntimeException(e);
 }
}
 {code}
Then, NodeStatusUpdaterImpl#getRMClient tries to create RM proxy for resource 
tracker protocol. Now, the Federation enabled check in RMProxy#newProxyInstance 
{code:java}
if (HAUtil.isHAEnabled(conf) || HAUtil.isFederationEnabled(conf)) {
   RMFailoverProxyProvider provider =
   instance.createRMFailoverProxyProvider(conf, protocol);{code}
is failing the registration of the nodemanager. By default, 
RMProxy#createRMFailoverProxyProvider will always select 
ConfiguredRMFailoverProxyProvider 
{code:java}
RMFailoverProxyProvider provider = ReflectionUtils.newInstance(
  conf.getClass(YarnConfiguration.CLIENT_FAILOVER_PROXY_PROVIDER,
 defaultProviderClass, RMFailoverProxyProvider.class), conf);
provider.init(conf, (RMProxy) this, protocol);{code}
and eventually, it will try to get RM's id from 
ConfiguredRMFailoverProxyProvider#init
{code:java}
Collection rmIds = HAUtil.getRMHAIds(conf);
 which would have been set only in case of HA setup according to 
ResourceManager#serviceInit.
this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf));
if (this.rmContext.isHAEnabled()) \{
HAUtil.verifyAndSetConfiguration(this.conf);
}
  {code}
 

When I tried to run with the proxy provider as 
FederationRMFailoverProxyProvider, it started the nodemanager but this would be 
idealistic to work with only in case of 1 RM. 
{code:java}

yarn.client.failover-proxy-provider

org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider
{code}
Please correct if I am wrong at any point. 

 

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> 

[jira] [Commented] (YARN-7592) yarn.federation.failover.enabled missing in yarn-default.xml

2018-09-07 Thread Rahul Anand (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606769#comment-16606769
 ] 

Rahul Anand commented on YARN-7592:
---

As per my understanding, for a Non-HA setup, with the default configuration, 
this will always create a problem. I have listed down my analysis.

NodeManager registration starts from {{NodeManager#main}} and evetually invokes 
{{NodeStatusUpdaterImpl#serviceStart}} 
{code:java}
protected void serviceStart() throws Exception \{
...
this.resourceTracker = getRMClient();
..
  } catch (Exception e) \{
  String errorMessage = "Unexpected error starting NodeStatusUpdater";
  LOG.error(errorMessage, e);
  throw new YarnRuntimeException(e);
 }
}
 {code}
Then, NodeStatusUpdaterImpl#getRMClient tries to create RM proxy for resource 
tracker protocol. Now, the Federation enabled check in RMProxy#newProxyInstance 
{code:java}
if (HAUtil.isHAEnabled(conf) || HAUtil.isFederationEnabled(conf)) {
   RMFailoverProxyProvider provider =
   instance.createRMFailoverProxyProvider(conf, protocol);{code}
is failing the registration of the nodemanager. By default, 
RMProxy#createRMFailoverProxyProvider will always select 
ConfiguredRMFailoverProxyProvider 
{code:java}
RMFailoverProxyProvider provider = ReflectionUtils.newInstance(
  conf.getClass(YarnConfiguration.CLIENT_FAILOVER_PROXY_PROVIDER,
 defaultProviderClass, RMFailoverProxyProvider.class), conf);
provider.init(conf, (RMProxy) this, protocol);{code}
and eventually, it will try to get RM's id from 
ConfiguredRMFailoverProxyProvider#init
{code:java}
Collection rmIds = HAUtil.getRMHAIds(conf);
 which would have been set only in case of HA setup according to 
ResourceManager#serviceInit.
this.rmContext.setHAEnabled(HAUtil.isHAEnabled(this.conf));
if (this.rmContext.isHAEnabled()) \{
HAUtil.verifyAndSetConfiguration(this.conf);
}
  {code}
 

When I tried to run with the proxy provider as 
FederationRMFailoverProxyProvider, it started the nodemanager but this would be 
idealistic to work with only in case of 1 RM. 
{code:java}

yarn.client.failover-proxy-provider

org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider
{code}
Please correct if I am wrong at any point. 

 

> yarn.federation.failover.enabled missing in yarn-default.xml
> 
>
> Key: YARN-7592
> URL: https://issues.apache.org/jira/browse/YARN-7592
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: federation
>Affects Versions: 3.0.0-beta1
>Reporter: Gera Shegalov
>Priority: Major
> Attachments: IssueReproduce.patch
>
>
> yarn.federation.failover.enabled should be documented in yarn-default.xml. I 
> am also not sure why it should be true by default and force the HA retry 
> policy in {{RMProxy#createRMProxy}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org