[jira] [Assigned] (YARN-10890) Node Attributes in Distributed mapping misses update to scheduler when node gets decommissioned/recommissioned

2021-08-18 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi reassigned YARN-10890:
---

Assignee: Tarun Parimi

> Node Attributes in Distributed mapping misses update to scheduler when node 
> gets decommissioned/recommissioned
> --
>
> Key: YARN-10890
> URL: https://issues.apache.org/jira/browse/YARN-10890
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0, 3.2.1
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> The NodeAttributesManagerImpl maintains the node to attribute mapping. But it 
> doesnt remove the mapping when a node goes down. This makes sense for 
> centralized mapping, since the attribute mapping is centralized to RM, so a 
> node going down doesn't affect the mapping.
> In distributed mapping, the node attribute mapping is updated via NM 
> heartbeat to RM and so these node attributes are only valid as long as the 
> node is heartbeating . But when a node is decommissioned or lost, the node 
> attribute entry still remains in  NodeAttributesManagerImpl.
> After the performance improvement change done in YARN-8925, we only update 
> distributed node attributes when necessary. However when a previously 
> decommissioned node is recommissioned again, NodeAttributesManagerImpl still 
> has the old mapping entry belonging to the old SchedulerNode instance which 
> was decommisioned.
> This results in ResourceTrackerService#updateNodeAttributesIfNecessary 
> skipping the update, since it is comparing with the attributes belonging to 
> the old decommisioned node instance.
> {code:java}
>   if (!NodeLabelUtil
>   .isNodeAttributesEquals(nodeAttributes, currentNodeAttributes)) 
> {
> this.rmContext.getNodeAttributesManager()
> .replaceNodeAttributes(NodeAttribute.PREFIX_DISTRIBUTED,
> ImmutableMap.of(nodeId.getHost(), nodeAttributes));
>   } else if (LOG.isDebugEnabled()) {
> LOG.debug("Skip updating node attributes since there is no change 
> for "
> + nodeId + " : " + nodeAttributes);
>   }
> {code}
> We should remove the distributed node attributes whenever a node gets 
> deactivated to avoid this issue. So these attributes will get added properly 
> in scheduler whenever the node becomes active again and registers/heartbeats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10890) Node Attributes in Distributed mapping misses update to scheduler when node gets decommissioned/recommissioned

2021-08-18 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10890:
---

 Summary: Node Attributes in Distributed mapping misses update to 
scheduler when node gets decommissioned/recommissioned
 Key: YARN-10890
 URL: https://issues.apache.org/jira/browse/YARN-10890
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.2.1, 3.3.0
Reporter: Tarun Parimi


The NodeAttributesManagerImpl maintains the node to attribute mapping. But it 
doesnt remove the mapping when a node goes down. This makes sense for 
centralized mapping, since the attribute mapping is centralized to RM, so a 
node going down doesn't affect the mapping.

In distributed mapping, the node attribute mapping is updated via NM heartbeat 
to RM and so these node attributes are only valid as long as the node is 
heartbeating . But when a node is decommissioned or lost, the node attribute 
entry still remains in  NodeAttributesManagerImpl.

After the performance improvement change done in YARN-8925, we only update 
distributed node attributes when necessary. However when a previously 
decommissioned node is recommissioned again, NodeAttributesManagerImpl still 
has the old mapping entry belonging to the old SchedulerNode instance which was 
decommisioned.

This results in ResourceTrackerService#updateNodeAttributesIfNecessary skipping 
the update, since it is comparing with the attributes belonging to the old 
decommisioned node instance.
{code:java}
if (!NodeLabelUtil
.isNodeAttributesEquals(nodeAttributes, currentNodeAttributes)) 
{
  this.rmContext.getNodeAttributesManager()
  .replaceNodeAttributes(NodeAttribute.PREFIX_DISTRIBUTED,
  ImmutableMap.of(nodeId.getHost(), nodeAttributes));
} else if (LOG.isDebugEnabled()) {
  LOG.debug("Skip updating node attributes since there is no change 
for "
  + nodeId + " : " + nodeAttributes);
}
{code}

We should remove the distributed node attributes whenever a node gets 
deactivated to avoid this issue. So these attributes will get added properly in 
scheduler whenever the node becomes active again and registers/heartbeats.






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9907) Make YARN Service AM RPC port configurable

2021-07-30 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390541#comment-17390541
 ] 

Tarun Parimi commented on YARN-9907:


[~pbacsko], yes you are right. We can close this as duplicate now.

> Make YARN Service AM RPC port configurable
> --
>
> Key: YARN-9907
> URL: https://issues.apache.org/jira/browse/YARN-9907
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9907.001.patch
>
>
> YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In 
> environments where firewalls block unnecessary ports by default, it is useful 
> to have a configuration that specifies the port range. Similar to what we 
> have for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-07-14 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380533#comment-17380533
 ] 

Tarun Parimi commented on YARN-10789:
-

[~snemeth], Looks like the build didnt get triggered till now for some reason. 
Was there an issue in jenkins? TestZKConfigurationStore #testDisableAuditLogs 
is passing. The other test failures are unrelated to the patch. 

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-07-07 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: (was: YARN-10789.branch-3.2.001.patch)

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-07-07 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: YARN-10789.branch-3.2.001.patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-25 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369280#comment-17369280
 ] 

Tarun Parimi commented on YARN-10789:
-

[~snemeth], reattaching the 3.2 patch to trigger build. Looks like the 
retrigger didnt happen for some reason.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10828) Backport YARN-9789 to branch-3.2

2021-06-25 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369282#comment-17369282
 ] 

Tarun Parimi commented on YARN-10828:
-

Thanks [~snemeth] for reviewing this and committting.

> Backport YARN-9789 to branch-3.2
> 
>
> Key: YARN-10828
> URL: https://issues.apache.org/jira/browse/YARN-10828
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.2.3
>
> Attachments: YARN-10828.branch-3.2.001.patch
>
>
> YARN-9789 fix is missing in branch-3.2 which is causing unit test 
> TestZKConfigurationStore#testDisableAuditLogs to fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-25 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: (was: YARN-10789.branch-3.2.001.patch)

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-25 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: YARN-10789.branch-3.2.001.patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10828) Backport YARN-9789 to branch-3.2

2021-06-24 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368738#comment-17368738
 ] 

Tarun Parimi commented on YARN-10828:
-

The test failures are not related to this patch.

> Backport YARN-9789 to branch-3.2
> 
>
> Key: YARN-10828
> URL: https://issues.apache.org/jira/browse/YARN-10828
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10828.branch-3.2.001.patch
>
>
> YARN-9789 fix is missing in branch-3.2 which is causing unit test 
> TestZKConfigurationStore#testDisableAuditLogs to fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10828) Backport YARN-9789 to branch-3.2

2021-06-22 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367601#comment-17367601
 ] 

Tarun Parimi commented on YARN-10828:
-

[~snemeth], please review this when you get time. Thanks.

> Backport YARN-9789 to branch-3.2
> 
>
> Key: YARN-10828
> URL: https://issues.apache.org/jira/browse/YARN-10828
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10828.branch-3.2.001.patch
>
>
> YARN-9789 fix is missing in branch-3.2 which is causing unit test 
> TestZKConfigurationStore#testDisableAuditLogs to fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-22 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367600#comment-17367600
 ] 

Tarun Parimi commented on YARN-10789:
-

[~snemeth], I have created YARN-10828 to backport YARN-9789 to branch-3.2.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10828) Backport YARN-9789 to branch-3.2

2021-06-22 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi reassigned YARN-10828:
---

Assignee: Tarun Parimi

Submitting a backport patch for branch-3.2. Validated that related unit tests 
pass.

> Backport YARN-9789 to branch-3.2
> 
>
> Key: YARN-10828
> URL: https://issues.apache.org/jira/browse/YARN-10828
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10828.branch-3.2.001.patch
>
>
> YARN-9789 fix is missing in branch-3.2 which is causing unit test 
> TestZKConfigurationStore#testDisableAuditLogs to fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10828) Backport YARN-9789 to branch-3.2

2021-06-22 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10828:

Attachment: YARN-10828.branch-3.2.001.patch

> Backport YARN-9789 to branch-3.2
> 
>
> Key: YARN-10828
> URL: https://issues.apache.org/jira/browse/YARN-10828
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Tarun Parimi
>Priority: Major
> Attachments: YARN-10828.branch-3.2.001.patch
>
>
> YARN-9789 fix is missing in branch-3.2 which is causing unit test 
> TestZKConfigurationStore#testDisableAuditLogs to fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10828) Backport YARN-9789 to branch-3.2

2021-06-22 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10828:
---

 Summary: Backport YARN-9789 to branch-3.2
 Key: YARN-10828
 URL: https://issues.apache.org/jira/browse/YARN-10828
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.2.0
Reporter: Tarun Parimi


YARN-9789 fix is missing in branch-3.2 which is causing unit test 
TestZKConfigurationStore#testDisableAuditLogs to fail. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-22 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367553#comment-17367553
 ] 

Tarun Parimi commented on YARN-10789:
-

[~snemeth], the failing test in TestZKConfigurationStore is 
testDisableAuditLogs . This unit test was added in YARN-9789 . But YARN-9789 
fix is missing in branch-3.2 . Looks like only the unit test part of YARN-9789 
got backported somehow to branch-3.2, but not the fix corresponding to it. To 
fix this test, we need to backport YARN-9789 patch to branch-3.2.


> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-15 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: YARN-10789.branch-3.2.001.patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-15 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363576#comment-17363576
 ] 

Tarun Parimi commented on YARN-10789:
-

Reattached Patch for branch-3.2 since jenkins triggerred only for branch-3.3 
patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-15 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: (was: YARN-10789.branch-3.2.001.patch)

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-14 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: YARN-10789.branch-3.3.001.patch
YARN-10789.branch-3.2.001.patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch, 
> YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-06-14 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362820#comment-17362820
 ] 

Tarun Parimi commented on YARN-10789:
-

Thanks [~snemeth] for the review and commit. Thanks [~bteke],[~zhuqi] for your 
reviews. 

We can backport it to 3.3/3.2 branches. The trunk patch applies cleanly on 3.3. 
Will add a patch for 3.2.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple

2021-06-14 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362752#comment-17362752
 ] 

Tarun Parimi commented on YARN-10816:
-

Thanks [~snemeth] for the review and commit.

> Avoid doing delegation token ops when 
> yarn.timeline-service.http-authentication.type=simple
> ---
>
> Key: YARN-10816
> URL: https://issues.apache.org/jira/browse/YARN-10816
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.4.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10816.001.patch, YARN-10816.002.patch
>
>
> YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is 
> used in TimelineClient when 
> yarn.timeline-service.http-authentication.type=simple
> PseudoAuthenticationHandler doesn't support delegation token ops like get, 
> renew and cancel since those ops strictly require SPNEGO auth to work. We 
> don't use timeline delegation tokens when simple auth is used.
> Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when 
> yarn.timeline-service.http-authentication.type=simple, but hadoop security 
> was enabled. After YARN-10339, the tokens are not used when 
> yarn.timeline-service.http-authentication.type=simple.
> In a rolling upgrade scenario, we can have a client  which doesn't have 
> YARN-10339 changes submitting an application and requests a Timeline 
> delegation token even when 
> yarn.timeline-service.http-authentication.type=simple. RM on the other hand 
> can have YARN-10339 changes and so will result in error while trying to renew 
> the token with PseudoAuthenticationHandler. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple

2021-06-10 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360708#comment-17360708
 ] 

Tarun Parimi commented on YARN-10816:
-

[~snemeth], please review this when you get some time.

> Avoid doing delegation token ops when 
> yarn.timeline-service.http-authentication.type=simple
> ---
>
> Key: YARN-10816
> URL: https://issues.apache.org/jira/browse/YARN-10816
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.4.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10816.001.patch, YARN-10816.002.patch
>
>
> YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is 
> used in TimelineClient when 
> yarn.timeline-service.http-authentication.type=simple
> PseudoAuthenticationHandler doesn't support delegation token ops like get, 
> renew and cancel since those ops strictly require SPNEGO auth to work. We 
> don't use timeline delegation tokens when simple auth is used.
> Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when 
> yarn.timeline-service.http-authentication.type=simple, but hadoop security 
> was enabled. After YARN-10339, the tokens are not used when 
> yarn.timeline-service.http-authentication.type=simple.
> In a rolling upgrade scenario, we can have a client  which doesn't have 
> YARN-10339 changes submitting an application and requests a Timeline 
> delegation token even when 
> yarn.timeline-service.http-authentication.type=simple. RM on the other hand 
> can have YARN-10339 changes and so will result in error while trying to renew 
> the token with PseudoAuthenticationHandler. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple

2021-06-10 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10816:

Attachment: YARN-10816.002.patch

> Avoid doing delegation token ops when 
> yarn.timeline-service.http-authentication.type=simple
> ---
>
> Key: YARN-10816
> URL: https://issues.apache.org/jira/browse/YARN-10816
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.4.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10816.001.patch, YARN-10816.002.patch
>
>
> YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is 
> used in TimelineClient when 
> yarn.timeline-service.http-authentication.type=simple
> PseudoAuthenticationHandler doesn't support delegation token ops like get, 
> renew and cancel since those ops strictly require SPNEGO auth to work. We 
> don't use timeline delegation tokens when simple auth is used.
> Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when 
> yarn.timeline-service.http-authentication.type=simple, but hadoop security 
> was enabled. After YARN-10339, the tokens are not used when 
> yarn.timeline-service.http-authentication.type=simple.
> In a rolling upgrade scenario, we can have a client  which doesn't have 
> YARN-10339 changes submitting an application and requests a Timeline 
> delegation token even when 
> yarn.timeline-service.http-authentication.type=simple. RM on the other hand 
> can have YARN-10339 changes and so will result in error while trying to renew 
> the token with PseudoAuthenticationHandler. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple

2021-06-10 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10816:

Attachment: YARN-10816.001.patch

> Avoid doing delegation token ops when 
> yarn.timeline-service.http-authentication.type=simple
> ---
>
> Key: YARN-10816
> URL: https://issues.apache.org/jira/browse/YARN-10816
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.4.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10816.001.patch
>
>
> YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is 
> used in TimelineClient when 
> yarn.timeline-service.http-authentication.type=simple
> PseudoAuthenticationHandler doesn't support delegation token ops like get, 
> renew and cancel since those ops strictly require SPNEGO auth to work. We 
> don't use timeline delegation tokens when simple auth is used.
> Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when 
> yarn.timeline-service.http-authentication.type=simple, but hadoop security 
> was enabled. After YARN-10339, the tokens are not used when 
> yarn.timeline-service.http-authentication.type=simple.
> In a rolling upgrade scenario, we can have a client  which doesn't have 
> YARN-10339 changes submitting an application and requests a Timeline 
> delegation token even when 
> yarn.timeline-service.http-authentication.type=simple. RM on the other hand 
> can have YARN-10339 changes and so will result in error while trying to renew 
> the token with PseudoAuthenticationHandler. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple

2021-06-09 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10816:
---

 Summary: Avoid doing delegation token ops when 
yarn.timeline-service.http-authentication.type=simple
 Key: YARN-10816
 URL: https://issues.apache.org/jira/browse/YARN-10816
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineclient
Affects Versions: 3.4.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is 
used in TimelineClient when 
yarn.timeline-service.http-authentication.type=simple

PseudoAuthenticationHandler doesn't support delegation token ops like get, 
renew and cancel since those ops strictly require SPNEGO auth to work. We don't 
use timeline delegation tokens when simple auth is used.

Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when 
yarn.timeline-service.http-authentication.type=simple, but hadoop security was 
enabled. After YARN-10339, the tokens are not used when 
yarn.timeline-service.http-authentication.type=simple.

In a rolling upgrade scenario, we can have a client  which doesn't have 
YARN-10339 changes submitting an application and requests a Timeline delegation 
token even when yarn.timeline-service.http-authentication.type=simple. RM on 
the other hand can have YARN-10339 changes and so will result in error while 
trying to renew the token with PseudoAuthenticationHandler. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-05-31 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354360#comment-17354360
 ] 

Tarun Parimi commented on YARN-10789:
-

Thanks [~snemeth] . Please also take a look at this when you get time.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-05-27 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352401#comment-17352401
 ] 

Tarun Parimi commented on YARN-10789:
-

Thanks [~sunilg]. Added warn log in the latest patch.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-05-27 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: YARN-10789.002.patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10789.001.patch, YARN-10789.002.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-05-26 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17352290#comment-17352290
 ] 

Tarun Parimi commented on YARN-10789:
-

Tested this patch only manually with a stability check with RM HA enabled and 
yarn.scheduler.configuration.store.class=zk configured. It is tough to 
reproduce this race condition. And so writing a reliable unit test case is not 
possible to cover this scenario.

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10789.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-05-26 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Attachment: YARN-10789.001.patch

> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10789.001.patch
>
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-05-26 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10789:

Description: 
We are observing below error randomly during hadoop install and RM initial 
startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
configured. This causes one of the RMs to not startup.

{code:java}
2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service 
RMActiveServices failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /confstore/CONF_STORE
{code}

We are trying to create the znode /confstore/CONF_STORE when we initialize the 
ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
initialized when CapacityScheduler does a serviceInit. This serviceInit is done 
by both Active and Standby RM. So we can run into a race condition when both 
Active and Standby try to create the same znode when both RM are started at 
same time.

ZKRMStateStore on the other hand avoids such race conditions, by creating the 
znodes only after serviceStart. serviceStart only happens for the active RM 
which won the leader election, unlike serviceInit which happens irrespective of 
leader election.

  was:
We are observing below error randomly during hadoop install and RM initial 
startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
configured. This cause one of the RM's to not startup.

{code:java}
2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service 
RMActiveServices failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /confstore/CONF_STORE
{code}

We are trying to create the znode /confstore/CONF_STORE when we initialize the 
ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
initialized when CapacityScheduler does a serviceInit. This serviceInit is done 
by both Active and Standby RM. So we can run into a race condition when both 
Active and Standby try to create the same znode when both RM are started at 
same time.

ZKRMStateStore on the other hand avoids such race conditions, by creating the 
znodes only after serviceStart. serviceStart only happens for the active RM 
which won the leader election, unlike serviceInit which happens irrespective of 
leader election.


> RM HA startup can fail due to race conditions in ZKConfigurationStore
> -
>
> Key: YARN-10789
> URL: https://issues.apache.org/jira/browse/YARN-10789
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> We are observing below error randomly during hadoop install and RM initial 
> startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
> configured. This causes one of the RMs to not startup.
> {code:java}
> 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: 
> Service RMActiveServices failed in state INITED
> org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists for /confstore/CONF_STORE
> {code}
> We are trying to create the znode /confstore/CONF_STORE when we initialize 
> the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
> initialized when CapacityScheduler does a serviceInit. This serviceInit is 
> done by both Active and Standby RM. So we can run into a race condition when 
> both Active and Standby try to create the same znode when both RM are started 
> at same time.
> ZKRMStateStore on the other hand avoids such race conditions, by creating the 
> znodes only after serviceStart. serviceStart only happens for the active RM 
> which won the leader election, unlike serviceInit which happens irrespective 
> of leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore

2021-05-26 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10789:
---

 Summary: RM HA startup can fail due to race conditions in 
ZKConfigurationStore
 Key: YARN-10789
 URL: https://issues.apache.org/jira/browse/YARN-10789
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


We are observing below error randomly during hadoop install and RM initial 
startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is 
configured. This cause one of the RM's to not startup.

{code:java}
2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service 
RMActiveServices failed in state INITED
org.apache.hadoop.service.ServiceStateException: java.io.IOException: 
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists for /confstore/CONF_STORE
{code}

We are trying to create the znode /confstore/CONF_STORE when we initialize the 
ZKConfigurationStore. But the problem is that the ZKConfigurationStore is 
initialized when CapacityScheduler does a serviceInit. This serviceInit is done 
by both Active and Standby RM. So we can run into a race condition when both 
Active and Standby try to create the same znode when both RM are started at 
same time.

ZKRMStateStore on the other hand avoids such race conditions, by creating the 
znodes only after serviceStart. serviceStart only happens for the active RM 
which won the leader election, unlike serviceInit which happens irrespective of 
leader election.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8564) Add queue level application lifetime monitor in FairScheduler

2021-05-18 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346849#comment-17346849
 ] 

Tarun Parimi commented on YARN-8564:


[~zhuqi], Any reason this jira got resolved? I don't see this patch committed 
anywhere. And it doesn't seem to be a duplicate.

> Add queue level application lifetime monitor in FairScheduler 
> --
>
> Key: YARN-8564
> URL: https://issues.apache.org/jira/browse/YARN-8564
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-8564.001.patch, test1~3.jpg, test4.jpg
>
>
> I wish to have application lifetime monitor for queue level in FairSheduler. 
> In our large yarn cluster, sometimes there are too many small jobs in one 
> minor queue but may run too long, it may affect our our high priority and 
> very important queue . If we can have queue level application lifetime 
> monitor in the queue level, and set small lifetime in the minor queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10007) YARN logs contain environment variables, which is a security risk

2020-12-15 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10007:

Issue Type: New Feature  (was: Bug)

> YARN logs contain environment variables, which is a security risk
> -
>
> Key: YARN-10007
> URL: https://issues.apache.org/jira/browse/YARN-10007
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: john lilley
>Priority: Major
>
> In most environments it is standard practice to relay "secrets" via 
> environment variables when spawning a process, because the alternatives 
> (command-line args or storing in a file) are insecure.  However, in a YARN 
> application, this also appears to be insecure because the environment is 
> logged.  While YARN has the ability to relay delegation tokens in the launch 
> context, it is unclear how to use this facility for generalized "secrets" 
> that may not conform to security-token structure.  
> For example, the RPDM_KEYSTORE_PASSWORDS env var is found in the aggregated 
> YARN logs:
> {{Container: container_e06_1574362398372_0023_01_01 on 
> node6..com_45454}}
> {{LogAggregationType: AGGREGATED}}
> {{}}
> {{LogType:launch_container.sh}}
> {{LogLastModifiedTime:Sat Nov 23 14:58:12 -0700 2019}}
> {{LogLength:4043}}
> {{LogContents:}}
> {{#!/bin/bash}}{{set -o pipefail -e}}
> {{[...]export 
> HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/hdp/2.6.5.1175-1/hadoop-yarn"}}}
> {{export 
> RPDM_KEYSTORE_PASSWORDS="eyJnZW5lcmFsIjoiZmtQZllubmVLRVo4c1Z0V0REQ3gxaHJzRnVjdVN5b1NBTE9OUTF1dEZpZ1x1MDAzZCJ9"}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools

2020-10-13 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10458:

Description: 
While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
that the queue creation fails because ACL submit application check couldn't 
succeed.

We tried setting acl_submit_applications to '*' for managed parent queues. For 
static queues, this worked but failed for dynamic queues. Also tried setting 
the below property but it didn't help either.
yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.

RM error log shows the following :

2020-09-18 01:08:40,579 INFO 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
 Application application_1600399068816_0460 user user1 mapping [default] to 
[queue1] override false
2020-09-18 01:08:40,579 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
application tag does not have access to  queue 'user1'. The placement is done 
for user 'hive'
 

Checking the code, scheduler#checkAccess() bails out even before checking the 
ACL permissions for that particular queue because the CSQueue is null.


{code:java}
public boolean checkAccess(UserGroupInformation callerUGI,
QueueACL acl, String queueName) {
CSQueue queue = getQueue(queueName);
if (queue == null) {
if (LOG.isDebugEnabled())

{ LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
queueName); }
return false;*<-- the method returns false here.*
}
return queue.hasAccess(acl, callerUGI);
}
{code}


As this is an auto created queue, CSQueue may be null in this case. May be 
scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
null and if queue mapping is involved and if so, check if the parent queue 
exists and is a managed parent and if so, check if the parent queue has valid 
ACL's instead of returning false ?

Thanks

  was:
Recently, one of our customers created dynamic queues based on placement rules 
in CDP Private Cloud Base 71.2 to run their Hive on Tez queries but the job 
failed because of not submitting to the appropriate queue.

Analyzing the Resource Manager log, we could see that the queue creation fails 
because ACL submit application check couldn't succeed.

We tried setting acl_submit_applications to '*' for managed parent queues. For 
static queues, this worked but failed for dynamic queues. Also tried setting 
the below property but it didn't help either.
yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.

RM error log shows the following :

2020-09-18 01:08:40,579 INFO 
org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule:
 Application application_1600399068816_0460 user user1 mapping [default] to 
[queue1] override false
2020-09-18 01:08:40,579 WARN 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from 
application tag does not have access to  queue 'user1'. The placement is done 
for user 'hive'
 

Checking the code, scheduler#checkAccess() bails out even before checking the 
ACL permissions for that particular queue because the CSQueue is null.

public boolean checkAccess(UserGroupInformation callerUGI,
QueueACL acl, String queueName) {
CSQueue queue = getQueue(queueName);
if (queue == null) {
if (LOG.isDebugEnabled())

{ LOG.debug("ACL not found for queue access-type " + acl + " for queue " + 
queueName); }
return false;*<-- the method returns false here.*
}
return queue.hasAccess(acl, callerUGI);
}

As this is an auto created queue, CSQueue may be null in this case. May be 
scheduler#checkAccess() should have a logic to differentiate when CSQueue is 
null and if queue mapping is involved and if so, check if the parent queue 
exists and is a managed parent and if so, check if the parent queue has valid 
ACL's instead of returning false ?

Thanks


> Hive On Tez queries fails upon submission to dynamically created pools
> --
>
> Key: YARN-10458
> URL: https://issues.apache.org/jira/browse/YARN-10458
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Reporter: Anand Srinivasan
>Priority: Major
>
> While using Dynamic Auto-Creation and Management of Leaf Queues, we could see 
> that the queue creation fails because ACL submit application check couldn't 
> succeed.
> We tried setting acl_submit_applications to '*' for managed parent queues. 
> For static queues, this worked but failed for dynamic queues. Also tried 
> setting the below property but it didn't help either.
> yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*.
> RM error log shows the following :
> 

[jira] [Created] (YARN-10446) Capacity Scheduler page displays incorrect Configured Capacity

2020-09-23 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10446:
---

 Summary: Capacity Scheduler page displays incorrect Configured 
Capacity
 Key: YARN-10446
 URL: https://issues.apache.org/jira/browse/YARN-10446
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.0
Reporter: Tarun Parimi
 Attachments: configured-capacity.png

Capacity Scheduler ui always shows Configured capacity as   
!configured-capacity.png!

The effective capacity value is however calculated correctly. This issue seems 
to be because we are displaying the configured min resources. This will only be 
set when we use *Absolute Resource Configuration* . When *Percentage based 
configuration* is done, this always displays  .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal

2020-09-21 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi resolved YARN-10440.
-
Resolution: Duplicate

Seems to be similar to YARN-8513 . The default config change in YARN-8896 fixes 
it. Try setting 
{noformat}
yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments=100{noformat}
Reopen with jstack dump, if issue reoccurs with the config change.

> resource manager hangs,and i cannot submit any new jobs,but rm and nm 
> processes are normal
> --
>
> Key: YARN-10440
> URL: https://issues.apache.org/jira/browse/YARN-10440
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.1.1
>Reporter: jufeng li
>Priority: Blocker
>
> RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. 
> I can open  x:8088/cluster/apps/RUNNING but can not 
> x:8088/cluster/scheduler.Those apps submited can not end itself and new 
> apps can not be submited.just everything hangs but not RM,NM server. How can 
> I fix this?help me,please!
>  
> here is the log:
> {code:java}
> ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang 
> clusterResource= type=NODE_LOCAL 
> requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,679 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,679 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  allocator.AbstractContainerAllocator 
> (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - 
> assignedContainer application attempt=appattempt_1600074574138_66297_01 
> container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition=
> 2020-09-17 00:22:25,680 INFO  capacity.CapacityScheduler 
> (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation 
> proposal
> 2020-09-17 00:22:25,680 INFO  

[jira] [Commented] (YARN-10159) TimelineConnector does not destroy the jersey client

2020-09-04 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190741#comment-17190741
 ] 

Tarun Parimi commented on YARN-10159:
-

[~prabhujoseph] . This issue is there even for ats v1 client in branch-2.8. So 
I want to backport it for branch-2.8 . Attached branch-2.8 patch. Can you 
review it when you get time?

> TimelineConnector does not destroy the jersey client
> 
>
> Key: YARN-10159
> URL: https://issues.apache.org/jira/browse/YARN-10159
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10159-001.patch, YARN-10159-002.patch, 
> YARN-10159-branch-2.8.001.patch
>
>
> TimelineConnector does not destroy the jersey client. This method must be 
> called when there are not responses pending otherwise undefined behavior will 
> occur.
> http://javadox.com/com.sun.jersey/jersey-client/1.8/com/sun/jersey/api/client/Client.html#destroy()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10159) TimelineConnector does not destroy the jersey client

2020-09-04 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10159:

Attachment: YARN-10159-branch-2.8.001.patch

> TimelineConnector does not destroy the jersey client
> 
>
> Key: YARN-10159
> URL: https://issues.apache.org/jira/browse/YARN-10159
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tanu Ajmera
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10159-001.patch, YARN-10159-002.patch, 
> YARN-10159-branch-2.8.001.patch
>
>
> TimelineConnector does not destroy the jersey client. This method must be 
> called when there are not responses pending otherwise undefined behavior will 
> occur.
> http://javadox.com/com.sun.jersey/jersey-client/1.8/com/sun/jersey/api/client/Client.html#destroy()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications

2020-08-04 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17171258#comment-17171258
 ] 

Tarun Parimi commented on YARN-10377:
-

Thanks for the review and commit [~prabhujoseph]

> Clicking on queue in Capacity Scheduler legacy ui does not show any 
> applications
> 
>
> Key: YARN-10377
> URL: https://issues.apache.org/jira/browse/YARN-10377
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 
> 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch
>
>
> The issue is in the capacity scheduler 
> [http://rm-host:port/clustter/scheduler] page 
>  If I click on the root queue, I am able to see the applications.
>  !Screenshot 2020-07-29 at 12.01.28 PM.png!
> But the application disappears when I click on the leaf queue -> default. 
> This issue is not present in the older 2.7.0 versions and I am able to see 
> apps normally filtered by the leaf queue when clicking on it.
> !Screenshot 2020-07-29 at 12.01.36 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications

2020-08-03 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170087#comment-17170087
 ] 

Tarun Parimi commented on YARN-10377:
-

Thanks [~prabhujoseph] . I have tested it manually and it works fine.

> Clicking on queue in Capacity Scheduler legacy ui does not show any 
> applications
> 
>
> Key: YARN-10377
> URL: https://issues.apache.org/jira/browse/YARN-10377
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 
> 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch
>
>
> The issue is in the capacity scheduler 
> [http://rm-host:port/clustter/scheduler] page 
>  If I click on the root queue, I am able to see the applications.
>  !Screenshot 2020-07-29 at 12.01.28 PM.png!
> But the application disappears when I click on the leaf queue -> default. 
> This issue is not present in the older 2.7.0 versions and I am able to see 
> apps normally filtered by the leaf queue when clicking on it.
> !Screenshot 2020-07-29 at 12.01.36 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications

2020-08-03 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10377:

Attachment: YARN-10377.001.patch

> Clicking on queue in Capacity Scheduler legacy ui does not show any 
> applications
> 
>
> Key: YARN-10377
> URL: https://issues.apache.org/jira/browse/YARN-10377
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 
> 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch
>
>
> The issue is in the capacity scheduler 
> [http://rm-host:port/clustter/scheduler] page 
>  If I click on the root queue, I am able to see the applications.
>  !Screenshot 2020-07-29 at 12.01.28 PM.png!
> But the application disappears when I click on the leaf queue -> default. 
> This issue is not present in the older 2.7.0 versions and I am able to see 
> apps normally filtered by the leaf queue when clicking on it.
> !Screenshot 2020-07-29 at 12.01.36 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications

2020-08-03 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi reassigned YARN-10377:
---

Assignee: Tarun Parimi

> Clicking on queue in Capacity Scheduler legacy ui does not show any 
> applications
> 
>
> Key: YARN-10377
> URL: https://issues.apache.org/jira/browse/YARN-10377
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 
> 2020-07-29 at 12.01.36 PM.png
>
>
> The issue is in the capacity scheduler 
> [http://rm-host:port/clustter/scheduler] page 
>  If I click on the root queue, I am able to see the applications.
>  !Screenshot 2020-07-29 at 12.01.28 PM.png!
> But the application disappears when I click on the leaf queue -> default. 
> This issue is not present in the older 2.7.0 versions and I am able to see 
> apps normally filtered by the leaf queue when clicking on it.
> !Screenshot 2020-07-29 at 12.01.36 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Resolved] (YARN-10378) When NM goes down and comes back up, PC allocation tags are not removed for completed containers

2020-07-30 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi resolved YARN-10378.
-
Resolution: Duplicate

Looks like YARN-10034 fixes this issue for NM going down scenario also. Closing 
as duplicate.

> When NM goes down and comes back up, PC allocation tags are not removed for 
> completed containers
> 
>
> Key: YARN-10378
> URL: https://issues.apache.org/jira/browse/YARN-10378
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> We are using placement constaints anti-affinity in an application along with 
> node label. The application requests two containers with anti affinity on the 
> node label containing only two nodes.
> So two containers will be allocated in the two nodes, one on each node 
> satisfying anti-affinity.
> When one nodemanager goes down for some time, the node is marked as lost by 
> RM and then it will kill all containers in that node.
> The AM will now have one pending container request, since the previous 
> container got killed.
> When the Nodemanager becomes up after some time, the pending container is not 
> getting allocated in that node again and the application has to wait forever 
> for that container.
> If the ResourceManager is restarted, this issue disappears and the container 
> gets allocated on the NodeManager which came back up recently.
> This seems to be an issue with the allocation tags not removed.
> The allocation tag is added for the container 
> container_e68_1595886973474_0005_01_03 .
> {code:java}
> 2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager 
> (AllocationTagsManager.java:addContainer(355)) - Added 
> container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\
> {code}
> However, the allocation tag is not removed when the container 
> container_e68_1595886973474_0005_01_03 is released. There is no 
> equivalent DEBUG message seen for removing tags. This means that the tags are 
> not getting removed. If the tag is not removed, then scheduler will not 
> allocate in the same node due to anti-affinity resulting in the issue 
> observed.
> {code:java}
> 2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler 
> (AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container 
> FINISHED: container_e68_1595886973474_0005_01_03
> 2020-07-28 17:19:34,353 INFO  scheduler.AbstractYarnScheduler 
> (AbstractYarnScheduler.java:completedContainer(669)) - Container 
> container_e68_1595886973474_0005_01_03 completed with event FINISHED, but 
> corresponding RMContainer doesn't exist.
> {code}
> This seems to be due to changes done in YARN-8511 . Change here was made to 
> remove the tags only after NM confirms container is released. However, in our 
> scenario this is not happening. So the tag will never get removed until RM 
> restart.
> Reverting YARN-8511 fixes this particular issue and tags are getting removed. 
> But this is not a valid solution since the problem that YARN-8511 solves is 
> also valid. We need to find a solution which does not break YARN-8511 and 
> also fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10378) When NM goes down and comes back up, PC allocation tags are not removed for completed containers

2020-07-30 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10378:

Description: 
We are using placement constaints anti-affinity in an application along with 
node label. The application requests two containers with anti affinity on the 
node label containing only two nodes.

So two containers will be allocated in the two nodes, one on each node 
satisfying anti-affinity.

When one nodemanager goes down for some time, the node is marked as lost by RM 
and then it will kill all containers in that node.

The AM will now have one pending container request, since the previous 
container got killed.

When the Nodemanager becomes up after some time, the pending container is not 
getting allocated in that node again and the application has to wait forever 
for that container.

If the ResourceManager is restarted, this issue disappears and the container 
gets allocated on the NodeManager which came back up recently.

This seems to be an issue with the allocation tags not removed.

The allocation tag is added for the container 
container_e68_1595886973474_0005_01_03 .
{code:java}
2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager 
(AllocationTagsManager.java:addContainer(355)) - Added 
container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\
{code}
However, the allocation tag is not removed when the container 
container_e68_1595886973474_0005_01_03 is released. There is no equivalent 
DEBUG message seen for removing tags. This means that the tags are not getting 
removed. If the tag is not removed, then scheduler will not allocate in the 
same node due to anti-affinity resulting in the issue observed.
{code:java}
2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container 
FINISHED: container_e68_1595886973474_0005_01_03
2020-07-28 17:19:34,353 INFO  scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:completedContainer(669)) - Container 
container_e68_1595886973474_0005_01_03 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
{code}
This seems to be due to changes done in YARN-8511 . Change here was made to 
remove the tags only after NM confirms container is released. However, in our 
scenario this is not happening. So the tag will never get removed until RM 
restart.

Reverting YARN-8511 fixes this particular issue and tags are getting removed. 
But this is not a valid solution since the problem that YARN-8511 solves is 
also valid. We need to find a solution which does not break YARN-8511 and also 
fixes this issue.

  was:
We are using placement constaints anti-affinity in an application along with 
node label. The application requests two containers with anti affinity on the 
node label containing only two nodes.

So two containers will be allocated in the two nodes, one on each node 
satisfying anti-affinity.

When one nodemanager goes down for some time, the node is marked as lost by RM 
and then it will kill all containers in that node.

The AM will now have one pending container request, since the previous 
container got killed.

When the Nodemanager becomes up after some time, the pending container is not 
getting allocated in that node again and the application has to wait forever 
for that container.

If the ResourceManager is restarted, this issue disappears and the container 
gets allocated on the NodeManager which came back up recently.

This seems to be an issue with the allocation tags not removed.

The allocation tag is added for the container 
container_e68_1595886973474_0005_01_03 .
{code:java}
2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager 
(AllocationTagsManager.java:addContainer(355)) - Added 
container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\
{code}
However, the allocation tag is not removed when the container 
container_e68_1595886973474_0005_01_03 is released. There is no equivalent 
DEBUG message seen for removing tags. This means that the tags are not getting 
removed. If the tag is not removed, then scheduler will not allocate in the 
same node resulting in the issue observed.
{code:java}
2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container 
FINISHED: container_e68_1595886973474_0005_01_03
2020-07-28 17:19:34,353 INFO  scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:completedContainer(669)) - Container 
container_e68_1595886973474_0005_01_03 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
{code}
This seems to be due to changes done in YARN-8511 . Change here was made to 
remove the tags only after NM confirms container is released. However, in our 
scenario this is not happening. So the tag will never get removed until RM 

[jira] [Created] (YARN-10378) When NM goes down and comes back up, PC allocation tags are not removed for completed containers

2020-07-30 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10378:
---

 Summary: When NM goes down and comes back up, PC allocation tags 
are not removed for completed containers
 Key: YARN-10378
 URL: https://issues.apache.org/jira/browse/YARN-10378
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler
Affects Versions: 3.1.1, 3.2.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


We are using placement constaints anti-affinity in an application along with 
node label. The application requests two containers with anti affinity on the 
node label containing only two nodes.

So two containers will be allocated in the two nodes, one on each node 
satisfying anti-affinity.

When one nodemanager goes down for some time, the node is marked as lost by RM 
and then it will kill all containers in that node.

The AM will now have one pending container request, since the previous 
container got killed.

When the Nodemanager becomes up after some time, the pending container is not 
getting allocated in that node again and the application has to wait forever 
for that container.

If the ResourceManager is restarted, this issue disappears and the container 
gets allocated on the NodeManager which came back up recently.

This seems to be an issue with the allocation tags not removed.

The allocation tag is added for the container 
container_e68_1595886973474_0005_01_03 .
{code:java}
2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager 
(AllocationTagsManager.java:addContainer(355)) - Added 
container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\
{code}
However, the allocation tag is not removed when the container 
container_e68_1595886973474_0005_01_03 is released. There is no equivalent 
DEBUG message seen for removing tags. This means that the tags are not getting 
removed. If the tag is not removed, then scheduler will not allocate in the 
same node resulting in the issue observed.
{code:java}
2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container 
FINISHED: container_e68_1595886973474_0005_01_03
2020-07-28 17:19:34,353 INFO  scheduler.AbstractYarnScheduler 
(AbstractYarnScheduler.java:completedContainer(669)) - Container 
container_e68_1595886973474_0005_01_03 completed with event FINISHED, but 
corresponding RMContainer doesn't exist.
{code}
This seems to be due to changes done in YARN-8511 . Change here was made to 
remove the tags only after NM confirms container is released. However, in our 
scenario this is not happening. So the tag will never get removed until RM 
restart.

Reverting YARN-8511 fixes this particular issue and tags are getting removed. 
But this is not a valid solution since the problem that YARN-8511 solves is 
also valid. We need to find a solution which does not break YARN-8511 and also 
fixes this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications

2020-07-29 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10377:
---

 Summary: Clicking on queue in Capacity Scheduler legacy ui does 
not show any applications
 Key: YARN-10377
 URL: https://issues.apache.org/jira/browse/YARN-10377
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.0
Reporter: Tarun Parimi
 Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 
2020-07-29 at 12.01.36 PM.png

The issue is in the capacity scheduler [http://rm-host:port/clustter/scheduler] 
page 
 If I click on the root queue, I am able to see the applications.
 !Screenshot 2020-07-29 at 12.01.28 PM.png!

But the application disappears when I click on the leaf queue -> default. This 
issue is not present in the older 2.7.0 versions and I am able to see apps 
normally filtered by the leaf queue when clicking on it.

!Screenshot 2020-07-29 at 12.01.36 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-17 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159782#comment-17159782
 ] 

Tarun Parimi commented on YARN-10339:
-

Thanks for the review [~prabhujoseph]

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol

2020-07-07 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153225#comment-17153225
 ] 

Tarun Parimi commented on YARN-10340:
-

[~prabhujoseph], The issue is because the HistoryClientService#initializeWebApp 
instantiates the RPC client connection when creating the WebApp .
{code:java}
ApplicationClientProtocol appClientProtocol =
ClientRMProxy.createRMProxy(conf, ApplicationClientProtocol.class);
{code}

This RPC client proxy instance will only use the mapred ugi at the time of 
creation and even for subsequent calls irrespective of doAs.
I made a code change to check by adding below method in HSWebServices and it 
works with the correct ugi fixing the issue.

{code:java}
@Override
protected ContainerReport getContainerReport(
  GetContainerReportRequest request) throws YarnException, IOException {
return ClientRMProxy.createRMProxy(conf,

ApplicationClientProtocol.class).getContainerReport(request).getContainerReport();
  }
{code}

This creates a separate rpc client instance every time though which is not 
efficient.


> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
> -
>
> Key: YARN-10340
> URL: https://issues.apache.org/jira/browse/YARN-10340
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
>  
> [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs]
> While accessing above link using systest user, the request fails saying 
> mapred user does not have access to the job
>  
> {code:java}
> 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: 
> Could not obtain node HTTP address from provider.
> javax.ws.rs.WebApplicationException: 
> org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have 
> privilege to see this application application_1593997842459_0214
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373)
> at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268)
> at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461)
>  
> {code}
> On Analyzing, found WebServices#getContainer uses doAs using UGI created by 
> createRemoteUser(end user) to access RM#ApplicationClientProtocol which does 
> not work. Need to use createProxyUser to do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-07 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152548#comment-17152548
 ] 

Tarun Parimi edited comment on YARN-10339 at 7/7/20, 8:17 AM:
--

Thanks [~prabhujoseph] . When atsv1 is enabled, delegation tokens are used even 
when auth is simple. I made changes in this patch, to add Timeline Delegation 
Token only when auth is kerberos. And fixed unit test failures and checkstyle.


was (Author: tarunparimi):
Thanks [~prabhujoseph] . When atsv1 is enabled, delegation tokens are used even 
when auth is simple. I made changes in this patch, to add Timeline Delegation 
Token only when auth is simple. And fixed unit test failures and checkstyle.

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-07 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152548#comment-17152548
 ] 

Tarun Parimi commented on YARN-10339:
-

Thanks [~prabhujoseph] . When atsv1 is enabled, delegation tokens are used even 
when auth is simple. I made changes in this patch, to add Timeline Delegation 
Token only when auth is simple. And fixed unit test failures and checkstyle.

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-07 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10339:

Attachment: YARN-10339.002.patch

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch, YARN-10339.002.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol

2020-07-07 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152501#comment-17152501
 ] 

Tarun Parimi commented on YARN-10340:
-

[~prabhujoseph],[~brahmareddy] The WebServices#getContainer works properly when 
called by RMWebServices or AHSWebServices. This could be because they use their 
own ClientRMService and ApplicationHistoryClientService respectively. 

But HsWebServices now uses ClientRMService remotely and so doAs doesn't work 
here as expected.

> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
> -
>
> Key: YARN-10340
> URL: https://issues.apache.org/jira/browse/YARN-10340
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> HsWebServices getContainerReport uses loginUser instead of remoteUser to 
> access ApplicationClientProtocol
>  
> [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs]
> While accessing above link using systest user, the request fails saying 
> mapred user does not have access to the job
>  
> {code:java}
> 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: 
> Could not obtain node HTTP address from provider.
> javax.ws.rs.WebApplicationException: 
> org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have 
> privilege to see this application application_1593997842459_0214
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
> at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405)
> at 
> org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373)
> at 
> org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268)
> at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461)
>  
> {code}
> On Analyzing, found WebServices#getContainer uses doAs using UGI created by 
> createRemoteUser(end user) to access RM#ApplicationClientProtocol which does 
> not work. Need to use createProxyUser to do the same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-06 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10339:

Attachment: YARN-10339.001.patch

> Timeline Client in Nodemanager gets 403 errors when simple auth is used in 
> kerberos environments
> 
>
> Key: YARN-10339
> URL: https://issues.apache.org/jira/browse/YARN-10339
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineclient
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10339.001.patch
>
>
> We get below errors in NodeManager logs whenever we set 
> yarn.timeline-service.http-authentication.type=simple in a cluster which has 
> kerberos enabled. There are use cases where simple auth is used only in 
> timeline server for convenience although kerberos is enabled.
> {code:java}
> 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
> (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline 
> server is not successful, HTTP error code: 403, Server response:
> {"exception":"ForbiddenException","message":"java.lang.Exception: The owner 
> of the posted timeline entities is not 
> set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
> {code}
> This seems to affect the NM timeline publisher which uses 
> TimelineV2ClientImpl. Doing a simple auth directly to timeline service via 
> curl works fine. So this issue is in the authenticator configuration in 
> timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments

2020-07-06 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10339:
---

 Summary: Timeline Client in Nodemanager gets 403 errors when 
simple auth is used in kerberos environments
 Key: YARN-10339
 URL: https://issues.apache.org/jira/browse/YARN-10339
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineclient
Affects Versions: 3.1.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


We get below errors in NodeManager logs whenever we set 
yarn.timeline-service.http-authentication.type=simple in a cluster which has 
kerberos enabled. There are use cases where simple auth is used only in 
timeline server for convenience although kerberos is enabled.

{code:java}
2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl 
(TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline server 
is not successful, HTTP error code: 403, Server response:

{"exception":"ForbiddenException","message":"java.lang.Exception: The owner of 
the posted timeline entities is not 
set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"}
{code}

This seems to affect the NM timeline publisher which uses TimelineV2ClientImpl. 
Doing a simple auth directly to timeline service via curl works fine. So this 
issue is in the authenticator configuration in timeline client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used

2020-05-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113342#comment-17113342
 ] 

Tarun Parimi edited comment on YARN-10283 at 5/21/20, 4:31 PM:
---

Thanks [~pbacsko] for the repro test patch. The POC patch changes the behavior 
to include partitions while doing {{reservationsContinueLooking}} in 
RegularContainerAllocator.java . Similar conditions to check for nodelabels is 
present in several places such as AbstractCSQueue.java since 
{{reservationsContinueLooking}} was implemented only for non node label 
scenario. Ideally we will have to consider fixing YARN-9903 in this scenario.


was (Author: tarunparimi):
Thanks for the repro test patch. The POC patch changes the behavior to include 
partitions while doing {{reservationsContinueLooking}} in 
RegularContainerAllocator.java . Similar conditions to check for nodelabels is 
present in several places such as AbstractCSQueue.java since 
{{reservationsContinueLooking}} was implemented only for non node label 
scenario. Ideally we will have to consider fixing YARN-9903 in this scenario.

> Capacity Scheduler: starvation occurs if a higher priority queue is full and 
> node labels are used
> -
>
> Key: YARN-10283
> URL: https://issues.apache.org/jira/browse/YARN-10283
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch
>
>
> Recently we've been investigating a scenario where applications submitted to 
> a lower priority queue could not get scheduled because a higher priority 
> queue in the same hierarchy could now satisfy the allocation request. Both 
> queue belonged to the same partition.
> If we disabled node labels, the problem disappeared.
> The problem is that {{RegularContainerAllocator}} always allocated a 
> container for the request, even if it should not have.
> *Example:*
> * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node)
> * Partition "shared" was created with 2 nodes
> * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were 
> added to the partition
> * Both queues have a limit of 
> * Using DominantResourceCalculator
> Setup:
> Submit distributed shell application to highprio with switches 
> "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per 
> container.
> Chain of events:
> 1. Queue is filled with contaners until it reaches usage  vCores:5>
> 2. A node update event is pushed to CS from a node which is part of the 
> partition
> 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller 
> than the current limit resource 
> 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an 
> allocated container for 
> 4. But we can't commit the resource request because we would have 9 vcores in 
> total, violating the limit.
> The problem is that we always try to assign container for the same 
> application in each heartbeat from "highprio". Applications in "lowprio" 
> cannot make progress.
> *Problem:*
> {{RegularContainerAllocator.assignContainer()}} does not handle this case 
> well. We only reject allocation if this condition is satisfied:
> {noformat}
>  if (rmContainer == null && reservationsContinueLooking
>   && node.getLabels().isEmpty()) {
> {noformat}
> But if we have node labels, we enter a different code path and succeed with 
> the allocation if there's room for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used

2020-05-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17113342#comment-17113342
 ] 

Tarun Parimi commented on YARN-10283:
-

Thanks for the repro test patch. The POC patch changes the behavior to include 
partitions while doing {{reservationsContinueLooking}} in 
RegularContainerAllocator.java . Similar conditions to check for nodelabels is 
present in several places such as AbstractCSQueue.java since 
{{reservationsContinueLooking}} was implemented only for non node label 
scenario. Ideally we will have to consider fixing YARN-9903 in this scenario.

> Capacity Scheduler: starvation occurs if a higher priority queue is full and 
> node labels are used
> -
>
> Key: YARN-10283
> URL: https://issues.apache.org/jira/browse/YARN-10283
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Peter Bacsko
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch
>
>
> Recently we've been investigating a scenario where applications submitted to 
> a lower priority queue could not get scheduled because a higher priority 
> queue in the same hierarchy could now satisfy the allocation request. Both 
> queue belonged to the same partition.
> If we disabled node labels, the problem disappeared.
> The problem is that {{RegularContainerAllocator}} always allocated a 
> container for the request, even if it should not have.
> *Example:*
> * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node)
> * Partition "shared" was created with 2 nodes
> * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were 
> added to the partition
> * Both queues have a limit of 
> * Using DominantResourceCalculator
> Setup:
> Submit distributed shell application to highprio with switches 
> "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per 
> container.
> Chain of events:
> 1. Queue is filled with contaners until it reaches usage  vCores:5>
> 2. A node update event is pushed to CS from a node which is part of the 
> partition
> 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller 
> than the current limit resource 
> 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an 
> allocated container for 
> 4. But we can't commit the resource request because we would have 9 vcores in 
> total, violating the limit.
> The problem is that we always try to assign container for the same 
> application in each heartbeat from "highprio". Applications in "lowprio" 
> cannot make progress.
> *Problem:*
> {{RegularContainerAllocator.assignContainer()}} does not handle this case 
> well. We only reject allocation if this condition is satisfied:
> {noformat}
>  if (rmContainer == null && reservationsContinueLooking
>   && node.getLabels().isEmpty()) {
> {noformat}
> But if we have node labels, we enter a different code path and succeed with 
> the allocation if there's room for a container.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping

2020-04-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088547#comment-17088547
 ] 

Tarun Parimi commented on YARN-10240:
-

Thanks for the review [~prabhujoseph]

> Prevent Fatal CancelledException in TimelineV2Client when stopping
> --
>
> Key: YARN-10240
> URL: https://issues.apache.org/jira/browse/YARN-10240
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: YARN-10240.001.patch
>
>
> When the timeline client is stopped, it will cancel all sync EntityHolders 
> after waiting for a drain timeout.
> {code:java}
> // if some entities were not drained then we need interrupt
>   // the threads which had put sync EntityHolders to the 
> queue.
>   EntitiesHolder nextEntityInTheQueue = null;
>   while ((nextEntityInTheQueue =
>   timelineEntityQueue.poll()) != null) {
> nextEntityInTheQueue.cancel(true);
>   }
> {code}
> We only handle interrupted exception here.
> {code:java}
> if (sync) {
> // In sync call we need to wait till its published and if any error 
> then
> // throw it back
> try {
>   entitiesHolder.get();
> } catch (ExecutionException e) {
>   throw new YarnException("Failed while publishing entity",
>   e.getCause());
> } catch (InterruptedException e) {
>   Thread.currentThread().interrupt();
>   throw new YarnException("Interrupted while publishing entity", e);
> }
>   }
> {code}
>  But calling nextEntityInTheQueue.cancel(true) will result in 
> entitiesHolder.get() throwing a CancelledException which is not handled. This 
> can result in FATAL error in NM. We need to prevent this.
> {code:java}
> FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in 
> dispatcher thread
> java.util.concurrent.CancellationException
>   at java.util.concurrent.FutureTask.report(FutureTask.java:121)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping

2020-04-20 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi reassigned YARN-10240:
---

Assignee: Tarun Parimi

> Prevent Fatal CancelledException in TimelineV2Client when stopping
> --
>
> Key: YARN-10240
> URL: https://issues.apache.org/jira/browse/YARN-10240
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-10240.001.patch
>
>
> When the timeline client is stopped, it will cancel all sync EntityHolders 
> after waiting for a drain timeout.
> {code:java}
> // if some entities were not drained then we need interrupt
>   // the threads which had put sync EntityHolders to the 
> queue.
>   EntitiesHolder nextEntityInTheQueue = null;
>   while ((nextEntityInTheQueue =
>   timelineEntityQueue.poll()) != null) {
> nextEntityInTheQueue.cancel(true);
>   }
> {code}
> We only handle interrupted exception here.
> {code:java}
> if (sync) {
> // In sync call we need to wait till its published and if any error 
> then
> // throw it back
> try {
>   entitiesHolder.get();
> } catch (ExecutionException e) {
>   throw new YarnException("Failed while publishing entity",
>   e.getCause());
> } catch (InterruptedException e) {
>   Thread.currentThread().interrupt();
>   throw new YarnException("Interrupted while publishing entity", e);
> }
>   }
> {code}
>  But calling nextEntityInTheQueue.cancel(true) will result in 
> entitiesHolder.get() throwing a CancelledException which is not handled. This 
> can result in FATAL error in NM. We need to prevent this.
> {code:java}
> FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in 
> dispatcher thread
> java.util.concurrent.CancellationException
>   at java.util.concurrent.FutureTask.report(FutureTask.java:121)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping

2020-04-20 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10240:

Attachment: YARN-10240.001.patch

> Prevent Fatal CancelledException in TimelineV2Client when stopping
> --
>
> Key: YARN-10240
> URL: https://issues.apache.org/jira/browse/YARN-10240
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Reporter: Tarun Parimi
>Priority: Major
> Attachments: YARN-10240.001.patch
>
>
> When the timeline client is stopped, it will cancel all sync EntityHolders 
> after waiting for a drain timeout.
> {code:java}
> // if some entities were not drained then we need interrupt
>   // the threads which had put sync EntityHolders to the 
> queue.
>   EntitiesHolder nextEntityInTheQueue = null;
>   while ((nextEntityInTheQueue =
>   timelineEntityQueue.poll()) != null) {
> nextEntityInTheQueue.cancel(true);
>   }
> {code}
> We only handle interrupted exception here.
> {code:java}
> if (sync) {
> // In sync call we need to wait till its published and if any error 
> then
> // throw it back
> try {
>   entitiesHolder.get();
> } catch (ExecutionException e) {
>   throw new YarnException("Failed while publishing entity",
>   e.getCause());
> } catch (InterruptedException e) {
>   Thread.currentThread().interrupt();
>   throw new YarnException("Interrupted while publishing entity", e);
> }
>   }
> {code}
>  But calling nextEntityInTheQueue.cancel(true) will result in 
> entitiesHolder.get() throwing a CancelledException which is not handled. This 
> can result in FATAL error in NM. We need to prevent this.
> {code:java}
> FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in 
> dispatcher thread
> java.util.concurrent.CancellationException
>   at java.util.concurrent.FutureTask.report(FutureTask.java:121)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping

2020-04-20 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10240:
---

 Summary: Prevent Fatal CancelledException in TimelineV2Client when 
stopping
 Key: YARN-10240
 URL: https://issues.apache.org/jira/browse/YARN-10240
 Project: Hadoop YARN
  Issue Type: Bug
  Components: ATSv2
Reporter: Tarun Parimi


When the timeline client is stopped, it will cancel all sync EntityHolders 
after waiting for a drain timeout.

{code:java}
// if some entities were not drained then we need interrupt
  // the threads which had put sync EntityHolders to the queue.
  EntitiesHolder nextEntityInTheQueue = null;
  while ((nextEntityInTheQueue =
  timelineEntityQueue.poll()) != null) {
nextEntityInTheQueue.cancel(true);
  }
{code}

We only handle interrupted exception here.
{code:java}
if (sync) {
// In sync call we need to wait till its published and if any error then
// throw it back
try {
  entitiesHolder.get();
} catch (ExecutionException e) {
  throw new YarnException("Failed while publishing entity",
  e.getCause());
} catch (InterruptedException e) {
  Thread.currentThread().interrupt();
  throw new YarnException("Interrupted while publishing entity", e);
}
  }
{code}

 But calling nextEntityInTheQueue.cancel(true) will result in 
entitiesHolder.get() throwing a CancelledException which is not handled. This 
can result in FATAL error in NM. We need to prevent this.

{code:java}
FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in 
dispatcher thread
java.util.concurrent.CancellationException
at java.util.concurrent.FutureTask.report(FutureTask.java:121)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545)
at 
org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149)
at 
org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348)
{code}







--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9816) EntityGroupFSTimelineStore#scanActiveLogs fails when undesired files are present under /ats/active.

2020-03-18 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9816:
---
Affects Version/s: 2.8.0

> EntityGroupFSTimelineStore#scanActiveLogs fails when undesired files are 
> present under /ats/active.
> ---
>
> Key: YARN-9816
> URL: https://issues.apache.org/jira/browse/YARN-9816
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: timelineserver
>Affects Versions: 2.8.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9816-001.patch
>
>
> EntityGroupFSTimelineStore#scanActiveLogs fails with StackOverflowError.  
> This happens when a file is present under /ats/active.
> {code}
> [hdfs@node2 yarn]$ hadoop fs -ls /ats/active
> Found 1 items
> -rw-r--r--   3 hdfs hadoop  0 2019-09-06 16:34 
> /ats/active/.distcp.tmp.attempt_155759136_39768_m_01_0
> {code}
> Error Message:
> {code:java}
> java.lang.StackOverflowError
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:632)
> at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185)
> at com.sun.proxy.$Proxy15.getListing(Unknown Source)
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2143)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1076)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1088)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1059)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1038)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1034)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1046)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.list(EntityGroupFSTimelineStore.java:398)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:368)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
> at 
> org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383)
>  {code}
> One of our user has tried to distcp hdfs://ats/active dir. Distcp job has 
> created the 
> temp file 

[jira] [Commented] (YARN-9967) Fix NodeManager failing to start when Hdfs Auxillary Jar is set

2020-03-05 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053104#comment-17053104
 ] 

Tarun Parimi commented on YARN-9967:


Hi [~snemeth], 
You can take it over. 
Thanks.

> Fix NodeManager failing to start when Hdfs Auxillary Jar is set
> ---
>
> Key: YARN-9967
> URL: https://issues.apache.org/jira/browse/YARN-9967
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: auxservices, nodemanager
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Tarun Parimi
>Priority: Major
>
> Loading an auxiliary jar from a Hdfs location on a node manager fails with 
> ClassNotFound Exception
> {code:java}
> 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> classpath: []
> 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: 
> system classes: [java., javax.accessibility., javax.activation., 
> javax.activity., javax.annotation., javax.annotation.processing., 
> javax.crypto., javax.imageio., javax.jws., javax.lang.model., 
> -javax.management.j2ee., javax.management., javax.naming., javax.net., 
> javax.print., javax.rmi., javax.script., -javax.security.auth.message., 
> javax.security.auth., javax.security.cert., javax.security.sasl., 
> javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., 
> -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., 
> org.xml.sax., org.apache.commons.logging., org.apache.log4j., 
> -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, 
> hdfs-default.xml, mapred-default.xml, yarn-default.xml]
> 2019-11-08 03:59:49,257 INFO org.apache.hadoop.service.AbstractService: 
> Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromHDFS
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:270)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:321)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016)
> {code}
> *Repro:*
> {code:java}
> 1. Prepare a custom auxiliary service jar and place it on hdfs
> [hdfs@yarndocker-1 yarn]$ cat TestShuffleHandler2.java 
> package org;
> import org.apache.hadoop.yarn.server.api.AuxiliaryService;
> import org.apache.hadoop.yarn.server.api.ApplicationInitializationContext;
> import org.apache.hadoop.yarn.server.api.ApplicationTerminationContext;
> import java.nio.ByteBuffer;
> public class TestShuffleHandler2 extends AuxiliaryService {
> public static final String MAPREDUCE_TEST_SHUFFLE_SERVICEID = 
> "test_shuffle2";
> public TestShuffleHandler2() {
>   super("testshuffle2");
> }
> @Override
> public void initializeApplication(ApplicationInitializationContext 
> context) {
> }
> @Override
> public void stopApplication(ApplicationTerminationContext context) {
> }
> @Override
> public synchronized ByteBuffer getMetaData() {
>   return ByteBuffer.allocate(0); 
> }
>   }
>   
> [hdfs@yarndocker-1 yarn]$ javac -d . -cp `hadoop classpath` 
> TestShuffleHandler2.java 
> [hdfs@yarndocker-1 yarn]$ jar cvf auxhdfs.jar org/
> [hdfs@yarndocker-1 mapreduce]$ 

[jira] [Updated] (YARN-10149) container-executor exits with 139 when the permissions of yarn log directory is improper

2020-02-18 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10149:

Description: 
container-executor fails with segmentation fault and exit code 139 when the 
permission of the yarn log directory is not proper.

While running the container-executor manually, we get the below message.

{code:java}
Error checking file stats for /hadoop/yarn/log -1 Permission denied.
{code}

But the exit code is 139 which corresponds to a segmentation fault. This is 
misleading especially since the "Permission denied" is not getting printed in 
the applogs or the NM logs. Only the exit code 139 message is present.

  was:
container-executor fails with segmentation fault and exit code 139 when the 
permission of the yarn log directory is not proper.

While running the container-executor manually, we get the below message.

{code:java}
Error checking file stats for /hadoop/yarn/log Permission denied -1
{code}

But the exit code is 139 which corresponds to a segmentation fault. This is 
misleading especially since the "Permission denied" is not getting printed in 
the applogs or the NM logs. Only the exit code 139 message is present.


> container-executor exits with 139 when the permissions of yarn log directory 
> is improper
> 
>
> Key: YARN-10149
> URL: https://issues.apache.org/jira/browse/YARN-10149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> container-executor fails with segmentation fault and exit code 139 when the 
> permission of the yarn log directory is not proper.
> While running the container-executor manually, we get the below message.
> {code:java}
> Error checking file stats for /hadoop/yarn/log -1 Permission denied.
> {code}
> But the exit code is 139 which corresponds to a segmentation fault. This is 
> misleading especially since the "Permission denied" is not getting printed in 
> the applogs or the NM logs. Only the exit code 139 message is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10149) container-executor exits with 139 when the permissions of yarn log directory is improper

2020-02-18 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-10149:

Description: 
container-executor fails with segmentation fault and exit code 139 when the 
permission of the yarn log directory is not proper.

While running the container-executor manually, we get the below message.

{code:java}
Error checking file stats for /hadoop/yarn/log Permission denied -1
{code}

But the exit code is 139 which corresponds to a segmentation fault. This is 
misleading especially since the "Permission denied" is not getting printed in 
the applogs or the NM logs. Only the exit code 139 message is present.

  was:
container-executor fails with segmentation fault and exit code 139 when the 
permission of the yarn log directory was not proper.

While running the container-executor manually, we get the below message.

{code:java}
Error checking file stats for /hadoop/yarn/log Permission denied -1
{code}

But the exit code is 139 which corresponds to a segmentation fault. This is 
misleading especially since the "Permission denied" is not getting printed in 
the applogs or the NM logs.
 


> container-executor exits with 139 when the permissions of yarn log directory 
> is improper
> 
>
> Key: YARN-10149
> URL: https://issues.apache.org/jira/browse/YARN-10149
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> container-executor fails with segmentation fault and exit code 139 when the 
> permission of the yarn log directory is not proper.
> While running the container-executor manually, we get the below message.
> {code:java}
> Error checking file stats for /hadoop/yarn/log Permission denied -1
> {code}
> But the exit code is 139 which corresponds to a segmentation fault. This is 
> misleading especially since the "Permission denied" is not getting printed in 
> the applogs or the NM logs. Only the exit code 139 message is present.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10149) container-executor exits with 139 when the permissions of yarn log directory is improper

2020-02-18 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-10149:
---

 Summary: container-executor exits with 139 when the permissions of 
yarn log directory is improper
 Key: YARN-10149
 URL: https://issues.apache.org/jira/browse/YARN-10149
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.1.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


container-executor fails with segmentation fault and exit code 139 when the 
permission of the yarn log directory was not proper.

While running the container-executor manually, we get the below message.

{code:java}
Error checking file stats for /hadoop/yarn/log Permission denied -1
{code}

But the exit code is 139 which corresponds to a segmentation fault. This is 
misleading especially since the "Permission denied" is not getting printed in 
the applogs or the NM logs.
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979324#comment-16979324
 ] 

Tarun Parimi commented on YARN-9968:


[~snemeth] , Please review this when you get time. 

> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9968.001.patch
>
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-13 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9968:
---
Attachment: YARN-9968.001.patch

> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9968.001.patch
>
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-13 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973352#comment-16973352
 ] 

Tarun Parimi edited comment on YARN-9968 at 11/13/19 1:56 PM:
--

[~snemeth], I was finally able reproduce it artificially in my test cluster. I 
added the below the sleep and subsequent exception in FSDownload class to 
simulate the hdfs not responding for a minute and then throwing the exception 
while trying to download. When the application which requested the resource 
gets killed during the minute when the thread sleeps, I got null pointer issue 
and public localizer exited.

{code:java}
  try {
Thread.sleep(6);
 throw new ExecutionException("Test", new IOException("Exception"));
  } catch (InterruptedException e) {
throw new IOException(e);
  }
{code}
>From this I understood that the issue occurs when the below sequence of events 
>occur,

1. The public localizer is waiting on the download of a file from hdfs for 
quite some time.
2. Application get killed/failed while the download is still waiting/sleeping. 
Due to this the app cleanup is triggered, which removes the 
LocalResourcesTracker for that app.

{code:java}
  private void handleDestroyApplicationResources(Application application) {
String userName = application.getUser();
ApplicationId appId = application.getAppId();
String appIDStr = application.toString();
LocalResourcesTracker appLocalRsrcsTracker =
  appRsrc.remove(appId.toString());
{code}

3. The download finally fails and it throws an exception from HDFS.
4. Since the tracker was removed due to app kill, we get the NullPointer in 
below code as tracker is null . This causes public localizer to exit and not 
handle future localization requests.
{code:java}
  tracker.handle(new ResourceFailedLocalizationEvent(
  assoc.getResource().getRequest(), diagnostics));
{code}

This issue is introduced due to the changes in YARN-8403 , where the failed 
localization is notified to the app for logging in the AM.

I think handling a null check and preventing this should be safe as the AM is 
already killed in this scenario. Will provide an initial patch based on this.

cc [~prabhujoseph]




was (Author: tarunparimi):
[~snemeth], I was finally able reproduce it artificially in my test cluster. I 
added the below the sleep and subsequent exception in FSDownload class to 
simulate the hdfs not responding for a minute and then throwing the exception 
while trying to download. When the application which requested the resource 
gets killed during the minute when the thread sleeps, I got null pointer issue 
and public localizer exited.

{code:java}
  try {
Thread.sleep(6);
 throw new ExecutionException("Test", new IOException("Exception"));
  } catch (InterruptedException e) {
throw new IOException(e);
  }

>From this I understood that the issue occurs when the below sequence of events 
>occur,

1. The public localizer is waiting on the download of a file from hdfs for 
quite some time.
2. Application get killed/failed while the download is still waiting/sleeping. 
Due to this the app cleanup is triggered, which removes the 
LocalResourcesTracker for that app.

{code:java}
  private void handleDestroyApplicationResources(Application application) {
String userName = application.getUser();
ApplicationId appId = application.getAppId();
String appIDStr = application.toString();
LocalResourcesTracker appLocalRsrcsTracker =
  appRsrc.remove(appId.toString());
{code}

3. The download finally fails and it throws an exception from HDFS.
4. Since the tracker was removed due to app kill, we get the NullPointer in 
below code as tracker is null . This causes public localizer to exit and not 
handle future localization requests.
{code:java}
  tracker.handle(new ResourceFailedLocalizationEvent(
  assoc.getResource().getRequest(), diagnostics));
{code}

This issue is introduced due to the changes in YARN-8403 , where the failed 
localization is notified to the app for logging in the AM.

I think handling a null check and preventing this should be safe as the AM is 
already killed in this scenario. Will provide an initial patch based on this.

cc [~prabhujoseph]



> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> 

[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-13 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973352#comment-16973352
 ] 

Tarun Parimi commented on YARN-9968:


[~snemeth], I was finally able reproduce it artificially in my test cluster. I 
added the below the sleep and subsequent exception in FSDownload class to 
simulate the hdfs not responding for a minute and then throwing the exception 
while trying to download. When the application which requested the resource 
gets killed during the minute when the thread sleeps, I got null pointer issue 
and public localizer exited.

{code:java}
  try {
Thread.sleep(6);
 throw new ExecutionException("Test", new IOException("Exception"));
  } catch (InterruptedException e) {
throw new IOException(e);
  }

>From this I understood that the issue occurs when the below sequence of events 
>occur,

1. The public localizer is waiting on the download of a file from hdfs for 
quite some time.
2. Application get killed/failed while the download is still waiting/sleeping. 
Due to this the app cleanup is triggered, which removes the 
LocalResourcesTracker for that app.

{code:java}
  private void handleDestroyApplicationResources(Application application) {
String userName = application.getUser();
ApplicationId appId = application.getAppId();
String appIDStr = application.toString();
LocalResourcesTracker appLocalRsrcsTracker =
  appRsrc.remove(appId.toString());
{code}

3. The download finally fails and it throws an exception from HDFS.
4. Since the tracker was removed due to app kill, we get the NullPointer in 
below code as tracker is null . This causes public localizer to exit and not 
handle future localization requests.
{code:java}
  tracker.handle(new ResourceFailedLocalizationEvent(
  assoc.getResource().getRequest(), diagnostics));
{code}

This issue is introduced due to the changes in YARN-8403 , where the failed 
localization is notified to the app for logging in the AM.

I think handling a null check and preventing this should be safe as the AM is 
already killed in this scenario. Will provide an initial patch based on this.

cc [~prabhujoseph]



> Public Localizer is exiting in NodeManager due to NullPointerException
> --
>
> Key: YARN-9968
> URL: https://issues.apache.org/jira/browse/YARN-9968
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> The Public Localizer is encountering a NullPointerException and exiting.
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(995)) - Error: Shutting down
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)
> INFO  localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:run(997)) - Public cache exiting
> {code}
> The NodeManager still keeps on running. Subsequent localization events for 
> containers keep encountering the below error, resulting in failed 
> Localization of all new containers. 
> {code:java}
> ERROR localizer.ResourceLocalizationService 
> (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { 
> { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
> },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
>  for download. Either queue is full or threadpool is shutdown.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 
> rejected from 
> org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated,
>  pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 
> 382286]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
> at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
> {code}
> When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, 

[jira] [Comment Edited] (YARN-9925) CapacitySchedulerQueueManager allows unsupported Queue hierarchy

2019-11-13 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973285#comment-16973285
 ] 

Tarun Parimi edited comment on YARN-9925 at 11/13/19 12:08 PM:
---

[~vinodkv] , it is fine for me. I was searching for the documentation 
specifying the unique leaf queue name. I don't see anything currently in apache 
docs referencing it. 

I guess a single line mentioning that all queue names have to be unique under 
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues
 would be helpful. Shall I create a jira for this doc change?


was (Author: tarunparimi):
[~vinodkv] , it is fine for me. I was searching for the documentation 
specifying the unique leaf queue name. I don't see anything currently in apache 
docs referencing it. 

I guess a single line mentioning all queue names to be unique under 
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues
 would be helpful. Shall I create a jira for this doc change?

> CapacitySchedulerQueueManager allows unsupported Queue hierarchy
> 
>
> Key: YARN-9925
> URL: https://issues.apache.org/jira/browse/YARN-9925
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9925-001.patch, YARN-9925-002.patch, 
> YARN-9925-003.patch
>
>
> CapacitySchedulerQueueManager allows unsupported Queue hierarchy. When 
> creating a queue with same name as an existing parent queue name - it has to 
> fail with below.
> {code:java}
> Caused by: java.io.IOException: A is moved from:root.A to:root.B.A after 
> refresh, which is not allowed.Caused by: java.io.IOException: A is moved 
> from:root.A to:root.B.A after refresh, which is not allowed. at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:335)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:762)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:473)
>  ... 70 more 
> {code}
> In Some cases, the error is not thrown while creating the queue but thrown at 
> submission of job "Failed to submit application_1571677375269_0002 to YARN : 
> Application application_1571677375269_0002 submitted by user : systest to 
> non-leaf queue : B"
> Below scenarios are allowed but it should not
> {code:java}
> It allows root.A.A1.B when root.B.B1 already exists.
>
> 1. Add root.A
> 2. Add root.A.A1
> 3. Add root.B
> 4. Add root.B.B1
> 5. Allows Add of root.A.A1.B 
> It allows two root queues:
>
> 1. Add root.A
> 2. Add root.B
> 3. Add root.A.A1
> 4. Allows Add of root.A.A1.root
>
> {code}
> Below scenario is handled properly:
> {code:java}
> It does not allow root.B.A when root.A.A1 already exists.
>  
> 1. Add root.A
> 2. Add root.B
> 3. Add root.A.A1
> 4. Does not Allow Add of root.B.A
> {code}
> This error handling has to be consistent in all scenarios.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9925) CapacitySchedulerQueueManager allows unsupported Queue hierarchy

2019-11-13 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973285#comment-16973285
 ] 

Tarun Parimi commented on YARN-9925:


[~vinodkv] , it is fine for me. I was searching for the documentation 
specifying the unique leaf queue name. I don't see anything currently in apache 
docs referencing it. 

I guess a single line mentioning all queue names to be unique under 
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues
 would be helpful. Shall I create a jira for this doc change?

> CapacitySchedulerQueueManager allows unsupported Queue hierarchy
> 
>
> Key: YARN-9925
> URL: https://issues.apache.org/jira/browse/YARN-9925
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.3.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-9925-001.patch, YARN-9925-002.patch, 
> YARN-9925-003.patch
>
>
> CapacitySchedulerQueueManager allows unsupported Queue hierarchy. When 
> creating a queue with same name as an existing parent queue name - it has to 
> fail with below.
> {code:java}
> Caused by: java.io.IOException: A is moved from:root.A to:root.B.A after 
> refresh, which is not allowed.Caused by: java.io.IOException: A is moved 
> from:root.A to:root.B.A after refresh, which is not allowed. at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:335)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:180)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:762)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:473)
>  ... 70 more 
> {code}
> In Some cases, the error is not thrown while creating the queue but thrown at 
> submission of job "Failed to submit application_1571677375269_0002 to YARN : 
> Application application_1571677375269_0002 submitted by user : systest to 
> non-leaf queue : B"
> Below scenarios are allowed but it should not
> {code:java}
> It allows root.A.A1.B when root.B.B1 already exists.
>
> 1. Add root.A
> 2. Add root.A.A1
> 3. Add root.B
> 4. Add root.B.B1
> 5. Allows Add of root.A.A1.B 
> It allows two root queues:
>
> 1. Add root.A
> 2. Add root.B
> 3. Add root.A.A1
> 4. Allows Add of root.A.A1.root
>
> {code}
> Below scenario is handled properly:
> {code:java}
> It does not allow root.B.A when root.A.A1 already exists.
>  
> 1. Add root.A
> 2. Add root.B
> 3. Add root.A.A1
> 4. Does not Allow Add of root.B.A
> {code}
> This error handling has to be consistent in all scenarios.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-12 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972420#comment-16972420
 ] 

Tarun Parimi commented on YARN-9968:


Hi [~snemeth]. Thanks for looking into this.
The issue is not reproducing for me so far. This is happening on a heavily 
loaded prod cluster. The cluster also is configured to use 
DefaultContainerExecutor , so the localizing is all done completely inside the 
NM jvm process.

The null pointer occurs in the below code where tracker.handle() is called. 
Looks like tracker is becoming null for some reason. Doing a null check on 
tracker might be a simple workaround, but understanding how the issue occurred 
might give us a better way to fix this.
{code:java}
 final String diagnostics = "Failed to download resource " +
  assoc.getResource() + " " + e.getCause();
  tracker.handle(new ResourceFailedLocalizationEvent(
  assoc.getResource().getRequest(), diagnostics));
{code}

There are also multiple HDFS warnings while doing localization in the log just 
before this NullPointerException. So I think those HDFS issues while localizing 
are definitely related and are causing the issue in the first place. But I 
haven't completely figured out how.

{code:java}
WARN  impl.BlockReaderFactory 
(BlockReaderFactory.java:getRemoteBlockReaderFromTcp(764)) - I/O error 
constructing remote block reader.
java.io.IOException: Got error, status=ERROR, status message opReadBlock 
BP-290360126-127.0.0.1-1559634768162:blk_3454574939_2740457478 received 
exception java.io.IOException: No data exists for block 
BP-290360126-127.0.0.1-1559634768162:blk_blk_3454574939_2740457478, for 
OP_READ_BLOCK, self=/127.0.0.1:15810, remote=/127.0.0.1:50010, for file 
/tmp/hadoop-yarn/staging/job-user/.staging/job_1571858983080_36874/job.jar, for 
pool BP-290360126-127.0.0.1-1559634768162 block 3814574939_2740867478
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:440)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:408)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379)
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:641)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:572)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:754)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820)
at java.io.DataInputStream.read(DataInputStream.java:149)
at 
org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100)
at 
org.apache.commons.io.input.TeeInputStream.read(TeeInputStream.java:129)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at java.io.PushbackInputStream.read(PushbackInputStream.java:186)
at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:403)
at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:278)
at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:122)
at java.util.jar.JarInputStream.(JarInputStream.java:83)
at java.util.jar.JarInputStream.(JarInputStream.java:62)
at org.apache.hadoop.util.RunJar.unJar(RunJar.java:114)
at org.apache.hadoop.util.RunJar.unJarAndSave(RunJar.java:167)
at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:354)
at 
org.apache.hadoop.yarn.util.FSDownload.downloadAndUnpack(FSDownload.java:303)
at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:283)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at 

[jira] [Created] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException

2019-11-12 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-9968:
--

 Summary: Public Localizer is exiting in NodeManager due to 
NullPointerException
 Key: YARN-9968
 URL: https://issues.apache.org/jira/browse/YARN-9968
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.1.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


The Public Localizer is encountering a NullPointerException and exiting.

{code:java}
ERROR localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(995)) - Error: Shutting down
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981)

INFO  localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:run(997)) - Public cache exiting
{code}

The NodeManager still keeps on running. Subsequent localization events for 
containers keep encountering the below error, resulting in failed Localization 
of all new containers. 

{code:java}
ERROR localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { { 
hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null 
},pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED}
 for download. Either queue is full or threadpool is shutdown.
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 rejected 
from 
org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, 
pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 382286]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at 
java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899)
{code}

When this happens, the NodeManager becomes usable only after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-24 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16958613#comment-16958613
 ] 

Tarun Parimi commented on YARN-9921:


Thanks for the reviews [~tangzhankun] and [~prabhujoseph#1]

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.3.0, 3.1.4
>
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9772) CapacitySchedulerQueueManager has incorrect list of queues

2019-10-23 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957776#comment-16957776
 ] 

Tarun Parimi commented on YARN-9772:


The operators having several hundreds of queues might accidentally configured 
this way. Since there is no current document which says to do otherwise.

Detailing it in documentation and the printing the complete queue paths which 
violate the rule will help those few people to change their queue configs 
properly.

> CapacitySchedulerQueueManager has incorrect list of queues
> --
>
> Key: YARN-9772
> URL: https://issues.apache.org/jira/browse/YARN-9772
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> CapacitySchedulerQueueManager has incorrect list of queues when there is more 
> than one parent queue (say at middle level) with same name.
> For example,
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> {{CapacitySchedulerQueueManager#getQueues}} maintains these list of queues. 
> While parsing "root.a.d.b", it overrides "root.a.b" with new Queue object in 
> the map because of similar name. After parsing all the queues, map count 
> should be 7, but it is 6. Any reference to queue "root.a.b" in code path is 
> nothing but "root.a.d.b" object. Since 
> {{CapacitySchedulerQueueManager#getQueues}} has been used in multiple places, 
> will need to understand the implications in detail. For example, 
> {{CapapcityScheduler#getQueue}} has been used in many places which in turn 
> uses {{CapacitySchedulerQueueManager#getQueues}}. cc [~eepayne], [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM

2019-10-22 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957101#comment-16957101
 ] 

Tarun Parimi commented on YARN-9928:


The issue is occurring since container returned in below code snippet becomes 
null.

{code:java}
  private void publishContainerCreatedEvent(ContainerEvent event) {
if (publishNMContainerEvents) {
  ContainerId containerId = event.getContainerID();
  ContainerEntity entity = createContainerEntity(containerId);
  Container container = context.getContainers().get(containerId);
  Resource resource = container.getResource();
{code}

This issue does not usually occur because there is a previous null check for 
the same done in ContainerManagerImpl . 

{code:java}
Map containers =
ContainerManagerImpl.this.context.getContainers();
  Container c = containers.get(event.getContainerID());
  if (c != null) {
c.handle(event);
if (nmMetricsPublisher != null) {
  nmMetricsPublisher.publishContainerEvent(event);
}
{code}

But in a heavily loaded prod cluster with lots of events in the 
ContainerManager dispatcher and when NM is also resyncing with RM at the same 
time in a separate NM dispatcher thread, it can suddenly remove all the 
completed containers.

So an additional null check is needed for the container in these scenarios.




> ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
> --
>
> Key: YARN-9928
> URL: https://issues.apache.org/jira/browse/YARN-9928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> Encountered the below FATAL error in the NodeManager which was under heavy 
> load and was also resyncing with RM at the same. This caused the NM to go 
> down. 
> {code:java}
> 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM

2019-10-22 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9928:
---
Component/s: ATSv2

> ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
> --
>
> Key: YARN-9928
> URL: https://issues.apache.org/jira/browse/YARN-9928
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: ATSv2
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> Encountered the below FATAL error in the NodeManager which was under heavy 
> load and was also resyncing with RM at the same. This caused the NM to go 
> down. 
> {code:java}
> 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM

2019-10-22 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9928:
---
Affects Version/s: 3.1.0

> ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
> --
>
> Key: YARN-9928
> URL: https://issues.apache.org/jira/browse/YARN-9928
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
>
> Encountered the below FATAL error in the NodeManager which was under heavy 
> load and was also resyncing with RM at the same. This caused the NM to go 
> down. 
> {code:java}
> 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher 
> (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216)
> at 
> org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM

2019-10-22 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-9928:
--

 Summary: ATSv2 can make NM go down with a FATAL error while it is 
resyncing with RM
 Key: YARN-9928
 URL: https://issues.apache.org/jira/browse/YARN-9928
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tarun Parimi
Assignee: Tarun Parimi


Encountered the below FATAL error in the NodeManager which was under heavy load 
and was also resyncing with RM at the same. This caused the NM to go down. 


{code:java}
2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher 
(AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216)
at 
org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9773) Add QueueMetrics for Custom Resources

2019-10-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955958#comment-16955958
 ] 

Tarun Parimi commented on YARN-9773:


Got a findbugs warning from the changes done in this jira.
https://builds.apache.org/job/PreCommit-YARN-Build/25021/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html

> Add QueueMetrics for Custom Resources
> -
>
> Key: YARN-9773
> URL: https://issues.apache.org/jira/browse/YARN-9773
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
> Fix For: 3.3.0, 3.2.2, 3.1.4
>
> Attachments: YARN-9773.001.patch, YARN-9773.002.patch, 
> YARN-9773.003.patch
>
>
> Although the custom resource metrics are calculated and saved as a 
> QueueMetricsForCustomResources object within the QueueMetrics class, the JMX 
> and Simon QueueMetrics do not report that information for custom resources. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955955#comment-16955955
 ] 

Tarun Parimi commented on YARN-9921:


The Findbugs warning is due to the changes done in YARN-9773  and is not 
related to the patch. 

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-21 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955803#comment-16955803
 ] 

Tarun Parimi commented on YARN-9921:


Thanks for the review [~tangzhankun].

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-20 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955763#comment-16955763
 ] 

Tarun Parimi edited comment on YARN-9921 at 10/21/19 5:55 AM:
--

Submitting a patch which changes the equals method in SchedulingRequestPBImpl 
to check using the objects instead of proto. Verified that this fixes the issue 
in my cluster where it is reproducing.
Added a case to test the updatePendingAsk for a newly constructed 
SchedulingRequest.

[~sunilg],[~cheersyang],[~eyang],[~Prabhu Joseph] Please check when you get 
time.


was (Author: tarunparimi):
Submitting a patch which changes the equals method in SchedulingRequestPBImpl 
to check using the objects instead of proto. Verified that this fixes the issue 
in my cluster where it is reproducing.
Added a case to test the updatePendingAsk for a newly constructed 
SchedulingRequest.

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-20 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9921:
---
Attachment: YARN-9921.001.patch

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9921.001.patch, differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-20 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955755#comment-16955755
 ] 

Tarun Parimi commented on YARN-9921:


On debugging this, I found that the targetExpressions object is considered by 
protobuf as unequal.

This is because the order of elements in targetExpressions is expected to be 
same. But the order can change as we can see below. !differenceProtobuf.png!

The reason the order changes is because we have defined targetExpression as an 
unordered Set.

{code:java}
/**
 * Get the target expressions of the constraint.
 *
 * @return the set of target expressions
 */
public Set getTargetExpressions() {
  return targetExpressions;
}
{code}

But the proto is defined as repeated string. I see in 
https://github.com/protocolbuffers/protobuf/issues/2116 that order is strictly 
checked for repeated fields.

{code:java}
  repeated PlacementConstraintTargetProto targetExpressions = 2;
{code}

I don't think it is safe to make any changes to the proto to handle this issue 
as it can cause backward compatibility/upgrade and other problems.

A simple fix is to change the equals method in SchedulingRequestPBImpl to not 
depend on the equals method of protobuf. Will submit a working patch on this 
soon.
 

 

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-20 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9921:
---
Attachment: differenceProtobuf.png

> Issue in PlacementConstraint when YARN Service AM retries allocation on 
> component failure.
> --
>
> Key: YARN-9921
> URL: https://issues.apache.org/jira/browse/YARN-9921
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: differenceProtobuf.png
>
>
> When YARN Service AM tries to relaunch a container on failure, we encounter 
> the below error in PlacementConstraints.
> {code:java}
> ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.YarnException: 
> org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
> Invalid updated SchedulingRequest added to scheduler, we only allows changing 
> numAllocations for the updated SchedulingRequest. 
> Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=0, 
> resources=}, 
> placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
> new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
> executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
> allocationTags=[component], 
> resourceSizing=ResourceSizingPBImpl{numAllocations=1, 
> resources=}, 
> placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
>  if any fields need to be updated, please cancel the old request (by setting 
> numAllocations to 0) and send a SchedulingRequest with different combination 
> of priority/allocationId
> {code}
> But we can see from the message that the SchedulingRequest is indeed valid 
> with everything same except numAllocations as expected. But still the below 
> equals check in SingleConstraintAppPlacementAllocator fails.
> {code:java}
> // Compare two objects
>   if (!schedulingRequest.equals(newSchedulingRequest)) {
> // Rollback #numAllocations
> sizing.setNumAllocations(newNumAllocations);
> throw new SchedulerInvalidResoureRequestException(
> "Invalid updated SchedulingRequest added to scheduler, "
> + " we only allows changing numAllocations for the updated "
> + "SchedulingRequest. Old=" + schedulingRequest.toString()
> + " new=" + newSchedulingRequest.toString()
> + ", if any fields need to be updated, please cancel the "
> + "old request (by setting numAllocations to 0) and send a "
> + "SchedulingRequest with different combination of "
> + "priority/allocationId");
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.

2019-10-20 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-9921:
--

 Summary: Issue in PlacementConstraint when YARN Service AM retries 
allocation on component failure.
 Key: YARN-9921
 URL: https://issues.apache.org/jira/browse/YARN-9921
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


When YARN Service AM tries to relaunch a container on failure, we encounter the 
below error in PlacementConstraints.

{code:java}
ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
org.apache.hadoop.yarn.exceptions.YarnException: 
org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: 
Invalid updated SchedulingRequest added to scheduler, we only allows changing 
numAllocations for the updated SchedulingRequest. 
Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
allocationTags=[component], 
resourceSizing=ResourceSizingPBImpl{numAllocations=0, resources=}, 
placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} 
new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, 
executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, 
allocationTags=[component], 
resourceSizing=ResourceSizingPBImpl{numAllocations=1, resources=}, 
placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]},
 if any fields need to be updated, please cancel the old request (by setting 
numAllocations to 0) and send a SchedulingRequest with different combination of 
priority/allocationId
{code}

But we can see from the message that the SchedulingRequest is indeed valid with 
everything same except numAllocations as expected. But still the below equals 
check in SingleConstraintAppPlacementAllocator fails.

{code:java}
// Compare two objects
  if (!schedulingRequest.equals(newSchedulingRequest)) {
// Rollback #numAllocations
sizing.setNumAllocations(newNumAllocations);
throw new SchedulerInvalidResoureRequestException(
"Invalid updated SchedulingRequest added to scheduler, "
+ " we only allows changing numAllocations for the updated "
+ "SchedulingRequest. Old=" + schedulingRequest.toString()
+ " new=" + newSchedulingRequest.toString()
+ ", if any fields need to be updated, please cancel the "
+ "old request (by setting numAllocations to 0) and send a "
+ "SchedulingRequest with different combination of "
+ "priority/allocationId");
  }
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9907) Make YARN Service AM RPC port configurable

2019-10-16 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9907:
---
Attachment: YARN-9907.001.patch

> Make YARN Service AM RPC port configurable
> --
>
> Key: YARN-9907
> URL: https://issues.apache.org/jira/browse/YARN-9907
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9907.001.patch
>
>
> YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In 
> environments where firewalls block unnecessary ports by default, it is useful 
> to have a configuration that specifies the port range. Similar to what we 
> have for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9907) Make YARN Service AM RPC port configurable

2019-10-16 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-9907:
--

 Summary: Make YARN Service AM RPC port configurable
 Key: YARN-9907
 URL: https://issues.apache.org/jira/browse/YARN-9907
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Reporter: Tarun Parimi
Assignee: Tarun Parimi


YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In 
environments where firewalls block unnecessary ports by default, it is useful 
to have a configuration that specifies the port range. Similar to what we have 
for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9903) Support reservations continue looking for Node Labels

2019-10-15 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9903:
---
Description: 
YARN-1769 brought in reservations continue looking feature which improves the 
several resource reservation scenarios. However, it is not handled currently 
when nodes have a label assigned to them. This is useful and in many cases 
necessary even for Node Labels. So we should look to support this for node 
labels also.

For example, in AbstractCSQueue.java, we have the below TODO.
{code:java}
// TODO, now only consider reservation cases when the node has no label 
if (this.reservationsContinueLooking && nodePartition.equals( 
RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, 
clusterResource, resourceCouldBeUnreserved, Resources.none())) {
{code}
cc [~sunilg]

  was:
YARN-1769 brought in reservations continue looking feature which improves the 
several resource reservation scenarios. However, it is not handled currently 
when nodes have a label assigned to them. This is useful and in many cases 
necessary even for Node Labels. So we should look to support this for node 
labels also.
{code:java}
// TODO, now only consider reservation cases when the node has no label if 
(this.reservationsContinueLooking && nodePartition.equals( 
RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, 
clusterResource, resourceCouldBeUnreserved, Resources.none())) {
{code}
cc [~sunilg]


> Support reservations continue looking for Node Labels
> -
>
> Key: YARN-9903
> URL: https://issues.apache.org/jira/browse/YARN-9903
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Tarun Parimi
>Priority: Major
>
> YARN-1769 brought in reservations continue looking feature which improves the 
> several resource reservation scenarios. However, it is not handled currently 
> when nodes have a label assigned to them. This is useful and in many cases 
> necessary even for Node Labels. So we should look to support this for node 
> labels also.
> For example, in AbstractCSQueue.java, we have the below TODO.
> {code:java}
> // TODO, now only consider reservation cases when the node has no label 
> if (this.reservationsContinueLooking && nodePartition.equals( 
> RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, 
> clusterResource, resourceCouldBeUnreserved, Resources.none())) {
> {code}
> cc [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9903) Support reservations continue looking for Node Labels

2019-10-15 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-9903:
--

 Summary: Support reservations continue looking for Node Labels
 Key: YARN-9903
 URL: https://issues.apache.org/jira/browse/YARN-9903
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Tarun Parimi


YARN-1769 brought in reservations continue looking feature which improves the 
several resource reservation scenarios. However, it is not handled currently 
when nodes have a label assigned to them. This is useful and in many cases 
necessary even for Node Labels. So we should look to support this for node 
labels also.
{code:java}
// TODO, now only consider reservation cases when the node has no label if 
(this.reservationsContinueLooking && nodePartition.equals( 
RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, 
clusterResource, resourceCouldBeUnreserved, Resources.none())) {
{code}
cc [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs

2019-09-19 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933274#comment-16933274
 ] 

Tarun Parimi commented on YARN-8786:


YARN-9833 could fix this issue

> LinuxContainerExecutor fails sporadically in create_local_dirs
> --
>
> Key: YARN-8786
> URL: https://issues.apache.org/jira/browse/YARN-8786
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Jon Bender
>Priority: Major
>
> We started using CGroups with LinuxContainerExecutor recently, running Apache 
> Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn 
> container will fail with a message like the following:
> {code:java}
> [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: 
> Container container_1530684675517_516620_01_020846 transitioned from 
> SCHEDULED to RUNNING
> [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO 
> monitor.ContainersMonitorImpl: Starting resource-monitoring for 
> container_1530684675517_516620_01_020846
> [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN 
> privileged.PrivilegedOperationExecutor: Shell execution returned exit code: 
> 35. Privileged Execution Operation Stderr:
> [2018-09-02 23:48:02.506159] Could not create container dirsCould not create 
> local files and directories
> [2018-09-02 23:48:02.506220]
> [2018-09-02 23:48:02.506238] Stdout: main : command provided 1
> [2018-09-02 23:48:02.506258] main : run as user is nobody
> [2018-09-02 23:48:02.506282] main : requested yarn user is root
> [2018-09-02 23:48:02.506294] Getting exit code file...
> [2018-09-02 23:48:02.506307] Creating script paths...
> [2018-09-02 23:48:02.506330] Writing pid file...
> [2018-09-02 23:48:02.506366] Writing to tmp file 
> /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp
> [2018-09-02 23:48:02.506389] Writing to cgroup task files...
> [2018-09-02 23:48:02.506402] Creating local dirs...
> [2018-09-02 23:48:02.506414] Getting exit code file...
> [2018-09-02 23:48:02.506435] Creating script paths...
> {code}
> Looking at the container executor source it's traceable to errors here: 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604]
>  And ultimately to 
> [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672]
> The root failure seems to be in the underlying mkdir call, but that exit code 
> / errno is swallowed so we don't have more details. We tend to see this when 
> many containers start at the same time for the same application on a host, 
> and suspect it may be related to some race conditions around those shared 
> directories between containers for the same application.
> For example, this is a typical pattern in the audit logs:
> {code:java}
> [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO 
> nodemanager.NMAuditLogger: USER=root  IP=<> Container Request 
> TARGET=ContainerManageImpl  RESULT=SUCCESS  
> APPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012870
> [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN 
> nodemanager.NMAuditLogger: USER=root  OPERATION=Container Finished - 
> Failed   TARGET=ContainerImplRESULT=FAILURE  DESCRIPTION=Container failed 
> with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126  
> CONTAINERID=container_1530684675517_559126_01_012871
> {code}
> Two containers for the same application starting in quick succession followed 
> by the EXITED_WITH_FAILURE step (exit code 35).
> We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, 
> the only major JIRAs that affected the executor since 3.0.0 seem unrelated 
> ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8]
>  and 
> [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931109#comment-16931109
 ] 

Tarun Parimi commented on YARN-9837:


Thanks for the review [~eyang] .

> YARN Service fails to fetch status for Stopped apps with bigger spec files
> --
>
> Key: YARN-9837
> URL: https://issues.apache.org/jira/browse/YARN-9837
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9837.001.patch
>
>
> Was unable to fetch status for a STOPPED app due to the below error in RM 
> logs.
> {code:java}
> ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: 
> {}
> java.io.EOFException: Read of 
> hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json 
> finished prematurely
> at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235)
> at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749)
> {code}
> This seems to happen when the json file my-service.json is larger than 128KB 
> in my cluster.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Tarun Parimi (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tarun Parimi updated YARN-9837:
---
Attachment: YARN-9837.001.patch

> YARN Service fails to fetch status for Stopped apps with bigger spec files
> --
>
> Key: YARN-9837
> URL: https://issues.apache.org/jira/browse/YARN-9837
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Attachments: YARN-9837.001.patch
>
>
> Was unable to fetch status for a STOPPED app due to the below error in RM 
> logs.
> {code:java}
> ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: 
> {}
> java.io.EOFException: Read of 
> hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json 
> finished prematurely
> at 
> org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188)
> at 
> org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409)
> at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235)
> at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749)
> {code}
> This seems to happen when the json file my-service.json is larger than 128KB 
> in my cluster.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files

2019-09-16 Thread Tarun Parimi (Jira)
Tarun Parimi created YARN-9837:
--

 Summary: YARN Service fails to fetch status for Stopped apps with 
bigger spec files
 Key: YARN-9837
 URL: https://issues.apache.org/jira/browse/YARN-9837
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Affects Versions: 3.1.0
Reporter: Tarun Parimi
Assignee: Tarun Parimi


Was unable to fetch status for a STOPPED app due to the below error in RM logs.
{code:java}
ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: {}
java.io.EOFException: Read of 
hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json 
finished prematurely
at 
org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188)
at 
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360)
at 
org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409)
at 
org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235)
at 
org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749)
{code}
This seems to happen when the json file my-service.json is larger than 128KB in 
my cluster.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9772) CapacitySchedulerQueueManager has incorrect list of queues

2019-09-16 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930521#comment-16930521
 ] 

Tarun Parimi commented on YARN-9772:


bq. Should we extend the duplicates check (as of now, it does only for leaf 
queues) to parent queues as well? 
[~maniraj...@gmail.com], Only problem I see is that there will be existing 
users who might have already have a queue config containing parent queues with 
duplicate names. They will face error when they upgrade and be forced to modify 
their current queue config.

> CapacitySchedulerQueueManager has incorrect list of queues
> --
>
> Key: YARN-9772
> URL: https://issues.apache.org/jira/browse/YARN-9772
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Manikandan R
>Assignee: Manikandan R
>Priority: Major
>
> CapacitySchedulerQueueManager has incorrect list of queues when there is more 
> than one parent queue (say at middle level) with same name.
> For example,
>  * root
>  ** a
>  *** b
>   c
>  *** d
>   b
>  * e
> {{CapacitySchedulerQueueManager#getQueues}} maintains these list of queues. 
> While parsing "root.a.d.b", it overrides "root.a.b" with new Queue object in 
> the map because of similar name. After parsing all the queues, map count 
> should be 7, but it is 6. Any reference to queue "root.a.b" in code path is 
> nothing but "root.a.d.b" object. Since 
> {{CapacitySchedulerQueueManager#getQueues}} has been used in multiple places, 
> will need to understand the implications in detail. For example, 
> {{CapapcityScheduler#getQueue}} has been used in many places which in turn 
> uses {{CapacitySchedulerQueueManager#getQueues}}. cc [~eepayne], [~sunilg]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9794) RM crashes due to runtime errors in TimelineServiceV2Publisher

2019-09-16 Thread Tarun Parimi (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930503#comment-16930503
 ] 

Tarun Parimi commented on YARN-9794:


Thanks [~abmodi],[~Prabhu Joseph] for the reviews and commit.

> RM crashes due to runtime errors in TimelineServiceV2Publisher
> --
>
> Key: YARN-9794
> URL: https://issues.apache.org/jira/browse/YARN-9794
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tarun Parimi
>Assignee: Tarun Parimi
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9794.001.patch, YARN-9794.002.patch
>
>
> Saw that RM crashes while startup due to errors while putting entity in 
> TimelineServiceV2Publisher.
> {code:java}
> 2019-08-28 09:35:45,273 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.RuntimeException: java.lang.IllegalArgumentException: 
> org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException:
>  CodedInputStream encountered an embedded string or message which claimed to 
> have negative size
> .
> at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:200)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:269)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
> at 
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
> at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
> at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
> at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
> at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:321)
> at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:285)
> at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.flush(TypedBufferedMutator.java:66)
> at 
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:566)
> at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.flushBufferedTimelineEntities(TimelineCollector.java:173)
> at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntities(TimelineCollector.java:150)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:459)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:73)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:494)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:483)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.IllegalArgumentException: 
> org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException:
>  CodedInputStream encountered an embedded string or message which claimed to 
> have negative size.
> at 
> org.apache.hbase.thirdparty.com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



  1   2   3   >