[jira] [Assigned] (YARN-10890) Node Attributes in Distributed mapping misses update to scheduler when node gets decommissioned/recommissioned
[ https://issues.apache.org/jira/browse/YARN-10890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi reassigned YARN-10890: --- Assignee: Tarun Parimi > Node Attributes in Distributed mapping misses update to scheduler when node > gets decommissioned/recommissioned > -- > > Key: YARN-10890 > URL: https://issues.apache.org/jira/browse/YARN-10890 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.2.1 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > The NodeAttributesManagerImpl maintains the node to attribute mapping. But it > doesnt remove the mapping when a node goes down. This makes sense for > centralized mapping, since the attribute mapping is centralized to RM, so a > node going down doesn't affect the mapping. > In distributed mapping, the node attribute mapping is updated via NM > heartbeat to RM and so these node attributes are only valid as long as the > node is heartbeating . But when a node is decommissioned or lost, the node > attribute entry still remains in NodeAttributesManagerImpl. > After the performance improvement change done in YARN-8925, we only update > distributed node attributes when necessary. However when a previously > decommissioned node is recommissioned again, NodeAttributesManagerImpl still > has the old mapping entry belonging to the old SchedulerNode instance which > was decommisioned. > This results in ResourceTrackerService#updateNodeAttributesIfNecessary > skipping the update, since it is comparing with the attributes belonging to > the old decommisioned node instance. > {code:java} > if (!NodeLabelUtil > .isNodeAttributesEquals(nodeAttributes, currentNodeAttributes)) > { > this.rmContext.getNodeAttributesManager() > .replaceNodeAttributes(NodeAttribute.PREFIX_DISTRIBUTED, > ImmutableMap.of(nodeId.getHost(), nodeAttributes)); > } else if (LOG.isDebugEnabled()) { > LOG.debug("Skip updating node attributes since there is no change > for " > + nodeId + " : " + nodeAttributes); > } > {code} > We should remove the distributed node attributes whenever a node gets > deactivated to avoid this issue. So these attributes will get added properly > in scheduler whenever the node becomes active again and registers/heartbeats. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10890) Node Attributes in Distributed mapping misses update to scheduler when node gets decommissioned/recommissioned
Tarun Parimi created YARN-10890: --- Summary: Node Attributes in Distributed mapping misses update to scheduler when node gets decommissioned/recommissioned Key: YARN-10890 URL: https://issues.apache.org/jira/browse/YARN-10890 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.2.1, 3.3.0 Reporter: Tarun Parimi The NodeAttributesManagerImpl maintains the node to attribute mapping. But it doesnt remove the mapping when a node goes down. This makes sense for centralized mapping, since the attribute mapping is centralized to RM, so a node going down doesn't affect the mapping. In distributed mapping, the node attribute mapping is updated via NM heartbeat to RM and so these node attributes are only valid as long as the node is heartbeating . But when a node is decommissioned or lost, the node attribute entry still remains in NodeAttributesManagerImpl. After the performance improvement change done in YARN-8925, we only update distributed node attributes when necessary. However when a previously decommissioned node is recommissioned again, NodeAttributesManagerImpl still has the old mapping entry belonging to the old SchedulerNode instance which was decommisioned. This results in ResourceTrackerService#updateNodeAttributesIfNecessary skipping the update, since it is comparing with the attributes belonging to the old decommisioned node instance. {code:java} if (!NodeLabelUtil .isNodeAttributesEquals(nodeAttributes, currentNodeAttributes)) { this.rmContext.getNodeAttributesManager() .replaceNodeAttributes(NodeAttribute.PREFIX_DISTRIBUTED, ImmutableMap.of(nodeId.getHost(), nodeAttributes)); } else if (LOG.isDebugEnabled()) { LOG.debug("Skip updating node attributes since there is no change for " + nodeId + " : " + nodeAttributes); } {code} We should remove the distributed node attributes whenever a node gets deactivated to avoid this issue. So these attributes will get added properly in scheduler whenever the node becomes active again and registers/heartbeats. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9907) Make YARN Service AM RPC port configurable
[ https://issues.apache.org/jira/browse/YARN-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390541#comment-17390541 ] Tarun Parimi commented on YARN-9907: [~pbacsko], yes you are right. We can close this as duplicate now. > Make YARN Service AM RPC port configurable > -- > > Key: YARN-9907 > URL: https://issues.apache.org/jira/browse/YARN-9907 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9907.001.patch > > > YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In > environments where firewalls block unnecessary ports by default, it is useful > to have a configuration that specifies the port range. Similar to what we > have for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380533#comment-17380533 ] Tarun Parimi commented on YARN-10789: - [~snemeth], Looks like the build didnt get triggered till now for some reason. Was there an issue in jenkins? TestZKConfigurationStore #testDisableAuditLogs is passing. The other test failures are unrelated to the patch. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: (was: YARN-10789.branch-3.2.001.patch) > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: YARN-10789.branch-3.2.001.patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369280#comment-17369280 ] Tarun Parimi commented on YARN-10789: - [~snemeth], reattaching the 3.2 patch to trigger build. Looks like the retrigger didnt happen for some reason. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10828) Backport YARN-9789 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17369282#comment-17369282 ] Tarun Parimi commented on YARN-10828: - Thanks [~snemeth] for reviewing this and committting. > Backport YARN-9789 to branch-3.2 > > > Key: YARN-10828 > URL: https://issues.apache.org/jira/browse/YARN-10828 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.2.3 > > Attachments: YARN-10828.branch-3.2.001.patch > > > YARN-9789 fix is missing in branch-3.2 which is causing unit test > TestZKConfigurationStore#testDisableAuditLogs to fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: (was: YARN-10789.branch-3.2.001.patch) > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: YARN-10789.branch-3.2.001.patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10828) Backport YARN-9789 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17368738#comment-17368738 ] Tarun Parimi commented on YARN-10828: - The test failures are not related to this patch. > Backport YARN-9789 to branch-3.2 > > > Key: YARN-10828 > URL: https://issues.apache.org/jira/browse/YARN-10828 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10828.branch-3.2.001.patch > > > YARN-9789 fix is missing in branch-3.2 which is causing unit test > TestZKConfigurationStore#testDisableAuditLogs to fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10828) Backport YARN-9789 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367601#comment-17367601 ] Tarun Parimi commented on YARN-10828: - [~snemeth], please review this when you get time. Thanks. > Backport YARN-9789 to branch-3.2 > > > Key: YARN-10828 > URL: https://issues.apache.org/jira/browse/YARN-10828 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10828.branch-3.2.001.patch > > > YARN-9789 fix is missing in branch-3.2 which is causing unit test > TestZKConfigurationStore#testDisableAuditLogs to fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367600#comment-17367600 ] Tarun Parimi commented on YARN-10789: - [~snemeth], I have created YARN-10828 to backport YARN-9789 to branch-3.2. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10828) Backport YARN-9789 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi reassigned YARN-10828: --- Assignee: Tarun Parimi Submitting a backport patch for branch-3.2. Validated that related unit tests pass. > Backport YARN-9789 to branch-3.2 > > > Key: YARN-10828 > URL: https://issues.apache.org/jira/browse/YARN-10828 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10828.branch-3.2.001.patch > > > YARN-9789 fix is missing in branch-3.2 which is causing unit test > TestZKConfigurationStore#testDisableAuditLogs to fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10828) Backport YARN-9789 to branch-3.2
[ https://issues.apache.org/jira/browse/YARN-10828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10828: Attachment: YARN-10828.branch-3.2.001.patch > Backport YARN-9789 to branch-3.2 > > > Key: YARN-10828 > URL: https://issues.apache.org/jira/browse/YARN-10828 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: Tarun Parimi >Priority: Major > Attachments: YARN-10828.branch-3.2.001.patch > > > YARN-9789 fix is missing in branch-3.2 which is causing unit test > TestZKConfigurationStore#testDisableAuditLogs to fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10828) Backport YARN-9789 to branch-3.2
Tarun Parimi created YARN-10828: --- Summary: Backport YARN-9789 to branch-3.2 Key: YARN-10828 URL: https://issues.apache.org/jira/browse/YARN-10828 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.2.0 Reporter: Tarun Parimi YARN-9789 fix is missing in branch-3.2 which is causing unit test TestZKConfigurationStore#testDisableAuditLogs to fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367553#comment-17367553 ] Tarun Parimi commented on YARN-10789: - [~snemeth], the failing test in TestZKConfigurationStore is testDisableAuditLogs . This unit test was added in YARN-9789 . But YARN-9789 fix is missing in branch-3.2 . Looks like only the unit test part of YARN-9789 got backported somehow to branch-3.2, but not the fix corresponding to it. To fix this test, we need to backport YARN-9789 patch to branch-3.2. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: YARN-10789.branch-3.2.001.patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363576#comment-17363576 ] Tarun Parimi commented on YARN-10789: - Reattached Patch for branch-3.2 since jenkins triggerred only for branch-3.3 patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: (was: YARN-10789.branch-3.2.001.patch) > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: YARN-10789.branch-3.3.001.patch YARN-10789.branch-3.2.001.patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch, > YARN-10789.branch-3.2.001.patch, YARN-10789.branch-3.3.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362820#comment-17362820 ] Tarun Parimi commented on YARN-10789: - Thanks [~snemeth] for the review and commit. Thanks [~bteke],[~zhuqi] for your reviews. We can backport it to 3.3/3.2 branches. The trunk patch applies cleanly on 3.3. Will add a patch for 3.2. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10789.001.patch, YARN-10789.002.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple
[ https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17362752#comment-17362752 ] Tarun Parimi commented on YARN-10816: - Thanks [~snemeth] for the review and commit. > Avoid doing delegation token ops when > yarn.timeline-service.http-authentication.type=simple > --- > > Key: YARN-10816 > URL: https://issues.apache.org/jira/browse/YARN-10816 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.4.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10816.001.patch, YARN-10816.002.patch > > > YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is > used in TimelineClient when > yarn.timeline-service.http-authentication.type=simple > PseudoAuthenticationHandler doesn't support delegation token ops like get, > renew and cancel since those ops strictly require SPNEGO auth to work. We > don't use timeline delegation tokens when simple auth is used. > Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when > yarn.timeline-service.http-authentication.type=simple, but hadoop security > was enabled. After YARN-10339, the tokens are not used when > yarn.timeline-service.http-authentication.type=simple. > In a rolling upgrade scenario, we can have a client which doesn't have > YARN-10339 changes submitting an application and requests a Timeline > delegation token even when > yarn.timeline-service.http-authentication.type=simple. RM on the other hand > can have YARN-10339 changes and so will result in error while trying to renew > the token with PseudoAuthenticationHandler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple
[ https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360708#comment-17360708 ] Tarun Parimi commented on YARN-10816: - [~snemeth], please review this when you get some time. > Avoid doing delegation token ops when > yarn.timeline-service.http-authentication.type=simple > --- > > Key: YARN-10816 > URL: https://issues.apache.org/jira/browse/YARN-10816 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.4.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10816.001.patch, YARN-10816.002.patch > > > YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is > used in TimelineClient when > yarn.timeline-service.http-authentication.type=simple > PseudoAuthenticationHandler doesn't support delegation token ops like get, > renew and cancel since those ops strictly require SPNEGO auth to work. We > don't use timeline delegation tokens when simple auth is used. > Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when > yarn.timeline-service.http-authentication.type=simple, but hadoop security > was enabled. After YARN-10339, the tokens are not used when > yarn.timeline-service.http-authentication.type=simple. > In a rolling upgrade scenario, we can have a client which doesn't have > YARN-10339 changes submitting an application and requests a Timeline > delegation token even when > yarn.timeline-service.http-authentication.type=simple. RM on the other hand > can have YARN-10339 changes and so will result in error while trying to renew > the token with PseudoAuthenticationHandler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple
[ https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10816: Attachment: YARN-10816.002.patch > Avoid doing delegation token ops when > yarn.timeline-service.http-authentication.type=simple > --- > > Key: YARN-10816 > URL: https://issues.apache.org/jira/browse/YARN-10816 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.4.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10816.001.patch, YARN-10816.002.patch > > > YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is > used in TimelineClient when > yarn.timeline-service.http-authentication.type=simple > PseudoAuthenticationHandler doesn't support delegation token ops like get, > renew and cancel since those ops strictly require SPNEGO auth to work. We > don't use timeline delegation tokens when simple auth is used. > Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when > yarn.timeline-service.http-authentication.type=simple, but hadoop security > was enabled. After YARN-10339, the tokens are not used when > yarn.timeline-service.http-authentication.type=simple. > In a rolling upgrade scenario, we can have a client which doesn't have > YARN-10339 changes submitting an application and requests a Timeline > delegation token even when > yarn.timeline-service.http-authentication.type=simple. RM on the other hand > can have YARN-10339 changes and so will result in error while trying to renew > the token with PseudoAuthenticationHandler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple
[ https://issues.apache.org/jira/browse/YARN-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10816: Attachment: YARN-10816.001.patch > Avoid doing delegation token ops when > yarn.timeline-service.http-authentication.type=simple > --- > > Key: YARN-10816 > URL: https://issues.apache.org/jira/browse/YARN-10816 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.4.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10816.001.patch > > > YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is > used in TimelineClient when > yarn.timeline-service.http-authentication.type=simple > PseudoAuthenticationHandler doesn't support delegation token ops like get, > renew and cancel since those ops strictly require SPNEGO auth to work. We > don't use timeline delegation tokens when simple auth is used. > Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when > yarn.timeline-service.http-authentication.type=simple, but hadoop security > was enabled. After YARN-10339, the tokens are not used when > yarn.timeline-service.http-authentication.type=simple. > In a rolling upgrade scenario, we can have a client which doesn't have > YARN-10339 changes submitting an application and requests a Timeline > delegation token even when > yarn.timeline-service.http-authentication.type=simple. RM on the other hand > can have YARN-10339 changes and so will result in error while trying to renew > the token with PseudoAuthenticationHandler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10816) Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple
Tarun Parimi created YARN-10816: --- Summary: Avoid doing delegation token ops when yarn.timeline-service.http-authentication.type=simple Key: YARN-10816 URL: https://issues.apache.org/jira/browse/YARN-10816 Project: Hadoop YARN Issue Type: Bug Components: timelineclient Affects Versions: 3.4.0 Reporter: Tarun Parimi Assignee: Tarun Parimi YARN-10339 introduced changes to ensure that PseudoAuthenticationHandler is used in TimelineClient when yarn.timeline-service.http-authentication.type=simple PseudoAuthenticationHandler doesn't support delegation token ops like get, renew and cancel since those ops strictly require SPNEGO auth to work. We don't use timeline delegation tokens when simple auth is used. Prior to YARN-10339, Timeline delegation tokens were unnecessarily used when yarn.timeline-service.http-authentication.type=simple, but hadoop security was enabled. After YARN-10339, the tokens are not used when yarn.timeline-service.http-authentication.type=simple. In a rolling upgrade scenario, we can have a client which doesn't have YARN-10339 changes submitting an application and requests a Timeline delegation token even when yarn.timeline-service.http-authentication.type=simple. RM on the other hand can have YARN-10339 changes and so will result in error while trying to renew the token with PseudoAuthenticationHandler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354360#comment-17354360 ] Tarun Parimi commented on YARN-10789: - Thanks [~snemeth] . Please also take a look at this when you get time. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10789.001.patch, YARN-10789.002.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352401#comment-17352401 ] Tarun Parimi commented on YARN-10789: - Thanks [~sunilg]. Added warn log in the latest patch. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10789.001.patch, YARN-10789.002.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: YARN-10789.002.patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10789.001.patch, YARN-10789.002.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352290#comment-17352290 ] Tarun Parimi commented on YARN-10789: - Tested this patch only manually with a stability check with RM HA enabled and yarn.scheduler.configuration.store.class=zk configured. It is tough to reproduce this race condition. And so writing a reliable unit test case is not possible to cover this scenario. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10789.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Attachment: YARN-10789.001.patch > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10789.001.patch > > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
[ https://issues.apache.org/jira/browse/YARN-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10789: Description: We are observing below error randomly during hadoop install and RM initial startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is configured. This causes one of the RMs to not startup. {code:java} 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state INITED org.apache.hadoop.service.ServiceStateException: java.io.IOException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /confstore/CONF_STORE {code} We are trying to create the znode /confstore/CONF_STORE when we initialize the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is initialized when CapacityScheduler does a serviceInit. This serviceInit is done by both Active and Standby RM. So we can run into a race condition when both Active and Standby try to create the same znode when both RM are started at same time. ZKRMStateStore on the other hand avoids such race conditions, by creating the znodes only after serviceStart. serviceStart only happens for the active RM which won the leader election, unlike serviceInit which happens irrespective of leader election. was: We are observing below error randomly during hadoop install and RM initial startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is configured. This cause one of the RM's to not startup. {code:java} 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state INITED org.apache.hadoop.service.ServiceStateException: java.io.IOException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /confstore/CONF_STORE {code} We are trying to create the znode /confstore/CONF_STORE when we initialize the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is initialized when CapacityScheduler does a serviceInit. This serviceInit is done by both Active and Standby RM. So we can run into a race condition when both Active and Standby try to create the same znode when both RM are started at same time. ZKRMStateStore on the other hand avoids such race conditions, by creating the znodes only after serviceStart. serviceStart only happens for the active RM which won the leader election, unlike serviceInit which happens irrespective of leader election. > RM HA startup can fail due to race conditions in ZKConfigurationStore > - > > Key: YARN-10789 > URL: https://issues.apache.org/jira/browse/YARN-10789 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > We are observing below error randomly during hadoop install and RM initial > startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is > configured. This causes one of the RMs to not startup. > {code:java} > 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: > Service RMActiveServices failed in state INITED > org.apache.hadoop.service.ServiceStateException: java.io.IOException: > org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = > NodeExists for /confstore/CONF_STORE > {code} > We are trying to create the znode /confstore/CONF_STORE when we initialize > the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is > initialized when CapacityScheduler does a serviceInit. This serviceInit is > done by both Active and Standby RM. So we can run into a race condition when > both Active and Standby try to create the same znode when both RM are started > at same time. > ZKRMStateStore on the other hand avoids such race conditions, by creating the > znodes only after serviceStart. serviceStart only happens for the active RM > which won the leader election, unlike serviceInit which happens irrespective > of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10789) RM HA startup can fail due to race conditions in ZKConfigurationStore
Tarun Parimi created YARN-10789: --- Summary: RM HA startup can fail due to race conditions in ZKConfigurationStore Key: YARN-10789 URL: https://issues.apache.org/jira/browse/YARN-10789 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0 Reporter: Tarun Parimi Assignee: Tarun Parimi We are observing below error randomly during hadoop install and RM initial startup when HA is enabled and yarn.scheduler.configuration.store.class=zk is configured. This cause one of the RM's to not startup. {code:java} 2021-05-26 12:59:18,986 INFO org.apache.hadoop.service.AbstractService: Service RMActiveServices failed in state INITED org.apache.hadoop.service.ServiceStateException: java.io.IOException: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /confstore/CONF_STORE {code} We are trying to create the znode /confstore/CONF_STORE when we initialize the ZKConfigurationStore. But the problem is that the ZKConfigurationStore is initialized when CapacityScheduler does a serviceInit. This serviceInit is done by both Active and Standby RM. So we can run into a race condition when both Active and Standby try to create the same znode when both RM are started at same time. ZKRMStateStore on the other hand avoids such race conditions, by creating the znodes only after serviceStart. serviceStart only happens for the active RM which won the leader election, unlike serviceInit which happens irrespective of leader election. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8564) Add queue level application lifetime monitor in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346849#comment-17346849 ] Tarun Parimi commented on YARN-8564: [~zhuqi], Any reason this jira got resolved? I don't see this patch committed anywhere. And it doesn't seem to be a duplicate. > Add queue level application lifetime monitor in FairScheduler > -- > > Key: YARN-8564 > URL: https://issues.apache.org/jira/browse/YARN-8564 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-8564.001.patch, test1~3.jpg, test4.jpg > > > I wish to have application lifetime monitor for queue level in FairSheduler. > In our large yarn cluster, sometimes there are too many small jobs in one > minor queue but may run too long, it may affect our our high priority and > very important queue . If we can have queue level application lifetime > monitor in the queue level, and set small lifetime in the minor queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10007) YARN logs contain environment variables, which is a security risk
[ https://issues.apache.org/jira/browse/YARN-10007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10007: Issue Type: New Feature (was: Bug) > YARN logs contain environment variables, which is a security risk > - > > Key: YARN-10007 > URL: https://issues.apache.org/jira/browse/YARN-10007 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: john lilley >Priority: Major > > In most environments it is standard practice to relay "secrets" via > environment variables when spawning a process, because the alternatives > (command-line args or storing in a file) are insecure. However, in a YARN > application, this also appears to be insecure because the environment is > logged. While YARN has the ability to relay delegation tokens in the launch > context, it is unclear how to use this facility for generalized "secrets" > that may not conform to security-token structure. > For example, the RPDM_KEYSTORE_PASSWORDS env var is found in the aggregated > YARN logs: > {{Container: container_e06_1574362398372_0023_01_01 on > node6..com_45454}} > {{LogAggregationType: AGGREGATED}} > {{}} > {{LogType:launch_container.sh}} > {{LogLastModifiedTime:Sat Nov 23 14:58:12 -0700 2019}} > {{LogLength:4043}} > {{LogContents:}} > {{#!/bin/bash}}{{set -o pipefail -e}} > {{[...]export > HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-"/usr/hdp/2.6.5.1175-1/hadoop-yarn"}}} > {{export > RPDM_KEYSTORE_PASSWORDS="eyJnZW5lcmFsIjoiZmtQZllubmVLRVo4c1Z0V0REQ3gxaHJzRnVjdVN5b1NBTE9OUTF1dEZpZ1x1MDAzZCJ9"}} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10458) Hive On Tez queries fails upon submission to dynamically created pools
[ https://issues.apache.org/jira/browse/YARN-10458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10458: Description: While using Dynamic Auto-Creation and Management of Leaf Queues, we could see that the queue creation fails because ACL submit application check couldn't succeed. We tried setting acl_submit_applications to '*' for managed parent queues. For static queues, this worked but failed for dynamic queues. Also tried setting the below property but it didn't help either. yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*. RM error log shows the following : 2020-09-18 01:08:40,579 INFO org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule: Application application_1600399068816_0460 user user1 mapping [default] to [queue1] override false 2020-09-18 01:08:40,579 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from application tag does not have access to queue 'user1'. The placement is done for user 'hive' Checking the code, scheduler#checkAccess() bails out even before checking the ACL permissions for that particular queue because the CSQueue is null. {code:java} public boolean checkAccess(UserGroupInformation callerUGI, QueueACL acl, String queueName) { CSQueue queue = getQueue(queueName); if (queue == null) { if (LOG.isDebugEnabled()) { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + queueName); } return false;*<-- the method returns false here.* } return queue.hasAccess(acl, callerUGI); } {code} As this is an auto created queue, CSQueue may be null in this case. May be scheduler#checkAccess() should have a logic to differentiate when CSQueue is null and if queue mapping is involved and if so, check if the parent queue exists and is a managed parent and if so, check if the parent queue has valid ACL's instead of returning false ? Thanks was: Recently, one of our customers created dynamic queues based on placement rules in CDP Private Cloud Base 71.2 to run their Hive on Tez queries but the job failed because of not submitting to the appropriate queue. Analyzing the Resource Manager log, we could see that the queue creation fails because ACL submit application check couldn't succeed. We tried setting acl_submit_applications to '*' for managed parent queues. For static queues, this worked but failed for dynamic queues. Also tried setting the below property but it didn't help either. yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*. RM error log shows the following : 2020-09-18 01:08:40,579 INFO org.apache.hadoop.yarn.server.resourcemanager.placement.UserGroupMappingPlacementRule: Application application_1600399068816_0460 user user1 mapping [default] to [queue1] override false 2020-09-18 01:08:40,579 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: User 'user1' from application tag does not have access to queue 'user1'. The placement is done for user 'hive' Checking the code, scheduler#checkAccess() bails out even before checking the ACL permissions for that particular queue because the CSQueue is null. public boolean checkAccess(UserGroupInformation callerUGI, QueueACL acl, String queueName) { CSQueue queue = getQueue(queueName); if (queue == null) { if (LOG.isDebugEnabled()) { LOG.debug("ACL not found for queue access-type " + acl + " for queue " + queueName); } return false;*<-- the method returns false here.* } return queue.hasAccess(acl, callerUGI); } As this is an auto created queue, CSQueue may be null in this case. May be scheduler#checkAccess() should have a logic to differentiate when CSQueue is null and if queue mapping is involved and if so, check if the parent queue exists and is a managed parent and if so, check if the parent queue has valid ACL's instead of returning false ? Thanks > Hive On Tez queries fails upon submission to dynamically created pools > -- > > Key: YARN-10458 > URL: https://issues.apache.org/jira/browse/YARN-10458 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Anand Srinivasan >Priority: Major > > While using Dynamic Auto-Creation and Management of Leaf Queues, we could see > that the queue creation fails because ACL submit application check couldn't > succeed. > We tried setting acl_submit_applications to '*' for managed parent queues. > For static queues, this worked but failed for dynamic queues. Also tried > setting the below property but it didn't help either. > yarn.scheduler.capacity.root.parent-queue-name.leaf-queue-template.acl_submit_applications=*. > RM error log shows the following : > 20
[jira] [Created] (YARN-10446) Capacity Scheduler page displays incorrect Configured Capacity
Tarun Parimi created YARN-10446: --- Summary: Capacity Scheduler page displays incorrect Configured Capacity Key: YARN-10446 URL: https://issues.apache.org/jira/browse/YARN-10446 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.0 Reporter: Tarun Parimi Attachments: configured-capacity.png Capacity Scheduler ui always shows Configured capacity as !configured-capacity.png! The effective capacity value is however calculated correctly. This issue seems to be because we are displaying the configured min resources. This will only be set when we use *Absolute Resource Configuration* . When *Percentage based configuration* is done, this always displays . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10440) resource manager hangs,and i cannot submit any new jobs,but rm and nm processes are normal
[ https://issues.apache.org/jira/browse/YARN-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi resolved YARN-10440. - Resolution: Duplicate Seems to be similar to YARN-8513 . The default config change in YARN-8896 fixes it. Try setting {noformat} yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments=100{noformat} Reopen with jstack dump, if issue reoccurs with the config change. > resource manager hangs,and i cannot submit any new jobs,but rm and nm > processes are normal > -- > > Key: YARN-10440 > URL: https://issues.apache.org/jira/browse/YARN-10440 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.1.1 >Reporter: jufeng li >Priority: Blocker > > RM hangs,and i cannot submit any new jobs,but RM and NM processes are normal. > I can open x:8088/cluster/apps/RUNNING but can not > x:8088/cluster/scheduler.Those apps submited can not end itself and new > apps can not be submited.just everything hangs but not RM,NM server. How can > I fix this?help me,please! > > here is the log: > {code:java} > ttempt=appattempt_1600074574138_66297_01 container=null queue=tianqiwang > clusterResource= type=NODE_LOCAL > requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,679 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,679 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocator > (AbstractContainerAllocator.java:getCSAssignmentFromAllocateResult(129)) - > assignedContainer application attempt=appattempt_1600074574138_66297_01 > container=null queue=tianqiwang clusterResource= vCores:4800> type=NODE_LOCAL requestedPartition= > 2020-09-17 00:22:25,680 INFO capacity.CapacityScheduler > (CapacityScheduler.java:tryCommit(2906)) - Failed to accept allocation > proposal > 2020-09-17 00:22:25,680 INFO allocator.AbstractContainerAllocat
[jira] [Commented] (YARN-10159) TimelineConnector does not destroy the jersey client
[ https://issues.apache.org/jira/browse/YARN-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190741#comment-17190741 ] Tarun Parimi commented on YARN-10159: - [~prabhujoseph] . This issue is there even for ats v1 client in branch-2.8. So I want to backport it for branch-2.8 . Attached branch-2.8 patch. Can you review it when you get time? > TimelineConnector does not destroy the jersey client > > > Key: YARN-10159 > URL: https://issues.apache.org/jira/browse/YARN-10159 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Tanu Ajmera >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10159-001.patch, YARN-10159-002.patch, > YARN-10159-branch-2.8.001.patch > > > TimelineConnector does not destroy the jersey client. This method must be > called when there are not responses pending otherwise undefined behavior will > occur. > http://javadox.com/com.sun.jersey/jersey-client/1.8/com/sun/jersey/api/client/Client.html#destroy() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10159) TimelineConnector does not destroy the jersey client
[ https://issues.apache.org/jira/browse/YARN-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10159: Attachment: YARN-10159-branch-2.8.001.patch > TimelineConnector does not destroy the jersey client > > > Key: YARN-10159 > URL: https://issues.apache.org/jira/browse/YARN-10159 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Tanu Ajmera >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10159-001.patch, YARN-10159-002.patch, > YARN-10159-branch-2.8.001.patch > > > TimelineConnector does not destroy the jersey client. This method must be > called when there are not responses pending otherwise undefined behavior will > occur. > http://javadox.com/com.sun.jersey/jersey-client/1.8/com/sun/jersey/api/client/Client.html#destroy() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications
[ https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171258#comment-17171258 ] Tarun Parimi commented on YARN-10377: - Thanks for the review and commit [~prabhujoseph] > Clicking on queue in Capacity Scheduler legacy ui does not show any > applications > > > Key: YARN-10377 > URL: https://issues.apache.org/jira/browse/YARN-10377 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot > 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch > > > The issue is in the capacity scheduler > [http://rm-host:port/clustter/scheduler] page > If I click on the root queue, I am able to see the applications. > !Screenshot 2020-07-29 at 12.01.28 PM.png! > But the application disappears when I click on the leaf queue -> default. > This issue is not present in the older 2.7.0 versions and I am able to see > apps normally filtered by the leaf queue when clicking on it. > !Screenshot 2020-07-29 at 12.01.36 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications
[ https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17170087#comment-17170087 ] Tarun Parimi commented on YARN-10377: - Thanks [~prabhujoseph] . I have tested it manually and it works fine. > Clicking on queue in Capacity Scheduler legacy ui does not show any > applications > > > Key: YARN-10377 > URL: https://issues.apache.org/jira/browse/YARN-10377 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot > 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch > > > The issue is in the capacity scheduler > [http://rm-host:port/clustter/scheduler] page > If I click on the root queue, I am able to see the applications. > !Screenshot 2020-07-29 at 12.01.28 PM.png! > But the application disappears when I click on the leaf queue -> default. > This issue is not present in the older 2.7.0 versions and I am able to see > apps normally filtered by the leaf queue when clicking on it. > !Screenshot 2020-07-29 at 12.01.36 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications
[ https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10377: Attachment: YARN-10377.001.patch > Clicking on queue in Capacity Scheduler legacy ui does not show any > applications > > > Key: YARN-10377 > URL: https://issues.apache.org/jira/browse/YARN-10377 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot > 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch > > > The issue is in the capacity scheduler > [http://rm-host:port/clustter/scheduler] page > If I click on the root queue, I am able to see the applications. > !Screenshot 2020-07-29 at 12.01.28 PM.png! > But the application disappears when I click on the leaf queue -> default. > This issue is not present in the older 2.7.0 versions and I am able to see > apps normally filtered by the leaf queue when clicking on it. > !Screenshot 2020-07-29 at 12.01.36 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications
[ https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi reassigned YARN-10377: --- Assignee: Tarun Parimi > Clicking on queue in Capacity Scheduler legacy ui does not show any > applications > > > Key: YARN-10377 > URL: https://issues.apache.org/jira/browse/YARN-10377 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot > 2020-07-29 at 12.01.36 PM.png > > > The issue is in the capacity scheduler > [http://rm-host:port/clustter/scheduler] page > If I click on the root queue, I am able to see the applications. > !Screenshot 2020-07-29 at 12.01.28 PM.png! > But the application disappears when I click on the leaf queue -> default. > This issue is not present in the older 2.7.0 versions and I am able to see > apps normally filtered by the leaf queue when clicking on it. > !Screenshot 2020-07-29 at 12.01.36 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Resolved] (YARN-10378) When NM goes down and comes back up, PC allocation tags are not removed for completed containers
[ https://issues.apache.org/jira/browse/YARN-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi resolved YARN-10378. - Resolution: Duplicate Looks like YARN-10034 fixes this issue for NM going down scenario also. Closing as duplicate. > When NM goes down and comes back up, PC allocation tags are not removed for > completed containers > > > Key: YARN-10378 > URL: https://issues.apache.org/jira/browse/YARN-10378 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.2.0, 3.1.1 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > We are using placement constaints anti-affinity in an application along with > node label. The application requests two containers with anti affinity on the > node label containing only two nodes. > So two containers will be allocated in the two nodes, one on each node > satisfying anti-affinity. > When one nodemanager goes down for some time, the node is marked as lost by > RM and then it will kill all containers in that node. > The AM will now have one pending container request, since the previous > container got killed. > When the Nodemanager becomes up after some time, the pending container is not > getting allocated in that node again and the application has to wait forever > for that container. > If the ResourceManager is restarted, this issue disappears and the container > gets allocated on the NodeManager which came back up recently. > This seems to be an issue with the allocation tags not removed. > The allocation tag is added for the container > container_e68_1595886973474_0005_01_03 . > {code:java} > 2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager > (AllocationTagsManager.java:addContainer(355)) - Added > container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\ > {code} > However, the allocation tag is not removed when the container > container_e68_1595886973474_0005_01_03 is released. There is no > equivalent DEBUG message seen for removing tags. This means that the tags are > not getting removed. If the tag is not removed, then scheduler will not > allocate in the same node due to anti-affinity resulting in the issue > observed. > {code:java} > 2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler > (AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container > FINISHED: container_e68_1595886973474_0005_01_03 > 2020-07-28 17:19:34,353 INFO scheduler.AbstractYarnScheduler > (AbstractYarnScheduler.java:completedContainer(669)) - Container > container_e68_1595886973474_0005_01_03 completed with event FINISHED, but > corresponding RMContainer doesn't exist. > {code} > This seems to be due to changes done in YARN-8511 . Change here was made to > remove the tags only after NM confirms container is released. However, in our > scenario this is not happening. So the tag will never get removed until RM > restart. > Reverting YARN-8511 fixes this particular issue and tags are getting removed. > But this is not a valid solution since the problem that YARN-8511 solves is > also valid. We need to find a solution which does not break YARN-8511 and > also fixes this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10378) When NM goes down and comes back up, PC allocation tags are not removed for completed containers
[ https://issues.apache.org/jira/browse/YARN-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10378: Description: We are using placement constaints anti-affinity in an application along with node label. The application requests two containers with anti affinity on the node label containing only two nodes. So two containers will be allocated in the two nodes, one on each node satisfying anti-affinity. When one nodemanager goes down for some time, the node is marked as lost by RM and then it will kill all containers in that node. The AM will now have one pending container request, since the previous container got killed. When the Nodemanager becomes up after some time, the pending container is not getting allocated in that node again and the application has to wait forever for that container. If the ResourceManager is restarted, this issue disappears and the container gets allocated on the NodeManager which came back up recently. This seems to be an issue with the allocation tags not removed. The allocation tag is added for the container container_e68_1595886973474_0005_01_03 . {code:java} 2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager (AllocationTagsManager.java:addContainer(355)) - Added container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\ {code} However, the allocation tag is not removed when the container container_e68_1595886973474_0005_01_03 is released. There is no equivalent DEBUG message seen for removing tags. This means that the tags are not getting removed. If the tag is not removed, then scheduler will not allocate in the same node due to anti-affinity resulting in the issue observed. {code:java} 2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container FINISHED: container_e68_1595886973474_0005_01_03 2020-07-28 17:19:34,353 INFO scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:completedContainer(669)) - Container container_e68_1595886973474_0005_01_03 completed with event FINISHED, but corresponding RMContainer doesn't exist. {code} This seems to be due to changes done in YARN-8511 . Change here was made to remove the tags only after NM confirms container is released. However, in our scenario this is not happening. So the tag will never get removed until RM restart. Reverting YARN-8511 fixes this particular issue and tags are getting removed. But this is not a valid solution since the problem that YARN-8511 solves is also valid. We need to find a solution which does not break YARN-8511 and also fixes this issue. was: We are using placement constaints anti-affinity in an application along with node label. The application requests two containers with anti affinity on the node label containing only two nodes. So two containers will be allocated in the two nodes, one on each node satisfying anti-affinity. When one nodemanager goes down for some time, the node is marked as lost by RM and then it will kill all containers in that node. The AM will now have one pending container request, since the previous container got killed. When the Nodemanager becomes up after some time, the pending container is not getting allocated in that node again and the application has to wait forever for that container. If the ResourceManager is restarted, this issue disappears and the container gets allocated on the NodeManager which came back up recently. This seems to be an issue with the allocation tags not removed. The allocation tag is added for the container container_e68_1595886973474_0005_01_03 . {code:java} 2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager (AllocationTagsManager.java:addContainer(355)) - Added container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\ {code} However, the allocation tag is not removed when the container container_e68_1595886973474_0005_01_03 is released. There is no equivalent DEBUG message seen for removing tags. This means that the tags are not getting removed. If the tag is not removed, then scheduler will not allocate in the same node resulting in the issue observed. {code:java} 2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container FINISHED: container_e68_1595886973474_0005_01_03 2020-07-28 17:19:34,353 INFO scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:completedContainer(669)) - Container container_e68_1595886973474_0005_01_03 completed with event FINISHED, but corresponding RMContainer doesn't exist. {code} This seems to be due to changes done in YARN-8511 . Change here was made to remove the tags only after NM confirms container is released. However, in our scenario this is not happening. So the tag will never get removed until RM restart
[jira] [Created] (YARN-10378) When NM goes down and comes back up, PC allocation tags are not removed for completed containers
Tarun Parimi created YARN-10378: --- Summary: When NM goes down and comes back up, PC allocation tags are not removed for completed containers Key: YARN-10378 URL: https://issues.apache.org/jira/browse/YARN-10378 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler Affects Versions: 3.1.1, 3.2.0 Reporter: Tarun Parimi Assignee: Tarun Parimi We are using placement constaints anti-affinity in an application along with node label. The application requests two containers with anti affinity on the node label containing only two nodes. So two containers will be allocated in the two nodes, one on each node satisfying anti-affinity. When one nodemanager goes down for some time, the node is marked as lost by RM and then it will kill all containers in that node. The AM will now have one pending container request, since the previous container got killed. When the Nodemanager becomes up after some time, the pending container is not getting allocated in that node again and the application has to wait forever for that container. If the ResourceManager is restarted, this issue disappears and the container gets allocated on the NodeManager which came back up recently. This seems to be an issue with the allocation tags not removed. The allocation tag is added for the container container_e68_1595886973474_0005_01_03 . {code:java} 2020-07-28 17:02:04,091 DEBUG constraint.AllocationTagsManager (AllocationTagsManager.java:addContainer(355)) - Added container=container_e68_1595886973474_0005_01_03 with tags=[hbase]\ {code} However, the allocation tag is not removed when the container container_e68_1595886973474_0005_01_03 is released. There is no equivalent DEBUG message seen for removing tags. This means that the tags are not getting removed. If the tag is not removed, then scheduler will not allocate in the same node resulting in the issue observed. {code:java} 2020-07-28 17:19:34,353 DEBUG scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:updateCompletedContainers(1038)) - Container FINISHED: container_e68_1595886973474_0005_01_03 2020-07-28 17:19:34,353 INFO scheduler.AbstractYarnScheduler (AbstractYarnScheduler.java:completedContainer(669)) - Container container_e68_1595886973474_0005_01_03 completed with event FINISHED, but corresponding RMContainer doesn't exist. {code} This seems to be due to changes done in YARN-8511 . Change here was made to remove the tags only after NM confirms container is released. However, in our scenario this is not happening. So the tag will never get removed until RM restart. Reverting YARN-8511 fixes this particular issue and tags are getting removed. But this is not a valid solution since the problem that YARN-8511 solves is also valid. We need to find a solution which does not break YARN-8511 and also fixes this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications
Tarun Parimi created YARN-10377: --- Summary: Clicking on queue in Capacity Scheduler legacy ui does not show any applications Key: YARN-10377 URL: https://issues.apache.org/jira/browse/YARN-10377 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.0 Reporter: Tarun Parimi Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot 2020-07-29 at 12.01.36 PM.png The issue is in the capacity scheduler [http://rm-host:port/clustter/scheduler] page If I click on the root queue, I am able to see the applications. !Screenshot 2020-07-29 at 12.01.28 PM.png! But the application disappears when I click on the leaf queue -> default. This issue is not present in the older 2.7.0 versions and I am able to see apps normally filtered by the leaf queue when clicking on it. !Screenshot 2020-07-29 at 12.01.36 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159782#comment-17159782 ] Tarun Parimi commented on YARN-10339: - Thanks for the review [~prabhujoseph] > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10339.001.patch, YARN-10339.002.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol
[ https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153225#comment-17153225 ] Tarun Parimi commented on YARN-10340: - [~prabhujoseph], The issue is because the HistoryClientService#initializeWebApp instantiates the RPC client connection when creating the WebApp . {code:java} ApplicationClientProtocol appClientProtocol = ClientRMProxy.createRMProxy(conf, ApplicationClientProtocol.class); {code} This RPC client proxy instance will only use the mapred ugi at the time of creation and even for subsequent calls irrespective of doAs. I made a code change to check by adding below method in HSWebServices and it works with the correct ugi fixing the issue. {code:java} @Override protected ContainerReport getContainerReport( GetContainerReportRequest request) throws YarnException, IOException { return ClientRMProxy.createRMProxy(conf, ApplicationClientProtocol.class).getContainerReport(request).getContainerReport(); } {code} This creates a separate rpc client instance every time though which is not efficient. > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > - > > Key: YARN-10340 > URL: https://issues.apache.org/jira/browse/YARN-10340 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > > [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs] > While accessing above link using systest user, the request fails saying > mapred user does not have access to the job > > {code:java} > 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: > Could not obtain node HTTP address from provider. > javax.ws.rs.WebApplicationException: > org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have > privilege to see this application application_1593997842459_0214 > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461) > > {code} > On Analyzing, found WebServices#getContainer uses doAs using UGI created by > createRemoteUser(end user) to access RM#ApplicationClientProtocol which does > not work. Need to use createProxyUser to do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152548#comment-17152548 ] Tarun Parimi edited comment on YARN-10339 at 7/7/20, 8:17 AM: -- Thanks [~prabhujoseph] . When atsv1 is enabled, delegation tokens are used even when auth is simple. I made changes in this patch, to add Timeline Delegation Token only when auth is kerberos. And fixed unit test failures and checkstyle. was (Author: tarunparimi): Thanks [~prabhujoseph] . When atsv1 is enabled, delegation tokens are used even when auth is simple. I made changes in this patch, to add Timeline Delegation Token only when auth is simple. And fixed unit test failures and checkstyle. > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10339.001.patch, YARN-10339.002.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152548#comment-17152548 ] Tarun Parimi commented on YARN-10339: - Thanks [~prabhujoseph] . When atsv1 is enabled, delegation tokens are used even when auth is simple. I made changes in this patch, to add Timeline Delegation Token only when auth is simple. And fixed unit test failures and checkstyle. > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10339.001.patch, YARN-10339.002.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10339: Attachment: YARN-10339.002.patch > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10339.001.patch, YARN-10339.002.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol
[ https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152501#comment-17152501 ] Tarun Parimi commented on YARN-10340: - [~prabhujoseph],[~brahmareddy] The WebServices#getContainer works properly when called by RMWebServices or AHSWebServices. This could be because they use their own ClientRMService and ApplicationHistoryClientService respectively. But HsWebServices now uses ClientRMService remotely and so doAs doesn't work here as expected. > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > - > > Key: YARN-10340 > URL: https://issues.apache.org/jira/browse/YARN-10340 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > > [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs] > While accessing above link using systest user, the request fails saying > mapred user does not have access to the job > > {code:java} > 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: > Could not obtain node HTTP address from provider. > javax.ws.rs.WebApplicationException: > org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have > privilege to see this application application_1593997842459_0214 > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461) > > {code} > On Analyzing, found WebServices#getContainer uses doAs using UGI created by > createRemoteUser(end user) to access RM#ApplicationClientProtocol which does > not work. Need to use createProxyUser to do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10339: Attachment: YARN-10339.001.patch > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10339.001.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
Tarun Parimi created YARN-10339: --- Summary: Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments Key: YARN-10339 URL: https://issues.apache.org/jira/browse/YARN-10339 Project: Hadoop YARN Issue Type: Bug Components: timelineclient Affects Versions: 3.1.0 Reporter: Tarun Parimi Assignee: Tarun Parimi We get below errors in NodeManager logs whenever we set yarn.timeline-service.http-authentication.type=simple in a cluster which has kerberos enabled. There are use cases where simple auth is used only in timeline server for convenience although kerberos is enabled. {code:java} 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline server is not successful, HTTP error code: 403, Server response: {"exception":"ForbiddenException","message":"java.lang.Exception: The owner of the posted timeline entities is not set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} {code} This seems to affect the NM timeline publisher which uses TimelineV2ClientImpl. Doing a simple auth directly to timeline service via curl works fine. So this issue is in the authenticator configuration in timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used
[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113342#comment-17113342 ] Tarun Parimi edited comment on YARN-10283 at 5/21/20, 4:31 PM: --- Thanks [~pbacsko] for the repro test patch. The POC patch changes the behavior to include partitions while doing {{reservationsContinueLooking}} in RegularContainerAllocator.java . Similar conditions to check for nodelabels is present in several places such as AbstractCSQueue.java since {{reservationsContinueLooking}} was implemented only for non node label scenario. Ideally we will have to consider fixing YARN-9903 in this scenario. was (Author: tarunparimi): Thanks for the repro test patch. The POC patch changes the behavior to include partitions while doing {{reservationsContinueLooking}} in RegularContainerAllocator.java . Similar conditions to check for nodelabels is present in several places such as AbstractCSQueue.java since {{reservationsContinueLooking}} was implemented only for non node label scenario. Ideally we will have to consider fixing YARN-9903 in this scenario. > Capacity Scheduler: starvation occurs if a higher priority queue is full and > node labels are used > - > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch > > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node) > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we enter a different code path and succeed with > the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10283) Capacity Scheduler: starvation occurs if a higher priority queue is full and node labels are used
[ https://issues.apache.org/jira/browse/YARN-10283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113342#comment-17113342 ] Tarun Parimi commented on YARN-10283: - Thanks for the repro test patch. The POC patch changes the behavior to include partitions while doing {{reservationsContinueLooking}} in RegularContainerAllocator.java . Similar conditions to check for nodelabels is present in several places such as AbstractCSQueue.java since {{reservationsContinueLooking}} was implemented only for non node label scenario. Ideally we will have to consider fixing YARN-9903 in this scenario. > Capacity Scheduler: starvation occurs if a higher priority queue is full and > node labels are used > - > > Key: YARN-10283 > URL: https://issues.apache.org/jira/browse/YARN-10283 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10283-POC01.patch, YARN-10283-ReproTest.patch > > > Recently we've been investigating a scenario where applications submitted to > a lower priority queue could not get scheduled because a higher priority > queue in the same hierarchy could now satisfy the allocation request. Both > queue belonged to the same partition. > If we disabled node labels, the problem disappeared. > The problem is that {{RegularContainerAllocator}} always allocated a > container for the request, even if it should not have. > *Example:* > * Cluster total resources: 3 nodes, 15GB, 24 vcores (5GB / 8 vcore per node) > * Partition "shared" was created with 2 nodes > * "root.lowprio" (priority = 20) and "root.highprio" (priorty = 40) were > added to the partition > * Both queues have a limit of > * Using DominantResourceCalculator > Setup: > Submit distributed shell application to highprio with switches > "-num_containers 3 -container_vcores 4". The memory allocation is 512MB per > container. > Chain of events: > 1. Queue is filled with contaners until it reaches usage vCores:5> > 2. A node update event is pushed to CS from a node which is part of the > partition > 2. {{AbstractCSQueue.canAssignToQueue()}} returns true because it's smaller > than the current limit resource > 3. Then {{LeafQueue.assignContainers()}} runs successfully and gets an > allocated container for > 4. But we can't commit the resource request because we would have 9 vcores in > total, violating the limit. > The problem is that we always try to assign container for the same > application in each heartbeat from "highprio". Applications in "lowprio" > cannot make progress. > *Problem:* > {{RegularContainerAllocator.assignContainer()}} does not handle this case > well. We only reject allocation if this condition is satisfied: > {noformat} > if (rmContainer == null && reservationsContinueLooking > && node.getLabels().isEmpty()) { > {noformat} > But if we have node labels, we enter a different code path and succeed with > the allocation if there's room for a container. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping
[ https://issues.apache.org/jira/browse/YARN-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088547#comment-17088547 ] Tarun Parimi commented on YARN-10240: - Thanks for the review [~prabhujoseph] > Prevent Fatal CancelledException in TimelineV2Client when stopping > -- > > Key: YARN-10240 > URL: https://issues.apache.org/jira/browse/YARN-10240 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10240.001.patch > > > When the timeline client is stopped, it will cancel all sync EntityHolders > after waiting for a drain timeout. > {code:java} > // if some entities were not drained then we need interrupt > // the threads which had put sync EntityHolders to the > queue. > EntitiesHolder nextEntityInTheQueue = null; > while ((nextEntityInTheQueue = > timelineEntityQueue.poll()) != null) { > nextEntityInTheQueue.cancel(true); > } > {code} > We only handle interrupted exception here. > {code:java} > if (sync) { > // In sync call we need to wait till its published and if any error > then > // throw it back > try { > entitiesHolder.get(); > } catch (ExecutionException e) { > throw new YarnException("Failed while publishing entity", > e.getCause()); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > throw new YarnException("Interrupted while publishing entity", e); > } > } > {code} > But calling nextEntityInTheQueue.cancel(true) will result in > entitiesHolder.get() throwing a CancelledException which is not handled. This > can result in FATAL error in NM. We need to prevent this. > {code:java} > FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in > dispatcher thread > java.util.concurrent.CancellationException > at java.util.concurrent.FutureTask.report(FutureTask.java:121) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545) > at > org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping
[ https://issues.apache.org/jira/browse/YARN-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi reassigned YARN-10240: --- Assignee: Tarun Parimi > Prevent Fatal CancelledException in TimelineV2Client when stopping > -- > > Key: YARN-10240 > URL: https://issues.apache.org/jira/browse/YARN-10240 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10240.001.patch > > > When the timeline client is stopped, it will cancel all sync EntityHolders > after waiting for a drain timeout. > {code:java} > // if some entities were not drained then we need interrupt > // the threads which had put sync EntityHolders to the > queue. > EntitiesHolder nextEntityInTheQueue = null; > while ((nextEntityInTheQueue = > timelineEntityQueue.poll()) != null) { > nextEntityInTheQueue.cancel(true); > } > {code} > We only handle interrupted exception here. > {code:java} > if (sync) { > // In sync call we need to wait till its published and if any error > then > // throw it back > try { > entitiesHolder.get(); > } catch (ExecutionException e) { > throw new YarnException("Failed while publishing entity", > e.getCause()); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > throw new YarnException("Interrupted while publishing entity", e); > } > } > {code} > But calling nextEntityInTheQueue.cancel(true) will result in > entitiesHolder.get() throwing a CancelledException which is not handled. This > can result in FATAL error in NM. We need to prevent this. > {code:java} > FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in > dispatcher thread > java.util.concurrent.CancellationException > at java.util.concurrent.FutureTask.report(FutureTask.java:121) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545) > at > org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping
[ https://issues.apache.org/jira/browse/YARN-10240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10240: Attachment: YARN-10240.001.patch > Prevent Fatal CancelledException in TimelineV2Client when stopping > -- > > Key: YARN-10240 > URL: https://issues.apache.org/jira/browse/YARN-10240 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Reporter: Tarun Parimi >Priority: Major > Attachments: YARN-10240.001.patch > > > When the timeline client is stopped, it will cancel all sync EntityHolders > after waiting for a drain timeout. > {code:java} > // if some entities were not drained then we need interrupt > // the threads which had put sync EntityHolders to the > queue. > EntitiesHolder nextEntityInTheQueue = null; > while ((nextEntityInTheQueue = > timelineEntityQueue.poll()) != null) { > nextEntityInTheQueue.cancel(true); > } > {code} > We only handle interrupted exception here. > {code:java} > if (sync) { > // In sync call we need to wait till its published and if any error > then > // throw it back > try { > entitiesHolder.get(); > } catch (ExecutionException e) { > throw new YarnException("Failed while publishing entity", > e.getCause()); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > throw new YarnException("Interrupted while publishing entity", e); > } > } > {code} > But calling nextEntityInTheQueue.cancel(true) will result in > entitiesHolder.get() throwing a CancelledException which is not handled. This > can result in FATAL error in NM. We need to prevent this. > {code:java} > FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in > dispatcher thread > java.util.concurrent.CancellationException > at java.util.concurrent.FutureTask.report(FutureTask.java:121) > at java.util.concurrent.FutureTask.get(FutureTask.java:192) > at > org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545) > at > org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10240) Prevent Fatal CancelledException in TimelineV2Client when stopping
Tarun Parimi created YARN-10240: --- Summary: Prevent Fatal CancelledException in TimelineV2Client when stopping Key: YARN-10240 URL: https://issues.apache.org/jira/browse/YARN-10240 Project: Hadoop YARN Issue Type: Bug Components: ATSv2 Reporter: Tarun Parimi When the timeline client is stopped, it will cancel all sync EntityHolders after waiting for a drain timeout. {code:java} // if some entities were not drained then we need interrupt // the threads which had put sync EntityHolders to the queue. EntitiesHolder nextEntityInTheQueue = null; while ((nextEntityInTheQueue = timelineEntityQueue.poll()) != null) { nextEntityInTheQueue.cancel(true); } {code} We only handle interrupted exception here. {code:java} if (sync) { // In sync call we need to wait till its published and if any error then // throw it back try { entitiesHolder.get(); } catch (ExecutionException e) { throw new YarnException("Failed while publishing entity", e.getCause()); } catch (InterruptedException e) { Thread.currentThread().interrupt(); throw new YarnException("Interrupted while publishing entity", e); } } {code} But calling nextEntityInTheQueue.cancel(true) will result in entitiesHolder.get() throwing a CancelledException which is not handled. This can result in FATAL error in NM. We need to prevent this. {code:java} FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread java.util.concurrent.CancellationException at java.util.concurrent.FutureTask.report(FutureTask.java:121) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl$TimelineEntityDispatcher.dispatchEntities(TimelineV2ClientImpl.java:545) at org.apache.hadoop.yarn.client.api.impl.TimelineV2ClientImpl.putEntities(TimelineV2ClientImpl.java:149) at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.putEntity(NMTimelinePublisher.java:348) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9816) EntityGroupFSTimelineStore#scanActiveLogs fails when undesired files are present under /ats/active.
[ https://issues.apache.org/jira/browse/YARN-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9816: --- Affects Version/s: 2.8.0 > EntityGroupFSTimelineStore#scanActiveLogs fails when undesired files are > present under /ats/active. > --- > > Key: YARN-9816 > URL: https://issues.apache.org/jira/browse/YARN-9816 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.8.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9816-001.patch > > > EntityGroupFSTimelineStore#scanActiveLogs fails with StackOverflowError. > This happens when a file is present under /ats/active. > {code} > [hdfs@node2 yarn]$ hadoop fs -ls /ats/active > Found 1 items > -rw-r--r-- 3 hdfs hadoop 0 2019-09-06 16:34 > /ats/active/.distcp.tmp.attempt_155759136_39768_m_01_0 > {code} > Error Message: > {code:java} > java.lang.StackOverflowError > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:632) > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:291) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:203) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:185) > at com.sun.proxy.$Proxy15.getListing(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2143) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1076) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1088) > at > org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1059) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1038) > at > org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1034) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusIterator(DistributedFileSystem.java:1046) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.list(EntityGroupFSTimelineStore.java:398) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:368) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > at > org.apache.hadoop.yarn.server.timeline.EntityGroupFSTimelineStore.scanActiveLogs(EntityGroupFSTimelineStore.java:383) > {code} > One of our user has tried to distcp hdfs://ats/active dir. Distcp job has > created the > temp file .distcp.tmp.at
[jira] [Commented] (YARN-9967) Fix NodeManager failing to start when Hdfs Auxillary Jar is set
[ https://issues.apache.org/jira/browse/YARN-9967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053104#comment-17053104 ] Tarun Parimi commented on YARN-9967: Hi [~snemeth], You can take it over. Thanks. > Fix NodeManager failing to start when Hdfs Auxillary Jar is set > --- > > Key: YARN-9967 > URL: https://issues.apache.org/jira/browse/YARN-9967 > Project: Hadoop YARN > Issue Type: Bug > Components: auxservices, nodemanager >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > Loading an auxiliary jar from a Hdfs location on a node manager fails with > ClassNotFound Exception > {code:java} > 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: > classpath: [] > 2019-11-08 03:59:49,256 INFO org.apache.hadoop.util.ApplicationClassLoader: > system classes: [java., javax.accessibility., javax.activation., > javax.activity., javax.annotation., javax.annotation.processing., > javax.crypto., javax.imageio., javax.jws., javax.lang.model., > -javax.management.j2ee., javax.management., javax.naming., javax.net., > javax.print., javax.rmi., javax.script., -javax.security.auth.message., > javax.security.auth., javax.security.cert., javax.security.sasl., > javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., > -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., > org.xml.sax., org.apache.commons.logging., org.apache.log4j., > -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, > hdfs-default.xml, mapred-default.xml, yarn-default.xml] > 2019-11-08 03:59:49,257 INFO org.apache.hadoop.service.AbstractService: > Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in state INITED > java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromHDFS > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:270) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:321) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:478) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:936) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:1016) > {code} > *Repro:* > {code:java} > 1. Prepare a custom auxiliary service jar and place it on hdfs > [hdfs@yarndocker-1 yarn]$ cat TestShuffleHandler2.java > package org; > import org.apache.hadoop.yarn.server.api.AuxiliaryService; > import org.apache.hadoop.yarn.server.api.ApplicationInitializationContext; > import org.apache.hadoop.yarn.server.api.ApplicationTerminationContext; > import java.nio.ByteBuffer; > public class TestShuffleHandler2 extends AuxiliaryService { > public static final String MAPREDUCE_TEST_SHUFFLE_SERVICEID = > "test_shuffle2"; > public TestShuffleHandler2() { > super("testshuffle2"); > } > @Override > public void initializeApplication(ApplicationInitializationContext > context) { > } > @Override > public void stopApplication(ApplicationTerminationContext context) { > } > @Override > public synchronized ByteBuffer getMetaData() { > return ByteBuffer.allocate(0); > } > } > > [hdfs@yarndocker-1 yarn]$ javac -d . -cp `hadoop classpath` > TestShuffleHandler2.java > [hdfs@yarndocker-1 yarn]$ jar cvf auxhdfs.jar org/ > [hdfs@yarndocker
[jira] [Updated] (YARN-10149) container-executor exits with 139 when the permissions of yarn log directory is improper
[ https://issues.apache.org/jira/browse/YARN-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10149: Description: container-executor fails with segmentation fault and exit code 139 when the permission of the yarn log directory is not proper. While running the container-executor manually, we get the below message. {code:java} Error checking file stats for /hadoop/yarn/log -1 Permission denied. {code} But the exit code is 139 which corresponds to a segmentation fault. This is misleading especially since the "Permission denied" is not getting printed in the applogs or the NM logs. Only the exit code 139 message is present. was: container-executor fails with segmentation fault and exit code 139 when the permission of the yarn log directory is not proper. While running the container-executor manually, we get the below message. {code:java} Error checking file stats for /hadoop/yarn/log Permission denied -1 {code} But the exit code is 139 which corresponds to a segmentation fault. This is misleading especially since the "Permission denied" is not getting printed in the applogs or the NM logs. Only the exit code 139 message is present. > container-executor exits with 139 when the permissions of yarn log directory > is improper > > > Key: YARN-10149 > URL: https://issues.apache.org/jira/browse/YARN-10149 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > container-executor fails with segmentation fault and exit code 139 when the > permission of the yarn log directory is not proper. > While running the container-executor manually, we get the below message. > {code:java} > Error checking file stats for /hadoop/yarn/log -1 Permission denied. > {code} > But the exit code is 139 which corresponds to a segmentation fault. This is > misleading especially since the "Permission denied" is not getting printed in > the applogs or the NM logs. Only the exit code 139 message is present. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10149) container-executor exits with 139 when the permissions of yarn log directory is improper
[ https://issues.apache.org/jira/browse/YARN-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-10149: Description: container-executor fails with segmentation fault and exit code 139 when the permission of the yarn log directory is not proper. While running the container-executor manually, we get the below message. {code:java} Error checking file stats for /hadoop/yarn/log Permission denied -1 {code} But the exit code is 139 which corresponds to a segmentation fault. This is misleading especially since the "Permission denied" is not getting printed in the applogs or the NM logs. Only the exit code 139 message is present. was: container-executor fails with segmentation fault and exit code 139 when the permission of the yarn log directory was not proper. While running the container-executor manually, we get the below message. {code:java} Error checking file stats for /hadoop/yarn/log Permission denied -1 {code} But the exit code is 139 which corresponds to a segmentation fault. This is misleading especially since the "Permission denied" is not getting printed in the applogs or the NM logs. > container-executor exits with 139 when the permissions of yarn log directory > is improper > > > Key: YARN-10149 > URL: https://issues.apache.org/jira/browse/YARN-10149 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > container-executor fails with segmentation fault and exit code 139 when the > permission of the yarn log directory is not proper. > While running the container-executor manually, we get the below message. > {code:java} > Error checking file stats for /hadoop/yarn/log Permission denied -1 > {code} > But the exit code is 139 which corresponds to a segmentation fault. This is > misleading especially since the "Permission denied" is not getting printed in > the applogs or the NM logs. Only the exit code 139 message is present. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10149) container-executor exits with 139 when the permissions of yarn log directory is improper
Tarun Parimi created YARN-10149: --- Summary: container-executor exits with 139 when the permissions of yarn log directory is improper Key: YARN-10149 URL: https://issues.apache.org/jira/browse/YARN-10149 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.1.0 Reporter: Tarun Parimi Assignee: Tarun Parimi container-executor fails with segmentation fault and exit code 139 when the permission of the yarn log directory was not proper. While running the container-executor manually, we get the below message. {code:java} Error checking file stats for /hadoop/yarn/log Permission denied -1 {code} But the exit code is 139 which corresponds to a segmentation fault. This is misleading especially since the "Permission denied" is not getting printed in the applogs or the NM logs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979324#comment-16979324 ] Tarun Parimi commented on YARN-9968: [~snemeth] , Please review this when you get time. > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9968.001.patch > > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9968: --- Attachment: YARN-9968.001.patch > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9968.001.patch > > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973352#comment-16973352 ] Tarun Parimi edited comment on YARN-9968 at 11/13/19 1:56 PM: -- [~snemeth], I was finally able reproduce it artificially in my test cluster. I added the below the sleep and subsequent exception in FSDownload class to simulate the hdfs not responding for a minute and then throwing the exception while trying to download. When the application which requested the resource gets killed during the minute when the thread sleeps, I got null pointer issue and public localizer exited. {code:java} try { Thread.sleep(6); throw new ExecutionException("Test", new IOException("Exception")); } catch (InterruptedException e) { throw new IOException(e); } {code} >From this I understood that the issue occurs when the below sequence of events >occur, 1. The public localizer is waiting on the download of a file from hdfs for quite some time. 2. Application get killed/failed while the download is still waiting/sleeping. Due to this the app cleanup is triggered, which removes the LocalResourcesTracker for that app. {code:java} private void handleDestroyApplicationResources(Application application) { String userName = application.getUser(); ApplicationId appId = application.getAppId(); String appIDStr = application.toString(); LocalResourcesTracker appLocalRsrcsTracker = appRsrc.remove(appId.toString()); {code} 3. The download finally fails and it throws an exception from HDFS. 4. Since the tracker was removed due to app kill, we get the NullPointer in below code as tracker is null . This causes public localizer to exit and not handle future localization requests. {code:java} tracker.handle(new ResourceFailedLocalizationEvent( assoc.getResource().getRequest(), diagnostics)); {code} This issue is introduced due to the changes in YARN-8403 , where the failed localization is notified to the app for logging in the AM. I think handling a null check and preventing this should be safe as the AM is already killed in this scenario. Will provide an initial patch based on this. cc [~prabhujoseph] was (Author: tarunparimi): [~snemeth], I was finally able reproduce it artificially in my test cluster. I added the below the sleep and subsequent exception in FSDownload class to simulate the hdfs not responding for a minute and then throwing the exception while trying to download. When the application which requested the resource gets killed during the minute when the thread sleeps, I got null pointer issue and public localizer exited. {code:java} try { Thread.sleep(6); throw new ExecutionException("Test", new IOException("Exception")); } catch (InterruptedException e) { throw new IOException(e); } >From this I understood that the issue occurs when the below sequence of events >occur, 1. The public localizer is waiting on the download of a file from hdfs for quite some time. 2. Application get killed/failed while the download is still waiting/sleeping. Due to this the app cleanup is triggered, which removes the LocalResourcesTracker for that app. {code:java} private void handleDestroyApplicationResources(Application application) { String userName = application.getUser(); ApplicationId appId = application.getAppId(); String appIDStr = application.toString(); LocalResourcesTracker appLocalRsrcsTracker = appRsrc.remove(appId.toString()); {code} 3. The download finally fails and it throws an exception from HDFS. 4. Since the tracker was removed due to app kill, we get the NullPointer in below code as tracker is null . This causes public localizer to exit and not handle future localization requests. {code:java} tracker.handle(new ResourceFailedLocalizationEvent( assoc.getResource().getRequest(), diagnostics)); {code} This issue is introduced due to the changes in YARN-8403 , where the failed localization is notified to the app for logging in the AM. I think handling a null check and preventing this should be safe as the AM is already killed in this scenario. Will provide an initial patch based on this. cc [~prabhujoseph] > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > The Public Localizer is encountering a NullPointerException and exiting. > {c
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973352#comment-16973352 ] Tarun Parimi commented on YARN-9968: [~snemeth], I was finally able reproduce it artificially in my test cluster. I added the below the sleep and subsequent exception in FSDownload class to simulate the hdfs not responding for a minute and then throwing the exception while trying to download. When the application which requested the resource gets killed during the minute when the thread sleeps, I got null pointer issue and public localizer exited. {code:java} try { Thread.sleep(6); throw new ExecutionException("Test", new IOException("Exception")); } catch (InterruptedException e) { throw new IOException(e); } >From this I understood that the issue occurs when the below sequence of events >occur, 1. The public localizer is waiting on the download of a file from hdfs for quite some time. 2. Application get killed/failed while the download is still waiting/sleeping. Due to this the app cleanup is triggered, which removes the LocalResourcesTracker for that app. {code:java} private void handleDestroyApplicationResources(Application application) { String userName = application.getUser(); ApplicationId appId = application.getAppId(); String appIDStr = application.toString(); LocalResourcesTracker appLocalRsrcsTracker = appRsrc.remove(appId.toString()); {code} 3. The download finally fails and it throws an exception from HDFS. 4. Since the tracker was removed due to app kill, we get the NullPointer in below code as tracker is null . This causes public localizer to exit and not handle future localization requests. {code:java} tracker.handle(new ResourceFailedLocalizationEvent( assoc.getResource().getRequest(), diagnostics)); {code} This issue is introduced due to the changes in YARN-8403 , where the failed localization is notified to the app for logging in the AM. I think handling a null check and preventing this should be safe as the AM is already killed in this scenario. Will provide an initial patch based on this. cc [~prabhujoseph] > Public Localizer is exiting in NodeManager due to NullPointerException > -- > > Key: YARN-9968 > URL: https://issues.apache.org/jira/browse/YARN-9968 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > The Public Localizer is encountering a NullPointerException and exiting. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(995)) - Error: Shutting down > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) > INFO localizer.ResourceLocalizationService > (ResourceLocalizationService.java:run(997)) - Public cache exiting > {code} > The NodeManager still keeps on running. Subsequent localization events for > containers keep encountering the below error, resulting in failed > Localization of all new containers. > {code:java} > ERROR localizer.ResourceLocalizationService > (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { > { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null > },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} > for download. Either queue is full or threadpool is shutdown. > java.util.concurrent.RejectedExecutionException: Task > java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 > rejected from > org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, > pool size = 0, active threads = 0, queued tasks = 0, completed tasks = > 382286] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) > {code} > When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To uns
[jira] [Comment Edited] (YARN-9925) CapacitySchedulerQueueManager allows unsupported Queue hierarchy
[ https://issues.apache.org/jira/browse/YARN-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973285#comment-16973285 ] Tarun Parimi edited comment on YARN-9925 at 11/13/19 12:08 PM: --- [~vinodkv] , it is fine for me. I was searching for the documentation specifying the unique leaf queue name. I don't see anything currently in apache docs referencing it. I guess a single line mentioning that all queue names have to be unique under https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues would be helpful. Shall I create a jira for this doc change? was (Author: tarunparimi): [~vinodkv] , it is fine for me. I was searching for the documentation specifying the unique leaf queue name. I don't see anything currently in apache docs referencing it. I guess a single line mentioning all queue names to be unique under https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues would be helpful. Shall I create a jira for this doc change? > CapacitySchedulerQueueManager allows unsupported Queue hierarchy > > > Key: YARN-9925 > URL: https://issues.apache.org/jira/browse/YARN-9925 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9925-001.patch, YARN-9925-002.patch, > YARN-9925-003.patch > > > CapacitySchedulerQueueManager allows unsupported Queue hierarchy. When > creating a queue with same name as an existing parent queue name - it has to > fail with below. > {code:java} > Caused by: java.io.IOException: A is moved from:root.A to:root.B.A after > refresh, which is not allowed.Caused by: java.io.IOException: A is moved > from:root.A to:root.B.A after refresh, which is not allowed. at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:335) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:762) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:473) > ... 70 more > {code} > In Some cases, the error is not thrown while creating the queue but thrown at > submission of job "Failed to submit application_1571677375269_0002 to YARN : > Application application_1571677375269_0002 submitted by user : systest to > non-leaf queue : B" > Below scenarios are allowed but it should not > {code:java} > It allows root.A.A1.B when root.B.B1 already exists. > > 1. Add root.A > 2. Add root.A.A1 > 3. Add root.B > 4. Add root.B.B1 > 5. Allows Add of root.A.A1.B > It allows two root queues: > > 1. Add root.A > 2. Add root.B > 3. Add root.A.A1 > 4. Allows Add of root.A.A1.root > > {code} > Below scenario is handled properly: > {code:java} > It does not allow root.B.A when root.A.A1 already exists. > > 1. Add root.A > 2. Add root.B > 3. Add root.A.A1 > 4. Does not Allow Add of root.B.A > {code} > This error handling has to be consistent in all scenarios. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9925) CapacitySchedulerQueueManager allows unsupported Queue hierarchy
[ https://issues.apache.org/jira/browse/YARN-9925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973285#comment-16973285 ] Tarun Parimi commented on YARN-9925: [~vinodkv] , it is fine for me. I was searching for the documentation specifying the unique leaf queue name. I don't see anything currently in apache docs referencing it. I guess a single line mentioning all queue names to be unique under https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html#Setting_up_queues would be helpful. Shall I create a jira for this doc change? > CapacitySchedulerQueueManager allows unsupported Queue hierarchy > > > Key: YARN-9925 > URL: https://issues.apache.org/jira/browse/YARN-9925 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9925-001.patch, YARN-9925-002.patch, > YARN-9925-003.patch > > > CapacitySchedulerQueueManager allows unsupported Queue hierarchy. When > creating a queue with same name as an existing parent queue name - it has to > fail with below. > {code:java} > Caused by: java.io.IOException: A is moved from:root.A to:root.B.A after > refresh, which is not allowed.Caused by: java.io.IOException: A is moved > from:root.A to:root.B.A after refresh, which is not allowed. at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:335) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:180) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:762) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:473) > ... 70 more > {code} > In Some cases, the error is not thrown while creating the queue but thrown at > submission of job "Failed to submit application_1571677375269_0002 to YARN : > Application application_1571677375269_0002 submitted by user : systest to > non-leaf queue : B" > Below scenarios are allowed but it should not > {code:java} > It allows root.A.A1.B when root.B.B1 already exists. > > 1. Add root.A > 2. Add root.A.A1 > 3. Add root.B > 4. Add root.B.B1 > 5. Allows Add of root.A.A1.B > It allows two root queues: > > 1. Add root.A > 2. Add root.B > 3. Add root.A.A1 > 4. Allows Add of root.A.A1.root > > {code} > Below scenario is handled properly: > {code:java} > It does not allow root.B.A when root.A.A1 already exists. > > 1. Add root.A > 2. Add root.B > 3. Add root.A.A1 > 4. Does not Allow Add of root.B.A > {code} > This error handling has to be consistent in all scenarios. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
[ https://issues.apache.org/jira/browse/YARN-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972420#comment-16972420 ] Tarun Parimi commented on YARN-9968: Hi [~snemeth]. Thanks for looking into this. The issue is not reproducing for me so far. This is happening on a heavily loaded prod cluster. The cluster also is configured to use DefaultContainerExecutor , so the localizing is all done completely inside the NM jvm process. The null pointer occurs in the below code where tracker.handle() is called. Looks like tracker is becoming null for some reason. Doing a null check on tracker might be a simple workaround, but understanding how the issue occurred might give us a better way to fix this. {code:java} final String diagnostics = "Failed to download resource " + assoc.getResource() + " " + e.getCause(); tracker.handle(new ResourceFailedLocalizationEvent( assoc.getResource().getRequest(), diagnostics)); {code} There are also multiple HDFS warnings while doing localization in the log just before this NullPointerException. So I think those HDFS issues while localizing are definitely related and are causing the issue in the first place. But I haven't completely figured out how. {code:java} WARN impl.BlockReaderFactory (BlockReaderFactory.java:getRemoteBlockReaderFromTcp(764)) - I/O error constructing remote block reader. java.io.IOException: Got error, status=ERROR, status message opReadBlock BP-290360126-127.0.0.1-1559634768162:blk_3454574939_2740457478 received exception java.io.IOException: No data exists for block BP-290360126-127.0.0.1-1559634768162:blk_blk_3454574939_2740457478, for OP_READ_BLOCK, self=/127.0.0.1:15810, remote=/127.0.0.1:50010, for file /tmp/hadoop-yarn/staging/job-user/.staging/job_1571858983080_36874/job.jar, for pool BP-290360126-127.0.0.1-1559634768162 block 3814574939_2740867478 at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:110) at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.checkSuccess(BlockReaderRemote.java:440) at org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.newBlockReader(BlockReaderRemote.java:408) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:853) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:749) at org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:379) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:641) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:572) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:754) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:820) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100) at org.apache.commons.io.input.TeeInputStream.read(TeeInputStream.java:129) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.PushbackInputStream.read(PushbackInputStream.java:186) at java.util.zip.ZipInputStream.readFully(ZipInputStream.java:403) at java.util.zip.ZipInputStream.readLOC(ZipInputStream.java:278) at java.util.zip.ZipInputStream.getNextEntry(ZipInputStream.java:122) at java.util.jar.JarInputStream.(JarInputStream.java:83) at java.util.jar.JarInputStream.(JarInputStream.java:62) at org.apache.hadoop.util.RunJar.unJar(RunJar.java:114) at org.apache.hadoop.util.RunJar.unJarAndSave(RunJar.java:167) at org.apache.hadoop.yarn.util.FSDownload.unpack(FSDownload.java:354) at org.apache.hadoop.yarn.util.FSDownload.downloadAndUnpack(FSDownload.java:303) at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:283) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
[jira] [Created] (YARN-9968) Public Localizer is exiting in NodeManager due to NullPointerException
Tarun Parimi created YARN-9968: -- Summary: Public Localizer is exiting in NodeManager due to NullPointerException Key: YARN-9968 URL: https://issues.apache.org/jira/browse/YARN-9968 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 3.1.0 Reporter: Tarun Parimi Assignee: Tarun Parimi The Public Localizer is encountering a NullPointerException and exiting. {code:java} ERROR localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(995)) - Error: Shutting down java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.run(ResourceLocalizationService.java:981) INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(997)) - Public cache exiting {code} The NodeManager still keeps on running. Subsequent localization events for containers keep encountering the below error, resulting in failed Localization of all new containers. {code:java} ERROR localizer.ResourceLocalizationService (ResourceLocalizationService.java:addResource(920)) - Failed to submit rsrc { { hdfs://namespace/raw/user/.staging/job/conf.xml 1572071824603, FILE, null },pending,[(container_e30_1571858463080_48304_01_000134)],12513553420029113,FAILED} for download. Either queue is full or threadpool is shutdown. java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ExecutorCompletionService$QueueingFuture@55c7fa21 rejected from org.apache.hadoop.util.concurrent.HadoopThreadPoolExecutor@46067edd[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 382286] at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) at java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$PublicLocalizer.addResource(ResourceLocalizationService.java:899) {code} When this happens, the NodeManager becomes usable only after a restart. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
[ https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16958613#comment-16958613 ] Tarun Parimi commented on YARN-9921: Thanks for the reviews [~tangzhankun] and [~prabhujoseph#1] > Issue in PlacementConstraint when YARN Service AM retries allocation on > component failure. > -- > > Key: YARN-9921 > URL: https://issues.apache.org/jira/browse/YARN-9921 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.3.0, 3.1.4 > > Attachments: YARN-9921.001.patch, differenceProtobuf.png > > > When YARN Service AM tries to relaunch a container on failure, we encounter > the below error in PlacementConstraints. > {code:java} > ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: > Invalid updated SchedulingRequest added to scheduler, we only allows changing > numAllocations for the updated SchedulingRequest. > Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=0, > resources=}, > placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} > new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, > resources=}, > placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, > if any fields need to be updated, please cancel the old request (by setting > numAllocations to 0) and send a SchedulingRequest with different combination > of priority/allocationId > {code} > But we can see from the message that the SchedulingRequest is indeed valid > with everything same except numAllocations as expected. But still the below > equals check in SingleConstraintAppPlacementAllocator fails. > {code:java} > // Compare two objects > if (!schedulingRequest.equals(newSchedulingRequest)) { > // Rollback #numAllocations > sizing.setNumAllocations(newNumAllocations); > throw new SchedulerInvalidResoureRequestException( > "Invalid updated SchedulingRequest added to scheduler, " > + " we only allows changing numAllocations for the updated " > + "SchedulingRequest. Old=" + schedulingRequest.toString() > + " new=" + newSchedulingRequest.toString() > + ", if any fields need to be updated, please cancel the " > + "old request (by setting numAllocations to 0) and send a " > + "SchedulingRequest with different combination of " > + "priority/allocationId"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9772) CapacitySchedulerQueueManager has incorrect list of queues
[ https://issues.apache.org/jira/browse/YARN-9772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957776#comment-16957776 ] Tarun Parimi commented on YARN-9772: The operators having several hundreds of queues might accidentally configured this way. Since there is no current document which says to do otherwise. Detailing it in documentation and the printing the complete queue paths which violate the rule will help those few people to change their queue configs properly. > CapacitySchedulerQueueManager has incorrect list of queues > -- > > Key: YARN-9772 > URL: https://issues.apache.org/jira/browse/YARN-9772 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > > CapacitySchedulerQueueManager has incorrect list of queues when there is more > than one parent queue (say at middle level) with same name. > For example, > * root > ** a > *** b > c > *** d > b > * e > {{CapacitySchedulerQueueManager#getQueues}} maintains these list of queues. > While parsing "root.a.d.b", it overrides "root.a.b" with new Queue object in > the map because of similar name. After parsing all the queues, map count > should be 7, but it is 6. Any reference to queue "root.a.b" in code path is > nothing but "root.a.d.b" object. Since > {{CapacitySchedulerQueueManager#getQueues}} has been used in multiple places, > will need to understand the implications in detail. For example, > {{CapapcityScheduler#getQueue}} has been used in many places which in turn > uses {{CapacitySchedulerQueueManager#getQueues}}. cc [~eepayne], [~sunilg] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
[ https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16957101#comment-16957101 ] Tarun Parimi commented on YARN-9928: The issue is occurring since container returned in below code snippet becomes null. {code:java} private void publishContainerCreatedEvent(ContainerEvent event) { if (publishNMContainerEvents) { ContainerId containerId = event.getContainerID(); ContainerEntity entity = createContainerEntity(containerId); Container container = context.getContainers().get(containerId); Resource resource = container.getResource(); {code} This issue does not usually occur because there is a previous null check for the same done in ContainerManagerImpl . {code:java} Map containers = ContainerManagerImpl.this.context.getContainers(); Container c = containers.get(event.getContainerID()); if (c != null) { c.handle(event); if (nmMetricsPublisher != null) { nmMetricsPublisher.publishContainerEvent(event); } {code} But in a heavily loaded prod cluster with lots of events in the ContainerManager dispatcher and when NM is also resyncing with RM at the same time in a separate NM dispatcher thread, it can suddenly remove all the completed containers. So an additional null check is needed for the container in these scenarios. > ATSv2 can make NM go down with a FATAL error while it is resyncing with RM > -- > > Key: YARN-9928 > URL: https://issues.apache.org/jira/browse/YARN-9928 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > Encountered the below FATAL error in the NodeManager which was under heavy > load and was also resyncing with RM at the same. This caused the NM to go > down. > {code:java} > 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher > (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
[ https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9928: --- Component/s: ATSv2 > ATSv2 can make NM go down with a FATAL error while it is resyncing with RM > -- > > Key: YARN-9928 > URL: https://issues.apache.org/jira/browse/YARN-9928 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > Encountered the below FATAL error in the NodeManager which was under heavy > load and was also resyncing with RM at the same. This caused the NM to go > down. > {code:java} > 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher > (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
[ https://issues.apache.org/jira/browse/YARN-9928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9928: --- Affects Version/s: 3.1.0 > ATSv2 can make NM go down with a FATAL error while it is resyncing with RM > -- > > Key: YARN-9928 > URL: https://issues.apache.org/jira/browse/YARN-9928 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > > Encountered the below FATAL error in the NodeManager which was under heavy > load and was also resyncing with RM at the same. This caused the NM to go > down. > {code:java} > 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher > (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216) > at > org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9928) ATSv2 can make NM go down with a FATAL error while it is resyncing with RM
Tarun Parimi created YARN-9928: -- Summary: ATSv2 can make NM go down with a FATAL error while it is resyncing with RM Key: YARN-9928 URL: https://issues.apache.org/jira/browse/YARN-9928 Project: Hadoop YARN Issue Type: Bug Reporter: Tarun Parimi Assignee: Tarun Parimi Encountered the below FATAL error in the NodeManager which was under heavy load and was also resyncing with RM at the same. This caused the NM to go down. {code:java} 2019-09-18 11:22:44,899 FATAL event.AsyncDispatcher (AsyncDispatcher.java:dispatch(203)) - Error in dispatcher thread java.lang.NullPointerException at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerCreatedEvent(NMTimelinePublisher.java:216) at org.apache.hadoop.yarn.server.nodemanager.timelineservice.NMTimelinePublisher.publishContainerEvent(NMTimelinePublisher.java:383) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1520) at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1511) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9773) Add QueueMetrics for Custom Resources
[ https://issues.apache.org/jira/browse/YARN-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955958#comment-16955958 ] Tarun Parimi commented on YARN-9773: Got a findbugs warning from the changes done in this jira. https://builds.apache.org/job/PreCommit-YARN-Build/25021/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html > Add QueueMetrics for Custom Resources > - > > Key: YARN-9773 > URL: https://issues.apache.org/jira/browse/YARN-9773 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > Fix For: 3.3.0, 3.2.2, 3.1.4 > > Attachments: YARN-9773.001.patch, YARN-9773.002.patch, > YARN-9773.003.patch > > > Although the custom resource metrics are calculated and saved as a > QueueMetricsForCustomResources object within the QueueMetrics class, the JMX > and Simon QueueMetrics do not report that information for custom resources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
[ https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955955#comment-16955955 ] Tarun Parimi commented on YARN-9921: The Findbugs warning is due to the changes done in YARN-9773 and is not related to the patch. > Issue in PlacementConstraint when YARN Service AM retries allocation on > component failure. > -- > > Key: YARN-9921 > URL: https://issues.apache.org/jira/browse/YARN-9921 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9921.001.patch, differenceProtobuf.png > > > When YARN Service AM tries to relaunch a container on failure, we encounter > the below error in PlacementConstraints. > {code:java} > ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: > Invalid updated SchedulingRequest added to scheduler, we only allows changing > numAllocations for the updated SchedulingRequest. > Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=0, > resources=}, > placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} > new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, > resources=}, > placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, > if any fields need to be updated, please cancel the old request (by setting > numAllocations to 0) and send a SchedulingRequest with different combination > of priority/allocationId > {code} > But we can see from the message that the SchedulingRequest is indeed valid > with everything same except numAllocations as expected. But still the below > equals check in SingleConstraintAppPlacementAllocator fails. > {code:java} > // Compare two objects > if (!schedulingRequest.equals(newSchedulingRequest)) { > // Rollback #numAllocations > sizing.setNumAllocations(newNumAllocations); > throw new SchedulerInvalidResoureRequestException( > "Invalid updated SchedulingRequest added to scheduler, " > + " we only allows changing numAllocations for the updated " > + "SchedulingRequest. Old=" + schedulingRequest.toString() > + " new=" + newSchedulingRequest.toString() > + ", if any fields need to be updated, please cancel the " > + "old request (by setting numAllocations to 0) and send a " > + "SchedulingRequest with different combination of " > + "priority/allocationId"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
[ https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955803#comment-16955803 ] Tarun Parimi commented on YARN-9921: Thanks for the review [~tangzhankun]. > Issue in PlacementConstraint when YARN Service AM retries allocation on > component failure. > -- > > Key: YARN-9921 > URL: https://issues.apache.org/jira/browse/YARN-9921 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9921.001.patch, differenceProtobuf.png > > > When YARN Service AM tries to relaunch a container on failure, we encounter > the below error in PlacementConstraints. > {code:java} > ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: > Invalid updated SchedulingRequest added to scheduler, we only allows changing > numAllocations for the updated SchedulingRequest. > Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=0, > resources=}, > placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} > new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, > resources=}, > placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, > if any fields need to be updated, please cancel the old request (by setting > numAllocations to 0) and send a SchedulingRequest with different combination > of priority/allocationId > {code} > But we can see from the message that the SchedulingRequest is indeed valid > with everything same except numAllocations as expected. But still the below > equals check in SingleConstraintAppPlacementAllocator fails. > {code:java} > // Compare two objects > if (!schedulingRequest.equals(newSchedulingRequest)) { > // Rollback #numAllocations > sizing.setNumAllocations(newNumAllocations); > throw new SchedulerInvalidResoureRequestException( > "Invalid updated SchedulingRequest added to scheduler, " > + " we only allows changing numAllocations for the updated " > + "SchedulingRequest. Old=" + schedulingRequest.toString() > + " new=" + newSchedulingRequest.toString() > + ", if any fields need to be updated, please cancel the " > + "old request (by setting numAllocations to 0) and send a " > + "SchedulingRequest with different combination of " > + "priority/allocationId"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
[ https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955763#comment-16955763 ] Tarun Parimi edited comment on YARN-9921 at 10/21/19 5:55 AM: -- Submitting a patch which changes the equals method in SchedulingRequestPBImpl to check using the objects instead of proto. Verified that this fixes the issue in my cluster where it is reproducing. Added a case to test the updatePendingAsk for a newly constructed SchedulingRequest. [~sunilg],[~cheersyang],[~eyang],[~Prabhu Joseph] Please check when you get time. was (Author: tarunparimi): Submitting a patch which changes the equals method in SchedulingRequestPBImpl to check using the objects instead of proto. Verified that this fixes the issue in my cluster where it is reproducing. Added a case to test the updatePendingAsk for a newly constructed SchedulingRequest. > Issue in PlacementConstraint when YARN Service AM retries allocation on > component failure. > -- > > Key: YARN-9921 > URL: https://issues.apache.org/jira/browse/YARN-9921 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9921.001.patch, differenceProtobuf.png > > > When YARN Service AM tries to relaunch a container on failure, we encounter > the below error in PlacementConstraints. > {code:java} > ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: > Invalid updated SchedulingRequest added to scheduler, we only allows changing > numAllocations for the updated SchedulingRequest. > Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=0, > resources=}, > placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} > new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, > resources=}, > placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, > if any fields need to be updated, please cancel the old request (by setting > numAllocations to 0) and send a SchedulingRequest with different combination > of priority/allocationId > {code} > But we can see from the message that the SchedulingRequest is indeed valid > with everything same except numAllocations as expected. But still the below > equals check in SingleConstraintAppPlacementAllocator fails. > {code:java} > // Compare two objects > if (!schedulingRequest.equals(newSchedulingRequest)) { > // Rollback #numAllocations > sizing.setNumAllocations(newNumAllocations); > throw new SchedulerInvalidResoureRequestException( > "Invalid updated SchedulingRequest added to scheduler, " > + " we only allows changing numAllocations for the updated " > + "SchedulingRequest. Old=" + schedulingRequest.toString() > + " new=" + newSchedulingRequest.toString() > + ", if any fields need to be updated, please cancel the " > + "old request (by setting numAllocations to 0) and send a " > + "SchedulingRequest with different combination of " > + "priority/allocationId"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
[ https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9921: --- Attachment: YARN-9921.001.patch > Issue in PlacementConstraint when YARN Service AM retries allocation on > component failure. > -- > > Key: YARN-9921 > URL: https://issues.apache.org/jira/browse/YARN-9921 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9921.001.patch, differenceProtobuf.png > > > When YARN Service AM tries to relaunch a container on failure, we encounter > the below error in PlacementConstraints. > {code:java} > ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: > Invalid updated SchedulingRequest added to scheduler, we only allows changing > numAllocations for the updated SchedulingRequest. > Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=0, > resources=}, > placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} > new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, > resources=}, > placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, > if any fields need to be updated, please cancel the old request (by setting > numAllocations to 0) and send a SchedulingRequest with different combination > of priority/allocationId > {code} > But we can see from the message that the SchedulingRequest is indeed valid > with everything same except numAllocations as expected. But still the below > equals check in SingleConstraintAppPlacementAllocator fails. > {code:java} > // Compare two objects > if (!schedulingRequest.equals(newSchedulingRequest)) { > // Rollback #numAllocations > sizing.setNumAllocations(newNumAllocations); > throw new SchedulerInvalidResoureRequestException( > "Invalid updated SchedulingRequest added to scheduler, " > + " we only allows changing numAllocations for the updated " > + "SchedulingRequest. Old=" + schedulingRequest.toString() > + " new=" + newSchedulingRequest.toString() > + ", if any fields need to be updated, please cancel the " > + "old request (by setting numAllocations to 0) and send a " > + "SchedulingRequest with different combination of " > + "priority/allocationId"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
[ https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16955755#comment-16955755 ] Tarun Parimi commented on YARN-9921: On debugging this, I found that the targetExpressions object is considered by protobuf as unequal. This is because the order of elements in targetExpressions is expected to be same. But the order can change as we can see below. !differenceProtobuf.png! The reason the order changes is because we have defined targetExpression as an unordered Set. {code:java} /** * Get the target expressions of the constraint. * * @return the set of target expressions */ public Set getTargetExpressions() { return targetExpressions; } {code} But the proto is defined as repeated string. I see in https://github.com/protocolbuffers/protobuf/issues/2116 that order is strictly checked for repeated fields. {code:java} repeated PlacementConstraintTargetProto targetExpressions = 2; {code} I don't think it is safe to make any changes to the proto to handle this issue as it can cause backward compatibility/upgrade and other problems. A simple fix is to change the equals method in SchedulingRequestPBImpl to not depend on the equals method of protobuf. Will submit a working patch on this soon. > Issue in PlacementConstraint when YARN Service AM retries allocation on > component failure. > -- > > Key: YARN-9921 > URL: https://issues.apache.org/jira/browse/YARN-9921 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: differenceProtobuf.png > > > When YARN Service AM tries to relaunch a container on failure, we encounter > the below error in PlacementConstraints. > {code:java} > ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: > Invalid updated SchedulingRequest added to scheduler, we only allows changing > numAllocations for the updated SchedulingRequest. > Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=0, > resources=}, > placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} > new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, > resources=}, > placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, > if any fields need to be updated, please cancel the old request (by setting > numAllocations to 0) and send a SchedulingRequest with different combination > of priority/allocationId > {code} > But we can see from the message that the SchedulingRequest is indeed valid > with everything same except numAllocations as expected. But still the below > equals check in SingleConstraintAppPlacementAllocator fails. > {code:java} > // Compare two objects > if (!schedulingRequest.equals(newSchedulingRequest)) { > // Rollback #numAllocations > sizing.setNumAllocations(newNumAllocations); > throw new SchedulerInvalidResoureRequestException( > "Invalid updated SchedulingRequest added to scheduler, " > + " we only allows changing numAllocations for the updated " > + "SchedulingRequest. Old=" + schedulingRequest.toString() > + " new=" + newSchedulingRequest.toString() > + ", if any fields need to be updated, please cancel the " > + "old request (by setting numAllocations to 0) and send a " > + "SchedulingRequest with different combination of " > + "priority/allocationId"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
[ https://issues.apache.org/jira/browse/YARN-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9921: --- Attachment: differenceProtobuf.png > Issue in PlacementConstraint when YARN Service AM retries allocation on > component failure. > -- > > Key: YARN-9921 > URL: https://issues.apache.org/jira/browse/YARN-9921 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: differenceProtobuf.png > > > When YARN Service AM tries to relaunch a container on failure, we encounter > the below error in PlacementConstraints. > {code:java} > ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat > org.apache.hadoop.yarn.exceptions.YarnException: > org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: > Invalid updated SchedulingRequest added to scheduler, we only allows changing > numAllocations for the updated SchedulingRequest. > Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=0, > resources=}, > placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} > new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, > executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, > allocationTags=[component], > resourceSizing=ResourceSizingPBImpl{numAllocations=1, > resources=}, > placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, > if any fields need to be updated, please cancel the old request (by setting > numAllocations to 0) and send a SchedulingRequest with different combination > of priority/allocationId > {code} > But we can see from the message that the SchedulingRequest is indeed valid > with everything same except numAllocations as expected. But still the below > equals check in SingleConstraintAppPlacementAllocator fails. > {code:java} > // Compare two objects > if (!schedulingRequest.equals(newSchedulingRequest)) { > // Rollback #numAllocations > sizing.setNumAllocations(newNumAllocations); > throw new SchedulerInvalidResoureRequestException( > "Invalid updated SchedulingRequest added to scheduler, " > + " we only allows changing numAllocations for the updated " > + "SchedulingRequest. Old=" + schedulingRequest.toString() > + " new=" + newSchedulingRequest.toString() > + ", if any fields need to be updated, please cancel the " > + "old request (by setting numAllocations to 0) and send a " > + "SchedulingRequest with different combination of " > + "priority/allocationId"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9921) Issue in PlacementConstraint when YARN Service AM retries allocation on component failure.
Tarun Parimi created YARN-9921: -- Summary: Issue in PlacementConstraint when YARN Service AM retries allocation on component failure. Key: YARN-9921 URL: https://issues.apache.org/jira/browse/YARN-9921 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.1.0 Reporter: Tarun Parimi Assignee: Tarun Parimi When YARN Service AM tries to relaunch a container on failure, we encounter the below error in PlacementConstraints. {code:java} ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.yarn.exceptions.SchedulerInvalidResoureRequestException: Invalid updated SchedulingRequest added to scheduler, we only allows changing numAllocations for the updated SchedulingRequest. Old=SchedulingRequestPBImpl{priority=0, allocationReqId=0, executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, allocationTags=[component], resourceSizing=ResourceSizingPBImpl{numAllocations=0, resources=}, placementConstraint=notin,node,llap:notin,node,yarn_node_partition/=[label]} new=SchedulingRequestPBImpl{priority=0, allocationReqId=0, executionType={Execution Type: GUARANTEED, Enforce Execution Type: true}, allocationTags=[component], resourceSizing=ResourceSizingPBImpl{numAllocations=1, resources=}, placementConstraint=notin,node,component:notin,node,yarn_node_partition/=[label]}, if any fields need to be updated, please cancel the old request (by setting numAllocations to 0) and send a SchedulingRequest with different combination of priority/allocationId {code} But we can see from the message that the SchedulingRequest is indeed valid with everything same except numAllocations as expected. But still the below equals check in SingleConstraintAppPlacementAllocator fails. {code:java} // Compare two objects if (!schedulingRequest.equals(newSchedulingRequest)) { // Rollback #numAllocations sizing.setNumAllocations(newNumAllocations); throw new SchedulerInvalidResoureRequestException( "Invalid updated SchedulingRequest added to scheduler, " + " we only allows changing numAllocations for the updated " + "SchedulingRequest. Old=" + schedulingRequest.toString() + " new=" + newSchedulingRequest.toString() + ", if any fields need to be updated, please cancel the " + "old request (by setting numAllocations to 0) and send a " + "SchedulingRequest with different combination of " + "priority/allocationId"); } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9907) Make YARN Service AM RPC port configurable
[ https://issues.apache.org/jira/browse/YARN-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9907: --- Attachment: YARN-9907.001.patch > Make YARN Service AM RPC port configurable > -- > > Key: YARN-9907 > URL: https://issues.apache.org/jira/browse/YARN-9907 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9907.001.patch > > > YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In > environments where firewalls block unnecessary ports by default, it is useful > to have a configuration that specifies the port range. Similar to what we > have for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9907) Make YARN Service AM RPC port configurable
Tarun Parimi created YARN-9907: -- Summary: Make YARN Service AM RPC port configurable Key: YARN-9907 URL: https://issues.apache.org/jira/browse/YARN-9907 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Reporter: Tarun Parimi Assignee: Tarun Parimi YARN Service AM uses a random ephemeral port for the ClientAMService RPC. In environments where firewalls block unnecessary ports by default, it is useful to have a configuration that specifies the port range. Similar to what we have for MapReduce {{yarn.app.mapreduce.am.job.client.port-range}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9903) Support reservations continue looking for Node Labels
[ https://issues.apache.org/jira/browse/YARN-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9903: --- Description: YARN-1769 brought in reservations continue looking feature which improves the several resource reservation scenarios. However, it is not handled currently when nodes have a label assigned to them. This is useful and in many cases necessary even for Node Labels. So we should look to support this for node labels also. For example, in AbstractCSQueue.java, we have the below TODO. {code:java} // TODO, now only consider reservation cases when the node has no label if (this.reservationsContinueLooking && nodePartition.equals( RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, clusterResource, resourceCouldBeUnreserved, Resources.none())) { {code} cc [~sunilg] was: YARN-1769 brought in reservations continue looking feature which improves the several resource reservation scenarios. However, it is not handled currently when nodes have a label assigned to them. This is useful and in many cases necessary even for Node Labels. So we should look to support this for node labels also. {code:java} // TODO, now only consider reservation cases when the node has no label if (this.reservationsContinueLooking && nodePartition.equals( RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, clusterResource, resourceCouldBeUnreserved, Resources.none())) { {code} cc [~sunilg] > Support reservations continue looking for Node Labels > - > > Key: YARN-9903 > URL: https://issues.apache.org/jira/browse/YARN-9903 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Tarun Parimi >Priority: Major > > YARN-1769 brought in reservations continue looking feature which improves the > several resource reservation scenarios. However, it is not handled currently > when nodes have a label assigned to them. This is useful and in many cases > necessary even for Node Labels. So we should look to support this for node > labels also. > For example, in AbstractCSQueue.java, we have the below TODO. > {code:java} > // TODO, now only consider reservation cases when the node has no label > if (this.reservationsContinueLooking && nodePartition.equals( > RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, > clusterResource, resourceCouldBeUnreserved, Resources.none())) { > {code} > cc [~sunilg] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9903) Support reservations continue looking for Node Labels
Tarun Parimi created YARN-9903: -- Summary: Support reservations continue looking for Node Labels Key: YARN-9903 URL: https://issues.apache.org/jira/browse/YARN-9903 Project: Hadoop YARN Issue Type: Bug Reporter: Tarun Parimi YARN-1769 brought in reservations continue looking feature which improves the several resource reservation scenarios. However, it is not handled currently when nodes have a label assigned to them. This is useful and in many cases necessary even for Node Labels. So we should look to support this for node labels also. {code:java} // TODO, now only consider reservation cases when the node has no label if (this.reservationsContinueLooking && nodePartition.equals( RMNodeLabelsManager.NO_LABEL) && Resources.greaterThan( resourceCalculator, clusterResource, resourceCouldBeUnreserved, Resources.none())) { {code} cc [~sunilg] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8786) LinuxContainerExecutor fails sporadically in create_local_dirs
[ https://issues.apache.org/jira/browse/YARN-8786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933274#comment-16933274 ] Tarun Parimi commented on YARN-8786: YARN-9833 could fix this issue > LinuxContainerExecutor fails sporadically in create_local_dirs > -- > > Key: YARN-8786 > URL: https://issues.apache.org/jira/browse/YARN-8786 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Jon Bender >Priority: Major > > We started using CGroups with LinuxContainerExecutor recently, running Apache > Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a yarn > container will fail with a message like the following: > {code:java} > [2018-09-02 23:48:02.458691] 18/09/02 23:48:02 INFO container.ContainerImpl: > Container container_1530684675517_516620_01_020846 transitioned from > SCHEDULED to RUNNING > [2018-09-02 23:48:02.458874] 18/09/02 23:48:02 INFO > monitor.ContainersMonitorImpl: Starting resource-monitoring for > container_1530684675517_516620_01_020846 > [2018-09-02 23:48:02.506114] 18/09/02 23:48:02 WARN > privileged.PrivilegedOperationExecutor: Shell execution returned exit code: > 35. Privileged Execution Operation Stderr: > [2018-09-02 23:48:02.506159] Could not create container dirsCould not create > local files and directories > [2018-09-02 23:48:02.506220] > [2018-09-02 23:48:02.506238] Stdout: main : command provided 1 > [2018-09-02 23:48:02.506258] main : run as user is nobody > [2018-09-02 23:48:02.506282] main : requested yarn user is root > [2018-09-02 23:48:02.506294] Getting exit code file... > [2018-09-02 23:48:02.506307] Creating script paths... > [2018-09-02 23:48:02.506330] Writing pid file... > [2018-09-02 23:48:02.506366] Writing to tmp file > /path/to/hadoop/yarn/local/nmPrivate/application_1530684675517_516620/container_1530684675517_516620_01_020846/container_1530684675517_516620_01_020846.pid.tmp > [2018-09-02 23:48:02.506389] Writing to cgroup task files... > [2018-09-02 23:48:02.506402] Creating local dirs... > [2018-09-02 23:48:02.506414] Getting exit code file... > [2018-09-02 23:48:02.506435] Creating script paths... > {code} > Looking at the container executor source it's traceable to errors here: > [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L1604] > And ultimately to > [https://github.com/apache/hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c#L672] > The root failure seems to be in the underlying mkdir call, but that exit code > / errno is swallowed so we don't have more details. We tend to see this when > many containers start at the same time for the same application on a host, > and suspect it may be related to some race conditions around those shared > directories between containers for the same application. > For example, this is a typical pattern in the audit logs: > {code:java} > [2018-09-07 17:16:38.447654] 18/09/07 17:16:38 INFO > nodemanager.NMAuditLogger: USER=root IP=<> Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012871 > [2018-09-07 17:16:38.492298] 18/09/07 17:16:38 INFO > nodemanager.NMAuditLogger: USER=root IP=<> Container Request > TARGET=ContainerManageImpl RESULT=SUCCESS > APPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012870 > [2018-09-07 17:16:38.614044] 18/09/07 17:16:38 WARN > nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - > Failed TARGET=ContainerImplRESULT=FAILURE DESCRIPTION=Container failed > with state: EXITED_WITH_FAILUREAPPID=application_1530684675517_559126 > CONTAINERID=container_1530684675517_559126_01_012871 > {code} > Two containers for the same application starting in quick succession followed > by the EXITED_WITH_FAILURE step (exit code 35). > We plan to upgrade to 3.1.x soon but I don't expect this to be fixed by this, > the only major JIRAs that affected the executor since 3.0.0 seem unrelated > ([https://github.com/apache/hadoop/commit/bc285da107bb84a3c60c5224369d7398a41db2d8] > and > [https://github.com/apache/hadoop/commit/a82be7754d74f4d16b206427b91e700bb5f44d56]) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files
[ https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931109#comment-16931109 ] Tarun Parimi commented on YARN-9837: Thanks for the review [~eyang] . > YARN Service fails to fetch status for Stopped apps with bigger spec files > -- > > Key: YARN-9837 > URL: https://issues.apache.org/jira/browse/YARN-9837 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9837.001.patch > > > Was unable to fetch status for a STOPPED app due to the below error in RM > logs. > {code:java} > ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: > {} > java.io.EOFException: Read of > hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json > finished prematurely > at > org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360) > at > org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409) > at > org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749) > {code} > This seems to happen when the json file my-service.json is larger than 128KB > in my cluster. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files
[ https://issues.apache.org/jira/browse/YARN-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tarun Parimi updated YARN-9837: --- Attachment: YARN-9837.001.patch > YARN Service fails to fetch status for Stopped apps with bigger spec files > -- > > Key: YARN-9837 > URL: https://issues.apache.org/jira/browse/YARN-9837 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-9837.001.patch > > > Was unable to fetch status for a STOPPED app due to the below error in RM > logs. > {code:java} > ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: > {} > java.io.EOFException: Read of > hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json > finished prematurely > at > org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360) > at > org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409) > at > org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235) > at > org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749) > {code} > This seems to happen when the json file my-service.json is larger than 128KB > in my cluster. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9837) YARN Service fails to fetch status for Stopped apps with bigger spec files
Tarun Parimi created YARN-9837: -- Summary: YARN Service fails to fetch status for Stopped apps with bigger spec files Key: YARN-9837 URL: https://issues.apache.org/jira/browse/YARN-9837 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Affects Versions: 3.1.0 Reporter: Tarun Parimi Assignee: Tarun Parimi Was unable to fetch status for a STOPPED app due to the below error in RM logs. {code:java} ERROR webapp.ApiServer (ApiServer.java:getService(213)) - Get service failed: {} java.io.EOFException: Read of hdfs://my-cluster:8020/user/appuser/.yarn/services/my-service/my-service.json finished prematurely at org.apache.hadoop.yarn.service.utils.JsonSerDeser.load(JsonSerDeser.java:188) at org.apache.hadoop.yarn.service.utils.ServiceApiUtil.loadService(ServiceApiUtil.java:360) at org.apache.hadoop.yarn.service.client.ServiceClient.getAppId(ServiceClient.java:1409) at org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1235) at org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$3(ApiServer.java:749) {code} This seems to happen when the json file my-service.json is larger than 128KB in my cluster. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9772) CapacitySchedulerQueueManager has incorrect list of queues
[ https://issues.apache.org/jira/browse/YARN-9772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930521#comment-16930521 ] Tarun Parimi commented on YARN-9772: bq. Should we extend the duplicates check (as of now, it does only for leaf queues) to parent queues as well? [~maniraj...@gmail.com], Only problem I see is that there will be existing users who might have already have a queue config containing parent queues with duplicate names. They will face error when they upgrade and be forced to modify their current queue config. > CapacitySchedulerQueueManager has incorrect list of queues > -- > > Key: YARN-9772 > URL: https://issues.apache.org/jira/browse/YARN-9772 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Manikandan R >Assignee: Manikandan R >Priority: Major > > CapacitySchedulerQueueManager has incorrect list of queues when there is more > than one parent queue (say at middle level) with same name. > For example, > * root > ** a > *** b > c > *** d > b > * e > {{CapacitySchedulerQueueManager#getQueues}} maintains these list of queues. > While parsing "root.a.d.b", it overrides "root.a.b" with new Queue object in > the map because of similar name. After parsing all the queues, map count > should be 7, but it is 6. Any reference to queue "root.a.b" in code path is > nothing but "root.a.d.b" object. Since > {{CapacitySchedulerQueueManager#getQueues}} has been used in multiple places, > will need to understand the implications in detail. For example, > {{CapapcityScheduler#getQueue}} has been used in many places which in turn > uses {{CapacitySchedulerQueueManager#getQueues}}. cc [~eepayne], [~sunilg] -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9794) RM crashes due to runtime errors in TimelineServiceV2Publisher
[ https://issues.apache.org/jira/browse/YARN-9794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930503#comment-16930503 ] Tarun Parimi commented on YARN-9794: Thanks [~abmodi],[~Prabhu Joseph] for the reviews and commit. > RM crashes due to runtime errors in TimelineServiceV2Publisher > -- > > Key: YARN-9794 > URL: https://issues.apache.org/jira/browse/YARN-9794 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9794.001.patch, YARN-9794.002.patch > > > Saw that RM crashes while startup due to errors while putting entity in > TimelineServiceV2Publisher. > {code:java} > 2019-08-28 09:35:45,273 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: > Error in dispatcher thread > java.lang.RuntimeException: java.lang.IllegalArgumentException: > org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: > CodedInputStream encountered an embedded string or message which claimed to > have negative size > . > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:200) > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:269) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437) > at > org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236) > at > org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:321) > at > org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:285) > at > org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.flush(TypedBufferedMutator.java:66) > at > org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.flush(HBaseTimelineWriterImpl.java:566) > at > org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.flushBufferedTimelineEntities(TimelineCollector.java:173) > at > org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntities(TimelineCollector.java:150) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:459) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:73) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:494) > at > org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:483) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalArgumentException: > org.apache.hbase.thirdparty.com.google.protobuf.InvalidProtocolBufferException: > CodedInputStream encountered an embedded string or message which claimed to > have negative size. > at > org.apache.hbase.thirdparty.com.google.protobuf.CodedInputStream.newInstance(CodedInputStream.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org