[jira] [Created] (YARN-10417) Issues in Script Based Node Attribute Error Handling
Prabhu Joseph created YARN-10417: Summary: Issues in Script Based Node Attribute Error Handling Key: YARN-10417 URL: https://issues.apache.org/jira/browse/YARN-10417 Project: Hadoop YARN Issue Type: Bug Components: nodeattibute Reporter: Prabhu Joseph Assignee: Tanu Ajmera Below issues are seen in Script Based Node Attribute Error Handling: 1. Expected format in Log prints double colon *::* but correct format is only single colon. 2. *Execution of Node Labels script* is wrong. {code} 2020-08-31 09:09:34,649 WARN org.apache.hadoop.yarn.server.nodemanager.nodelabels.NodeDescriptorsScriptRunner: Execution of Node Labels script failed, Caught exception : Malformed output, expecting format NODE_ATTRIBUTE::ATTRIBUTE_NAME,ATTRIBUTE_TYPE,ATTRIBUTE_VALUE; but get HostGroup:STRING:compute {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-1806: Fix Version/s: 3.4.0 > webUI update to allow end users to request thread dump > -- > > Key: YARN-1806 > URL: https://issues.apache.org/jira/browse/YARN-1806 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ming Ma >Assignee: Siddharth Ahuja >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-1806.001.patch > > > Both individual container gage and containers page will support this. After > end user clicks on the request link, they can follow to get to stdout page > for the thread dump content. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1806) webUI update to allow end users to request thread dump
[ https://issues.apache.org/jira/browse/YARN-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185001#comment-17185001 ] Prabhu Joseph commented on YARN-1806: - This is very useful for debugging. Thanks [~sahuja] for the patch and [~akhilpb] for the review. Have committed the patch to trunk. > webUI update to allow end users to request thread dump > -- > > Key: YARN-1806 > URL: https://issues.apache.org/jira/browse/YARN-1806 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager >Reporter: Ming Ma >Assignee: Siddharth Ahuja >Priority: Major > Attachments: YARN-1806.001.patch > > > Both individual container gage and containers page will support this. After > end user clicks on the request link, they can follow to get to stdout page > for the thread dump content. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183261#comment-17183261 ] Prabhu Joseph commented on YARN-10352: -- Thanks [~bibinchundatt] for the review, can you commit the patch when you get time. Thanks. > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183147#comment-17183147 ] Prabhu Joseph commented on YARN-10360: -- Thanks [~sunilg], have committed the [^YARN-10360-002.patch] to trunk. > Support Multi Node Placement in SingleConstraintAppPlacementAllocator > - > > Key: YARN-10360 > URL: https://issues.apache.org/jira/browse/YARN-10360 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, multi-node-placement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10360-001.patch, YARN-10360-002.patch > > > Currently, placement constraints are not supported when Multi Node Placement > is enabled. This Jira is to add Support for Multi Node Placement in > SingleConstraintAppPlacementAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182958#comment-17182958 ] Prabhu Joseph commented on YARN-10360: -- Thanks [~sunilg] for the review. Testcase failures are existing ones and tracked by YARN-9333 and YARN-9587. > Support Multi Node Placement in SingleConstraintAppPlacementAllocator > - > > Key: YARN-10360 > URL: https://issues.apache.org/jira/browse/YARN-10360 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, multi-node-placement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10360-001.patch, YARN-10360-002.patch > > > Currently, placement constraints are not supported when Multi Node Placement > is enabled. This Jira is to add Support for Multi Node Placement in > SingleConstraintAppPlacementAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10389) Option to override RMWebServices with custom WebService class
[ https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175554#comment-17175554 ] Prabhu Joseph commented on YARN-10389: -- Have committed [^YARN-10389-008.patch] to trunk. > Option to override RMWebServices with custom WebService class > - > > Key: YARN-10389 > URL: https://issues.apache.org/jira/browse/YARN-10389 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Tanu Ajmera >Priority: Major > Attachments: YARN-10389-001.patch, YARN-10389-002.patch, > YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, > YARN-10389-006.patch, YARN-10389-007.patch, YARN-10389-008.patch > > > YARN-8047 provides support to add custom WebServices as part of RMWebApp. > Since each WebService has to have a separate WebService Path, /ws/v1/cluster > root path cannot be used globally. > Another alternative is to provide an option to override the RMWebServices > with custom WebServices implementation which can extend the RMWebService, > this way /ws/v1/cluster path can be used globally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10389) Option to override RMWebServices with custom WebService class
[ https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175401#comment-17175401 ] Prabhu Joseph commented on YARN-10389: -- Thanks [~tanu.ajmera], the latest patch [^YARN-10389-008.patch] looks good. Will commit after jenkins result. Thanks [~BilwaST] and [~sunilg] for the review. > Option to override RMWebServices with custom WebService class > - > > Key: YARN-10389 > URL: https://issues.apache.org/jira/browse/YARN-10389 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Tanu Ajmera >Priority: Major > Attachments: YARN-10389-001.patch, YARN-10389-002.patch, > YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, > YARN-10389-006.patch, YARN-10389-007.patch, YARN-10389-008.patch > > > YARN-8047 provides support to add custom WebServices as part of RMWebApp. > Since each WebService has to have a separate WebService Path, /ws/v1/cluster > root path cannot be used globally. > Another alternative is to provide an option to override the RMWebServices > with custom WebServices implementation which can extend the RMWebService, > this way /ws/v1/cluster path can be used globally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10389) Option to override RMWebServices with custom WebService class
[ https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174941#comment-17174941 ] Prabhu Joseph edited comment on YARN-10389 at 8/10/20, 5:32 PM: [~tanu.ajmera] Thanks for the patch. The patch looks good. One minor issue which is not related to this patch. 1. The below null check is not required {code} bindExternalClasses(); if (rm != null) {code} rm gets accessed even before the null check from bindExternalClasses, so no use having the null check. {code} private void bindExternalClasses() { YarnConfiguration yarnConf = new YarnConfiguration(rm.getConfig()); {code} was (Author: prabhu joseph): [~tanu.ajmera] Thanks for the patch. The patch looks good. One minor issue which is not related to this patch. 1. The below null check is not required {code} bindExternalClasses(); if (rm != null) {code} rm gets accessed even before the null check from bindExternalClasses, so no use having the null check. {code} private void bindExternalClasses() { YarnConfiguration yarnConf = new YarnConfiguration(rm.getConfig()); {codE} > Option to override RMWebServices with custom WebService class > - > > Key: YARN-10389 > URL: https://issues.apache.org/jira/browse/YARN-10389 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Tanu Ajmera >Priority: Major > Attachments: YARN-10389-001.patch, YARN-10389-002.patch, > YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, > YARN-10389-006.patch, YARN-10389-007.patch > > > YARN-8047 provides support to add custom WebServices as part of RMWebApp. > Since each WebService has to have a separate WebService Path, /ws/v1/cluster > root path cannot be used globally. > Another alternative is to provide an option to override the RMWebServices > with custom WebServices implementation which can extend the RMWebService, > this way /ws/v1/cluster path can be used globally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10389) Option to override RMWebServices with custom WebService class
[ https://issues.apache.org/jira/browse/YARN-10389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17174941#comment-17174941 ] Prabhu Joseph commented on YARN-10389: -- [~tanu.ajmera] Thanks for the patch. The patch looks good. One minor issue which is not related to this patch. 1. The below null check is not required {code} bindExternalClasses(); if (rm != null) {code} rm gets accessed even before the null check from bindExternalClasses, so no use having the null check. {code} private void bindExternalClasses() { YarnConfiguration yarnConf = new YarnConfiguration(rm.getConfig()); {codE} > Option to override RMWebServices with custom WebService class > - > > Key: YARN-10389 > URL: https://issues.apache.org/jira/browse/YARN-10389 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Tanu Ajmera >Priority: Major > Attachments: YARN-10389-001.patch, YARN-10389-002.patch, > YARN-10389-003.patch, YARN-10389-004.patch, YARN-10389-005.patch, > YARN-10389-006.patch, YARN-10389-007.patch > > > YARN-8047 provides support to add custom WebServices as part of RMWebApp. > Since each WebService has to have a separate WebService Path, /ws/v1/cluster > root path cannot be used globally. > Another alternative is to provide an option to override the RMWebServices > with custom WebServices implementation which can extend the RMWebService, > this way /ws/v1/cluster path can be used globally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173691#comment-17173691 ] Prabhu Joseph commented on YARN-10364: -- Have pushed the patch to trunk. > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10364.001.patch, YARN-10364.002.patch, > YARN-10364.003.patch > > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173676#comment-17173676 ] Prabhu Joseph commented on YARN-10364: -- Thanks [~sunilg] for the review. Will commit the patch shortly. > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10364.001.patch, YARN-10364.002.patch, > YARN-10364.003.patch > > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10389) Option to override RMWebServices with custom WebService class
Prabhu Joseph created YARN-10389: Summary: Option to override RMWebServices with custom WebService class Key: YARN-10389 URL: https://issues.apache.org/jira/browse/YARN-10389 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.4.0 Reporter: Prabhu Joseph Assignee: Tanu Ajmera YARN-8047 provides support to add custom WebServices as part of RMWebApp. Since each WebService has to have a separate WebService Path, /ws/v1/cluster root path cannot be used globally. Another alternative is to provide an option to override the RMWebServices with custom WebServices implementation which can extend the RMWebService, this way /ws/v1/cluster path can be used globally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10361) Make custom DAO classes configurable into RMWebApp#JAXBContextResolver
[ https://issues.apache.org/jira/browse/YARN-10361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172109#comment-17172109 ] Prabhu Joseph commented on YARN-10361: -- Have committed the [^YARN-10361.003.patch] to trunk. > Make custom DAO classes configurable into RMWebApp#JAXBContextResolver > -- > > Key: YARN-10361 > URL: https://issues.apache.org/jira/browse/YARN-10361 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10361.001.patch, YARN-10361.002.patch, > YARN-10361.003.patch > > > YARN-8047 provides support to add custom WebServices as part of RMWebApp. But > the custom DAO classes needs to be added into JAXBContextResolver. This Jira > is to configure the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10361) Make custom DAO classes configurable into RMWebApp#JAXBContextResolver
[ https://issues.apache.org/jira/browse/YARN-10361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172096#comment-17172096 ] Prabhu Joseph commented on YARN-10361: -- Thanks [~BilwaST] for the patch. +1, will commit it shortly. > Make custom DAO classes configurable into RMWebApp#JAXBContextResolver > -- > > Key: YARN-10361 > URL: https://issues.apache.org/jira/browse/YARN-10361 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10361.001.patch, YARN-10361.002.patch, > YARN-10361.003.patch > > > YARN-8047 provides support to add custom WebServices as part of RMWebApp. But > the custom DAO classes needs to be added into JAXBContextResolver. This Jira > is to configure the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-008.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch, YARN-10352-007.patch, YARN-10352-008.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications
[ https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170989#comment-17170989 ] Prabhu Joseph commented on YARN-10377: -- Thanks [~tarunparimi], have committed the patch to trunk. > Clicking on queue in Capacity Scheduler legacy ui does not show any > applications > > > Key: YARN-10377 > URL: https://issues.apache.org/jira/browse/YARN-10377 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot > 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch > > > The issue is in the capacity scheduler > [http://rm-host:port/clustter/scheduler] page > If I click on the root queue, I am able to see the applications. > !Screenshot 2020-07-29 at 12.01.28 PM.png! > But the application disappears when I click on the leaf queue -> default. > This issue is not present in the older 2.7.0 versions and I am able to see > apps normally filtered by the leaf queue when clicking on it. > !Screenshot 2020-07-29 at 12.01.36 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10381: - Fix Version/s: 3.4.0 > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-10381.001.patch, YARN-10381.002.patch, > YARN-10381.003.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170985#comment-17170985 ] Prabhu Joseph commented on YARN-10381: -- Have committed the patch to trunk. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch, YARN-10381.002.patch, > YARN-10381.003.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170978#comment-17170978 ] Prabhu Joseph commented on YARN-10381: -- Thanks [~sahuja] for the patch and [~BilwaST] for the review. The patch looks good, +1. Will commit it shortly. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch, YARN-10381.002.patch, > YARN-10381.003.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-007.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch, YARN-10352-007.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170676#comment-17170676 ] Prabhu Joseph commented on YARN-10352: -- Thanks [~bibinchundatt] for reviewing. bq. The custom iterator how much improvement we have against the Iterators.filter ? Have used custom iterator mainly to avoid an unnecessary Null Check required by FindBugs on using Iterators.filter with predicate in [^YARN-10352-002.patch] - [Build Run|https://issues.apache.org/jira/browse/YARN-10352?focusedCommentId=17161295=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17161295] {code} node must be non-null but is marked as nullable At MultiNodeSortingManager.java:is marked as nullable At MultiNodeSortingManager.java:[lines 124-125] {code} Was having a Predicate like below {code} private Predicate heartbeatFilter = new Predicate() { +@Override +public boolean apply(final N node) { + long timeElapsedFromLastHeartbeat = + Time.monotonicNow() - node.getLastHeartbeatMonotonicTime(); + return timeElapsedFromLastHeartbeat <= (nmHeartbeatInterval * 2); +} + }; {code} Let me know if this is fine, or the findbugs issue can be ignored. Will fix the other two comments. Thanks. > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10377) Clicking on queue in Capacity Scheduler legacy ui does not show any applications
[ https://issues.apache.org/jira/browse/YARN-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169987#comment-17169987 ] Prabhu Joseph commented on YARN-10377: -- [~tarunparimi] Thanks for the patch. The patch looks good. Will commit it tomorrow if no other comments. > Clicking on queue in Capacity Scheduler legacy ui does not show any > applications > > > Key: YARN-10377 > URL: https://issues.apache.org/jira/browse/YARN-10377 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: Screenshot 2020-07-29 at 12.01.28 PM.png, Screenshot > 2020-07-29 at 12.01.36 PM.png, YARN-10377.001.patch > > > The issue is in the capacity scheduler > [http://rm-host:port/clustter/scheduler] page > If I click on the root queue, I am able to see the applications. > !Screenshot 2020-07-29 at 12.01.28 PM.png! > But the application disappears when I click on the leaf queue -> default. > This issue is not present in the older 2.7.0 versions and I am able to see > apps normally filtered by the leaf queue when clicking on it. > !Screenshot 2020-07-29 at 12.01.36 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169986#comment-17169986 ] Prabhu Joseph commented on YARN-10364: -- [~BilwaST] Latest patch looks good to me. [~sunilg] Can you review the latest patch when you get time. Thanks. > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10364.001.patch, YARN-10364.002.patch, > YARN-10364.003.patch > > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169867#comment-17169867 ] Prabhu Joseph commented on YARN-10364: -- The change looks good, thanks. > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10364.001.patch, YARN-10364.002.patch, > YARN-10364.003.patch > > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169812#comment-17169812 ] Prabhu Joseph commented on YARN-10364: -- Thanks [~BilwaST] for the patch. Have a comment in below section 1. AbstractCSQueue#validateAbsoluteVsPercentageCapacityConfig will always succeed as it compares localType copied from this.capacityConfigType with the same this.capacityConfigType. It has to compare the type (this.capacityConfigType) of previous node label with the new one derived using minResource of next node label. Like below. {code} private void validateAbsoluteVsPercentageCapacityConfig( String queuePath, String label) { CapacityConfigType localType = checkConfigTypeIsAbsoluteResource(queuePath, label) ? CapacityConfigType.ABSOLUTE_RESOURCE : CapacityConfigType.PERCENTAGE; if (!queuePath.equals("root") && !this.capacityConfigType.equals(localType)) { throw new IllegalArgumentException("Queue '" + getQueuePath() + "' should use either percentage based capacity" + " configuration or absolute resource."); } } {code} > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10364.001.patch, YARN-10364.002.patch > > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10381) Send out application attempt state along with other elements in the application attempt object returned from appattempts REST API call
[ https://issues.apache.org/jira/browse/YARN-10381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169602#comment-17169602 ] Prabhu Joseph commented on YARN-10381: -- Thanks [~sahuja] for the patch. Can you update the doc - [ApplicationAttempts RM REST API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] - Elements of the appAttempt object, JSON and XML response section. > Send out application attempt state along with other elements in the > application attempt object returned from appattempts REST API call > -- > > Key: YARN-10381 > URL: https://issues.apache.org/jira/browse/YARN-10381 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Affects Versions: 3.3.0 >Reporter: Siddharth Ahuja >Assignee: Siddharth Ahuja >Priority: Minor > Attachments: YARN-10381.001.patch, YARN-10381.002.patch > > > The [ApplicationAttempts RM REST > API|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Attempts_API] > : > {code} > http://rm-http-address:port/ws/v1/cluster/apps/{appid}/appattempts > {code} > returns a collection of Application Attempt objects, where each application > attempt object contains elements like id, nodeId, startTime etc. > This JIRA has been raised to send out Application Attempt state as well as > part of the application attempt information from this REST API call. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10380: - Parent: YARN-5139 Issue Type: Sub-task (was: Improvement) > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Wangda Tan >Priority: Critical > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10380) Import logic of multi-node allocation in CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-10380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168893#comment-17168893 ] Prabhu Joseph commented on YARN-10380: -- [~wangda] Below are the other issues 1. YARN-10357 - Proactively relocate allocated containers from a stopped node 2. Handling difference in CandidateSet v.s. Multi-node sorter - https://issues.apache.org/jira/browse/YARN-10352?focusedCommentId=17161696=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17161696 3. NM does not unregister to RM when recovery is enabled. This causes RM to unnecessarily allocate on those nodes and later need to relocate (YARN-10357) if nodes has not heartbeated. Relying on heartbeats won't be accurate if there are some network delays, instead NM can unregister with a special flag set. > Import logic of multi-node allocation in CapacityScheduler > -- > > Key: YARN-10380 > URL: https://issues.apache.org/jira/browse/YARN-10380 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wangda Tan >Priority: Critical > > *1) Entry point:* > When we do multi-node allocation, we're using the same logic of async > scheduling: > {code:java} > // Allocate containers of node [start, end) > for (FiCaSchedulerNode node : nodes) { > if (current++ >= start) { > if (shouldSkipNodeSchedule(node, cs, printSkipedNodeLogging)) { > continue; > } > cs.allocateContainersToNode(node.getNodeID(), false); > } > } {code} > Is it the most effective way to do multi-node scheduling? Should we allocate > based on partitions? In above logic, if we have thousands of node in one > partition, we will repeatly access all nodes of the partition thousands of > times. > I would suggest looking at making entry-point for node-heartbeat, > async-scheduling (single node), and async-scheduling (multi-node) to be > different. > Node-heartbeat and async-scheduling (single node) can be still similar and > share most of the code. > async-scheduling (multi-node): should iterate partition first, using pseudo > code like: > {code:java} > for (partition : all partitions) { > allocateContainersOnMultiNodes(getCandidate(partition)) > } {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165524#comment-17165524 ] Prabhu Joseph commented on YARN-10360: -- [~tangzhankun] Can you review this Jira when you get time. This adds Multi Node Iterator inside SingleConstraintAppPlacementAllocator to serve SchedulingRequest when Multi Node Placement Enabled. Thanks. > Support Multi Node Placement in SingleConstraintAppPlacementAllocator > - > > Key: YARN-10360 > URL: https://issues.apache.org/jira/browse/YARN-10360 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, multi-node-placement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10360-001.patch, YARN-10360-002.patch > > > Currently, placement constraints are not supported when Multi Node Placement > is enabled. This Jira is to add Support for Multi Node Placement in > SingleConstraintAppPlacementAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10360: - Attachment: YARN-10360-002.patch > Support Multi Node Placement in SingleConstraintAppPlacementAllocator > - > > Key: YARN-10360 > URL: https://issues.apache.org/jira/browse/YARN-10360 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, multi-node-placement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10360-001.patch, YARN-10360-002.patch > > > Currently, placement constraints are not supported when Multi Node Placement > is enabled. This Jira is to add Support for Multi Node Placement in > SingleConstraintAppPlacementAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165502#comment-17165502 ] Prabhu Joseph commented on YARN-10364: -- [~BilwaST] Yes right. One more place has similar check in AbstractCSQueue#validateAbsoluteVsPercentageCapacityConfig {code} if (!minResource.equals(Resources.none())) { localType = CapacityConfigType.ABSOLUTE_RESOURCE; } {code} Any config which matches RESOURCE_PATTERN like [memory=0] or [vcores=0] has to be considered as CapacityConfigType.ABSOLUTE_RESOURCE. Need thorough testing to make sure that considering configs [memory=0] or [vcores=0] as CapacityConfigType.ABSOLUTE_RESOURCE does not cause any failures for a ParentQueue, LeafQueue, ManagedParentQueue and AutoCreatedLeafQueue. > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Bilwa S T >Priority: Major > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10366) Yarn rmadmin help message shows two labels for one node for --replaceLabelsOnNode
[ https://issues.apache.org/jira/browse/YARN-10366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165478#comment-17165478 ] Prabhu Joseph commented on YARN-10366: -- Have committed to trunk. Will resolve the Jira. > Yarn rmadmin help message shows two labels for one node for > --replaceLabelsOnNode > - > > Key: YARN-10366 > URL: https://issues.apache.org/jira/browse/YARN-10366 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Tanu Ajmera >Assignee: Tanu Ajmera >Priority: Major > Attachments: Screenshot 2020-07-24 at 4.07.10 PM.png, > YARN-10366-001.patch > > > In the help message of “yarn rmadmin” , looks like one node can be assign > with two labels, which is not consistent with the “Each node can have only > one node label” -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10366) Yarn rmadmin help message shows two labels for one node for --replaceLabelsOnNode
[ https://issues.apache.org/jira/browse/YARN-10366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17165465#comment-17165465 ] Prabhu Joseph commented on YARN-10366: -- Thanks [~tanu.ajmera] for the patch. +1. will commit it shortly. > Yarn rmadmin help message shows two labels for one node for > --replaceLabelsOnNode > - > > Key: YARN-10366 > URL: https://issues.apache.org/jira/browse/YARN-10366 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Tanu Ajmera >Assignee: Tanu Ajmera >Priority: Major > Attachments: Screenshot 2020-07-24 at 4.07.10 PM.png, > YARN-10366-001.patch > > > In the help message of “yarn rmadmin” , looks like one node can be assign > with two labels, which is not consistent with the “Each node can have only > one node label” -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17164177#comment-17164177 ] Prabhu Joseph commented on YARN-10319: -- Thanks [~Tao Yang] and [~adam.antal] for the review. Have committed the [^YARN-10319-006.patch] to trunk. Will resolve the Jira. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10319: - Fix Version/s: 3.4.0 > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Fix For: 3.4.0 > > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163411#comment-17163411 ] Prabhu Joseph commented on YARN-10319: -- [~adam.antal] Failed testcase is not related, let me know if the latest patch [^YARN-10319-006.patch] is fine. Thanks. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
Prabhu Joseph created YARN-10364: Summary: Absolute Resource [memory=0] is considered as Percentage config type Key: YARN-10364 URL: https://issues.apache.org/jira/browse/YARN-10364 Project: Hadoop YARN Issue Type: Bug Reporter: Prabhu Joseph Assignee: Prabhu Joseph Absolute Resource [memory=0] is considered as Percentage config type. This causes failure while converting queues from Percentage to Absolute Resources automatically. *Repro:* 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% 2. While converting above to absolute resource automatically, capacity of queue A = [memory=], A.B = [memory=0] This fails with below as A is considered as Absolute Resource whereas B is considered as Percentage config type. {code} 2020-07-23 09:36:40,499 WARN org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: CapacityScheduler configuration validation failed:java.io.IOException: Failed to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should use either percentage based capacityconfiguration or absolute resource together for label: {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10364) Absolute Resource [memory=0] is considered as Percentage config type
[ https://issues.apache.org/jira/browse/YARN-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10364: - Affects Version/s: 3.4.0 > Absolute Resource [memory=0] is considered as Percentage config type > > > Key: YARN-10364 > URL: https://issues.apache.org/jira/browse/YARN-10364 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > Absolute Resource [memory=0] is considered as Percentage config type. This > causes failure while converting queues from Percentage to Absolute Resources > automatically. > *Repro:* > 1. Queue A = 100% and child queues Queue A.B = 0%, A.C=100% > 2. While converting above to absolute resource automatically, capacity of > queue A = [memory=], A.B = [memory=0] > This fails with below as A is considered as Absolute Resource whereas B is > considered as Percentage config type. > {code} > 2020-07-23 09:36:40,499 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Failed > to re-init queues : Parent queue 'root.A' and child queue 'root.A.B' should > use either percentage based capacityconfiguration or absolute resource > together for label: > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17163335#comment-17163335 ] Prabhu Joseph commented on YARN-10352: -- [~wangda] The failing testcase if not related to this fix. Let me know if the latest patch [^YARN-10352-006.patch] is fine, i will commit it if no comments. Thanks. > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-006.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch, > YARN-10352-006.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10361) Make custom DAO classes configurable into RMWebApp#JAXBContextResolver
Prabhu Joseph created YARN-10361: Summary: Make custom DAO classes configurable into RMWebApp#JAXBContextResolver Key: YARN-10361 URL: https://issues.apache.org/jira/browse/YARN-10361 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 3.4.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph YARN-8047 provides support to add custom WebServices as part of RMWebApp. But the custom DAO classes needs to be added into JAXBContextResolver. This Jira is to configure the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator
[ https://issues.apache.org/jira/browse/YARN-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10360: - Attachment: YARN-10360-001.patch > Support Multi Node Placement in SingleConstraintAppPlacementAllocator > - > > Key: YARN-10360 > URL: https://issues.apache.org/jira/browse/YARN-10360 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, multi-node-placement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10360-001.patch > > > Currently, placement constraints are not supported when Multi Node Placement > is enabled. This Jira is to add Support for Multi Node Placement in > SingleConstraintAppPlacementAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10360) Support Multi Node Placement in SingleConstraintAppPlacementAllocator
Prabhu Joseph created YARN-10360: Summary: Support Multi Node Placement in SingleConstraintAppPlacementAllocator Key: YARN-10360 URL: https://issues.apache.org/jira/browse/YARN-10360 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, multi-node-placement Affects Versions: 3.4.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph Currently, placement constraints are not supported when Multi Node Placement is enabled. This Jira is to add Support for Multi Node Placement in SingleConstraintAppPlacementAllocator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10293) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement (YARN-10259)
[ https://issues.apache.org/jira/browse/YARN-10293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10293: - Parent: YARN-5139 Issue Type: Sub-task (was: Bug) > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement (YARN-10259) > > > Key: YARN-10293 > URL: https://issues.apache.org/jira/browse/YARN-10293 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10293-001.patch, YARN-10293-002.patch, > YARN-10293-003-WIP.patch, YARN-10293-004.patch, YARN-10293-005.patch > > > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement. YARN-10259 has fixed two issues > related to it > https://issues.apache.org/jira/browse/YARN-10259?focusedCommentId=17105987=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17105987 > Have found one more bug in the CapacityScheduler.java code which causes the > same issue with slight difference in the repro. > *Repro:* > *Nodes : Available : Used* > Node1 - 8GB, 8vcores - 8GB. 8cores > Node2 - 8GB, 8vcores - 8GB. 8cores > Node3 - 8GB, 8vcores - 8GB. 8cores > Queues -> A and B both 50% capacity, 100% max capacity > MultiNode enabled + Preemption enabled > 1. JobA submitted to A queue and which used full cluster 24GB and 24 vcores > 2. JobB Submitted to B queue with AM size of 1GB > {code} > 2020-05-21 12:12:27,313 INFO > org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=systest > IP=172.27.160.139 OPERATION=Submit Application Request > TARGET=ClientRMService RESULT=SUCCESS APPID=application_1590046667304_0005 > CALLERCONTEXT=CLI QUEUENAME=dummy > {code} > 3. Preemption happens and used capacity is lesser than 1.0f > {code} > 2020-05-21 12:12:48,222 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptMetrics: > Non-AM container preempted, current > appAttemptId=appattempt_1590046667304_0004_01, > containerId=container_e09_1590046667304_0004_01_24, > resource= > {code} > 4. JobB gets a Reserved Container as part of > CapacityScheduler#allocateOrReserveNewContainer > {code} > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: > container_e09_1590046667304_0005_01_01 Container Transitioned from NEW to > RESERVED > 2020-05-21 12:12:48,226 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > {code} > *Why RegularContainerAllocator reserved the container when the used capacity > is <= 1.0f ?* > {code} > The reason is even though the container is preempted - nodemanager has to > stop the container and heartbeat and update the available and unallocated > resources to ResourceManager. > {code} > 5. Now, no new allocation happens and reserved container stays at reserved. > After reservation the used capacity becomes 1.0f, below will be in a loop and > no new allocate or reserve happens. The reserved container cannot be > allocated as reserved node does not have space. node2 has space for 1GB, > 1vcore but CapacityScheduler#allocateOrReserveNewContainers not getting > called causing the Hang. > *[INFINITE LOOP] CapacityScheduler#allocateContainersOnMultiNodes -> > CapacityScheduler#allocateFromReservedContainer -> Re-reserve the container > on node* > {code} > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Trying to fulfill reservation for application application_1590046667304_0005 > on node: tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: > assignContainers: partition= #applications=1 > 2020-05-21 12:13:33,242 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp: > Reserved container=container_e09_1590046667304_0005_01_01, on node=host: > tajmera-fullnodes-3.tajmera-fullnodes.root.hwx.site:8041 #containers=8 > available= used= with > resource= > 2020-05-21 12:13:33,243 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Allocation proposal accepted > {code} > CapacityScheduler#allocateOrReserveNewContainers won't be
[jira] [Updated] (YARN-10259) Reserved Containers not allocated from available space of other nodes in CandidateNodeSet in MultiNodePlacement
[ https://issues.apache.org/jira/browse/YARN-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10259: - Parent: YARN-5139 Issue Type: Sub-task (was: Bug) > Reserved Containers not allocated from available space of other nodes in > CandidateNodeSet in MultiNodePlacement > --- > > Key: YARN-10259 > URL: https://issues.apache.org/jira/browse/YARN-10259 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler >Affects Versions: 3.2.0, 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.4.0, 3.3.1 > > Attachments: YARN-10259-001.patch, YARN-10259-002.patch, > YARN-10259-003.patch > > > Reserved Containers are not allocated from the available space of other nodes > in CandidateNodeSet in MultiNodePlacement. > *Repro:* > 1. MultiNode Placement Enabled. > 2. Two nodes h1 and h2 with 8GB > 3. Submit app1 AM (5GB) which gets placed in h1 and app2 AM (5GB) which gets > placed in h2. > 4. Submit app3 AM which is reserved in h1 > 5. Kill app2 which frees space in h2. > 6. app3 AM never gets ALLOCATED > RM logs shows YARN-8127 fix rejecting the allocation proposal for app3 AM on > h2 as it expects the assignment to be on same node where reservation has > happened. > {code} > 2020-05-05 18:49:37,264 DEBUG [AsyncDispatcher event handler] > scheduler.SchedulerApplicationAttempt > (SchedulerApplicationAttempt.java:commonReserve(573)) - Application attempt > appattempt_1588684773609_0003_01 reserved container > container_1588684773609_0003_01_01 on node host: h1:1234 #containers=1 > available= used=. This attempt > currently has 1 reserved containers at priority 0; currentReservation > > 2020-05-05 18:49:37,264 INFO [AsyncDispatcher event handler] > fica.FiCaSchedulerApp (FiCaSchedulerApp.java:apply(670)) - Reserved > container=container_1588684773609_0003_01_01, on node=host: h1:1234 > #containers=1 available= used= > with resource= >RESERVED=[(Application=appattempt_1588684773609_0003_01; > Node=h1:1234; Resource=)] > > 2020-05-05 18:49:38,283 DEBUG [Time-limited test] > allocator.RegularContainerAllocator > (RegularContainerAllocator.java:assignContainer(514)) - assignContainers: > node=h2 application=application_1588684773609_0003 priority=0 > pendingAsk=,repeat=1> > type=OFF_SWITCH > 2020-05-05 18:49:38,285 DEBUG [Time-limited test] fica.FiCaSchedulerApp > (FiCaSchedulerApp.java:commonCheckContainerAllocation(371)) - Try to allocate > from reserved container container_1588684773609_0003_01_01, but node is > not reserved >ALLOCATED=[(Application=appattempt_1588684773609_0003_01; > Node=h2:1234; Resource=)] > {code} > Attached testcase which reproduces the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10357) Proactively relocate allocated containers from a stopped node
[ https://issues.apache.org/jira/browse/YARN-10357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10357: - Parent: YARN-5139 Issue Type: Sub-task (was: Improvement) > Proactively relocate allocated containers from a stopped node > - > > Key: YARN-10357 > URL: https://issues.apache.org/jira/browse/YARN-10357 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, multi-node-placement >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > In a cloud environment, node can be frequently commissioned, if we always > wait for 10 mins timeout, it may not be good, it's better to improve the > logic by preempting containers newly allocated (by not acquired) on NM which > stopped heartbeating. With this, we can proactively relocate containers to > different nodes before the 10 mins timeout. > cc [~leftnoteasy] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Parent: YARN-5139 Issue Type: Sub-task (was: Bug) > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162214#comment-17162214 ] Prabhu Joseph commented on YARN-10319: -- Thanks [~adam.antal] for the review. Have addressed them in patch [^YARN-10319-006.patch] . > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10319: - Attachment: YARN-10319-006.patch > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch, YARN-10319-006.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-005.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162199#comment-17162199 ] Prabhu Joseph commented on YARN-10352: -- Thanks [~wangda]. Have fixed checkstyle issues and failed testcase in [^YARN-10352-005.patch] > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch, YARN-10352-005.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161931#comment-17161931 ] Prabhu Joseph commented on YARN-10352: -- [~wangda] Thanks [~wangda] for the review comments. Have addressed them in [^YARN-10352-004.patch] bq. With this, we can proactively relocate containers to different nodes before the 10 mins timeout. Yes right, have reported YARN-10357 to track this. Currently NM does not unregister from RM when Node Recovery is Enabled so that it won't affect the existing running containers. Instead, i think it can send unRegisterNM with a boolean set which RM can use for stop scheduling, preempting allocated (but not acquired) containers without disturbing running containers on that node. RM will also have the right cluster available resources without considering the stopped nodes. NodeStatusUpdaterImpl#serviceStop {code} if (this.registeredWithRM && !this.isStopped && !isNMUnderSupervisionWithRecoveryEnabled() && !context.getDecommissioned() && !failedToConnect) { unRegisterNM(); } {code} > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10357) Proactively relocate allocated containers from a stopped node
Prabhu Joseph created YARN-10357: Summary: Proactively relocate allocated containers from a stopped node Key: YARN-10357 URL: https://issues.apache.org/jira/browse/YARN-10357 Project: Hadoop YARN Issue Type: Improvement Components: multi-node-placement, capacityscheduler Affects Versions: 3.4.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph In a cloud environment, node can be frequently commissioned, if we always wait for 10 mins timeout, it may not be good, it's better to improve the logic by preempting containers newly allocated (by not acquired) on NM which stopped heartbeating. With this, we can proactively relocate containers to different nodes before the 10 mins timeout. cc [~leftnoteasy] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-004.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch, YARN-10352-004.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161682#comment-17161682 ] Prabhu Joseph commented on YARN-10352: -- [~wangda] For each node in the list given by {{CapacityScheduler#getNodesHeartbeated}}, only allocation of reserved containers from that node happens. Allocate or Reserve new containers uses the multiple node candidates prepared by {{MultiNodeSorter#reSortClusterNodes}} (below code snippet) which passes the list to the configured {{MultiNodeLookupPolicy}} to perform sorting in background at every configured sorting interval. {{MultiNodeSortingManager}} filters that list while returning to {{RegularContainerAllocator#allocate}} call. {code:java} Map nodesByPartition = new HashMap<>(); List nodes = ((AbstractYarnScheduler) rmContext .getScheduler()).getNodeTracker().getNodesPerPartition(label); if (nodes != null) { nodes.forEach(n -> nodesByPartition.put(n.getNodeID(), n)); multiNodePolicy.addAndRefreshNodesSet( (Collection) nodesByPartition.values(), label); } {code} > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-003.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch, > YARN-10352-003.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161163#comment-17161163 ] Prabhu Joseph commented on YARN-10352: -- Thanks [~bibinchundatt] for the inputs. 1. Removed iterating the nodes from {{ClusterNodeTracker}} and moved the filtering logic to {{CapacityScheduler}}. 2. Have added filter logic while returning the {{preferrednodeIterator}}. 3. {{reSortClusterNodes}} need not filter as at the end {{preferrednodeIterator}} does the same. Let me know if this is fine. > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-002.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch, YARN-10352-002.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Attachment: YARN-10352-001.patch > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > Attachments: YARN-10352-001.patch > > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) Skip schedule on not heartbeated nodes in Multi Node Placement
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Summary: Skip schedule on not heartbeated nodes in Multi Node Placement (was: MultiNode Placement assigns container on stopped NodeManagers) > Skip schedule on not heartbeated nodes in Multi Node Placement > -- > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159389#comment-17159389 ] Prabhu Joseph commented on YARN-10339: -- Have committed the [^YARN-10339.002.patch] to trunk. Will resolve this Jira. > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10339.001.patch, YARN-10339.002.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159383#comment-17159383 ] Prabhu Joseph commented on YARN-10339: -- Thanks [~tarunparimi] for the patch. +1, will commit it shortly. > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10339.001.patch, YARN-10339.002.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) MultiNode Placement assigns container on stopped NodeManagers
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Summary: MultiNode Placement assigns container on stopped NodeManagers (was: MultiNode Placament assigns container on stopped NodeManagers) > MultiNode Placement assigns container on stopped NodeManagers > - > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Labels: capacityscheduler multi-node-placement (was: ) > MultiNode Placament assigns container on stopped NodeManagers > - > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: capacityscheduler, multi-node-placement > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Affects Version/s: 3.3.0 > MultiNode Placament assigns container on stopped NodeManagers > - > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.3.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers
[ https://issues.apache.org/jira/browse/YARN-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10352: - Affects Version/s: 3.4.0 > MultiNode Placament assigns container on stopped NodeManagers > - > > Key: YARN-10352 > URL: https://issues.apache.org/jira/browse/YARN-10352 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > > When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM > Active Nodes will be still having those stopped nodes until NM Liveliness > Monitor Expires after configured timeout > (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, > Multi Node Placement assigns the containers on those nodes. They need to > exclude the nodes which has not heartbeated for configured heartbeat interval > (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to > Asynchronous Capacity Scheduler Threads. > (CapacityScheduler#shouldSkipNodeSchedule) > *Repro:* > 1. Enable Multi Node Placement > (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery > Enabled (yarn.node.recovery.enabled) > 2. Have only one NM running say worker0 > 3. Stop worker0 and start any other NM say worker1 > 4. Submit a sleep job. The containers will timeout as assigned to stopped NM > worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10352) MultiNode Placament assigns container on stopped NodeManagers
Prabhu Joseph created YARN-10352: Summary: MultiNode Placament assigns container on stopped NodeManagers Key: YARN-10352 URL: https://issues.apache.org/jira/browse/YARN-10352 Project: Hadoop YARN Issue Type: Bug Reporter: Prabhu Joseph Assignee: Prabhu Joseph When Node Recovery is Enabled, Stopping a NM won't unregister to RM. So RM Active Nodes will be still having those stopped nodes until NM Liveliness Monitor Expires after configured timeout (yarn.nm.liveness-monitor.expiry-interval-ms = 10 mins). During this 10mins, Multi Node Placement assigns the containers on those nodes. They need to exclude the nodes which has not heartbeated for configured heartbeat interval (yarn.resourcemanager.nodemanagers.heartbeat-interval-ms=1000ms) similar to Asynchronous Capacity Scheduler Threads. (CapacityScheduler#shouldSkipNodeSchedule) *Repro:* 1. Enable Multi Node Placement (yarn.scheduler.capacity.multi-node-placement-enabled) + Node Recovery Enabled (yarn.node.recovery.enabled) 2. Have only one NM running say worker0 3. Stop worker0 and start any other NM say worker1 4. Submit a sleep job. The containers will timeout as assigned to stopped NM worker0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8676) Incorrect progress index in old yarn UI
[ https://issues.apache.org/jira/browse/YARN-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156533#comment-17156533 ] Prabhu Joseph commented on YARN-8676: - [~Cyl] Can you share the screenshot of RM UI before the patch which shows the issue and the screenshot after the fix which shows the issue gets fixed. Thanks. > Incorrect progress index in old yarn UI > --- > > Key: YARN-8676 > URL: https://issues.apache.org/jira/browse/YARN-8676 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yeliang Cang >Assignee: Yeliang Cang >Priority: Critical > Attachments: YARN-8676.001.patch > > > The index of parseHadoopProgress index is wrong in > WebPageUtils#getAppsTableColumnDefs > {code:java} > if (isFairSchedulerPage) { > sb.append("[15]"); > } else if (isResourceManager) { > sb.append("[17]"); > } else { > sb.append("[9]"); > } > {code} > should be > {code:java} > if (isFairSchedulerPage) { > sb.append("[16]"); > } else if (isResourceManager) { > sb.append("[18]"); > } else { > sb.append("[11]"); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10349) YARN_AM_RM_TOKEN, Localizer token does not have a service name
Prabhu Joseph created YARN-10349: Summary: YARN_AM_RM_TOKEN, Localizer token does not have a service name Key: YARN-10349 URL: https://issues.apache.org/jira/browse/YARN-10349 Project: Hadoop YARN Issue Type: Bug Reporter: Prabhu Joseph Assignee: Prabhu Joseph UGI Credentials#addToken silently overrides the token with same service name (HADOOP-17121). This causes tokens like YARN_AM_RM_TOKEN, Localizer with empty service name to get overridden. It is safer to have a service name for these tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol
[ https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10340: - Parent: YARN-10025 Issue Type: Sub-task (was: Bug) > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > - > > Key: YARN-10340 > URL: https://issues.apache.org/jira/browse/YARN-10340 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > > [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs] > While accessing above link using systest user, the request fails saying > mapred user does not have access to the job > > {code:java} > 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: > Could not obtain node HTTP address from provider. > javax.ws.rs.WebApplicationException: > org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have > privilege to see this application application_1593997842459_0214 > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461) > > {code} > On Analyzing, found WebServices#getContainer uses doAs using UGI created by > createRemoteUser(end user) to access RM#ApplicationClientProtocol which does > not work. Need to use createProxyUser to do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Parent: YARN-10025 Issue Type: Sub-task (was: Bug) > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Affects Version/s: (was: 3.2.0) 3.2.2 3.3.0 > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Description: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs for completed jobs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred # Run a sample MR job using systest user # Once the job is complete, access the job logs using hue user from YARN UI2. !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! YARN CLI works fine and does not allow hue user to view systest user job logs. {code:java} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} was: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred # Run a sample MR job using systest user # Once the job is complete, access the job logs using hue user from YARN UI2. !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! YARN CLI works fine and does not allow hue user to view systest user job logs. {code:java} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.2.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs for completed jobs > from YARN UI2 through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Description: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred # Run a sample MR job using systest user # Once the job is complete, access the job logs using hue user from YARN UI2. !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! YARN CLI works fine and does not allow hue user to view systest user job logs. {code:java} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} was: HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred 1. Run a sample MR job using systest user 2. Once the job is complete, access the job logs using hue user from YARN UI2. YARN CLI works fine. {code} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.2.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs from YARN UI2 > through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > # Run a sample MR job using systest user > # Once the job is complete, access the job logs using hue user from YARN > UI2. > !Screen Shot 2020-07-08 at 12.54.21 PM.png|height=300! > > YARN CLI works fine and does not allow hue user to view systest user job logs. > {code:java} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
Prabhu Joseph created YARN-10345: Summary: HsWebServices containerlogs does not honor ACLs for completed jobs Key: YARN-10345 URL: https://issues.apache.org/jira/browse/YARN-10345 Project: Hadoop YARN Issue Type: Bug Components: yarn Affects Versions: 3.2.0, 3.4.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png HsWebServices containerlogs does not honor ACLs. User who does not have permission to view a job is allowed to view the job logs from YARN UI2 through HsWebServices. *Repro:* Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + HistoryServer runs as mapred 1. Run a sample MR job using systest user 2. Once the job is complete, access the job logs using hue user from YARN UI2. YARN CLI works fine. {code} [hue@pjoseph-cm-2 /]$ [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at rmhostname:8032 Permission denied: user=hue, access=EXECUTE, inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10345) HsWebServices containerlogs does not honor ACLs for completed jobs
[ https://issues.apache.org/jira/browse/YARN-10345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10345: - Attachment: Screen Shot 2020-07-08 at 12.54.21 PM.png > HsWebServices containerlogs does not honor ACLs for completed jobs > -- > > Key: YARN-10345 > URL: https://issues.apache.org/jira/browse/YARN-10345 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.2.0, 3.4.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Critical > Attachments: Screen Shot 2020-07-08 at 12.54.21 PM.png > > > HsWebServices containerlogs does not honor ACLs. User who does not have > permission to view a job is allowed to view the job logs from YARN UI2 > through HsWebServices. > *Repro:* > Secure cluster + yarn.admin.acl=yarn,mapred + Root Queue ACLs set to " " + > HistoryServer runs as mapred > 1. Run a sample MR job using systest user > 2. Once the job is complete, access the job logs using hue user from YARN > UI2. > YARN CLI works fine. > {code} > [hue@pjoseph-cm-2 /]$ > [hue@pjoseph-cm-2 /]$ yarn logs -applicationId application_1594188841761_0002 > WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. > 20/07/08 07:23:08 INFO client.RMProxy: Connecting to ResourceManager at > rmhostname:8032 > Permission denied: user=hue, access=EXECUTE, > inode="/tmp/logs/systest":systest:hadoop:drwxrwx--- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:496) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8047) RMWebApp make external class pluggable
[ https://issues.apache.org/jira/browse/YARN-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153309#comment-17153309 ] Prabhu Joseph commented on YARN-8047: - Thanks [~BilwaST] for the patch. Have committed the latest patch [^YARN-8047.006.patch] to trunk. Can you report a separate Jira to handle the testcase. > RMWebApp make external class pluggable > -- > > Key: YARN-8047 > URL: https://issues.apache.org/jira/browse/YARN-8047 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-8047-001.patch, YARN-8047-002.patch, > YARN-8047-003.patch, YARN-8047.004.patch, YARN-8047.005.patch, > YARN-8047.006.patch > > > JIra should make sure we should be able to plugin webservices and web pages > of scheduler in Resourcemanager > * RMWebApp allow to bind external classes > * RMController allow to plugin scheduler classes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol
[ https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153240#comment-17153240 ] Prabhu Joseph commented on YARN-10340: -- Thanks [~tarunparimi] for the analysis. bq. This creates a separate rpc client instance every time though which is not efficient. This won't be a problem as these newly added WebServices (YARN-10028) are used only by Yarn UI2 unless user opens huge number of UI2 pages at a time. And also this is the right way for achieving doAs for RPC calls. > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > - > > Key: YARN-10340 > URL: https://issues.apache.org/jira/browse/YARN-10340 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > > [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs] > While accessing above link using systest user, the request fails saying > mapred user does not have access to the job > > {code:java} > 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: > Could not obtain node HTTP address from provider. > javax.ws.rs.WebApplicationException: > org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have > privilege to see this application application_1593997842459_0214 > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461) > > {code} > On Analyzing, found WebServices#getContainer uses doAs using UGI created by > createRemoteUser(end user) to access RM#ApplicationClientProtocol which does > not work. Need to use createProxyUser to do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10337) TestRMHATimelineCollectors fails on hadoop trunk
[ https://issues.apache.org/jira/browse/YARN-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10337: - Parent: YARN-9802 Issue Type: Sub-task (was: Bug) > TestRMHATimelineCollectors fails on hadoop trunk > > > Key: YARN-10337 > URL: https://issues.apache.org/jira/browse/YARN-10337 > Project: Hadoop YARN > Issue Type: Sub-task > Components: test, yarn >Reporter: Ahmed Hussein >Assignee: Bilwa S T >Priority: Major > Fix For: 3.4.0 > > Attachments: YARN-10337.001.patch > > > {{TestRMHATimelineCollectors}} has been failing on trunk. I see it frequently > in the qbt reports and the yetus reprts > {code:bash} > [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 5.95 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors > [ERROR] > testRebuildCollectorDataOnFailover(org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors) > Time elapsed: 5.615 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover(TestRMHATimelineCollectors.java:105) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > [INFO] > [INFO] Results: > [INFO] > [ERROR] Errors: > [ERROR] TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover:105 > NullPointer > [INFO] > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0 > [INFO] > [ERROR] There are test failures. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10337) TestRMHATimelineCollectors fails on hadoop trunk
[ https://issues.apache.org/jira/browse/YARN-10337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152680#comment-17152680 ] Prabhu Joseph commented on YARN-10337: -- Thanks [~BilwaST] for the patch. +1, have committed it to trunk. Will resolve the Jira. > TestRMHATimelineCollectors fails on hadoop trunk > > > Key: YARN-10337 > URL: https://issues.apache.org/jira/browse/YARN-10337 > Project: Hadoop YARN > Issue Type: Bug > Components: test, yarn >Reporter: Ahmed Hussein >Assignee: Bilwa S T >Priority: Major > Attachments: YARN-10337.001.patch > > > {{TestRMHATimelineCollectors}} has been failing on trunk. I see it frequently > in the qbt reports and the yetus reprts > {code:bash} > [INFO] Running > org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 5.95 > s <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors > [ERROR] > testRebuildCollectorDataOnFailover(org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors) > Time elapsed: 5.615 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover(TestRMHATimelineCollectors.java:105) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:80) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > [INFO] > [INFO] Results: > [INFO] > [ERROR] Errors: > [ERROR] TestRMHATimelineCollectors.testRebuildCollectorDataOnFailover:105 > NullPointer > [INFO] > [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0 > [INFO] > [ERROR] There are test failures. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10339) Timeline Client in Nodemanager gets 403 errors when simple auth is used in kerberos environments
[ https://issues.apache.org/jira/browse/YARN-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152527#comment-17152527 ] Prabhu Joseph commented on YARN-10339: -- [~tarunparimi] Thanks for the patch. The patch looks good. Can you fix the checkstyle issues and failing testcase. > Timeline Client in Nodemanager gets 403 errors when simple auth is used in > kerberos environments > > > Key: YARN-10339 > URL: https://issues.apache.org/jira/browse/YARN-10339 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineclient >Affects Versions: 3.1.0 >Reporter: Tarun Parimi >Assignee: Tarun Parimi >Priority: Major > Attachments: YARN-10339.001.patch > > > We get below errors in NodeManager logs whenever we set > yarn.timeline-service.http-authentication.type=simple in a cluster which has > kerberos enabled. There are use cases where simple auth is used only in > timeline server for convenience although kerberos is enabled. > {code:java} > 2020-05-20 20:06:30,181 ERROR impl.TimelineV2ClientImpl > (TimelineV2ClientImpl.java:putObjects(321)) - Response from the timeline > server is not successful, HTTP error code: 403, Server response: > {"exception":"ForbiddenException","message":"java.lang.Exception: The owner > of the posted timeline entities is not > set","javaClassName":"org.apache.hadoop.yarn.webapp.ForbiddenException"} > {code} > This seems to affect the NM timeline publisher which uses > TimelineV2ClientImpl. Doing a simple auth directly to timeline service via > curl works fine. So this issue is in the authenticator configuration in > timeline client. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol
[ https://issues.apache.org/jira/browse/YARN-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152173#comment-17152173 ] Prabhu Joseph commented on YARN-10340: -- [~brahmareddy] This issue happens irrespective of HADOOP-16095 change. Looks this issue is present long ago. *Repro:* Setup: Secure cluster + HistoryServer runs as mapred user + yarn.admin.acl=yarn and ACL for queues are set to " " 1. Run a mapreduce sleep job as userA 2. Access http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs as userA after kinit. 3. The request fails with below error in HistoryServer logs {code} 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: Could not obtain node HTTP address from provider. javax.ws.rs.WebApplicationException: org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have privilege to see this application application_1593997842459_0214 at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) {code} > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > - > > Key: YARN-10340 > URL: https://issues.apache.org/jira/browse/YARN-10340 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Prabhu Joseph >Assignee: Tarun Parimi >Priority: Major > > HsWebServices getContainerReport uses loginUser instead of remoteUser to > access ApplicationClientProtocol > > [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs] > While accessing above link using systest user, the request fails saying > mapred user does not have access to the job > > {code:java} > 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: > Could not obtain node HTTP address from provider. > javax.ws.rs.WebApplicationException: > org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have > privilege to see this application application_1593997842459_0214 > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544) > at > org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405) > at > org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373) > at > org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268) > at > org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461) > > {code} > On Analyzing, found WebServices#getContainer uses doAs using UGI created by > createRemoteUser(end user) to access RM#ApplicationClientProtocol which does > not work. Need to use createProxyUser to do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10340) HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol
Prabhu Joseph created YARN-10340: Summary: HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol Key: YARN-10340 URL: https://issues.apache.org/jira/browse/YARN-10340 Project: Hadoop YARN Issue Type: Bug Reporter: Prabhu Joseph Assignee: Tarun Parimi HsWebServices getContainerReport uses loginUser instead of remoteUser to access ApplicationClientProtocol [http://:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs|http://pjoseph-secure-1.pjoseph-secure.root.hwx.site:19888/ws/v1/history/containers/container_e03_1594030808801_0002_01_03/logs] While accessing above link using systest user, the request fails saying mapred user does not have access to the job {code:java} 2020-07-06 14:02:59,178 WARN org.apache.hadoop.yarn.server.webapp.LogServlet: Could not obtain node HTTP address from provider. javax.ws.rs.WebApplicationException: org.apache.hadoop.yarn.exceptions.YarnException: User mapred does not have privilege to see this application application_1593997842459_0214 at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:516) at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466) at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:639) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:985) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:913) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882) at org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowThrowable(WebServices.java:544) at org.apache.hadoop.yarn.server.webapp.WebServices.rewrapAndThrowException(WebServices.java:530) at org.apache.hadoop.yarn.server.webapp.WebServices.getContainer(WebServices.java:405) at org.apache.hadoop.yarn.server.webapp.WebServices.getNodeHttpAddress(WebServices.java:373) at org.apache.hadoop.yarn.server.webapp.LogServlet.getContainerLogsInfo(LogServlet.java:268) at org.apache.hadoop.mapreduce.v2.hs.webapp.HsWebServices.getContainerLogs(HsWebServices.java:461) {code} On Analyzing, found WebServices#getContainer uses doAs using UGI created by createRemoteUser(end user) to access RM#ApplicationClientProtocol which does not work. Need to use createProxyUser to do the same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150173#comment-17150173 ] Prabhu Joseph commented on YARN-10319: -- Thanks [~Tao Yang] and [~adam.antal] for reviewing. Have addressed above comments in the latest patch [^YARN-10319-005.patch] . Document change can be viewed [here|https://tinyurl.com/y7mpok4q] > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10319: - Attachment: YARN-10319-005.patch > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch, YARN-10319-005.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149595#comment-17149595 ] Prabhu Joseph edited comment on YARN-10333 at 7/1/20, 5:41 PM: --- [~sunil.gov...@gmail.com] Can you review this Jira when you get time. Thanks. Have verified with below combinations: *fs.defaultFS Log Aggregation Path* hdfs://nm1 s3a://tmp/app-logs hdfs://nm1 abfs://tmp/app-logs hdfs://nm1 hdfs://nm2/tmp/app-logs hdfs://nm1 hdfs://nm1/tmp/app-logs was (Author: prabhu joseph): [~sunil.gov...@gmail.com] Can you review this Jira when you get time. Thanks. Have verified with below combinations: *fs.defaultFS Log Aggregation Path* hdfs://nm1 s3a://tmp/app-logs hdfs://nm1 abfs://tmp/app-logs hdfs://nm1 hdfs://nm2/tmp/app-logs hdfs://nm1 hdfs://nm1/tmp/app-logs > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch, YARN-10333-002.patch, > YARN-10333-003.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to how it does for Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149595#comment-17149595 ] Prabhu Joseph commented on YARN-10333: -- [~sunil.gov...@gmail.com] Can you review this Jira when you get time. Thanks. Have verified with below combinations: *fs.defaultFS Log Aggregation Path* hdfs://nm1 s3a://tmp/app-logs hdfs://nm1 abfs://tmp/app-logs hdfs://nm1 hdfs://nm2/tmp/app-logs hdfs://nm1 hdfs://nm1/tmp/app-logs > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch, YARN-10333-002.patch, > YARN-10333-003.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to how it does for Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10333: - Attachment: YARN-10333-003.patch > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch, YARN-10333-002.patch, > YARN-10333-003.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to how it does for Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10333: - Attachment: YARN-10333-002.patch > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch, YARN-10333-002.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to how it does for Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10333: - Description: There are use cases where Yarn Log Aggregation Path is configured to a FileSystem like S3 or ABFS different from what is configured in fs.defaultFS (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS and not for log aggregation path. This Jira is to improve YarnClient by obtaining delegation token for log aggregation path and add it to the Credential of Container Launch Context similar to how it does for Timeline Delegation Token. was: There are use cases where Yarn Log Aggregation Path is configured to a FileSystem like S3 or ABFS different from what is configured in fs.defaultFS (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS and not for log aggregation path. This Jira is to improve YarnClient by obtaining delegation token for log aggregation path and add it to the Credential of Container Launch Context similar to Timeline Delegation Token. > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to how it does for Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10333: - Attachment: YARN-10333-001.patch > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10333) YarnClient obtain Delegation Token for Log Aggregation Path
[ https://issues.apache.org/jira/browse/YARN-10333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10333: - Summary: YarnClient obtain Delegation Token for Log Aggregation Path (was: YarnClient fetch DT for Log Aggregation Path) > YarnClient obtain Delegation Token for Log Aggregation Path > --- > > Key: YARN-10333 > URL: https://issues.apache.org/jira/browse/YARN-10333 > Project: Hadoop YARN > Issue Type: Improvement > Components: log-aggregation >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-10333-001.patch > > > There are use cases where Yarn Log Aggregation Path is configured to a > FileSystem like S3 or ABFS different from what is configured in fs.defaultFS > (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS > and not for log aggregation path. > This Jira is to improve YarnClient by obtaining delegation token for log > aggregation path and add it to the Credential of Container Launch Context > similar to Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10333) YarnClient fetch DT for Log Aggregation Path
Prabhu Joseph created YARN-10333: Summary: YarnClient fetch DT for Log Aggregation Path Key: YARN-10333 URL: https://issues.apache.org/jira/browse/YARN-10333 Project: Hadoop YARN Issue Type: Improvement Components: log-aggregation Affects Versions: 3.3.0 Reporter: Prabhu Joseph Assignee: Prabhu Joseph There are use cases where Yarn Log Aggregation Path is configured to a FileSystem like S3 or ABFS different from what is configured in fs.defaultFS (HDFS). Log Aggregation fails as the client has token only for fs.defaultFS and not for log aggregation path. This Jira is to improve YarnClient by obtaining delegation token for log aggregation path and add it to the Credential of Container Launch Context similar to Timeline Delegation Token. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148552#comment-17148552 ] Prabhu Joseph commented on YARN-10319: -- [~Tao Yang] Can you review the latest patch when you get time. Thanks. > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143544#comment-17143544 ] Prabhu Joseph commented on YARN-10319: -- Thanks [~Tao Yang] for detailed review. Have addressed all above comments in [^YARN-10319-004.patch] . bq. The fetching approaches of activities and bulk-activities REST API are different (asynchronous or synchronous), I think we should elaborate this in the document. Have updated the document with below {code} The scheduler bulk activities RESTful API can fetch scheduler activities info recorded for multiple scheduling cycle. This may take time to return as it internally waits still it records specified activitiesCount. {code} > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10319) Record Last N Scheduler Activities from ActivitiesManager
[ https://issues.apache.org/jira/browse/YARN-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10319: - Attachment: YARN-10319-004.patch > Record Last N Scheduler Activities from ActivitiesManager > - > > Key: YARN-10319 > URL: https://issues.apache.org/jira/browse/YARN-10319 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Labels: activitiesmanager > Attachments: Screen Shot 2020-06-18 at 1.26.31 PM.png, > YARN-10319-001-WIP.patch, YARN-10319-002.patch, YARN-10319-003.patch, > YARN-10319-004.patch > > > ActivitiesManager records a call flow for a given nodeId or a last call flow. > This is useful when debugging the issue live where the user queries with > right nodeId. But capturing last N scheduler activities during the issue > period can help to debug the issue offline. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8047) RMWebApp make external class pluggable
[ https://issues.apache.org/jira/browse/YARN-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17142799#comment-17142799 ] Prabhu Joseph commented on YARN-8047: - [~BilwaST] Thanks for the patch. Below are some of the comments. 1. yarn-default.xml a. The description of yarn.http.rmwebapp.external.classes says "Used to specify custom web application pages". This looks not correct, can we change it to "Used to specify custom web services for ResourceManager" 2. RMWebApp.java b. Below code is not necessary as getClasses won't return null. {code:java} +if (externalClasses == null) { + return; +} {code} 3. RmController.java a. Below needs to be removed. + System.out.println(schedulerName); b. This is affecting existing behavior for custom schedulers, we need to show the DefaultSchedulerPage.class if hadoop.http.rmwebapp.scheduler.page.class not configured. And a warn message into log saying the custom page class is not found if user has configured a class which does not exist. + renderText("Not Found"); 4. Can you also include a unit testcase which shows custom webservice and custom page works fine. > RMWebApp make external class pluggable > -- > > Key: YARN-8047 > URL: https://issues.apache.org/jira/browse/YARN-8047 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Bibin Chundatt >Assignee: Bilwa S T >Priority: Minor > Attachments: YARN-8047-001.patch, YARN-8047-002.patch, > YARN-8047-003.patch, YARN-8047.004.patch, YARN-8047.005.patch > > > JIra should make sure we should be able to plugin webservices and web pages > of scheduler in Resourcemanager > * RMWebApp allow to bind external classes > * RMController allow to plugin scheduler classes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10321) Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios
[ https://issues.apache.org/jira/browse/YARN-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141756#comment-17141756 ] Prabhu Joseph commented on YARN-10321: -- Thanks [~snemeth] for the patch and [~shuzirra] for the review. The patch looks good, +1. Have committed it to trunk. > Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios > > > Key: YARN-10321 > URL: https://issues.apache.org/jira/browse/YARN-10321 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-10321.001.patch > > > org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule#testMapping > is very large and hard to read/maintain and moreover, error-prone. > We should break this testcase down into several separate testcases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10321) Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios
[ https://issues.apache.org/jira/browse/YARN-10321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-10321: - Fix Version/s: 3.4.0 > Break down TestUserGroupMappingPlacementRule#testMapping into test scenarios > > > Key: YARN-10321 > URL: https://issues.apache.org/jira/browse/YARN-10321 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Fix For: 3.4.0 > > Attachments: YARN-10321.001.patch > > > org.apache.hadoop.yarn.server.resourcemanager.placement.TestUserGroupMappingPlacementRule#testMapping > is very large and hard to read/maintain and moreover, error-prone. > We should break this testcase down into several separate testcases. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org