[jira] [Created] (YARN-11659) app submission fast fail with node label when node label is disable
Junfan Zhang created YARN-11659: --- Summary: app submission fast fail with node label when node label is disable Key: YARN-11659 URL: https://issues.apache.org/jira/browse/YARN-11659 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11659) app with node label submission should fast fail when node label is disable
[ https://issues.apache.org/jira/browse/YARN-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11659: --- Assignee: Junfan Zhang > app with node label submission should fast fail when node label is disable > -- > > Key: YARN-11659 > URL: https://issues.apache.org/jira/browse/YARN-11659 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11659) app with node label submission should fast fail when node label is disable
[ https://issues.apache.org/jira/browse/YARN-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11659: Summary: app with node label submission should fast fail when node label is disable (was: app submission fast fail with node label when node label is disable) > app with node label submission should fast fail when node label is disable > -- > > Key: YARN-11659 > URL: https://issues.apache.org/jira/browse/YARN-11659 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression
Junfan Zhang created YARN-11660: --- Summary: SingleConstraintAppPlacementAllocator performance regression Key: YARN-11660 URL: https://issues.apache.org/jira/browse/YARN-11660 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression
[ https://issues.apache.org/jira/browse/YARN-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11660: --- Assignee: Junfan Zhang > SingleConstraintAppPlacementAllocator performance regression > > > Key: YARN-11660 > URL: https://issues.apache.org/jira/browse/YARN-11660 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11668) Potential concurrent modification exception for node attributes of node manager
Junfan Zhang created YARN-11668: --- Summary: Potential concurrent modification exception for node attributes of node manager Key: YARN-11668 URL: https://issues.apache.org/jira/browse/YARN-11668 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11668) Potential concurrent modification exception for node attributes of node manager
[ https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11668: Description: The RM crash when encoutering the following the stacktrace. > Potential concurrent modification exception for node attributes of node > manager > --- > > Key: YARN-11668 > URL: https://issues.apache.org/jira/browse/YARN-11668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > > The RM crash when encoutering the following the stacktrace. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11668) Potential concurrent modification exception for node attributes of node manager
[ https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11668: Description: The RM crash when encoutering the following the stacktrace in the attachment. (was: The RM crash when encoutering the following the stacktrace.) > Potential concurrent modification exception for node attributes of node > manager > --- > > Key: YARN-11668 > URL: https://issues.apache.org/jira/browse/YARN-11668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Attachments: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg > > > The RM crash when encoutering the following the stacktrace in the attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11668) Potential concurrent modification exception for node attributes of node manager
[ https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11668: Attachment: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg > Potential concurrent modification exception for node attributes of node > manager > --- > > Key: YARN-11668 > URL: https://issues.apache.org/jira/browse/YARN-11668 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Attachments: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg > > > The RM crash when encoutering the following the stacktrace. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10065) Support Placement Constraints for AM container allocations
[ https://issues.apache.org/jira/browse/YARN-10065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17861090#comment-17861090 ] Junfan Zhang commented on YARN-10065: - I think I can pick up this, I have implemented this in our internal Yarn version. > Support Placement Constraints for AM container allocations > -- > > Key: YARN-10065 > URL: https://issues.apache.org/jira/browse/YARN-10065 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.2.0 >Reporter: Daniel Velasquez >Priority: Major > > Currently ApplicationSubmissionContext API supports specifying a node label > expression for the AM resource request. It would be beneficial to have the > ability to specify Placement Constraints as well for the AM resource request. > We have a requirement to constrain AM containers on certain nodes e.g. AM > containers not on preemptible/spot cloud instances. It looks like node > attributes would fit our use case well. However, we currently don't have the > ability to specify this in the API for AM resource requests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11704) Avoid nested 'AND' placement constraint for non tags in scheduling request
[ https://issues.apache.org/jira/browse/YARN-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11704: --- Assignee: Junfan Zhang > Avoid nested 'AND' placement constraint for non tags in scheduling request > -- > > Key: YARN-11704 > URL: https://issues.apache.org/jira/browse/YARN-11704 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacity scheduler >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11704) Avoid nested 'AND' placement constraint for non tags in scheduling request
Junfan Zhang created YARN-11704: --- Summary: Avoid nested 'AND' placement constraint for non tags in scheduling request Key: YARN-11704 URL: https://issues.apache.org/jira/browse/YARN-11704 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11728) Scheduling hang when multiple nodes placement is enabled
Junfan Zhang created YARN-11728: --- Summary: Scheduling hang when multiple nodes placement is enabled Key: YARN-11728 URL: https://issues.apache.org/jira/browse/YARN-11728 Project: Hadoop YARN Issue Type: Bug Components: capacity scheduler, multi-node-placement Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled
[ https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11728: Description: When trying to use the multi node placement to enable the customize multiple nodes lookup policy, I found this has some problems of > Scheduling hang when multiple nodes placement is enabled > > > Key: YARN-11728 > URL: https://issues.apache.org/jira/browse/YARN-11728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, multi-node-placement >Reporter: Junfan Zhang >Priority: Major > > When trying to use the multi node placement to enable the customize multiple > nodes lookup policy, I found this has some problems of -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled
[ https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11728: Description: When trying to use the multi node placement to enable the customize multiple nodes lookup policy, I found this has some problems that will hang the scheduling if having one container is reserved in one node although other candidates nodes are with enough resources. Let me to describe how to reproduce this problem. h2. Preconditions 1. Using the capacity-scheduler 2. Starting the hadoop yarn cluster with at least 2 nodemanagers h2. How to reproduce 1. Firstly, enable the default node lookup policy of {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in capacity-scheduler.xml {code:xml} yarn.scheduler.capacity.multi-node-placement-enabled true yarn.scheduler.capacity.multi-node-sorting.policy.names default yarn.scheduler.capacity.multi-node-sorting.policydefault yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy {code} 2. Use the spark to submit the app with the exceeding 1 nodemanager's total vcores container request. If the 2 nodemanagers have the same total vcores of 96, and the spark app request the executors instance 100, and every executor request the 1 vcores. And then you will see this allocation will hang in the 97th container. You will see the RM's log that will show the following logs like this: !screenshot-1.png! was: When trying to use the multi node placement to enable the customize multiple nodes lookup policy, I found this has some problems that will hang the scheduling if having one container is reserved in one node although other candidates nodes are with enough resources. Let me to describe how to reproduce this problem. h2. Preconditions 1. Using the capacity-scheduler 2. Starting the hadoop yarn cluster with at least 2 nodemanagers h2. How to reproduce 1. Firstly, enable the default node lookup policy of {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in capacity-scheduler.xml {code:xml} yarn.scheduler.capacity.multi-node-placement-enabled true yarn.scheduler.capacity.multi-node-sorting.policy.names default yarn.scheduler.capacity.multi-node-sorting.policydefault yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy {code} > Scheduling hang when multiple nodes placement is enabled > > > Key: YARN-11728 > URL: https://issues.apache.org/jira/browse/YARN-11728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, multi-node-placement >Reporter: Junfan Zhang >Priority: Major > Attachments: screenshot-1.png > > > When trying to use the multi node placement to enable the customize multiple > nodes lookup policy, I found this has some problems that will hang the > scheduling if having one container is reserved in one node although other > candidates nodes are with enough resources. > Let me to describe how to reproduce this problem. > h2. Preconditions > 1. Using the capacity-scheduler > 2. Starting the hadoop yarn cluster with at least 2 nodemanagers > h2. How to reproduce > 1. Firstly, enable the default node lookup policy of > {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options > in capacity-scheduler.xml > {code:xml} > > yarn.scheduler.capacity.multi-node-placement-enabled > true > > > yarn.scheduler.capacity.multi-node-sorting.policy.names > default > > > > yarn.scheduler.capacity.multi-node-sorting.policydefault > > > > yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy > > {code} > 2. Use the spark to submit the app with the exceeding 1 nodemanager's total > vcores container request. > If the 2 nodemanagers have the same total vcores of 96, and the spark app > request the executors instance 100, and every executor request the 1 vcores. > And then you will see this allocation will hang in the 97th container. You > will see the RM's log that will show the following logs like this: > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled
[ https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11728: Attachment: (was: screenshot-1.png) > Scheduling hang when multiple nodes placement is enabled > > > Key: YARN-11728 > URL: https://issues.apache.org/jira/browse/YARN-11728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, multi-node-placement >Reporter: Junfan Zhang >Priority: Major > Attachments: screenshot-2.png > > > When trying to use the multi node placement to enable the customize multiple > nodes lookup policy, I found this has some problems that will hang the > scheduling if having one container is reserved in one node although other > candidates nodes are with enough resources. > Let me to describe how to reproduce this problem. > h2. Preconditions > 1. Using the capacity-scheduler > 2. Starting the hadoop yarn cluster with at least 2 nodemanagers > h2. How to reproduce > 1. Firstly, enable the default node lookup policy of > {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options > in capacity-scheduler.xml > {code:xml} > > yarn.scheduler.capacity.multi-node-placement-enabled > true > > > yarn.scheduler.capacity.multi-node-sorting.policy.names > default > > > > yarn.scheduler.capacity.multi-node-sorting.policydefault > > > > yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy > > {code} > 2. Use the spark to submit the app with the exceeding 1 nodemanager's total > vcores container request. > If the 2 nodemanagers have the same total vcores of 96, and the spark app > request the executors instance 100, and every executor request the 1 vcores. > And then you will see this allocation will hang in the 97th container. You > will see the RM's log that will show the following logs like this: > !screenshot-2.png! !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled
[ https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11728: Description: When trying to use the multi node placement to enable the customize multiple nodes lookup policy, I found this has some problems that will hang the scheduling if having one container is reserved in one node although other candidates nodes are with enough resources. Let me to describe how to reproduce this problem. h2. Preconditions 1. Using the capacity-scheduler 2. Starting the hadoop yarn cluster with at least 2 nodemanagers h2. How to reproduce 1. Firstly, enable the default node lookup policy of {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in capacity-scheduler.xml {code:xml} yarn.scheduler.capacity.multi-node-placement-enabled true yarn.scheduler.capacity.multi-node-sorting.policy.names default yarn.scheduler.capacity.multi-node-sorting.policydefault yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy {code} 2. Use the spark to submit the app with the exceeding 1 nodemanager's total vcores container request. If the 2 nodemanagers have the same total vcores of 96, and the spark app request the executors instance 100, and every executor request the 1 vcores. And then you will see this allocation will hang in the 97th container. You will see the RM's log that will show the following logs like this: !screenshot-2.png! !screenshot-1.png! was: When trying to use the multi node placement to enable the customize multiple nodes lookup policy, I found this has some problems that will hang the scheduling if having one container is reserved in one node although other candidates nodes are with enough resources. Let me to describe how to reproduce this problem. h2. Preconditions 1. Using the capacity-scheduler 2. Starting the hadoop yarn cluster with at least 2 nodemanagers h2. How to reproduce 1. Firstly, enable the default node lookup policy of {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in capacity-scheduler.xml {code:xml} yarn.scheduler.capacity.multi-node-placement-enabled true yarn.scheduler.capacity.multi-node-sorting.policy.names default yarn.scheduler.capacity.multi-node-sorting.policydefault yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy {code} 2. Use the spark to submit the app with the exceeding 1 nodemanager's total vcores container request. If the 2 nodemanagers have the same total vcores of 96, and the spark app request the executors instance 100, and every executor request the 1 vcores. And then you will see this allocation will hang in the 97th container. You will see the RM's log that will show the following logs like this: !screenshot-1.png! > Scheduling hang when multiple nodes placement is enabled > > > Key: YARN-11728 > URL: https://issues.apache.org/jira/browse/YARN-11728 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, multi-node-placement >Reporter: Junfan Zhang >Priority: Major > Attachments: screenshot-2.png > > > When trying to use the multi node placement to enable the customize multiple > nodes lookup policy, I found this has some problems that will hang the > scheduling if having one container is reserved in one node although other > candidates nodes are with enough resources. > Let me to describe how to reproduce this problem. > h2. Preconditions > 1. Using the capacity-scheduler > 2. Starting the hadoop yarn cluster with at least 2 nodemanagers > h2. How to reproduce > 1. Firstly, enable the default node lookup policy of > {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options > in capacity-scheduler.xml > {code:xml} > > yarn.scheduler.capacity.multi-node-placement-enabled > true > > > yarn.scheduler.capacity.multi-node-sorting.policy.names > default > > > > yarn.scheduler.capacity.multi-node-sorting.policydefault > > > > yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy > > {code} > 2. Use the spark to submit the app with the exceeding 1 nodemanager's total > vcores container request. > If the 2 nodemanagers have the same total vcores of 96, and the spark app > request the executors instance 100, and every executor request the 1 vcores. > And then you will see this allocation will hang in the 97th c
[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled
[ https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11728: Description: When trying to use the multi node placement to enable the customize multiple nodes lookup policy, I found this has some problems that will hang the scheduling if having one container is reserved in one node although other candidates nodes are with enough resources. Let me to describe how to reproduce this problem. h2. Preconditions 1. Using the capacity-scheduler that enables the async scheduling 2. Starting the hadoop yarn cluster with at least 2 nodemanagers h2. How to reproduce 1. Firstly, enable the default node lookup policy of {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in capacity-scheduler.xml {code:xml} yarn.scheduler.capacity.multi-node-placement-enabled true yarn.scheduler.capacity.multi-node-sorting.policy.names default yarn.scheduler.capacity.multi-node-sorting.policydefault yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy {code} 2. Use the spark to submit the app with the exceeding 1 nodemanager's total vcores container request. If the 2 nodemanagers have the same total vcores of 96, and the spark app request the executors instance 100, and every executor request the 1 vcores. And then you will see this allocation will hang in the 97th container. You will see the RM's log that will show the following logs like this: !screenshot-2.png! And at this time, If you want to submit another one app into this cluster, you will see this app's AM will not be allocated any resource. h2. Why After digging into this yarn‘s async scheduling logic, I found something strange about the multi node placement. Simple to say is that the scheduling hange is caused by the one reserved container. For the multiple node placement is enabled, for one container which is selected by some specified policy, it is not noly matched with single one candidate nodemanager, but with the multiple nodes. The sequece of the multiple nodes is determinzed by the customize lookup policy, the default is the {{ResourceUsageMultiNodeLookupPolicy}}. And the policy is managed by the {{MultiNodeSortingManager}}, that will use the policy to resort the cluster's all healthy nodes with the interval 1 second. 1. Now let's suppose in the first 1 second, the nodes sequence is (node1, node2). And the 97th container(1th container is AM) will be reserved in the node1. 2. For the next time for async scheduling thread, this will find this reserved container and try to re-reserve/re-start. Pity, no existing container will be release. 3. And after 1 second, the sorting policy make effect that will resort the nodes sequence, and then it is (node2, node1). For normal thought, if the node1 is full of container with no free resource, the reserved container could be picked up by another node(like node2). But this is allowed for yarn, and so, hang happens. h2. How to fix this 1. If having multiple nodes candidates, we should lookup all the nodes until having the enough resource to start instead of reserving 2. Allow to other nodes to pick up the reserved container was: When trying to use the multi node placement to enable the customize multiple nodes lookup policy, I found this has some problems that will hang the scheduling if having one container is reserved in one node although other candidates nodes are with enough resources. Let me to describe how to reproduce this problem. h2. Preconditions 1. Using the capacity-scheduler that enables the async scheduling 2. Starting the hadoop yarn cluster with at least 2 nodemanagers h2. How to reproduce 1. Firstly, enable the default node lookup policy of {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in capacity-scheduler.xml {code:xml} yarn.scheduler.capacity.multi-node-placement-enabled true yarn.scheduler.capacity.multi-node-sorting.policy.names default yarn.scheduler.capacity.multi-node-sorting.policydefault yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy {code} 2. Use the spark to submit the app with the exceeding 1 nodemanager's total vcores container request. If the 2 nodemanagers have the same total vcores of 96, and the spark app request the executors instance 100, and every executor request the 1 vcores. And then you will see this allocation will hang in the 97th container. You will see the RM's log that will show the following logs like this: !screenshot-2.png! And at this time, If you want to submit another one app into this cluster, you will see thi
[jira] [Commented] (YARN-11115) Add configuration to disable AM preemption for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541786#comment-17541786 ] Junfan Zhang commented on YARN-5: - Sorry for late reply. Feel free to take it. [~groot] Looking forward your patch. > Add configuration to disable AM preemption for capacity scheduler > - > > Key: YARN-5 > URL: https://issues.apache.org/jira/browse/YARN-5 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Yuan Luo >Assignee: Ashutosh Gupta >Priority: Major > > I think it's necessary to add configuration to disable AM preemption for > capacity-scheduler, like fair-scheduler feature: YARN-9537. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11164) PartitionQueueMetrics support more metrics
[ https://issues.apache.org/jira/browse/YARN-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11164: --- Assignee: Junfan Zhang > PartitionQueueMetrics support more metrics > -- > > Key: YARN-11164 > URL: https://issues.apache.org/jira/browse/YARN-11164 > Project: Hadoop YARN > Issue Type: Improvement > Components: metrics >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > When we enable the node-label when using the capacity scheduler, the > partition queue metrics are missing a lot of metrics, compared with the > {{QueueMetrics}}. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11164) PartitionQueueMetrics support more metrics
Junfan Zhang created YARN-11164: --- Summary: PartitionQueueMetrics support more metrics Key: YARN-11164 URL: https://issues.apache.org/jira/browse/YARN-11164 Project: Hadoop YARN Issue Type: Improvement Components: metrics Reporter: Junfan Zhang When we enable the node-label when using the capacity scheduler, the partition queue metrics are missing a lot of metrics, compared with the {{QueueMetrics}}. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11179) Show more detailed info when container token is expired
Junfan Zhang created YARN-11179: --- Summary: Show more detailed info when container token is expired Key: YARN-11179 URL: https://issues.apache.org/jira/browse/YARN-11179 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang There is no appid in log about failing on starting containers when container token is expired. This will make hard to solve the error. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11179) Show more detailed info when container token is expired
[ https://issues.apache.org/jira/browse/YARN-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11179: Description: There is no appid in log about failing on starting containers when container token is expired. This will make hard to troubleshoot. (was: There is no appid in log about failing on starting containers when container token is expired. This will make hard to solve the error.) > Show more detailed info when container token is expired > --- > > Key: YARN-11179 > URL: https://issues.apache.org/jira/browse/YARN-11179 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > There is no appid in log about failing on starting containers when container > token is expired. This will make hard to troubleshoot. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11084) Introduce new config to specify AM default node-label when not specified
Junfan Zhang created YARN-11084: --- Summary: Introduce new config to specify AM default node-label when not specified Key: YARN-11084 URL: https://issues.apache.org/jira/browse/YARN-11084 Project: Hadoop YARN Issue Type: New Feature Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11084) Introduce new config to specify AM default node-label when not specified
[ https://issues.apache.org/jira/browse/YARN-11084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11084: Description: h2. What When submitting application to Yarn and user don't specify any node-label on AM request and {{{}ApplicationSubmissionContext{}}}, we hope that Yarn could provide the default AM node-label. h2. Why Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). To prevent application instability due to elastic NM decommission, we hope that the AM of job can be allocated to on-premise NMs. > Introduce new config to specify AM default node-label when not specified > > > Key: YARN-11084 > URL: https://issues.apache.org/jira/browse/YARN-11084 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Priority: Major > > h2. What > When submitting application to Yarn and user don't specify any node-label on > AM request and {{{}ApplicationSubmissionContext{}}}, we hope that Yarn could > provide the default AM node-label. > > h2. Why > Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). To prevent application > instability due to elastic NM decommission, we hope that the AM of job can be > allocated to on-premise NMs. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11086) Add space in debug log of ParentQueue
Junfan Zhang created YARN-11086: --- Summary: Add space in debug log of ParentQueue Key: YARN-11086 URL: https://issues.apache.org/jira/browse/YARN-11086 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11087) Introduce the config to control the refresh interval in RMNodeLabelsMappingProvider
Junfan Zhang created YARN-11087: --- Summary: Introduce the config to control the refresh interval in RMNodeLabelsMappingProvider Key: YARN-11087 URL: https://issues.apache.org/jira/browse/YARN-11087 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater
[ https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11087: Summary: Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater (was: Introduce the config to control the refresh interval in RMNodeLabelsMappingProvider) > Introduce the config to control the refresh interval in > RMDelegatedNodeLabelsUpdater > > > Key: YARN-11087 > URL: https://issues.apache.org/jira/browse/YARN-11087 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater
[ https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11087: Description: h3. Why When configuring nodes to labels mapping by Delegated-Centralized mode, once the newly registered nodes comes, the node-label of this node wont be attached until triggering the nodelabel mapping provider, which the delayed time depends on the scheduler interval. h3. How to solve this bug > Introduce the config to control the refresh interval in > RMDelegatedNodeLabelsUpdater > > > Key: YARN-11087 > URL: https://issues.apache.org/jira/browse/YARN-11087 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > h3. Why > When configuring nodes to labels mapping by Delegated-Centralized mode, once > the newly registered nodes comes, the node-label of this node wont be > attached until triggering the nodelabel mapping provider, which the delayed > time depends on the scheduler interval. > h3. How to solve this bug -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater
[ https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11087: Issue Type: Bug (was: Improvement) > Introduce the config to control the refresh interval in > RMDelegatedNodeLabelsUpdater > > > Key: YARN-11087 > URL: https://issues.apache.org/jira/browse/YARN-11087 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > h3. Why > When configuring nodes to labels mapping by Delegated-Centralized mode, once > the newly registered nodes comes, the node-label of this node wont be > attached until triggering the nodelabel mapping provider, which the delayed > time depends on the scheduler interval. > h3. How to solve this bug -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater
[ https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11087: Description: h3. Why When configuring nodes to labels mapping by Delegated-Centralized mode, once the newly registered nodes comes, the node-label of this node wont be attached until triggering the nodelabel mapping provider, which the delayed time depends on the scheduler interval. h3. How to solve this bug I think there are two options # Introduce the new config to specify the update-node-label schedule interval. If u want to quickly refresh the newly registered nodes, user should decrease the interval. # Once the newly registered node come, directly trigger the execution of nodelabel mapping provider. But if the provider is the time-consuming operation and lots of nodes register to RM at the same time, this will also make some nodes with node-label delay. I prefer the first option and submit the PR to solve this. Feel free to discuss if having any ideas. was: h3. Why When configuring nodes to labels mapping by Delegated-Centralized mode, once the newly registered nodes comes, the node-label of this node wont be attached until triggering the nodelabel mapping provider, which the delayed time depends on the scheduler interval. h3. How to solve this bug > Introduce the config to control the refresh interval in > RMDelegatedNodeLabelsUpdater > > > Key: YARN-11087 > URL: https://issues.apache.org/jira/browse/YARN-11087 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > h3. Why > When configuring nodes to labels mapping by Delegated-Centralized mode, once > the newly registered nodes comes, the node-label of this node wont be > attached until triggering the nodelabel mapping provider, which the delayed > time depends on the scheduler interval. > h3. How to solve this bug > I think there are two options > # Introduce the new config to specify the update-node-label schedule > interval. If u want to quickly refresh the newly registered nodes, user > should decrease the interval. > # Once the newly registered node come, directly trigger the execution of > nodelabel mapping provider. But if the provider is the time-consuming > operation and lots of nodes register to RM at the same time, this will also > make some nodes with node-label delay. > I prefer the first option and submit the PR to solve this. > Feel free to discuss if having any ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
Junfan Zhang created YARN-11088: --- Summary: Introduce the config to control the AM allocated to non-exclusive nodes Key: YARN-11088 URL: https://issues.apache.org/jira/browse/YARN-11088 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11088: Description: h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. But Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. h4. How to support it Introduce the new config to control the was: h4. What When submitting application to Yarn and user don't specify any node-label on AM request and ApplicationSubmissionContext, we hope that Yarn could provide the default AM node-label. h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > But Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When the elastic nodemanagers > decommission, we hope that the AM can be scheduled to non-exclusive nodes. > h4. How to support it > Introduce the new config to control the -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11088: Description: h4. What When submitting application to Yarn and user don't specify any node-label on AM request and ApplicationSubmissionContext, we hope that Yarn could provide the default AM node-label. h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > > h4. What > When submitting application to Yarn and user don't specify any node-label on > AM request and ApplicationSubmissionContext, we hope that Yarn could provide > the default AM node-label. > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When the elastic nodemanagers > decommission, we hope that the AM can be scheduled to non-exclusive nodes. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11088: Description: h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. But Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. h4. How to support it Introduce the new config to control the AM can be allocated to non-exclusive nodes. Feel free to discuss if having any ideas! was: h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. But Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. h4. How to support it Introduce the new config to control the > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > But Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When the elastic nodemanagers > decommission, we hope that the AM can be scheduled to non-exclusive nodes. > h4. How to support it > Introduce the new config to control the AM can be allocated to non-exclusive > nodes. > Feel free to discuss if having any ideas! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11088: Description: h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. But Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. h4. How to support it Introduce the new config to control the AM can be allocated to non-exclusive nodes. *Feel free to discuss if having any ideas!* was: h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. But Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. h4. How to support it Introduce the new config to control the AM can be allocated to non-exclusive nodes. Feel free to discuss if having any ideas! > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > But Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When the elastic nodemanagers > decommission, we hope that the AM can be scheduled to non-exclusive nodes. > h4. How to support it > Introduce the new config to control the AM can be allocated to non-exclusive > nodes. > *Feel free to discuss if having any ideas!* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11089) Fix typo in rm audit log
Junfan Zhang created YARN-11089: --- Summary: Fix typo in rm audit log Key: YARN-11089 URL: https://issues.apache.org/jira/browse/YARN-11089 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11088: Description: h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. But Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When all the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. h4. How to support it Introduce the new config to control the AM can be allocated to non-exclusive nodes. *Feel free to discuss if having any ideas!* was: h4. Why Current the implementation of Yarn about AM allocation on non-exclusive nodes is directly to fail fast. I know this aims to keep the stability of job, because the container in non-exclusive nodes will be preempted. But Yarn cluster in our internal company exists on-premise NodeManagers and elastic NodeManagers (which is built on K8s). When the elastic nodemanagers decommission, we hope that the AM can be scheduled to non-exclusive nodes. h4. How to support it Introduce the new config to control the AM can be allocated to non-exclusive nodes. *Feel free to discuss if having any ideas!* > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > But Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When all the elastic > nodemanagers decommission, we hope that the AM can be scheduled to > non-exclusive nodes. > h4. How to support it > Introduce the new config to control the AM can be allocated to non-exclusive > nodes. > *Feel free to discuss if having any ideas!* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506703#comment-17506703 ] Junfan Zhang commented on YARN-11088: - Could you help check this feature? [~quapaw] [~tdomok] > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > But Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When all the elastic > nodemanagers decommission, we hope that the AM can be scheduled to > non-exclusive nodes. > h4. How to support it > Introduce the new config to control the AM can be allocated to non-exclusive > nodes. > *Feel free to discuss if having any ideas!* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506703#comment-17506703 ] Junfan Zhang edited comment on YARN-11088 at 3/15/22, 4:56 AM: --- Could you help check this feature? [~quapaw] [~tdomok]. If OK, please assign to me. was (Author: zuston): Could you help check this feature? [~quapaw] [~tdomok] > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > But Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When all the elastic > nodemanagers decommission, we hope that the AM can be scheduled to > non-exclusive nodes. > h4. How to support it > Introduce the new config to control the AM can be allocated to non-exclusive > nodes. > *Feel free to discuss if having any ideas!* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes
[ https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509902#comment-17509902 ] Junfan Zhang commented on YARN-11088: - I will submit PR tomorrow and it has been applied in our internal Yarn. Glad to contribute to the community. [~quapaw] > Introduce the config to control the AM allocated to non-exclusive nodes > --- > > Key: YARN-11088 > URL: https://issues.apache.org/jira/browse/YARN-11088 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > h4. Why > Current the implementation of Yarn about AM allocation on non-exclusive nodes > is directly to fail fast. I know this aims to keep the stability of job, > because the container in non-exclusive nodes will be preempted. > But Yarn cluster in our internal company exists on-premise NodeManagers and > elastic NodeManagers (which is built on K8s). When all the elastic > nodemanagers decommission, we hope that the AM can be scheduled to > non-exclusive nodes. > h4. How to support it > Introduce the new config to control the AM can be allocated to non-exclusive > nodes. > *Feel free to discuss if having any ideas!* -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater
[ https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509903#comment-17509903 ] Junfan Zhang edited comment on YARN-11087 at 3/21/22, 1:59 PM: --- What do u think of the second option? [~quapaw] was (Author: zuston): What do u think of the second option? [~snemeth] > Introduce the config to control the refresh interval in > RMDelegatedNodeLabelsUpdater > > > Key: YARN-11087 > URL: https://issues.apache.org/jira/browse/YARN-11087 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > h3. Why > When configuring nodes to labels mapping by Delegated-Centralized mode, once > the newly registered nodes comes, the node-label of this node wont be > attached until triggering the nodelabel mapping provider, which the delayed > time depends on the scheduler interval. > h3. How to solve this bug > I think there are two options > # Introduce the new config to specify the update-node-label schedule > interval. If u want to quickly refresh the newly registered nodes, user > should decrease the interval. > # Once the newly registered node come, directly trigger the execution of > nodelabel mapping provider. But if the provider is the time-consuming > operation and lots of nodes register to RM at the same time, this will also > make some nodes with node-label delay. > I prefer the first option and submit the PR to solve this. > Feel free to discuss if having any ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater
[ https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509903#comment-17509903 ] Junfan Zhang commented on YARN-11087: - What do u think of the second option? [~snemeth] > Introduce the config to control the refresh interval in > RMDelegatedNodeLabelsUpdater > > > Key: YARN-11087 > URL: https://issues.apache.org/jira/browse/YARN-11087 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > h3. Why > When configuring nodes to labels mapping by Delegated-Centralized mode, once > the newly registered nodes comes, the node-label of this node wont be > attached until triggering the nodelabel mapping provider, which the delayed > time depends on the scheduler interval. > h3. How to solve this bug > I think there are two options > # Introduce the new config to specify the update-node-label schedule > interval. If u want to quickly refresh the newly registered nodes, user > should decrease the interval. > # Once the newly registered node come, directly trigger the execution of > nodelabel mapping provider. But if the provider is the time-consuming > operation and lots of nodes register to RM at the same time, this will also > make some nodes with node-label delay. > I prefer the first option and submit the PR to solve this. > Feel free to discuss if having any ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11099) Limit the resources usage of non-exclusive allocation
Junfan Zhang created YARN-11099: --- Summary: Limit the resources usage of non-exclusive allocation Key: YARN-11099 URL: https://issues.apache.org/jira/browse/YARN-11099 Project: Hadoop YARN Issue Type: New Feature Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11099) Limit the resources usage of non-exclusive allocation
[ https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11099: Description: In current non-exclusive allocation, there is no limitation of resource usage. related code link: But in our internal hadoop, we hope the resource usage of non-exclusive allocation can be limited to the {{Effective Max Capacity}} > Limit the resources usage of non-exclusive allocation > - > > Key: YARN-11099 > URL: https://issues.apache.org/jira/browse/YARN-11099 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Priority: Major > > In current non-exclusive allocation, there is no limitation of resource > usage. related code link: > But in our internal hadoop, we hope the resource usage of non-exclusive > allocation can be limited to the {{Effective Max Capacity}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11099) Limit the resources usage of non-exclusive allocation
[ https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11099: Description: In current non-exclusive allocation, there is no limitation of resource usage. [related code link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783] But in our internal hadoop, we hope the resource usage of non-exclusive allocation can be limited to the {{Effective Max Capacity}} was: In current non-exclusive allocation, there is no limitation of resource usage. related code link: But in our internal hadoop, we hope the resource usage of non-exclusive allocation can be limited to the {{Effective Max Capacity}} > Limit the resources usage of non-exclusive allocation > - > > Key: YARN-11099 > URL: https://issues.apache.org/jira/browse/YARN-11099 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Priority: Major > > In current non-exclusive allocation, there is no limitation of resource > usage. [related code > link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783] > But in our internal hadoop, we hope the resource usage of non-exclusive > allocation can be limited to the {{Effective Max Capacity}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11099) Limit the resources usage of non-exclusive allocation
[ https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511899#comment-17511899 ] Junfan Zhang commented on YARN-11099: - Do u have any ideas on it? [~quapaw] > Limit the resources usage of non-exclusive allocation > - > > Key: YARN-11099 > URL: https://issues.apache.org/jira/browse/YARN-11099 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Priority: Major > > In current non-exclusive allocation, there is no limitation of resource > usage. [related code > link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783] > But in our internal hadoop, we hope the resource usage of non-exclusive > allocation can be limited to the {{Effective Max Capacity}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11099) Limit the resources usage of non-exclusive allocation
[ https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11099: --- Assignee: Junfan Zhang > Limit the resources usage of non-exclusive allocation > - > > Key: YARN-11099 > URL: https://issues.apache.org/jira/browse/YARN-11099 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > In current non-exclusive allocation, there is no limitation of resource > usage. [related code > link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783] > But in our internal hadoop, we hope the resource usage of non-exclusive > allocation can be limited to the {{Effective Max Capacity}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA
[ https://issues.apache.org/jira/browse/YARN-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511902#comment-17511902 ] Junfan Zhang commented on YARN-5464: Any update on it? [~shuzirra] , [~brahmareddy] ,[~quapaw] This PR meets our internal requirement and hope it can be merged into trunk. > Server-Side NM Graceful Decommissioning with RM HA > -- > > Key: YARN-5464 > URL: https://issues.apache.org/jira/browse/YARN-5464 > Project: Hadoop YARN > Issue Type: Sub-task > Components: graceful, yarn >Reporter: Robert Kanter >Assignee: Gergely Pollák >Priority: Major > Attachments: YARN-5464.001.patch, YARN-5464.002.patch, > YARN-5464.003.patch, YARN-5464.004.patch, YARN-5464.005.patch, > YARN-5464.006.patch, YARN-5464.wip.patch > > > Make sure to remove the note added by YARN-7094 about RM HA failover not > working right. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11099) Limit the resources usage of non-exclusive allocation
[ https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511899#comment-17511899 ] Junfan Zhang edited comment on YARN-11099 at 3/24/22, 3:04 PM: --- Do u have any ideas on it? [~quapaw]. If OK, i will go ahead. was (Author: zuston): Do u have any ideas on it? [~quapaw] > Limit the resources usage of non-exclusive allocation > - > > Key: YARN-11099 > URL: https://issues.apache.org/jira/browse/YARN-11099 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > In current non-exclusive allocation, there is no limitation of resource > usage. [related code > link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783] > But in our internal hadoop, we hope the resource usage of non-exclusive > allocation can be limited to the {{Effective Max Capacity}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-11099) Limit the resources usage of non-exclusive allocation
[ https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511899#comment-17511899 ] Junfan Zhang edited comment on YARN-11099 at 3/24/22, 3:04 PM: --- Do u have any ideas on it? [~quapaw]. If OK, i will go ahead. was (Author: zuston): Do u have any ideas on it? [~quapaw]. If OK, i will go ahead. > Limit the resources usage of non-exclusive allocation > - > > Key: YARN-11099 > URL: https://issues.apache.org/jira/browse/YARN-11099 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > In current non-exclusive allocation, there is no limitation of resource > usage. [related code > link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783] > But in our internal hadoop, we hope the resource usage of non-exclusive > allocation can be limited to the {{Effective Max Capacity}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11099) Limit the resources usage of non-exclusive allocation
[ https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514160#comment-17514160 ] Junfan Zhang commented on YARN-11099: - + [~bteke] > Limit the resources usage of non-exclusive allocation > - > > Key: YARN-11099 > URL: https://issues.apache.org/jira/browse/YARN-11099 > Project: Hadoop YARN > Issue Type: New Feature >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > In current non-exclusive allocation, there is no limitation of resource > usage. [related code > link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783] > But in our internal hadoop, we hope the resource usage of non-exclusive > allocation can be limited to the {{Effective Max Capacity}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11106) Fix the test failure due to missing conf of yarn.resourcemanager.node-labels.am.default-node-label-expression
Junfan Zhang created YARN-11106: --- Summary: Fix the test failure due to missing conf of yarn.resourcemanager.node-labels.am.default-node-label-expression Key: YARN-11106 URL: https://issues.apache.org/jira/browse/YARN-11106 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11106) Fix the test failure due to missing conf of yarn.resourcemanager.node-labels.am.default-node-label-expression
[ https://issues.apache.org/jira/browse/YARN-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11106: --- Assignee: Junfan Zhang > Fix the test failure due to missing conf of > yarn.resourcemanager.node-labels.am.default-node-label-expression > - > > Key: YARN-11106 > URL: https://issues.apache.org/jira/browse/YARN-11106 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11101) Fix TestYarnConfigurationFields
[ https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517854#comment-17517854 ] Junfan Zhang commented on YARN-11101: - Sorry. This has been fixed in [https://github.com/apache/hadoop/pull/4121 |https://github.com/apache/hadoop/pull/4121] [~aajisaka] [ |https://github.com/apache/hadoop/pull/4121] > Fix TestYarnConfigurationFields > --- > > Key: YARN-11101 > URL: https://issues.apache.org/jira/browse/YARN-11101 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, newbie >Reporter: Akira Ajisaka >Priority: Major > > yarn.resourcemanager.node-labels.am.default-node-label-expression is missing > in yarn-default.xml. > {noformat} > [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 > s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields > [ERROR] testCompareConfigurationClassAgainstXml Time elapsed: 0.082 s <<< > FAILURE! > java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration > has 1 variables missing in yarn-default.xml Entries: > yarn.resourcemanager.node-labels.am.default-node-label-expression > expected:<0> but was:<1> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at > org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493) > {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized
Junfan Zhang created YARN-1: --- Summary: Recovery failure when node-label configure-type transit from delegated-centralized to centralized Key: YARN-1 URL: https://issues.apache.org/jira/browse/YARN-1 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang Assignee: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized
[ https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-1: Description: When i > Recovery failure when node-label configure-type transit from > delegated-centralized to centralized > - > > Key: YARN-1 > URL: https://issues.apache.org/jira/browse/YARN-1 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > When i -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized
[ https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-1: Description: When i make configure-type from delegated-centralized to centralized in yarn-site.xml and restart the RM, it failed. The error stacktrace is as follows {code:txt} 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61) at org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138) at org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76) at org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41) at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120) at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328) ... 5 more 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session {code} When i digging into the codebase, found that the node and labels mapping is stored into the nodelabel.mirror file when configured the was:When i > Recovery failure when node-label configure-type transit from > delegated-centralized to centralized > - > > Key: YARN-1 > URL: https://issues.apache.org/jira/browse/YARN-1 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > When i make configure-type from delegated-centralized to centralized in > yarn-site.xml and restart the RM, it failed. > The error stacktrace is as follows > > {code:txt} > 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: > Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to
[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized
[ https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-1: Description: When i make configure-type from delegated-centralized to centralized in yarn-site.xml and restart the RM, it failed. The error stacktrace is as follows {code:txt} 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61) at org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138) at org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76) at org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41) at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120) at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328) ... 5 more 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session {code} When i digging into the codebase, found that the node and labels mapping is stored in the nodelabel.mirror file when configured the type of centralized. However the conf was: When i make configure-type from delegated-centralized to centralized in yarn-site.xml and restart the RM, it failed. The error stacktrace is as follows {code:txt} 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) a
[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized
[ https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-1: Description: When i make configure-type from delegated-centralized to centralized in yarn-site.xml and restart the RM, it failed. The error stacktrace is as follows {code:txt} 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333) at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144) ... 4 more Caused by: java.lang.NullPointerException at org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61) at org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138) at org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76) at org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41) at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120) at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149) at org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252) at org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328) ... 5 more 2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session {code} When i digging into the codebase, found that the node and labels mapping is stored in the nodelabel.mirror file when configured the type of centralized. So the content of nodelabel.mirror file is as follows 1. the node-label list 2. the node to label mapping (only exist when configured the type of centralized) was: When i make configure-type from delegated-centralized to centralized in yarn-site.xml and restart the RM, it failed. The error stacktrace is as follows {code:txt} 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901) at org.apache.hadoop.ha.ActiveStandbyElector.
[jira] [Commented] (YARN-11115) Add configuration to disable AM preemption for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527280#comment-17527280 ] Junfan Zhang commented on YARN-5: - Sounds good. Do you mind i take this ticket? [~luoyuan] > Add configuration to disable AM preemption for capacity scheduler > - > > Key: YARN-5 > URL: https://issues.apache.org/jira/browse/YARN-5 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Yuan Luo >Priority: Major > > I think it's necessary to add configuration to disable AM preemption for > capacity-scheduler, like fair-scheduler feature: YARN-9537. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11115) Add configuration to disable AM preemption for capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-5: --- Assignee: Junfan Zhang > Add configuration to disable AM preemption for capacity scheduler > - > > Key: YARN-5 > URL: https://issues.apache.org/jira/browse/YARN-5 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Reporter: Yuan Luo >Assignee: Junfan Zhang >Priority: Major > > I think it's necessary to add configuration to disable AM preemption for > capacity-scheduler, like fair-scheduler feature: YARN-9537. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9746) Rm should only rewrite the jobConf passed by app when supporting multi-cluster token renew
Junfan Zhang created YARN-9746: -- Summary: Rm should only rewrite the jobConf passed by app when supporting multi-cluster token renew Key: YARN-9746 URL: https://issues.apache.org/jira/browse/YARN-9746 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang This issue links to YARN-5910. When to support multi-cluster delegation token renew, the path of YARN-5910 works in most scenarios. But when intergrating with Oozie, we encounter some problems. In Oozie having multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew another cluster's token, YARN-5910 was patched and related config was set. The config is as follows {code:xml} mapreduce.job.send-token-conf dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ dfs.nameservices hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 dfs.ha.namenodes.hadoop-clusterB-ns01 nn1,nn2 dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 namenode01-clusterB.qiyi.hadoop:8020 dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 namenode02-clusterB.qiyi.hadoop:8020 dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider {code} However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some config. Although we can set the required configurations through the app, this is not a good idea. So i think rm should only rewrite the jobConf passed by app to solve the above situation. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9746) Rm should only rewrite the jobConf passed by app when supporting multi-cluster token renew
[ https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-9746: --- Attachment: YARN-9746-01.path > Rm should only rewrite the jobConf passed by app when supporting > multi-cluster token renew > -- > > Key: YARN-9746 > URL: https://issues.apache.org/jira/browse/YARN-9746 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > Attachments: YARN-9746-01.path > > > This issue links to YARN-5910. > When to support multi-cluster delegation token renew, the path of YARN-5910 > works in most scenarios. > But when intergrating with Oozie, we encounter some problems. In Oozie having > multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA > token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew > another cluster's token, YARN-5910 was patched and related config was set. > The config is as follows > {code:xml} > > mapreduce.job.send-token-conf > > dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ > > > dfs.nameservices > > hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 > > > dfs.ha.namenodes.hadoop-clusterB-ns01 > nn1,nn2 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 > namenode01-clusterB.qiyi.hadoop:8020 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 > namenode02-clusterB.qiyi.hadoop:8020 > > > > dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > {code} > However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some > config. Although we can set the required configurations through the app, this > is not a good idea. So i think rm should only rewrite the jobConf passed by > app to solve the above situation. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew
[ https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-9746: --- Summary: Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew (was: Rm should only rewrite the jobConf passed by app when supporting multi-cluster token renew) > Rm should only rewrite partial jobConf passed by app when supporting > multi-cluster token renew > -- > > Key: YARN-9746 > URL: https://issues.apache.org/jira/browse/YARN-9746 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Junfan Zhang >Priority: Major > Attachments: YARN-9746-01.path > > > This issue links to YARN-5910. > When to support multi-cluster delegation token renew, the path of YARN-5910 > works in most scenarios. > But when intergrating with Oozie, we encounter some problems. In Oozie having > multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA > token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew > another cluster's token, YARN-5910 was patched and related config was set. > The config is as follows > {code:xml} > > mapreduce.job.send-token-conf > > dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ > > > dfs.nameservices > > hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 > > > dfs.ha.namenodes.hadoop-clusterB-ns01 > nn1,nn2 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 > namenode01-clusterB.qiyi.hadoop:8020 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 > namenode02-clusterB.qiyi.hadoop:8020 > > > > dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > {code} > However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some > config. Although we can set the required configurations through the app, this > is not a good idea. So i think rm should only rewrite the jobConf passed by > app to solve the above situation. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew
[ https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-9746: --- Issue Type: Bug (was: Improvement) > Rm should only rewrite partial jobConf passed by app when supporting > multi-cluster token renew > -- > > Key: YARN-9746 > URL: https://issues.apache.org/jira/browse/YARN-9746 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Attachments: YARN-9746-01.path > > > This issue links to YARN-5910. > When to support multi-cluster delegation token renew, the path of YARN-5910 > works in most scenarios. > But when intergrating with Oozie, we encounter some problems. In Oozie having > multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA > token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew > another cluster's token, YARN-5910 was patched and related config was set. > The config is as follows > {code:xml} > > mapreduce.job.send-token-conf > > dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ > > > dfs.nameservices > > hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 > > > dfs.ha.namenodes.hadoop-clusterB-ns01 > nn1,nn2 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 > namenode01-clusterB.qiyi.hadoop:8020 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 > namenode02-clusterB.qiyi.hadoop:8020 > > > > dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > {code} > However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some > config. Although we can set the required configurations through the app, this > is not a good idea. So i think rm should only rewrite the jobConf passed by > app to solve the above situation. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew
[ https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-9746: --- Attachment: YARN-9746-01.patch > Rm should only rewrite partial jobConf passed by app when supporting > multi-cluster token renew > -- > > Key: YARN-9746 > URL: https://issues.apache.org/jira/browse/YARN-9746 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Attachments: YARN-9746-01.patch > > > This issue links to YARN-5910. > When to support multi-cluster delegation token renew, the path of YARN-5910 > works in most scenarios. > But when intergrating with Oozie, we encounter some problems. In Oozie having > multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA > token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew > another cluster's token, YARN-5910 was patched and related config was set. > The config is as follows > {code:xml} > > mapreduce.job.send-token-conf > > dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ > > > dfs.nameservices > > hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 > > > dfs.ha.namenodes.hadoop-clusterB-ns01 > nn1,nn2 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 > namenode01-clusterB.qiyi.hadoop:8020 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 > namenode02-clusterB.qiyi.hadoop:8020 > > > > dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > {code} > However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some > config. Although we can set the required configurations through the app, this > is not a good idea. So i think rm should only rewrite the jobConf passed by > app to solve the above situation. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew
[ https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-9746: --- Attachment: (was: YARN-9746-01.path) > Rm should only rewrite partial jobConf passed by app when supporting > multi-cluster token renew > -- > > Key: YARN-9746 > URL: https://issues.apache.org/jira/browse/YARN-9746 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Attachments: YARN-9746-01.patch > > > This issue links to YARN-5910. > When to support multi-cluster delegation token renew, the path of YARN-5910 > works in most scenarios. > But when intergrating with Oozie, we encounter some problems. In Oozie having > multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA > token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew > another cluster's token, YARN-5910 was patched and related config was set. > The config is as follows > {code:xml} > > mapreduce.job.send-token-conf > > dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ > > > dfs.nameservices > > hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 > > > dfs.ha.namenodes.hadoop-clusterB-ns01 > nn1,nn2 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 > namenode01-clusterB.qiyi.hadoop:8020 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 > namenode02-clusterB.qiyi.hadoop:8020 > > > > dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > {code} > However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some > config. Although we can set the required configurations through the app, this > is not a good idea. So i think rm should only rewrite the jobConf passed by > app to solve the above situation. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew
[ https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-9746: --- Description: This issue links to YARN-5910. When to support multi-cluster delegation token renew, the path of YARN-5910 works in most scenarios. But when intergrating with Oozie, we encounter some problems. In Oozie having multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew another cluster's token, YARN-5910 was patched and related config was set. The config is as follows {code:xml} mapreduce.job.send-token-conf dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ dfs.nameservices hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 dfs.ha.namenodes.hadoop-clusterB-ns01 nn1,nn2 dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 namenode01-clusterB.hadoop:8020 dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 namenode02-clusterB.hadoop:8020 dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider {code} However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some config. Although we can set the required configurations through the app, this is not a good idea. So i think rm should only rewrite the jobConf passed by app to solve the above situation. was: This issue links to YARN-5910. When to support multi-cluster delegation token renew, the path of YARN-5910 works in most scenarios. But when intergrating with Oozie, we encounter some problems. In Oozie having multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew another cluster's token, YARN-5910 was patched and related config was set. The config is as follows {code:xml} mapreduce.job.send-token-conf dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ dfs.nameservices hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 dfs.ha.namenodes.hadoop-clusterB-ns01 nn1,nn2 dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 namenode01-clusterB.qiyi.hadoop:8020 dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 namenode02-clusterB.qiyi.hadoop:8020 dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider {code} However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some config. Although we can set the required configurations through the app, this is not a good idea. So i think rm should only rewrite the jobConf passed by app to solve the above situation. > Rm should only rewrite partial jobConf passed by app when supporting > multi-cluster token renew > -- > > Key: YARN-9746 > URL: https://issues.apache.org/jira/browse/YARN-9746 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Attachments: YARN-9746-01.patch > > > This issue links to YARN-5910. > When to support multi-cluster delegation token renew, the path of YARN-5910 > works in most scenarios. > But when intergrating with Oozie, we encounter some problems. In Oozie having > multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA > token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew > another cluster's token, YARN-5910 was patched and related config was set. > The config is as follows > {code:xml} > > mapreduce.job
[jira] [Updated] (YARN-9746) RM should merge local config for token renewal
[ https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-9746: --- Summary: RM should merge local config for token renewal (was: Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew) > RM should merge local config for token renewal > -- > > Key: YARN-9746 > URL: https://issues.apache.org/jira/browse/YARN-9746 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Junfan Zhang >Priority: Major > Attachments: YARN-9746-01.patch > > > This issue links to YARN-5910. > When to support multi-cluster delegation token renew, the path of YARN-5910 > works in most scenarios. > But when intergrating with Oozie, we encounter some problems. In Oozie having > multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA > token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew > another cluster's token, YARN-5910 was patched and related config was set. > The config is as follows > {code:xml} > > mapreduce.job.send-token-conf > > dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$ > > > dfs.nameservices > > hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04 > > > dfs.ha.namenodes.hadoop-clusterB-ns01 > nn1,nn2 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1 > namenode01-clusterB.hadoop:8020 > > > > dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2 > namenode02-clusterB.hadoop:8020 > > > > dfs.client.failover.proxy.provider.hadoop-clusterB-ns01 > > org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider > > {code} > However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some > config. Although we can set the required configurations through the app, this > is not a good idea. So i think rm should only rewrite the jobConf passed by > app to solve the above situation. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8382) cgroup file leak in NM
[ https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-8382: --- Description: As Jiandan said in YARN-6562, NM may delete Cgroup container file timeout with logs like below: org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to delete for 1000ms we found one situation is that when we set *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than {*}yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms{*}, the cgroup file leak happens *.* One container process tree looks like follow graph: bash(16097)───java(16099)─┬─\{java}(16100) ├─\{java}(16101) {{ ├─\{java}(16102)}} {{when NM kills a container, NM sends kill -15 -pid to kill container process group. Bash process will exit when it received sigterm, but java process may do some job (shutdownHook etc.), and doesn't exit until receive sigkill. And when bash process exits, CgroupsLCEResourcesHandler begin to try to delete cgroup files. So when *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* arrived, the java processes may still running and cgourp/tasks still not empty and cause a cgroup file leak.}} {{we add a condition that *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this problem.}} was: As Jiandan said in YARN-6562, NM may delete Cgroup container file timeout with logs like below: org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to delete for 1000ms we found one situation is that when we set *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the cgroup file leak happens *.* One container process tree looks like follow graph: bash(16097)───java(16099)─┬─\{java}(16100) ├─\{java}(16101) {{ ├─\{java}(16102)}} {{when NM kills a container, NM sends kill -15 -pid to kill container process group. Bash process will exit when it received sigterm, but java process may do some job (shutdownHook etc.), and doesn't exit unit receive sigkill. And when bash process exits, CgroupsLCEResourcesHandler begin to try to delete cgroup files. So when *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* arrived, the java processes may still running and cgourp/tasks still not empty and cause a cgroup file leak.}} {{we add a condition that *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this problem.}} > cgroup file leak in NM > -- > > Key: YARN-8382 > URL: https://issues.apache.org/jira/browse/YARN-8382 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: we write an container with a shutdownHook which has a > piece of code like "while(true) sleep(100)" . when > *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* > *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; > when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* > ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted > successfully*** >Reporter: Hu Ziqian >Assignee: Hu Ziqian >Priority: Major > Fix For: 3.2.0, 3.1.1, 3.0.4, 2.10.1 > > Attachments: YARN-8382-branch-2.8.3.001.patch, > YARN-8382-branch-2.8.3.002.patch, YARN-8382.001.patch, YARN-8382.002.patch > > > As Jiandan said in YARN-6562, NM may delete Cgroup container file timeout > with logs like below: > org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: > Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to > delete for 1000ms > > we found one situation is that when we set > *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than > {*}yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms{*}, > the cgroup file leak happens *.* > > One container process tree looks like follow graph: > bash(16097)───java(16099)─┬─\{java}(16100) > ├─\{java}(16101) > {{ ├─\{java}(16102)}} > > {{when NM kills a container, NM sends kill -15 -pid to kill container process > group. Bash process will exit when it received sigterm, but java process may > do some job (shutdownHook etc.), and doe
[jira] [Created] (YARN-11555) Support specifying node attribute for AM
Junfan Zhang created YARN-11555: --- Summary: Support specifying node attribute for AM Key: YARN-11555 URL: https://issues.apache.org/jira/browse/YARN-11555 Project: Hadoop YARN Issue Type: New Feature Components: nodeattibute Reporter: Junfan Zhang Hey community, I want to use node attributes to replace Node labels in yarn nm colocated with k8s. As I know, node attributes looks more flexible. But I didn't see any support of specifying node attributes for AM. Do I miss something? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11555) Support specifying node attribute for AM
[ https://issues.apache.org/jira/browse/YARN-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757910#comment-17757910 ] Junfan Zhang commented on YARN-11555: - cc [~slfan1989] > Support specifying node attribute for AM > > > Key: YARN-11555 > URL: https://issues.apache.org/jira/browse/YARN-11555 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodeattibute >Reporter: Junfan Zhang >Priority: Major > > Hey community, > > I want to use node attributes to replace Node labels in yarn nm colocated > with k8s. > As I know, node attributes looks more flexible. > > But I didn't see any support of specifying node attributes for AM. Do I miss > something? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11555) Support specifying node attribute for AM
[ https://issues.apache.org/jira/browse/YARN-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11555: --- Assignee: Junfan Zhang > Support specifying node attribute for AM > > > Key: YARN-11555 > URL: https://issues.apache.org/jira/browse/YARN-11555 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodeattibute >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > > Hey community, > > I want to use node attributes to replace Node labels in yarn nm colocated > with k8s. > As I know, node attributes looks more flexible. > > But I didn't see any support of specifying node attributes for AM. Do I miss > something? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11559) Can't specify node label in scheduling request in AMRMClient
Junfan Zhang created YARN-11559: --- Summary: Can't specify node label in scheduling request in AMRMClient Key: YARN-11559 URL: https://issues.apache.org/jira/browse/YARN-11559 Project: Hadoop YARN Issue Type: Bug Components: nodeattibute Reporter: Junfan Zhang When trying to use placement constraint with node-attribute and node-label, I found it can't specify node label in the scheduling request, which means for the each container request, the node label is invalid. BTW, I'm not sure the {{ApplicationSubmissionContext.setNodeLabelExpression(..)}} is valid when using the {{schedulingRequest}} . -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8007) Support specifying placement constraint for task containers in SLS
[ https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778509#comment-17778509 ] Junfan Zhang commented on YARN-8007: Thanks for proposing this. Can we involve the {{PlacementConstraintProcessor}} into SLS ? Not only the AppPlacementAllocator > Support specifying placement constraint for task containers in SLS > -- > > Key: YARN-8007 > URL: https://issues.apache.org/jira/browse/YARN-8007 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8007.001.patch, YARN-8007.002.patch, > YARN-8007.003.patch > > > YARN-6592 introduces placement constraint. Currently SLS does not support > specify placement constraint. > In order to help better perf test, we should be able to support specify > placement for containers in sls configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11597) NPE when getting the static files in SLSWebApp
Junfan Zhang created YARN-11597: --- Summary: NPE when getting the static files in SLSWebApp Key: YARN-11597 URL: https://issues.apache.org/jira/browse/YARN-11597 Project: Hadoop YARN Issue Type: Bug Components: scheduler-load-simulator Affects Versions: 3.3.6 Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11597) NPE when getting the static files in SLSWebApp
[ https://issues.apache.org/jira/browse/YARN-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11597: Attachment: 20231023-171754.jpeg > NPE when getting the static files in SLSWebApp > --- > > Key: YARN-11597 > URL: https://issues.apache.org/jira/browse/YARN-11597 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.3.6 >Reporter: Junfan Zhang >Priority: Major > Attachments: 20231023-171754.jpeg > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11597) NPE when getting the static files in SLSWebApp
[ https://issues.apache.org/jira/browse/YARN-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11597: Description: When using the SLS, the web api of {{http://localhost:10001/simulate}} is broken, because the static file loading failed due to 404. This is caused by the static handler is not initialized. NPE stacktrace is attached. > NPE when getting the static files in SLSWebApp > --- > > Key: YARN-11597 > URL: https://issues.apache.org/jira/browse/YARN-11597 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.3.6 >Reporter: Junfan Zhang >Priority: Major > Attachments: 20231023-171754.jpeg > > > When using the SLS, the web api of {{http://localhost:10001/simulate}} is > broken, because the static file loading failed due to 404. > This is caused by the static handler is not initialized. NPE stacktrace is > attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-11597) NPE when getting the static files in SLSWebApp
[ https://issues.apache.org/jira/browse/YARN-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang reassigned YARN-11597: --- Assignee: Junfan Zhang > NPE when getting the static files in SLSWebApp > --- > > Key: YARN-11597 > URL: https://issues.apache.org/jira/browse/YARN-11597 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Affects Versions: 3.3.6 >Reporter: Junfan Zhang >Assignee: Junfan Zhang >Priority: Major > Labels: pull-request-available > Attachments: 20231023-171754.jpeg > > > When using the SLS, the web api of {{http://localhost:10001/simulate}} is > broken, because the static file loading failed due to 404. > This is caused by the static handler is not initialized. NPE stacktrace is > attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10065) Support Placement Constraints for AM container allocations
[ https://issues.apache.org/jira/browse/YARN-10065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778599#comment-17778599 ] Junfan Zhang commented on YARN-10065: - +1 for this feature. > Support Placement Constraints for AM container allocations > -- > > Key: YARN-10065 > URL: https://issues.apache.org/jira/browse/YARN-10065 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn >Affects Versions: 3.2.0 >Reporter: Daniel Velasquez >Priority: Major > > Currently ApplicationSubmissionContext API supports specifying a node label > expression for the AM resource request. It would be beneficial to have the > ability to specify Placement Constraints as well for the AM resource request. > We have a requirement to constrain AM containers on certain nodes e.g. AM > containers not on preemptible/spot cloud instances. It looks like node > attributes would fit our use case well. However, we currently don't have the > ability to specify this in the API for AM resource requests. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8007) Support specifying placement constraint for task containers in SLS
[ https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778906#comment-17778906 ] Junfan Zhang commented on YARN-8007: If you dont mind, I want to pick this up. Feel free to discuss more about this. > Support specifying placement constraint for task containers in SLS > -- > > Key: YARN-8007 > URL: https://issues.apache.org/jira/browse/YARN-8007 > Project: Hadoop YARN > Issue Type: Sub-task > Components: scheduler-load-simulator >Reporter: Jiandan Yang >Assignee: Jiandan Yang >Priority: Major > Attachments: YARN-8007.001.patch, YARN-8007.002.patch, > YARN-8007.003.patch > > > YARN-6592 introduces placement constraint. Currently SLS does not support > specify placement constraint. > In order to help better perf test, we should be able to support specify > placement for containers in sls configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11598) Support unified node label specified in sls-runner.xml
Junfan Zhang created YARN-11598: --- Summary: Support unified node label specified in sls-runner.xml Key: YARN-11598 URL: https://issues.apache.org/jira/browse/YARN-11598 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang In https://issues.apache.org/jira/browse/YARN-8175, the node label is supported by the dedicated node file, which is useful for different labels mapping to different nodes. But my requirements of testing the node labels scheduling performance for the same labels and using Synth mode, this way is hard. So I want to introduce the unified node labels specified in sls-runner.xml , which is useful for my above requirements. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11599) Incorrect log4j properties file in SLS sample conf
Junfan Zhang created YARN-11599: --- Summary: Incorrect log4j properties file in SLS sample conf Key: YARN-11599 URL: https://issues.apache.org/jira/browse/YARN-11599 Project: Hadoop YARN Issue Type: Bug Reporter: Junfan Zhang https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-sls/src/main/sample-conf/log4j.properties log4j.appender.test=org.apache.log4j.ConsoleAppender log4j.appender.test.Target=System.out log4j.appender.test.layout=org.apache.log4j.PatternLayout log4j.appender.test.layout.ConversionPattern=%d\{ABSOLUTE} %5p %c\{1}:%L - %m%n log4j.logger=NONE, test This is invalid in current log4j version, if applied this, the test performance will be slow! I think the warn level is enough and required. The level < warn will effect performance -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11601) Support random queue in SLS SYNTH trace type
Junfan Zhang created YARN-11601: --- Summary: Support random queue in SLS SYNTH trace type Key: YARN-11601 URL: https://issues.apache.org/jira/browse/YARN-11601 Project: Hadoop YARN Issue Type: Improvement Reporter: Junfan Zhang The job submitted to different queues will effect performance, so this is necessary to support random queue for one job in the specified multiple queues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11599) Incorrect log4j properties file in SLS sample conf
[ https://issues.apache.org/jira/browse/YARN-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11599: Component/s: scheduler-load-simulator > Incorrect log4j properties file in SLS sample conf > -- > > Key: YARN-11599 > URL: https://issues.apache.org/jira/browse/YARN-11599 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler-load-simulator >Reporter: Junfan Zhang >Priority: Major > Labels: pull-request-available > > https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-sls/src/main/sample-conf/log4j.properties > > log4j.appender.test=org.apache.log4j.ConsoleAppender > log4j.appender.test.Target=System.out > log4j.appender.test.layout=org.apache.log4j.PatternLayout > log4j.appender.test.layout.ConversionPattern=%d\{ABSOLUTE} %5p %c\{1}:%L - > %m%n > log4j.logger=NONE, test > > This is invalid in current log4j version, if applied this, the test > performance will be slow! > > I think the warn level is enough and required. The level < warn will effect > performance -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-11601) Support random queue in SLS SYNTH trace type
[ https://issues.apache.org/jira/browse/YARN-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junfan Zhang updated YARN-11601: Component/s: scheduler-load-simulator > Support random queue in SLS SYNTH trace type > > > Key: YARN-11601 > URL: https://issues.apache.org/jira/browse/YARN-11601 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler-load-simulator >Reporter: Junfan Zhang >Priority: Major > > The job submitted to different queues will effect performance, so this is > necessary to support random queue for one job in the specified multiple > queues. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-11600) After jetty is upgraded to 9.4.51.v20230217, sls cannot load js/css
[ https://issues.apache.org/jira/browse/YARN-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779798#comment-17779798 ] Junfan Zhang commented on YARN-11600: - This has been tracked in https://issues.apache.org/jira/browse/YARN-11597 > After jetty is upgraded to 9.4.51.v20230217, sls cannot load js/css > --- > > Key: YARN-11600 > URL: https://issues.apache.org/jira/browse/YARN-11600 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: yanbin.zhang >Priority: Major > Attachments: image-2023-10-26-09-52-30-975.png > > > !image-2023-10-26-09-52-30-975.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11616) Fast fail when multiple attribute kvs are specified
Junfan Zhang created YARN-11616: --- Summary: Fast fail when multiple attribute kvs are specified Key: YARN-11616 URL: https://issues.apache.org/jira/browse/YARN-11616 Project: Hadoop YARN Issue Type: Bug Components: nodeattibute Reporter: Junfan Zhang In the {{NodeConstraintParser}}, it won't throw exception when multiple attribute kvs are specified. It will return incorrect placement constraint, which will mislead users. Like the {{rm.yarn.io/foo=1,rm.yarn.io/bar=2}}, it will parse it to {{node,EQ,rm.yarn.io/bar=[1:2]}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-11617) Noisy log in SingleConstraintAppPlacementAllocator
Junfan Zhang created YARN-11617: --- Summary: Noisy log in SingleConstraintAppPlacementAllocator Key: YARN-11617 URL: https://issues.apache.org/jira/browse/YARN-11617 Project: Hadoop YARN Issue Type: Improvement Components: capacity scheduler Reporter: Junfan Zhang Too many noisy log in SingleConstraintAppPlacementAllocator like that: 2023-11-20 15:14:30,493 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator: Successfully added SchedulingRequest to app=appattempt_1700464328807_0002_01 placementConstraint=[ node,EQ,nm.yarn.io/lifecycle=[reserved:true]]. nodePartition= -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org