from:"Junfan Zhang \(Jira\)"

[jira] [Created] (YARN-11659) app submission fast fail with node label when node label is disable

2024-03-11 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11659:
---

 Summary: app submission fast fail with node label when node label 
is disable
 Key: YARN-11659
 URL: https://issues.apache.org/jira/browse/YARN-11659
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11659) app with node label submission should fast fail when node label is disable

2024-03-11 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11659:
---

Assignee: Junfan Zhang

> app with node label submission should fast fail when node label is disable
> --
>
> Key: YARN-11659
> URL: https://issues.apache.org/jira/browse/YARN-11659
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11659) app with node label submission should fast fail when node label is disable

2024-03-11 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11659:

Summary: app with node label submission should fast fail when node label is 
disable  (was: app submission fast fail with node label when node label is 
disable)

> app with node label submission should fast fail when node label is disable
> --
>
> Key: YARN-11659
> URL: https://issues.apache.org/jira/browse/YARN-11659
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression

2024-03-11 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11660:
---

 Summary: SingleConstraintAppPlacementAllocator performance 
regression
 Key: YARN-11660
 URL: https://issues.apache.org/jira/browse/YARN-11660
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11660) SingleConstraintAppPlacementAllocator performance regression

2024-03-11 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11660:
---

Assignee: Junfan Zhang

> SingleConstraintAppPlacementAllocator performance regression
> 
>
> Key: YARN-11660
> URL: https://issues.apache.org/jira/browse/YARN-11660
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11668) Potential concurrent modification exception for node attributes of node manager

2024-03-27 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11668:
---

 Summary: Potential concurrent modification exception for node 
attributes of node manager
 Key: YARN-11668
 URL: https://issues.apache.org/jira/browse/YARN-11668
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11668) Potential concurrent modification exception for node attributes of node manager

2024-03-27 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11668:

Description: The RM crash when encoutering the following the stacktrace.

> Potential concurrent modification exception for node attributes of node 
> manager
> ---
>
> Key: YARN-11668
> URL: https://issues.apache.org/jira/browse/YARN-11668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
>
> The RM crash when encoutering the following the stacktrace.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11668) Potential concurrent modification exception for node attributes of node manager

2024-03-27 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11668:

Description: The RM crash when encoutering the following the stacktrace in 
the attachment.  (was: The RM crash when encoutering the following the 
stacktrace.)

> Potential concurrent modification exception for node attributes of node 
> manager
> ---
>
> Key: YARN-11668
> URL: https://issues.apache.org/jira/browse/YARN-11668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg
>
>
> The RM crash when encoutering the following the stacktrace in the attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11668) Potential concurrent modification exception for node attributes of node manager

2024-03-27 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11668:

Attachment: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg

> Potential concurrent modification exception for node attributes of node 
> manager
> ---
>
> Key: YARN-11668
> URL: https://issues.apache.org/jira/browse/YARN-11668
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: img_v3_029c_55ac6b50-64aa-4cbe-81a0-5f8d22c623fg.jpg
>
>
> The RM crash when encoutering the following the stacktrace.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10065) Support Placement Constraints for AM container allocations

2024-07-01 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17861090#comment-17861090
 ] 

Junfan Zhang commented on YARN-10065:
-

I think I can pick up this, I have implemented this in our internal Yarn 
version. 

> Support Placement Constraints for AM container allocations
> --
>
> Key: YARN-10065
> URL: https://issues.apache.org/jira/browse/YARN-10065
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.0
>Reporter: Daniel Velasquez
>Priority: Major
>
> Currently ApplicationSubmissionContext API supports specifying a node label 
> expression for the AM resource request. It would be beneficial to have the 
> ability to specify Placement Constraints as well for the AM resource request. 
> We have a requirement to constrain AM containers on certain nodes e.g. AM 
> containers not on preemptible/spot cloud instances. It looks like node 
> attributes would fit our use case well. However, we currently don't have the 
> ability to specify this in the API for AM resource requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11704) Avoid nested 'AND' placement constraint for non tags in scheduling request

2024-07-11 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11704:
---

Assignee: Junfan Zhang

> Avoid nested 'AND' placement constraint for non tags in scheduling request
> --
>
> Key: YARN-11704
> URL: https://issues.apache.org/jira/browse/YARN-11704
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11704) Avoid nested 'AND' placement constraint for non tags in scheduling request

2024-07-11 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11704:
---

 Summary: Avoid nested 'AND' placement constraint for non tags in 
scheduling request
 Key: YARN-11704
 URL: https://issues.apache.org/jira/browse/YARN-11704
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11728) Scheduling hang when multiple nodes placement is enabled

2024-09-08 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11728:
---

 Summary: Scheduling hang when multiple nodes placement is enabled
 Key: YARN-11728
 URL: https://issues.apache.org/jira/browse/YARN-11728
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacity scheduler, multi-node-placement
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled

2024-09-08 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11728:

Description: When trying to use the multi node placement to enable the 
customize multiple nodes lookup policy, I found this has some problems of 

> Scheduling hang when multiple nodes placement is enabled
> 
>
> Key: YARN-11728
> URL: https://issues.apache.org/jira/browse/YARN-11728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, multi-node-placement
>Reporter: Junfan Zhang
>Priority: Major
>
> When trying to use the multi node placement to enable the customize multiple 
> nodes lookup policy, I found this has some problems of 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled

2024-09-08 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11728:

Description: 
When trying to use the multi node placement to enable the customize multiple 
nodes lookup policy, I found this has some problems that will hang the 
scheduling if having one container is reserved in one node although other 
candidates nodes are with enough resources.

Let me to describe how to reproduce this problem. 

h2. Preconditions

1. Using the capacity-scheduler
2. Starting the hadoop yarn cluster with at least 2 nodemanagers 

h2. How to reproduce

1. Firstly, enable the default node lookup policy of 
{{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in 
capacity-scheduler.xml

{code:xml}

yarn.scheduler.capacity.multi-node-placement-enabled
true



yarn.scheduler.capacity.multi-node-sorting.policy.names
default




yarn.scheduler.capacity.multi-node-sorting.policydefault




yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy

{code}

2. Use the spark to submit the app with the exceeding 1 nodemanager's total 
vcores container request.

If the 2 nodemanagers have the same total vcores of 96, and the spark app 
request the executors instance 100, and every executor request the 1 vcores. 
And then you will see this allocation will hang in the 97th container. You will 
see the RM's log that will show the following logs like this:

 !screenshot-1.png! 



  was:
When trying to use the multi node placement to enable the customize multiple 
nodes lookup policy, I found this has some problems that will hang the 
scheduling if having one container is reserved in one node although other 
candidates nodes are with enough resources.

Let me to describe how to reproduce this problem. 

h2. Preconditions

1. Using the capacity-scheduler
2. Starting the hadoop yarn cluster with at least 2 nodemanagers 

h2. How to reproduce

1. Firstly, enable the default node lookup policy of 
{{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in 
capacity-scheduler.xml

{code:xml}

yarn.scheduler.capacity.multi-node-placement-enabled
true



yarn.scheduler.capacity.multi-node-sorting.policy.names
default




yarn.scheduler.capacity.multi-node-sorting.policydefault




yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy

{code}



> Scheduling hang when multiple nodes placement is enabled
> 
>
> Key: YARN-11728
> URL: https://issues.apache.org/jira/browse/YARN-11728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, multi-node-placement
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When trying to use the multi node placement to enable the customize multiple 
> nodes lookup policy, I found this has some problems that will hang the 
> scheduling if having one container is reserved in one node although other 
> candidates nodes are with enough resources.
> Let me to describe how to reproduce this problem. 
> h2. Preconditions
> 1. Using the capacity-scheduler
> 2. Starting the hadoop yarn cluster with at least 2 nodemanagers 
> h2. How to reproduce
> 1. Firstly, enable the default node lookup policy of 
> {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options 
> in capacity-scheduler.xml
> {code:xml}
> 
> yarn.scheduler.capacity.multi-node-placement-enabled
> true
> 
> 
> yarn.scheduler.capacity.multi-node-sorting.policy.names
> default
> 
> 
>   
> yarn.scheduler.capacity.multi-node-sorting.policydefault
> 
> 
> 
> yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy
> 
> {code}
> 2. Use the spark to submit the app with the exceeding 1 nodemanager's total 
> vcores container request.
> If the 2 nodemanagers have the same total vcores of 96, and the spark app 
> request the executors instance 100, and every executor request the 1 vcores. 
> And then you will see this allocation will hang in the 97th container. You 
> will see the RM's log that will show the following logs like this:
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled

2024-09-08 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11728:

Attachment: (was: screenshot-1.png)

> Scheduling hang when multiple nodes placement is enabled
> 
>
> Key: YARN-11728
> URL: https://issues.apache.org/jira/browse/YARN-11728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, multi-node-placement
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: screenshot-2.png
>
>
> When trying to use the multi node placement to enable the customize multiple 
> nodes lookup policy, I found this has some problems that will hang the 
> scheduling if having one container is reserved in one node although other 
> candidates nodes are with enough resources.
> Let me to describe how to reproduce this problem. 
> h2. Preconditions
> 1. Using the capacity-scheduler
> 2. Starting the hadoop yarn cluster with at least 2 nodemanagers 
> h2. How to reproduce
> 1. Firstly, enable the default node lookup policy of 
> {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options 
> in capacity-scheduler.xml
> {code:xml}
> 
> yarn.scheduler.capacity.multi-node-placement-enabled
> true
> 
> 
> yarn.scheduler.capacity.multi-node-sorting.policy.names
> default
> 
> 
>   
> yarn.scheduler.capacity.multi-node-sorting.policydefault
> 
> 
> 
> yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy
> 
> {code}
> 2. Use the spark to submit the app with the exceeding 1 nodemanager's total 
> vcores container request.
> If the 2 nodemanagers have the same total vcores of 96, and the spark app 
> request the executors instance 100, and every executor request the 1 vcores. 
> And then you will see this allocation will hang in the 97th container. You 
> will see the RM's log that will show the following logs like this:
>  !screenshot-2.png!  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled

2024-09-08 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11728:

Description: 
When trying to use the multi node placement to enable the customize multiple 
nodes lookup policy, I found this has some problems that will hang the 
scheduling if having one container is reserved in one node although other 
candidates nodes are with enough resources.

Let me to describe how to reproduce this problem. 

h2. Preconditions

1. Using the capacity-scheduler
2. Starting the hadoop yarn cluster with at least 2 nodemanagers 

h2. How to reproduce

1. Firstly, enable the default node lookup policy of 
{{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in 
capacity-scheduler.xml

{code:xml}

yarn.scheduler.capacity.multi-node-placement-enabled
true



yarn.scheduler.capacity.multi-node-sorting.policy.names
default




yarn.scheduler.capacity.multi-node-sorting.policydefault




yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy

{code}

2. Use the spark to submit the app with the exceeding 1 nodemanager's total 
vcores container request.

If the 2 nodemanagers have the same total vcores of 96, and the spark app 
request the executors instance 100, and every executor request the 1 vcores. 
And then you will see this allocation will hang in the 97th container. You will 
see the RM's log that will show the following logs like this:

 !screenshot-2.png!  !screenshot-1.png! 



  was:
When trying to use the multi node placement to enable the customize multiple 
nodes lookup policy, I found this has some problems that will hang the 
scheduling if having one container is reserved in one node although other 
candidates nodes are with enough resources.

Let me to describe how to reproduce this problem. 

h2. Preconditions

1. Using the capacity-scheduler
2. Starting the hadoop yarn cluster with at least 2 nodemanagers 

h2. How to reproduce

1. Firstly, enable the default node lookup policy of 
{{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in 
capacity-scheduler.xml

{code:xml}

yarn.scheduler.capacity.multi-node-placement-enabled
true



yarn.scheduler.capacity.multi-node-sorting.policy.names
default




yarn.scheduler.capacity.multi-node-sorting.policydefault




yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy

{code}

2. Use the spark to submit the app with the exceeding 1 nodemanager's total 
vcores container request.

If the 2 nodemanagers have the same total vcores of 96, and the spark app 
request the executors instance 100, and every executor request the 1 vcores. 
And then you will see this allocation will hang in the 97th container. You will 
see the RM's log that will show the following logs like this:

 !screenshot-1.png! 




> Scheduling hang when multiple nodes placement is enabled
> 
>
> Key: YARN-11728
> URL: https://issues.apache.org/jira/browse/YARN-11728
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, multi-node-placement
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: screenshot-2.png
>
>
> When trying to use the multi node placement to enable the customize multiple 
> nodes lookup policy, I found this has some problems that will hang the 
> scheduling if having one container is reserved in one node although other 
> candidates nodes are with enough resources.
> Let me to describe how to reproduce this problem. 
> h2. Preconditions
> 1. Using the capacity-scheduler
> 2. Starting the hadoop yarn cluster with at least 2 nodemanagers 
> h2. How to reproduce
> 1. Firstly, enable the default node lookup policy of 
> {{ResourceUsageMultiNodeLookupPolicy}} by using the following config options 
> in capacity-scheduler.xml
> {code:xml}
> 
> yarn.scheduler.capacity.multi-node-placement-enabled
> true
> 
> 
> yarn.scheduler.capacity.multi-node-sorting.policy.names
> default
> 
> 
>   
> yarn.scheduler.capacity.multi-node-sorting.policydefault
> 
> 
> 
> yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy
> 
> {code}
> 2. Use the spark to submit the app with the exceeding 1 nodemanager's total 
> vcores container request.
> If the 2 nodemanagers have the same total vcores of 96, and the spark app 
> request the executors instance 100, and every executor request the 1 vcores. 
> And then you will see this allocation will hang in the 97th c

[jira] [Updated] (YARN-11728) Scheduling hang when multiple nodes placement is enabled

2024-09-08 Thread Junfan Zhang (Jira)

[
https://issues.apache.org/jira/browse/YARN-11728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Junfan Zhang updated YARN-11728:

Description:
When trying to use the multi node placement to enable the customize multiple
nodes lookup policy, I found this has some problems that will hang the
scheduling if having one container is reserved in one node although other
candidates nodes are with enough resources.

Let me to describe how to reproduce this problem.

h2. Preconditions

1. Using the capacity-scheduler that enables the async scheduling
2. Starting the hadoop yarn cluster with at least 2 nodemanagers

h2. How to reproduce

1. Firstly, enable the default node lookup policy of
{{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in
capacity-scheduler.xml

{code:xml}

yarn.scheduler.capacity.multi-node-placement-enabled
true

yarn.scheduler.capacity.multi-node-sorting.policy.names
default

yarn.scheduler.capacity.multi-node-sorting.policydefault

yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy

{code}

2. Use the spark to submit the app with the exceeding 1 nodemanager's total
vcores container request.

If the 2 nodemanagers have the same total vcores of 96, and the spark app
request the executors instance 100, and every executor request the 1 vcores.
And then you will see this allocation will hang in the 97th container. You will
see the RM's log that will show the following logs like this:

!screenshot-2.png!

And at this time, If you want to submit another one app into this cluster, you
will see this app's AM will not be allocated any resource.

h2. Why

After digging into this yarn‘s async scheduling logic, I found something
strange about the multi node placement. Simple to say is that the scheduling
hange is caused by the one reserved container.

For the multiple node placement is enabled, for one container which is selected
by some specified policy, it is not noly matched with single one candidate
nodemanager, but with the multiple nodes. The sequece of the multiple nodes is
determinzed by the customize lookup policy, the default is the
{{ResourceUsageMultiNodeLookupPolicy}}.
And the policy is managed by the {{MultiNodeSortingManager}}, that will use the
policy to resort the cluster's all healthy nodes with the interval 1 second.

1. Now let's suppose in the first 1 second, the nodes sequence is (node1,
node2). And the 97th container(1th container is AM) will be reserved in the
node1.
2. For the next time for async scheduling thread, this will find this reserved
container and try to re-reserve/re-start. Pity, no existing container will be
release.
3. And after 1 second, the sorting policy make effect that will resort the
nodes sequence, and then it is (node2, node1). For normal thought, if the node1
is full of container with no free resource, the reserved container could be
picked up by another node(like node2). But this is allowed for yarn, and so,
hang happens.

h2. How to fix this

1. If having multiple nodes candidates, we should lookup all the nodes until
having the enough resource to start instead of reserving
2. Allow to other nodes to pick up the reserved container

was:
When trying to use the multi node placement to enable the customize multiple
nodes lookup policy, I found this has some problems that will hang the
scheduling if having one container is reserved in one node although other
candidates nodes are with enough resources.

Let me to describe how to reproduce this problem.

h2. Preconditions

1. Using the capacity-scheduler that enables the async scheduling
2. Starting the hadoop yarn cluster with at least 2 nodemanagers

h2. How to reproduce

1. Firstly, enable the default node lookup policy of
{{ResourceUsageMultiNodeLookupPolicy}} by using the following config options in
capacity-scheduler.xml

{code:xml}

yarn.scheduler.capacity.multi-node-placement-enabled
true

yarn.scheduler.capacity.multi-node-sorting.policy.names
default

yarn.scheduler.capacity.multi-node-sorting.policydefault

yarn.scheduler.capacity.multi-node-sorting.policy.default.classorg.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.ResourceUsageMultiNodeLookupPolicy

{code}

2. Use the spark to submit the app with the exceeding 1 nodemanager's total
vcores container request.

!screenshot-2.png!

And at this time, If you want to submit another one app into this cluster, you
will see thi

[jira] [Commented] (YARN-11115) Add configuration to disable AM preemption for capacity scheduler

2022-05-24 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541786#comment-17541786
 ] 

Junfan Zhang commented on YARN-5:
-

Sorry for late reply. Feel free to take it. [~groot] 

Looking forward your patch.

> Add configuration to disable AM preemption for capacity scheduler
> -
>
> Key: YARN-5
> URL: https://issues.apache.org/jira/browse/YARN-5
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Yuan Luo
>Assignee: Ashutosh Gupta
>Priority: Major
>
> I think it's necessary to add configuration to disable AM preemption for 
> capacity-scheduler, like fair-scheduler feature: YARN-9537.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11164) PartitionQueueMetrics support more metrics

2022-05-25 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11164:
---

Assignee: Junfan Zhang

> PartitionQueueMetrics support more metrics
> --
>
> Key: YARN-11164
> URL: https://issues.apache.org/jira/browse/YARN-11164
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: metrics
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> When we enable the node-label when using the capacity scheduler, the 
> partition queue metrics are missing a lot of metrics, compared with the 
> {{QueueMetrics}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11164) PartitionQueueMetrics support more metrics

2022-05-25 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11164:
---

 Summary: PartitionQueueMetrics support more metrics
 Key: YARN-11164
 URL: https://issues.apache.org/jira/browse/YARN-11164
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: metrics
Reporter: Junfan Zhang


When we enable the node-label when using the capacity scheduler, the partition 
queue metrics are missing a lot of metrics, compared with the {{QueueMetrics}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11179) Show more detailed info when container token is expired

2022-06-13 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11179:
---

 Summary: Show more detailed info when container token is expired
 Key: YARN-11179
 URL: https://issues.apache.org/jira/browse/YARN-11179
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang


There is no appid in log about failing on starting containers when container 
token is expired. This will make hard to solve the error.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11179) Show more detailed info when container token is expired

2022-06-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11179:

Description: There is no appid in log about failing on starting containers 
when container token is expired. This will make hard to troubleshoot.  (was: 
There is no appid in log about failing on starting containers when container 
token is expired. This will make hard to solve the error.)

> Show more detailed info when container token is expired
> ---
>
> Key: YARN-11179
> URL: https://issues.apache.org/jira/browse/YARN-11179
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> There is no appid in log about failing on starting containers when container 
> token is expired. This will make hard to troubleshoot.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11084) Introduce new config to specify AM default node-label when not specified

2022-03-09 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11084:
---

 Summary: Introduce new config to specify AM default node-label 
when not specified
 Key: YARN-11084
 URL: https://issues.apache.org/jira/browse/YARN-11084
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11084) Introduce new config to specify AM default node-label when not specified

2022-03-10 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11084:

Description: 
h2. What
When submitting application to Yarn and user don't specify any node-label on AM 
request and {{{}ApplicationSubmissionContext{}}}, we hope that Yarn could 
provide the default AM node-label.
 
h2. Why
Yarn cluster in our internal company exists on-premise NodeManagers and elastic 
NodeManagers (which is built on K8s). To prevent application instability due to 
elastic NM decommission, we hope that the AM of job can be allocated to 
on-premise NMs.
 

> Introduce new config to specify AM default node-label when not specified
> 
>
> Key: YARN-11084
> URL: https://issues.apache.org/jira/browse/YARN-11084
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Priority: Major
>
> h2. What
> When submitting application to Yarn and user don't specify any node-label on 
> AM request and {{{}ApplicationSubmissionContext{}}}, we hope that Yarn could 
> provide the default AM node-label.
>  
> h2. Why
> Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). To prevent application 
> instability due to elastic NM decommission, we hope that the AM of job can be 
> allocated to on-premise NMs.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11086) Add space in debug log of ParentQueue

2022-03-13 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11086:
---

 Summary: Add space in debug log of ParentQueue
 Key: YARN-11086
 URL: https://issues.apache.org/jira/browse/YARN-11086
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11087) Introduce the config to control the refresh interval in RMNodeLabelsMappingProvider

2022-03-13 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11087:
---

 Summary: Introduce the config to control the refresh interval in 
RMNodeLabelsMappingProvider
 Key: YARN-11087
 URL: https://issues.apache.org/jira/browse/YARN-11087
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11087:

Summary: Introduce the config to control the refresh interval in 
RMDelegatedNodeLabelsUpdater  (was: Introduce the config to control the refresh 
interval in RMNodeLabelsMappingProvider)

> Introduce the config to control the refresh interval in 
> RMDelegatedNodeLabelsUpdater
> 
>
> Key: YARN-11087
> URL: https://issues.apache.org/jira/browse/YARN-11087
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11087:

Description: 
h3. Why 
When configuring nodes to labels mapping by Delegated-Centralized mode, once 
the newly registered nodes comes, the node-label of this node wont be attached 
until triggering the nodelabel mapping provider, which the delayed time depends 
on the scheduler interval.

h3. How to solve this bug

> Introduce the config to control the refresh interval in 
> RMDelegatedNodeLabelsUpdater
> 
>
> Key: YARN-11087
> URL: https://issues.apache.org/jira/browse/YARN-11087
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Why 
> When configuring nodes to labels mapping by Delegated-Centralized mode, once 
> the newly registered nodes comes, the node-label of this node wont be 
> attached until triggering the nodelabel mapping provider, which the delayed 
> time depends on the scheduler interval.
> h3. How to solve this bug



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11087:

Issue Type: Bug  (was: Improvement)

> Introduce the config to control the refresh interval in 
> RMDelegatedNodeLabelsUpdater
> 
>
> Key: YARN-11087
> URL: https://issues.apache.org/jira/browse/YARN-11087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Why 
> When configuring nodes to labels mapping by Delegated-Centralized mode, once 
> the newly registered nodes comes, the node-label of this node wont be 
> attached until triggering the nodelabel mapping provider, which the delayed 
> time depends on the scheduler interval.
> h3. How to solve this bug



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11087:

Description: 
h3. Why 
When configuring nodes to labels mapping by Delegated-Centralized mode, once 
the newly registered nodes comes, the node-label of this node wont be attached 
until triggering the nodelabel mapping provider, which the delayed time depends 
on the scheduler interval.

h3. How to solve this bug
I think there are two options
# Introduce the new config to specify the update-node-label schedule interval. 
If u want to quickly refresh the newly registered nodes, user should decrease 
the interval.
# Once the newly registered node come, directly trigger the execution of 
nodelabel mapping provider. But if the provider is the time-consuming operation 
and lots of nodes register to RM  at the same time, this will also make some 
nodes with node-label delay.

I prefer the first option and submit the PR to solve this. 

Feel free to discuss if having any ideas.

  was:
h3. Why 
When configuring nodes to labels mapping by Delegated-Centralized mode, once 
the newly registered nodes comes, the node-label of this node wont be attached 
until triggering the nodelabel mapping provider, which the delayed time depends 
on the scheduler interval.

h3. How to solve this bug


> Introduce the config to control the refresh interval in 
> RMDelegatedNodeLabelsUpdater
> 
>
> Key: YARN-11087
> URL: https://issues.apache.org/jira/browse/YARN-11087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h3. Why 
> When configuring nodes to labels mapping by Delegated-Centralized mode, once 
> the newly registered nodes comes, the node-label of this node wont be 
> attached until triggering the nodelabel mapping provider, which the delayed 
> time depends on the scheduler interval.
> h3. How to solve this bug
> I think there are two options
> # Introduce the new config to specify the update-node-label schedule 
> interval. If u want to quickly refresh the newly registered nodes, user 
> should decrease the interval.
> # Once the newly registered node come, directly trigger the execution of 
> nodelabel mapping provider. But if the provider is the time-consuming 
> operation and lots of nodes register to RM  at the same time, this will also 
> make some nodes with node-label delay.
> I prefer the first option and submit the PR to solve this. 
> Feel free to discuss if having any ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-13 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11088:
---

 Summary: Introduce the config to control the AM allocated to 
non-exclusive nodes
 Key: YARN-11088
 URL: https://issues.apache.org/jira/browse/YARN-11088
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11088:

Description: 
h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

But Yarn cluster in our internal company exists on-premise NodeManagers and 
elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

h4. How to support it
Introduce the new config to control the 

  was:
h4. What

When submitting application to Yarn and user don't specify any node-label on AM 
request and ApplicationSubmissionContext, we hope that Yarn could provide the 
default AM node-label.

h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

Yarn cluster in our internal company exists on-premise NodeManagers and elastic 
NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.


> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> But Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
> decommission, we hope that the AM can be scheduled to non-exclusive nodes.
> h4. How to support it
> Introduce the new config to control the 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11088:

Description: 
h4. What

When submitting application to Yarn and user don't specify any node-label on AM 
request and ApplicationSubmissionContext, we hope that Yarn could provide the 
default AM node-label.

h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

Yarn cluster in our internal company exists on-premise NodeManagers and elastic 
NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>
> h4. What
> When submitting application to Yarn and user don't specify any node-label on 
> AM request and ApplicationSubmissionContext, we hope that Yarn could provide 
> the default AM node-label.
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
> decommission, we hope that the AM can be scheduled to non-exclusive nodes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11088:

Description: 
h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

But Yarn cluster in our internal company exists on-premise NodeManagers and 
elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

h4. How to support it
Introduce the new config to control the AM can be allocated to non-exclusive 
nodes.

Feel free to discuss if having any ideas!

  was:
h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

But Yarn cluster in our internal company exists on-premise NodeManagers and 
elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

h4. How to support it
Introduce the new config to control the 


> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> But Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
> decommission, we hope that the AM can be scheduled to non-exclusive nodes.
> h4. How to support it
> Introduce the new config to control the AM can be allocated to non-exclusive 
> nodes.
> Feel free to discuss if having any ideas!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-13 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11088:

Description: 
h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

But Yarn cluster in our internal company exists on-premise NodeManagers and 
elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

h4. How to support it
Introduce the new config to control the AM can be allocated to non-exclusive 
nodes.




*Feel free to discuss if having any ideas!*

  was:
h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

But Yarn cluster in our internal company exists on-premise NodeManagers and 
elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

h4. How to support it
Introduce the new config to control the AM can be allocated to non-exclusive 
nodes.

Feel free to discuss if having any ideas!


> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> But Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
> decommission, we hope that the AM can be scheduled to non-exclusive nodes.
> h4. How to support it
> Introduce the new config to control the AM can be allocated to non-exclusive 
> nodes.
> *Feel free to discuss if having any ideas!*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11089) Fix typo in rm audit log

2022-03-14 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11089:
---

 Summary: Fix typo in rm audit log
 Key: YARN-11089
 URL: https://issues.apache.org/jira/browse/YARN-11089
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-14 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11088:

Description: 
h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

But Yarn cluster in our internal company exists on-premise NodeManagers and 
elastic NodeManagers (which is built on K8s). When all the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

h4. How to support it
Introduce the new config to control the AM can be allocated to non-exclusive 
nodes.




*Feel free to discuss if having any ideas!*

  was:
h4. Why
Current the implementation of Yarn about AM allocation on non-exclusive nodes 
is directly to fail fast. I know this aims to keep the stability of job, 
because the container in non-exclusive nodes will be preempted. 

But Yarn cluster in our internal company exists on-premise NodeManagers and 
elastic NodeManagers (which is built on K8s). When the elastic nodemanagers 
decommission, we hope that the AM can be scheduled to non-exclusive nodes.

h4. How to support it
Introduce the new config to control the AM can be allocated to non-exclusive 
nodes.




*Feel free to discuss if having any ideas!*


> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> But Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When all the elastic 
> nodemanagers decommission, we hope that the AM can be scheduled to 
> non-exclusive nodes.
> h4. How to support it
> Introduce the new config to control the AM can be allocated to non-exclusive 
> nodes.
> *Feel free to discuss if having any ideas!*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-14 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506703#comment-17506703
 ] 

Junfan Zhang commented on YARN-11088:
-

Could you help check this feature? [~quapaw]  [~tdomok] 

> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> But Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When all the elastic 
> nodemanagers decommission, we hope that the AM can be scheduled to 
> non-exclusive nodes.
> h4. How to support it
> Introduce the new config to control the AM can be allocated to non-exclusive 
> nodes.
> *Feel free to discuss if having any ideas!*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-14 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506703#comment-17506703
 ] 

Junfan Zhang edited comment on YARN-11088 at 3/15/22, 4:56 AM:
---

Could you help check this feature? [~quapaw]  [~tdomok].

If OK, please assign to me.


was (Author: zuston):
Could you help check this feature? [~quapaw]  [~tdomok] 

> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
>
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> But Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When all the elastic 
> nodemanagers decommission, we hope that the AM can be scheduled to 
> non-exclusive nodes.
> h4. How to support it
> Introduce the new config to control the AM can be allocated to non-exclusive 
> nodes.
> *Feel free to discuss if having any ideas!*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11088) Introduce the config to control the AM allocated to non-exclusive nodes

2022-03-21 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509902#comment-17509902
 ] 

Junfan Zhang commented on YARN-11088:
-

I will submit PR tomorrow and it has been applied in our internal Yarn. Glad to 
contribute to the community. [~quapaw] 

> Introduce the config to control the AM allocated to non-exclusive nodes
> ---
>
> Key: YARN-11088
> URL: https://issues.apache.org/jira/browse/YARN-11088
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> h4. Why
> Current the implementation of Yarn about AM allocation on non-exclusive nodes 
> is directly to fail fast. I know this aims to keep the stability of job, 
> because the container in non-exclusive nodes will be preempted. 
> But Yarn cluster in our internal company exists on-premise NodeManagers and 
> elastic NodeManagers (which is built on K8s). When all the elastic 
> nodemanagers decommission, we hope that the AM can be scheduled to 
> non-exclusive nodes.
> h4. How to support it
> Introduce the new config to control the AM can be allocated to non-exclusive 
> nodes.
> *Feel free to discuss if having any ideas!*



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater

2022-03-21 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509903#comment-17509903
 ] 

Junfan Zhang edited comment on YARN-11087 at 3/21/22, 1:59 PM:
---

What do u think of the second option? [~quapaw] 


was (Author: zuston):
What do u think of the second option? [~snemeth]

> Introduce the config to control the refresh interval in 
> RMDelegatedNodeLabelsUpdater
> 
>
> Key: YARN-11087
> URL: https://issues.apache.org/jira/browse/YARN-11087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h3. Why 
> When configuring nodes to labels mapping by Delegated-Centralized mode, once 
> the newly registered nodes comes, the node-label of this node wont be 
> attached until triggering the nodelabel mapping provider, which the delayed 
> time depends on the scheduler interval.
> h3. How to solve this bug
> I think there are two options
> # Introduce the new config to specify the update-node-label schedule 
> interval. If u want to quickly refresh the newly registered nodes, user 
> should decrease the interval.
> # Once the newly registered node come, directly trigger the execution of 
> nodelabel mapping provider. But if the provider is the time-consuming 
> operation and lots of nodes register to RM  at the same time, this will also 
> make some nodes with node-label delay.
> I prefer the first option and submit the PR to solve this. 
> Feel free to discuss if having any ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11087) Introduce the config to control the refresh interval in RMDelegatedNodeLabelsUpdater

2022-03-21 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509903#comment-17509903
 ] 

Junfan Zhang commented on YARN-11087:
-

What do u think of the second option? [~snemeth]

> Introduce the config to control the refresh interval in 
> RMDelegatedNodeLabelsUpdater
> 
>
> Key: YARN-11087
> URL: https://issues.apache.org/jira/browse/YARN-11087
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> h3. Why 
> When configuring nodes to labels mapping by Delegated-Centralized mode, once 
> the newly registered nodes comes, the node-label of this node wont be 
> attached until triggering the nodelabel mapping provider, which the delayed 
> time depends on the scheduler interval.
> h3. How to solve this bug
> I think there are two options
> # Introduce the new config to specify the update-node-label schedule 
> interval. If u want to quickly refresh the newly registered nodes, user 
> should decrease the interval.
> # Once the newly registered node come, directly trigger the execution of 
> nodelabel mapping provider. But if the provider is the time-consuming 
> operation and lots of nodes register to RM  at the same time, this will also 
> make some nodes with node-label delay.
> I prefer the first option and submit the PR to solve this. 
> Feel free to discuss if having any ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-24 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11099:
---

 Summary: Limit the resources usage of non-exclusive allocation
 Key: YARN-11099
 URL: https://issues.apache.org/jira/browse/YARN-11099
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-24 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11099:

Description: 
In current non-exclusive allocation, there is no limitation of resource usage. 
related code link: 

But in our internal hadoop, we hope the resource usage of non-exclusive 
allocation can be limited to the {{Effective Max Capacity}}

> Limit the resources usage of non-exclusive allocation
> -
>
> Key: YARN-11099
> URL: https://issues.apache.org/jira/browse/YARN-11099
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Priority: Major
>
> In current non-exclusive allocation, there is no limitation of resource 
> usage. related code link: 
> But in our internal hadoop, we hope the resource usage of non-exclusive 
> allocation can be limited to the {{Effective Max Capacity}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-24 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11099:

Description: 
In current non-exclusive allocation, there is no limitation of resource usage. 
[related code 
link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783]

But in our internal hadoop, we hope the resource usage of non-exclusive 
allocation can be limited to the {{Effective Max Capacity}}

  was:
In current non-exclusive allocation, there is no limitation of resource usage. 
related code link: 

But in our internal hadoop, we hope the resource usage of non-exclusive 
allocation can be limited to the {{Effective Max Capacity}}


> Limit the resources usage of non-exclusive allocation
> -
>
> Key: YARN-11099
> URL: https://issues.apache.org/jira/browse/YARN-11099
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Priority: Major
>
> In current non-exclusive allocation, there is no limitation of resource 
> usage. [related code 
> link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783]
> But in our internal hadoop, we hope the resource usage of non-exclusive 
> allocation can be limited to the {{Effective Max Capacity}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-24 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511899#comment-17511899
 ] 

Junfan Zhang commented on YARN-11099:
-

Do u have any ideas on it? [~quapaw]

> Limit the resources usage of non-exclusive allocation
> -
>
> Key: YARN-11099
> URL: https://issues.apache.org/jira/browse/YARN-11099
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Priority: Major
>
> In current non-exclusive allocation, there is no limitation of resource 
> usage. [related code 
> link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783]
> But in our internal hadoop, we hope the resource usage of non-exclusive 
> allocation can be limited to the {{Effective Max Capacity}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-24 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11099:
---

Assignee: Junfan Zhang

> Limit the resources usage of non-exclusive allocation
> -
>
> Key: YARN-11099
> URL: https://issues.apache.org/jira/browse/YARN-11099
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> In current non-exclusive allocation, there is no limitation of resource 
> usage. [related code 
> link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783]
> But in our internal hadoop, we hope the resource usage of non-exclusive 
> allocation can be limited to the {{Effective Max Capacity}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5464) Server-Side NM Graceful Decommissioning with RM HA

2022-03-24 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511902#comment-17511902
 ] 

Junfan Zhang commented on YARN-5464:


Any update on it?  [~shuzirra]  , [~brahmareddy] ,[~quapaw]
This PR meets our internal requirement and hope it can be merged into trunk.

> Server-Side NM Graceful Decommissioning with RM HA
> --
>
> Key: YARN-5464
> URL: https://issues.apache.org/jira/browse/YARN-5464
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: graceful, yarn
>Reporter: Robert Kanter
>Assignee: Gergely Pollák
>Priority: Major
> Attachments: YARN-5464.001.patch, YARN-5464.002.patch, 
> YARN-5464.003.patch, YARN-5464.004.patch, YARN-5464.005.patch, 
> YARN-5464.006.patch, YARN-5464.wip.patch
>
>
> Make sure to remove the note added by YARN-7094 about RM HA failover not 
> working right.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-24 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511899#comment-17511899
 ] 

Junfan Zhang edited comment on YARN-11099 at 3/24/22, 3:04 PM:
---

Do u have any ideas on it? [~quapaw].
If OK, i will go ahead.


was (Author: zuston):
Do u have any ideas on it? [~quapaw]

> Limit the resources usage of non-exclusive allocation
> -
>
> Key: YARN-11099
> URL: https://issues.apache.org/jira/browse/YARN-11099
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> In current non-exclusive allocation, there is no limitation of resource 
> usage. [related code 
> link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783]
> But in our internal hadoop, we hope the resource usage of non-exclusive 
> allocation can be limited to the {{Effective Max Capacity}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-24 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511899#comment-17511899
 ] 

Junfan Zhang edited comment on YARN-11099 at 3/24/22, 3:04 PM:
---

Do u have any ideas on it? [~quapaw].

If OK, i will go ahead.


was (Author: zuston):
Do u have any ideas on it? [~quapaw].
If OK, i will go ahead.

> Limit the resources usage of non-exclusive allocation
> -
>
> Key: YARN-11099
> URL: https://issues.apache.org/jira/browse/YARN-11099
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> In current non-exclusive allocation, there is no limitation of resource 
> usage. [related code 
> link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783]
> But in our internal hadoop, we hope the resource usage of non-exclusive 
> allocation can be limited to the {{Effective Max Capacity}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11099) Limit the resources usage of non-exclusive allocation

2022-03-29 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514160#comment-17514160
 ] 

Junfan Zhang commented on YARN-11099:
-

+ [~bteke]

> Limit the resources usage of non-exclusive allocation
> -
>
> Key: YARN-11099
> URL: https://issues.apache.org/jira/browse/YARN-11099
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> In current non-exclusive allocation, there is no limitation of resource 
> usage. [related code 
> link|https://github.com/apache/hadoop/blob/077c6c62d6c1ed89e209449a5f9c5849b05e7dff/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java#L783]
> But in our internal hadoop, we hope the resource usage of non-exclusive 
> allocation can be limited to the {{Effective Max Capacity}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11106) Fix the test failure due to missing conf of yarn.resourcemanager.node-labels.am.default-node-label-expression

2022-03-30 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11106:
---

 Summary: Fix the test failure due to missing conf of 
yarn.resourcemanager.node-labels.am.default-node-label-expression
 Key: YARN-11106
 URL: https://issues.apache.org/jira/browse/YARN-11106
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11106) Fix the test failure due to missing conf of yarn.resourcemanager.node-labels.am.default-node-label-expression

2022-03-30 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11106:
---

Assignee: Junfan Zhang

> Fix the test failure due to missing conf of 
> yarn.resourcemanager.node-labels.am.default-node-label-expression
> -
>
> Key: YARN-11106
> URL: https://issues.apache.org/jira/browse/YARN-11106
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11101) Fix TestYarnConfigurationFields

2022-04-05 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517854#comment-17517854
 ] 

Junfan Zhang commented on YARN-11101:
-

Sorry. This has been fixed in [https://github.com/apache/hadoop/pull/4121 
|https://github.com/apache/hadoop/pull/4121] [~aajisaka] [ 
|https://github.com/apache/hadoop/pull/4121]

> Fix TestYarnConfigurationFields
> ---
>
> Key: YARN-11101
> URL: https://issues.apache.org/jira/browse/YARN-11101
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: documentation, newbie
>Reporter: Akira Ajisaka
>Priority: Major
>
> yarn.resourcemanager.node-labels.am.default-node-label-expression is missing 
> in yarn-default.xml.
> {noformat}
> [INFO] Running org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] Tests run: 4, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.533 
> s <<< FAILURE! - in org.apache.hadoop.yarn.conf.TestYarnConfigurationFields
> [ERROR] testCompareConfigurationClassAgainstXml  Time elapsed: 0.082 s  <<< 
> FAILURE!
> java.lang.AssertionError: class org.apache.hadoop.yarn.conf.YarnConfiguration 
> has 1 variables missing in yarn-default.xml Entries:   
> yarn.resourcemanager.node-labels.am.default-node-label-expression 
> expected:<0> but was:<1>
>   at org.junit.Assert.fail(Assert.java:89)
>   at org.junit.Assert.failNotEquals(Assert.java:835)
>   at org.junit.Assert.assertEquals(Assert.java:647)
>   at 
> org.apache.hadoop.conf.TestConfigurationFieldsBase.testCompareConfigurationClassAgainstXml(TestConfigurationFieldsBase.java:493)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-1:
---

 Summary: Recovery failure when node-label configure-type transit 
from delegated-centralized to centralized
 Key: YARN-1
 URL: https://issues.apache.org/jira/browse/YARN-1
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang
Assignee: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: When i 

> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> When i 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: 
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
... 5 more
2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session

 {code}

When i digging into the codebase, found that the node and labels mapping is 
stored into the nodelabel.mirror file when configured the 

 

  was:When i 


> Recovery failure when node-label configure-type transit from 
> delegated-centralized to centralized
> -
>
> Key: YARN-1
> URL: https://issues.apache.org/jira/browse/YARN-1
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> When i make configure-type from delegated-centralized to centralized in 
> yarn-site.xml and restart the RM, it failed.
> The error stacktrace is as follows
>  
> {code:txt}
> 2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Exception handling the winning of election
> org.apache.hadoop.ha.ServiceFailedException: RM could not transition to

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: 
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
... 5 more
2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session

 {code}

When i digging into the codebase, found that the node and labels mapping is 
stored in the nodelabel.mirror file when configured the type of centralized. 
However the conf

 

  was:
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
a

[jira] [Updated] (YARN-11111) Recovery failure when node-label configure-type transit from delegated-centralized to centralized

2022-04-18 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-1:

Description: 
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:476)
at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when 
transitioning to Active mode
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:333)
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
... 4 more
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.initNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:61)
at 
org.apache.hadoop.yarn.server.api.protocolrecords.impl.pb.ReplaceLabelsOnNodeRequestPBImpl.getNodeToLabels(ReplaceLabelsOnNodeRequestPBImpl.java:138)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:76)
at 
org.apache.hadoop.yarn.nodelabels.store.op.NodeLabelMirrorOp.recover(NodeLabelMirrorOp.java:41)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.loadFromMirror(AbstractFSNodeStore.java:120)
at 
org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:149)
at 
org.apache.hadoop.yarn.nodelabels.FileSystemNodeLabelsStore.recover(FileSystemNodeLabelsStore.java:106)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.initNodeLabelStore(CommonNodeLabelsManager.java:252)
at 
org.apache.hadoop.yarn.nodelabels.CommonNodeLabelsManager.serviceStart(CommonNodeLabelsManager.java:266)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:910)
at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1278)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1319)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1315)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1315)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:328)
... 5 more
2022-04-13 14:44:14,886 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying 
to re-establish ZK session

 {code}

When i digging into the codebase, found that the node and labels mapping is 
stored in the nodelabel.mirror file when configured the type of centralized. 

So the content of nodelabel.mirror file is as follows
1. the node-label list
2. the node to label mapping (only exist when configured the type of 
centralized)

 

  was:
When i make configure-type from delegated-centralized to centralized in 
yarn-site.xml and restart the RM, it failed.

The error stacktrace is as follows
 
{code:txt}
2022-04-13 14:44:14,885 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
at 
org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
at 
org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:901)
at 
org.apache.hadoop.ha.ActiveStandbyElector.

[jira] [Commented] (YARN-11115) Add configuration to disable AM preemption for capacity scheduler

2022-04-24 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-5?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527280#comment-17527280
 ] 

Junfan Zhang commented on YARN-5:
-

Sounds good. Do you mind i take this ticket? [~luoyuan]

> Add configuration to disable AM preemption for capacity scheduler
> -
>
> Key: YARN-5
> URL: https://issues.apache.org/jira/browse/YARN-5
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Yuan Luo
>Priority: Major
>
> I think it's necessary to add configuration to disable AM preemption for 
> capacity-scheduler, like fair-scheduler feature: YARN-9537.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11115) Add configuration to disable AM preemption for capacity scheduler

2022-04-26 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-5:
---

Assignee: Junfan Zhang

> Add configuration to disable AM preemption for capacity scheduler
> -
>
> Key: YARN-5
> URL: https://issues.apache.org/jira/browse/YARN-5
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Reporter: Yuan Luo
>Assignee: Junfan Zhang
>Priority: Major
>
> I think it's necessary to add configuration to disable AM preemption for 
> capacity-scheduler, like fair-scheduler feature: YARN-9537.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-9746) Rm should only rewrite the jobConf passed by app when supporting multi-cluster token renew

2019-08-13 Thread Junfan Zhang (JIRA)

Junfan Zhang created YARN-9746:
--

 Summary: Rm should only rewrite the jobConf passed by app when 
supporting multi-cluster token renew
 Key: YARN-9746
 URL: https://issues.apache.org/jira/browse/YARN-9746
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang


This issue links to YARN-5910.

When to support multi-cluster delegation token renew, the path of YARN-5910 
works in most scenarios.

But when intergrating with Oozie, we encounter some problems. In Oozie having 
multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
another cluster's token, YARN-5910 was patched and related config was set. The 
config is as follows
{code:xml}

mapreduce.job.send-token-conf

dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$


dfs.nameservices

hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04


dfs.ha.namenodes.hadoop-clusterB-ns01
nn1,nn2



dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
namenode01-clusterB.qiyi.hadoop:8020



dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
namenode02-clusterB.qiyi.hadoop:8020



dfs.client.failover.proxy.provider.hadoop-clusterB-ns01

org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

{code}
However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
config. Although we can set the required configurations through the app, this 
is not a good idea. So i think rm should only rewrite the jobConf passed by app 
to solve the above situation.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9746) Rm should only rewrite the jobConf passed by app when supporting multi-cluster token renew

2019-08-13 Thread Junfan Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-9746:
---
Attachment: YARN-9746-01.path

> Rm should only rewrite the jobConf passed by app when supporting 
> multi-cluster token renew
> --
>
> Key: YARN-9746
> URL: https://issues.apache.org/jira/browse/YARN-9746
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: YARN-9746-01.path
>
>
> This issue links to YARN-5910.
> When to support multi-cluster delegation token renew, the path of YARN-5910 
> works in most scenarios.
> But when intergrating with Oozie, we encounter some problems. In Oozie having 
> multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
> token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
> another cluster's token, YARN-5910 was patched and related config was set. 
> The config is as follows
> {code:xml}
> 
> mapreduce.job.send-token-conf
> 
> dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$
> 
> 
> dfs.nameservices
> 
> hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04
> 
> 
> dfs.ha.namenodes.hadoop-clusterB-ns01
> nn1,nn2
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
> namenode01-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
> namenode02-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.client.failover.proxy.provider.hadoop-clusterB-ns01
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
> {code}
> However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
> config. Although we can set the required configurations through the app, this 
> is not a good idea. So i think rm should only rewrite the jobConf passed by 
> app to solve the above situation.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew

2019-08-13 Thread Junfan Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-9746:
---
Summary: Rm should only rewrite partial jobConf passed by app when 
supporting multi-cluster token renew  (was: Rm should only rewrite the jobConf 
passed by app when supporting multi-cluster token renew)

> Rm should only rewrite partial jobConf passed by app when supporting 
> multi-cluster token renew
> --
>
> Key: YARN-9746
> URL: https://issues.apache.org/jira/browse/YARN-9746
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: YARN-9746-01.path
>
>
> This issue links to YARN-5910.
> When to support multi-cluster delegation token renew, the path of YARN-5910 
> works in most scenarios.
> But when intergrating with Oozie, we encounter some problems. In Oozie having 
> multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
> token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
> another cluster's token, YARN-5910 was patched and related config was set. 
> The config is as follows
> {code:xml}
> 
> mapreduce.job.send-token-conf
> 
> dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$
> 
> 
> dfs.nameservices
> 
> hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04
> 
> 
> dfs.ha.namenodes.hadoop-clusterB-ns01
> nn1,nn2
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
> namenode01-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
> namenode02-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.client.failover.proxy.provider.hadoop-clusterB-ns01
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
> {code}
> However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
> config. Although we can set the required configurations through the app, this 
> is not a good idea. So i think rm should only rewrite the jobConf passed by 
> app to solve the above situation.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew

2019-08-14 Thread Junfan Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-9746:
---
Issue Type: Bug  (was: Improvement)

> Rm should only rewrite partial jobConf passed by app when supporting 
> multi-cluster token renew
> --
>
> Key: YARN-9746
> URL: https://issues.apache.org/jira/browse/YARN-9746
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: YARN-9746-01.path
>
>
> This issue links to YARN-5910.
> When to support multi-cluster delegation token renew, the path of YARN-5910 
> works in most scenarios.
> But when intergrating with Oozie, we encounter some problems. In Oozie having 
> multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
> token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
> another cluster's token, YARN-5910 was patched and related config was set. 
> The config is as follows
> {code:xml}
> 
> mapreduce.job.send-token-conf
> 
> dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$
> 
> 
> dfs.nameservices
> 
> hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04
> 
> 
> dfs.ha.namenodes.hadoop-clusterB-ns01
> nn1,nn2
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
> namenode01-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
> namenode02-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.client.failover.proxy.provider.hadoop-clusterB-ns01
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
> {code}
> However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
> config. Although we can set the required configurations through the app, this 
> is not a good idea. So i think rm should only rewrite the jobConf passed by 
> app to solve the above situation.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew

2019-08-14 Thread Junfan Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-9746:
---
Attachment: YARN-9746-01.patch

> Rm should only rewrite partial jobConf passed by app when supporting 
> multi-cluster token renew
> --
>
> Key: YARN-9746
> URL: https://issues.apache.org/jira/browse/YARN-9746
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: YARN-9746-01.patch
>
>
> This issue links to YARN-5910.
> When to support multi-cluster delegation token renew, the path of YARN-5910 
> works in most scenarios.
> But when intergrating with Oozie, we encounter some problems. In Oozie having 
> multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
> token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
> another cluster's token, YARN-5910 was patched and related config was set. 
> The config is as follows
> {code:xml}
> 
> mapreduce.job.send-token-conf
> 
> dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$
> 
> 
> dfs.nameservices
> 
> hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04
> 
> 
> dfs.ha.namenodes.hadoop-clusterB-ns01
> nn1,nn2
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
> namenode01-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
> namenode02-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.client.failover.proxy.provider.hadoop-clusterB-ns01
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
> {code}
> However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
> config. Although we can set the required configurations through the app, this 
> is not a good idea. So i think rm should only rewrite the jobConf passed by 
> app to solve the above situation.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew

2019-08-14 Thread Junfan Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-9746:
---
Attachment: (was: YARN-9746-01.path)

> Rm should only rewrite partial jobConf passed by app when supporting 
> multi-cluster token renew
> --
>
> Key: YARN-9746
> URL: https://issues.apache.org/jira/browse/YARN-9746
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: YARN-9746-01.patch
>
>
> This issue links to YARN-5910.
> When to support multi-cluster delegation token renew, the path of YARN-5910 
> works in most scenarios.
> But when intergrating with Oozie, we encounter some problems. In Oozie having 
> multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
> token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
> another cluster's token, YARN-5910 was patched and related config was set. 
> The config is as follows
> {code:xml}
> 
> mapreduce.job.send-token-conf
> 
> dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$
> 
> 
> dfs.nameservices
> 
> hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04
> 
> 
> dfs.ha.namenodes.hadoop-clusterB-ns01
> nn1,nn2
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
> namenode01-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
> namenode02-clusterB.qiyi.hadoop:8020
> 
> 
> 
> dfs.client.failover.proxy.provider.hadoop-clusterB-ns01
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
> {code}
> However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
> config. Although we can set the required configurations through the app, this 
> is not a good idea. So i think rm should only rewrite the jobConf passed by 
> app to solve the above situation.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-9746) Rm should only rewrite partial jobConf passed by app when supporting multi-cluster token renew

2019-08-15 Thread Junfan Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-9746:
---
Description: 
This issue links to YARN-5910.

When to support multi-cluster delegation token renew, the path of YARN-5910 
works in most scenarios.

But when intergrating with Oozie, we encounter some problems. In Oozie having 
multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
another cluster's token, YARN-5910 was patched and related config was set. The 
config is as follows
{code:xml}

mapreduce.job.send-token-conf

dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$


dfs.nameservices

hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04


dfs.ha.namenodes.hadoop-clusterB-ns01
nn1,nn2



dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
namenode01-clusterB.hadoop:8020



dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
namenode02-clusterB.hadoop:8020



dfs.client.failover.proxy.provider.hadoop-clusterB-ns01

org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

{code}
However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
config. Although we can set the required configurations through the app, this 
is not a good idea. So i think rm should only rewrite the jobConf passed by app 
to solve the above situation.  

  was:
This issue links to YARN-5910.

When to support multi-cluster delegation token renew, the path of YARN-5910 
works in most scenarios.

But when intergrating with Oozie, we encounter some problems. In Oozie having 
multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
another cluster's token, YARN-5910 was patched and related config was set. The 
config is as follows
{code:xml}

mapreduce.job.send-token-conf

dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$


dfs.nameservices

hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04


dfs.ha.namenodes.hadoop-clusterB-ns01
nn1,nn2



dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
namenode01-clusterB.qiyi.hadoop:8020



dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
namenode02-clusterB.qiyi.hadoop:8020



dfs.client.failover.proxy.provider.hadoop-clusterB-ns01

org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

{code}
However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
config. Although we can set the required configurations through the app, this 
is not a good idea. So i think rm should only rewrite the jobConf passed by app 
to solve the above situation.  


> Rm should only rewrite partial jobConf passed by app when supporting 
> multi-cluster token renew
> --
>
> Key: YARN-9746
> URL: https://issues.apache.org/jira/browse/YARN-9746
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: YARN-9746-01.patch
>
>
> This issue links to YARN-5910.
> When to support multi-cluster delegation token renew, the path of YARN-5910 
> works in most scenarios.
> But when intergrating with Oozie, we encounter some problems. In Oozie having 
> multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
> token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
> another cluster's token, YARN-5910 was patched and related config was set. 
> The config is as follows
> {code:xml}
> 
> mapreduce.job

[jira] [Updated] (YARN-9746) RM should merge local config for token renewal

2019-08-15 Thread Junfan Zhang (JIRA)



 [ 
https://issues.apache.org/jira/browse/YARN-9746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-9746:
---
Summary: RM should merge local config for token renewal  (was: Rm should 
only rewrite partial jobConf passed by app when supporting multi-cluster token 
renew)

> RM should merge local config for token renewal
> --
>
> Key: YARN-9746
> URL: https://issues.apache.org/jira/browse/YARN-9746
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: YARN-9746-01.patch
>
>
> This issue links to YARN-5910.
> When to support multi-cluster delegation token renew, the path of YARN-5910 
> works in most scenarios.
> But when intergrating with Oozie, we encounter some problems. In Oozie having 
> multi delegation tokens including HDFS_DELEGATION_TOKEN(another cluster HA 
> token) and MR_DELEGATION_TOKEN(Oozie mr launcher token), to support renew 
> another cluster's token, YARN-5910 was patched and related config was set. 
> The config is as follows
> {code:xml}
> 
> mapreduce.job.send-token-conf
> 
> dfs.namenode.kerberos.principal|dfs.nameservices|^dfs.namenode.rpc-address.*$|^dfs.ha.namenodes.*$|^dfs.client.failover.proxy.provider.*$
> 
> 
> dfs.nameservices
> 
> hadoop-clusterA-ns01,hadoop-clusterA-ns02,hadoop-clusterA-ns03,hadoop-clusterA-ns04,hadoop-clusterB-ns01,hadoop-clusterB-ns02,hadoop-clusterB-ns03,hadoop-clusterB-ns04
> 
> 
> dfs.ha.namenodes.hadoop-clusterB-ns01
> nn1,nn2
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn1
> namenode01-clusterB.hadoop:8020
> 
> 
> 
> dfs.namenode.rpc-address.hadoop-clusterB-ns01.nn2
> namenode02-clusterB.hadoop:8020
> 
> 
> 
> dfs.client.failover.proxy.provider.hadoop-clusterB-ns01
> 
> org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
> 
> {code}
> However, the MR_DELEGATION_TOKEN could‘t be renewed, because of lacking some 
> config. Although we can set the required configurations through the app, this 
> is not a good idea. So i think rm should only rewrite the jobConf passed by 
> app to solve the above situation.  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-8382) cgroup file leak in NM

2023-04-03 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-8382:
---
Description: 
As Jiandan said in YARN-6562, NM may delete  Cgroup container file timeout with 
logs like below:

org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to 
delete for 1000ms

 

we found one situation is that when we set 
*yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than 
{*}yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms{*}, the 
cgroup file leak happens *.* 

 

One container process tree looks like follow graph:

bash(16097)───java(16099)─┬─\{java}(16100) 

                                                  ├─\{java}(16101) 

{{                       ├─\{java}(16102)}}

 

{{when NM kills a container, NM sends kill -15 -pid to kill container process 
group. Bash process will exit when it received sigterm, but java process may do 
some job (shutdownHook etc.), and doesn't exit until receive sigkill. And when 
bash process exits, CgroupsLCEResourcesHandler begin to try to delete cgroup 
files. So when 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* arrived, 
the java processes may still running and cgourp/tasks still not empty and cause 
a cgroup file leak.}}

 

{{we add a condition that 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must 
bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this 
problem.}}

 

  was:
As Jiandan said in YARN-6562, NM may delete  Cgroup container file timeout with 
logs like below:

org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to 
delete for 1000ms

 

we found one situation is that when we set 
*yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the 
cgroup file leak happens *.* 

 

One container process tree looks like follow graph:

bash(16097)───java(16099)─┬─\{java}(16100) 

                                                  ├─\{java}(16101) 

{{                       ├─\{java}(16102)}}

 

{{when NM kills a container, NM sends kill -15 -pid to kill container process 
group. Bash process will exit when it received sigterm, but java process may do 
some job (shutdownHook etc.), and doesn't exit unit receive sigkill. And when 
bash process exits, CgroupsLCEResourcesHandler begin to try to delete cgroup 
files. So when 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* arrived, 
the java processes may still running and cgourp/tasks still not empty and cause 
a cgroup file leak.}}

 

{{we add a condition that 
*yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must 
bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this 
problem.}}

 


> cgroup file leak in NM
> --
>
> Key: YARN-8382
> URL: https://issues.apache.org/jira/browse/YARN-8382
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
> Environment: we write an container with a shutdownHook which has a 
> piece of code like  "while(true) sleep(100)" . when 
> *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* 
> *yarn.nodemanager.sleep-delay-before-sigkill.ms , cgourp file leak happens; 
> when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms >* 
> ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted 
> successfully***
>Reporter: Hu Ziqian
>Assignee: Hu Ziqian
>Priority: Major
> Fix For: 3.2.0, 3.1.1, 3.0.4, 2.10.1
>
> Attachments: YARN-8382-branch-2.8.3.001.patch, 
> YARN-8382-branch-2.8.3.002.patch, YARN-8382.001.patch, YARN-8382.002.patch
>
>
> As Jiandan said in YARN-6562, NM may delete  Cgroup container file timeout 
> with logs like below:
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: 
> Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to 
> delete for 1000ms
>  
> we found one situation is that when we set 
> *yarn.nodemanager.sleep-delay-before-sigkill.ms* bigger than 
> {*}yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms{*}, 
> the cgroup file leak happens *.* 
>  
> One container process tree looks like follow graph:
> bash(16097)───java(16099)─┬─\{java}(16100) 
>                                                   ├─\{java}(16101) 
> {{                       ├─\{java}(16102)}}
>  
> {{when NM kills a container, NM sends kill -15 -pid to kill container process 
> group. Bash process will exit when it received sigterm, but java process may 
> do some job (shutdownHook etc.), and doe

[jira] [Created] (YARN-11555) Support specifying node attribute for AM

2023-08-23 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11555:
---

 Summary: Support specifying node attribute for AM
 Key: YARN-11555
 URL: https://issues.apache.org/jira/browse/YARN-11555
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: nodeattibute
Reporter: Junfan Zhang


Hey community,

 

I want to use node attributes to replace Node labels in yarn nm colocated with 
k8s.

As I know, node attributes looks more flexible.

 

But I didn't see any support of specifying node attributes for AM. Do I miss 
something?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11555) Support specifying node attribute for AM

2023-08-23 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757910#comment-17757910
 ] 

Junfan Zhang commented on YARN-11555:
-

cc [~slfan1989]

> Support specifying node attribute for AM
> 
>
> Key: YARN-11555
> URL: https://issues.apache.org/jira/browse/YARN-11555
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodeattibute
>Reporter: Junfan Zhang
>Priority: Major
>
> Hey community,
>  
> I want to use node attributes to replace Node labels in yarn nm colocated 
> with k8s.
> As I know, node attributes looks more flexible.
>  
> But I didn't see any support of specifying node attributes for AM. Do I miss 
> something?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11555) Support specifying node attribute for AM

2023-08-29 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11555:
---

Assignee: Junfan Zhang

> Support specifying node attribute for AM
> 
>
> Key: YARN-11555
> URL: https://issues.apache.org/jira/browse/YARN-11555
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodeattibute
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>
> Hey community,
>  
> I want to use node attributes to replace Node labels in yarn nm colocated 
> with k8s.
> As I know, node attributes looks more flexible.
>  
> But I didn't see any support of specifying node attributes for AM. Do I miss 
> something?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11559) Can't specify node label in scheduling request in AMRMClient

2023-08-30 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11559:
---

 Summary: Can't specify node label in scheduling request in 
AMRMClient
 Key: YARN-11559
 URL: https://issues.apache.org/jira/browse/YARN-11559
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodeattibute
Reporter: Junfan Zhang


When trying to use placement constraint with node-attribute and node-label, I 
found it can't specify node label in the scheduling request, which means for 
the each container request, the node label is invalid.

BTW, I'm not sure the 
{{ApplicationSubmissionContext.setNodeLabelExpression(..)}} is valid when using 
the {{schedulingRequest}} .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8007) Support specifying placement constraint for task containers in SLS

2023-10-22 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778509#comment-17778509
 ] 

Junfan Zhang commented on YARN-8007:


Thanks for proposing this. Can we involve the {{PlacementConstraintProcessor}} 
into SLS ? Not only the AppPlacementAllocator

> Support specifying placement constraint for task containers in SLS
> --
>
> Key: YARN-8007
> URL: https://issues.apache.org/jira/browse/YARN-8007
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8007.001.patch, YARN-8007.002.patch, 
> YARN-8007.003.patch
>
>
> YARN-6592 introduces placement constraint. Currently SLS does not support 
> specify placement constraint. 
> In order to help better perf test, we should be able to support specify 
> placement for containers in sls configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11597) NPE when getting the static files in SLSWebApp

2023-10-23 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11597:
---

 Summary: NPE when getting the static files in SLSWebApp 
 Key: YARN-11597
 URL: https://issues.apache.org/jira/browse/YARN-11597
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scheduler-load-simulator
Affects Versions: 3.3.6
Reporter: Junfan Zhang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11597) NPE when getting the static files in SLSWebApp

2023-10-23 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11597:

Attachment: 20231023-171754.jpeg

> NPE when getting the static files in SLSWebApp 
> ---
>
> Key: YARN-11597
> URL: https://issues.apache.org/jira/browse/YARN-11597
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.3.6
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: 20231023-171754.jpeg
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11597) NPE when getting the static files in SLSWebApp

2023-10-23 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11597:

Description: 
When using the SLS, the web api of {{http://localhost:10001/simulate}} is 
broken, because the static file loading failed due to 404.

This is caused by the static handler is not initialized. NPE stacktrace  is 
attached.



> NPE when getting the static files in SLSWebApp 
> ---
>
> Key: YARN-11597
> URL: https://issues.apache.org/jira/browse/YARN-11597
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.3.6
>Reporter: Junfan Zhang
>Priority: Major
> Attachments: 20231023-171754.jpeg
>
>
> When using the SLS, the web api of {{http://localhost:10001/simulate}} is 
> broken, because the static file loading failed due to 404.
> This is caused by the static handler is not initialized. NPE stacktrace  is 
> attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11597) NPE when getting the static files in SLSWebApp

2023-10-23 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang reassigned YARN-11597:
---

Assignee: Junfan Zhang

> NPE when getting the static files in SLSWebApp 
> ---
>
> Key: YARN-11597
> URL: https://issues.apache.org/jira/browse/YARN-11597
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Affects Versions: 3.3.6
>Reporter: Junfan Zhang
>Assignee: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: 20231023-171754.jpeg
>
>
> When using the SLS, the web api of {{http://localhost:10001/simulate}} is 
> broken, because the static file loading failed due to 404.
> This is caused by the static handler is not initialized. NPE stacktrace  is 
> attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10065) Support Placement Constraints for AM container allocations

2023-10-23 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778599#comment-17778599
 ] 

Junfan Zhang commented on YARN-10065:
-

+1 for this feature.

> Support Placement Constraints for AM container allocations
> --
>
> Key: YARN-10065
> URL: https://issues.apache.org/jira/browse/YARN-10065
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: yarn
>Affects Versions: 3.2.0
>Reporter: Daniel Velasquez
>Priority: Major
>
> Currently ApplicationSubmissionContext API supports specifying a node label 
> expression for the AM resource request. It would be beneficial to have the 
> ability to specify Placement Constraints as well for the AM resource request. 
> We have a requirement to constrain AM containers on certain nodes e.g. AM 
> containers not on preemptible/spot cloud instances. It looks like node 
> attributes would fit our use case well. However, we currently don't have the 
> ability to specify this in the API for AM resource requests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8007) Support specifying placement constraint for task containers in SLS

2023-10-23 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-8007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778906#comment-17778906
 ] 

Junfan Zhang commented on YARN-8007:


If you dont mind, I want to pick this up. Feel free to discuss more about this.

> Support specifying placement constraint for task containers in SLS
> --
>
> Key: YARN-8007
> URL: https://issues.apache.org/jira/browse/YARN-8007
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler-load-simulator
>Reporter: Jiandan Yang 
>Assignee: Jiandan Yang 
>Priority: Major
> Attachments: YARN-8007.001.patch, YARN-8007.002.patch, 
> YARN-8007.003.patch
>
>
> YARN-6592 introduces placement constraint. Currently SLS does not support 
> specify placement constraint. 
> In order to help better perf test, we should be able to support specify 
> placement for containers in sls configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11598) Support unified node label specified in sls-runner.xml

2023-10-24 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11598:
---

 Summary: Support unified node label specified in sls-runner.xml
 Key: YARN-11598
 URL: https://issues.apache.org/jira/browse/YARN-11598
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang


In https://issues.apache.org/jira/browse/YARN-8175, the node label is supported 
by the dedicated node file, which is useful for different labels mapping to 
different nodes. But my requirements of testing the node labels scheduling 
performance for the same labels and using Synth mode, this way is hard.

So I want to introduce the unified node labels specified in sls-runner.xml , 
which is useful for my above requirements.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11599) Incorrect log4j properties file in SLS sample conf

2023-10-25 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11599:
---

 Summary: Incorrect log4j properties file in SLS sample conf
 Key: YARN-11599
 URL: https://issues.apache.org/jira/browse/YARN-11599
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Junfan Zhang


https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-sls/src/main/sample-conf/log4j.properties

 

log4j.appender.test=org.apache.log4j.ConsoleAppender
log4j.appender.test.Target=System.out
log4j.appender.test.layout=org.apache.log4j.PatternLayout
log4j.appender.test.layout.ConversionPattern=%d\{ABSOLUTE} %5p %c\{1}:%L - %m%n

log4j.logger=NONE, test

 

This is invalid in current log4j version, if applied this, the test performance 
will be slow!

 

I think the warn level is enough and required. The level < warn will effect 
performance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11601) Support random queue in SLS SYNTH trace type

2023-10-26 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11601:
---

 Summary: Support random queue in SLS SYNTH trace type
 Key: YARN-11601
 URL: https://issues.apache.org/jira/browse/YARN-11601
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Junfan Zhang


The job submitted to different queues will effect performance, so this is 
necessary to support random queue for one job in the specified multiple queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11599) Incorrect log4j properties file in SLS sample conf

2023-10-26 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11599:

Component/s: scheduler-load-simulator

> Incorrect log4j properties file in SLS sample conf
> --
>
> Key: YARN-11599
> URL: https://issues.apache.org/jira/browse/YARN-11599
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler-load-simulator
>Reporter: Junfan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-sls/src/main/sample-conf/log4j.properties
>  
> log4j.appender.test=org.apache.log4j.ConsoleAppender
> log4j.appender.test.Target=System.out
> log4j.appender.test.layout=org.apache.log4j.PatternLayout
> log4j.appender.test.layout.ConversionPattern=%d\{ABSOLUTE} %5p %c\{1}:%L - 
> %m%n
> log4j.logger=NONE, test
>  
> This is invalid in current log4j version, if applied this, the test 
> performance will be slow!
>  
> I think the warn level is enough and required. The level < warn will effect 
> performance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11601) Support random queue in SLS SYNTH trace type

2023-10-26 Thread Junfan Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junfan Zhang updated YARN-11601:

Component/s: scheduler-load-simulator

> Support random queue in SLS SYNTH trace type
> 
>
> Key: YARN-11601
> URL: https://issues.apache.org/jira/browse/YARN-11601
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: scheduler-load-simulator
>Reporter: Junfan Zhang
>Priority: Major
>
> The job submitted to different queues will effect performance, so this is 
> necessary to support random queue for one job in the specified multiple 
> queues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11600) After jetty is upgraded to 9.4.51.v20230217, sls cannot load js/css

2023-10-26 Thread Junfan Zhang (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779798#comment-17779798
 ] 

Junfan Zhang commented on YARN-11600:
-

This has been tracked in https://issues.apache.org/jira/browse/YARN-11597

> After jetty is upgraded to 9.4.51.v20230217, sls cannot load js/css
> ---
>
> Key: YARN-11600
> URL: https://issues.apache.org/jira/browse/YARN-11600
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: yanbin.zhang
>Priority: Major
> Attachments: image-2023-10-26-09-52-30-975.png
>
>
> !image-2023-10-26-09-52-30-975.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11616) Fast fail when multiple attribute kvs are specified

2023-11-20 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11616:
---

 Summary: Fast fail when multiple attribute kvs are specified
 Key: YARN-11616
 URL: https://issues.apache.org/jira/browse/YARN-11616
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodeattibute
Reporter: Junfan Zhang


In the {{NodeConstraintParser}}, it won't throw exception when multiple 
attribute kvs are specified. It will return incorrect placement constraint, 
which will mislead users. Like the 

{{rm.yarn.io/foo=1,rm.yarn.io/bar=2}}, it will parse it to 
{{node,EQ,rm.yarn.io/bar=[1:2]}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11617) Noisy log in SingleConstraintAppPlacementAllocator

2023-11-20 Thread Junfan Zhang (Jira)

Junfan Zhang created YARN-11617:
---

 Summary: Noisy log in SingleConstraintAppPlacementAllocator
 Key: YARN-11617
 URL: https://issues.apache.org/jira/browse/YARN-11617
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacity scheduler
Reporter: Junfan Zhang


Too many noisy log in SingleConstraintAppPlacementAllocator like that:

2023-11-20 15:14:30,493 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.placement.SingleConstraintAppPlacementAllocator:
 Successfully added SchedulingRequest to 
app=appattempt_1700464328807_0002_01 placementConstraint=[
node,EQ,nm.yarn.io/lifecycle=[reserved:true]]. nodePartition=



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

90 matches

Mail list logo