[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321128#comment-17321128
 ] 

Peter Bacsko commented on YARN-10654:
-

[~snemeth] [~shuzirra] do you guys have some time to review this? It's the 
equivalent of what FS does.

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10654-001.patch
>
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320954#comment-17320954
 ] 

Peter Bacsko commented on YARN-10654:
-

Uploaded patch v1 which is probably the simplest approach to the '.' problem.

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10654-001.patch
>
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10654:

Attachment: YARN-10654-001.patch

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
> Attachments: YARN-10654-001.patch
>
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced

2021-04-14 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko reassigned YARN-10654:
---

Assignee: Peter Bacsko  (was: Gergely Pollak)

> Dots '.' in CSMappingRule path variables should be replaced
> ---
>
> Key: YARN-10654
> URL: https://issues.apache.org/jira/browse/YARN-10654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Peter Bacsko
>Priority: Major
>
> Dots are used as separators, so we should escape them somehow in the 
> variables when substituting them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317080#comment-17317080
 ] 

Peter Bacsko commented on YARN-10564:
-

+1

Committed to trunk. Thanks [~gandras] for the patch and [~zhuqi] for the review.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.006.patch, YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316309#comment-17316309
 ] 

Peter Bacsko commented on YARN-10564:
-

Thanks [~gandras] I have the following suggestions: please add comments to the 
"for" loop which explains this. I don't want to dictate the wording. It could 
be more sentences. I think it's important. Also, maybe also comment that 
"supportedWildcardLevel" or MAX_WILDCARD_LEVEL might change in the future (just 
like me, people might realize that the range is [0-1] and it might make people 
confused).

Also, an overall comment like "collect all template settings based on prefix, 
then finally apply the collected settings to the newly created queue" might be 
useful. I'd put it somewhere before the "while" loop, but this is just an idea.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277
 ] 

Peter Bacsko edited comment on YARN-10564 at 4/7/21, 12:16 PM:
---

Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which 
modifies "queuePathParts". First we try to find the templates for the parent 
explicitly, then we step back a wildcard at each iteration. By changing 
"queuePathParts", the prefix changes so eventually we might find a parent which 
contains templates. 

Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected 
values for the original queue.

Is this correct?


was (Author: pbacsko):
Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which 
modifies "queuePathParts". First we try to find the templates for the parent 
explicitly, then we step back each wildcard at a time. By changing 
"queuePathParts", the prefix changes so eventually we might find a parent which 
contains templates. 

Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected 
values for the original queue.

Is this correct?

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277
 ] 

Peter Bacsko commented on YARN-10564:
-

Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which 
modifies "queuePathParts". First we try to find the templates for the parent 
explicitly, then we step back each wildcard at a time. By changing 
"queuePathParts", the prefix changes so eventually we might find a parent which 
contains templates. 

Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected 
values for the original queue.

Is this correct?

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213
 ] 

Peter Bacsko edited comment on YARN-10564 at 4/7/21, 11:51 AM:
---

[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? I 
don't understand what it is meant to represent.


was (Author: pbacsko):
[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213
 ] 

Peter Bacsko edited comment on YARN-10564 at 4/7/21, 10:49 AM:
---

[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.


was (Author: pbacsko):
[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations

2021-04-07 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213
 ] 

Peter Bacsko commented on YARN-10564:
-

[~gandras] thanks for the patch.
>From coding POV it looks ok, this is more like a high level review.

There's are some things I just can't figure out (maybe I'm in a bad shape 
today).

1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue 
{{root.a.newparent.newchild}} get created. How does the weight settings 
propagate to "newparent" and "newchild"? I kept looking at the code, but it's 
just not obvious. I can see that "root.a" will have an entry in 
{{templateEntries}}, but then what?

2. I can't deciper this part:
{noformat}
for (int i = 0; i <= wildcardLevel; ++i) {
queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE);
}
{noformat}
What's happening here?

3. There is a variable called "supportedWildcardLevel". What is "supported" 
means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 
1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because 
{{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? 
Mentally I don't understand what it is meant to represent.

> Support Auto Queue Creation template configurations
> ---
>
> Key: YARN-10564
> URL: https://issues.apache.org/jira/browse/YARN-10564
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Major
> Attachments: YARN-10564.001.patch, YARN-10564.002.patch, 
> YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, 
> YARN-10564.poc.001.patch
>
>
> Similar to how the template configuration works for ManagedParents, we need 
> to support templates for the new auto queue creation logic. Proposition is to 
> allow wildcards in template configs such as:
> {noformat}
> yarn.scheduler.capacity.root.*.*.weight 10{noformat}
> which would mean, that set weight to 10 of every leaf of every parent under 
> root.
> We should possibly take an approach, that could support arbitrary depth of 
> template configuration, because we might need to lift the limitation of auto 
> queue nesting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313241#comment-17313241
 ] 

Peter Bacsko commented on YARN-10726:
-

Ok, I strongly believe that the failing tests are flaky.

[~zhuqi] could you verify it by running them locally a couple of times?

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch, YARN-10726.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10693) Add document for YARN-10623 auto refresh queue conf in cs.

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313219#comment-17313219
 ] 

Peter Bacsko commented on YARN-10693:
-

I'll review this as soon as I have some spare cycles.

> Add document for YARN-10623 auto refresh queue conf in cs.
> --
>
> Key: YARN-10693
> URL: https://issues.apache.org/jira/browse/YARN-10693
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10693.001.patch, YARN-10693.002.patch, 
> YARN-10693.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313218#comment-17313218
 ] 

Peter Bacsko commented on YARN-10637:
-

Thanks [~zhuqi] I think it's good then.

[~gandras] do you have any comments?

> We should support fs to cs support for auto refresh queues when conf changed, 
> after YARN-10623 finished.
> 
>
> Key: YARN-10637
> URL: https://issues.apache.org/jira/browse/YARN-10637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10637.001.patch, YARN-10637.002.patch, 
> YARN-10637.003.patch, YARN-10637.004.patch
>
>
> cc [~pbacsko] [~gandras] [~bteke]
> We should also fill this, when  YARN-10623 finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313192#comment-17313192
 ] 

Peter Bacsko commented on YARN-10726:
-

Ah, I already committed the change. Let's hope Jenkins comes back green :)

+1

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch, YARN-10726.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313189#comment-17313189
 ] 

Peter Bacsko commented on YARN-10726:
-

"hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer" - this 
is unrelated I believe. This test case has been failing for a long time.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch, YARN-10726.002.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313184#comment-17313184
 ] 

Peter Bacsko commented on YARN-10637:
-

Thanks [~zhuqi] this makes sense. Is this always enabled in Fair Scheduler? 
Because we should only add this policy if auto-refresh is enabled on the 
FS-side.

> We should support fs to cs support for auto refresh queues when conf changed, 
> after YARN-10623 finished.
> 
>
> Key: YARN-10637
> URL: https://issues.apache.org/jira/browse/YARN-10637
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10637.001.patch, YARN-10637.002.patch, 
> YARN-10637.003.patch, YARN-10637.004.patch
>
>
> cc [~pbacsko] [~gandras] [~bteke]
> We should also fill this, when  YARN-10623 finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313138#comment-17313138
 ] 

Peter Bacsko commented on YARN-10726:
-

This is from {{AsyncDispatcher}}:

{noformat}
 if (qSize != 0 && qSize % 1000 == 0
  && lastEventQueueSizeLogged != qSize) {
lastEventQueueSizeLogged = qSize;
LOG.info("Size of event-queue is " + qSize);
  }
{noformat}

Update the code with {{lastEventQueueSizeLogged}}.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123
 ] 

Peter Bacsko edited comment on YARN-10726 at 4/1/21, 12:01 PM:
---

Thanks [~zhuqi]. I think it's a good idea. My only concern (which might not be 
valid) is that we have too many events, this code can possibly run too 
frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it 
prints at 1000, then it starts to consume events, size goes back from 1000 to 
990, then it prints the size again.

I think we should limit how often we print this message. We should log it too 
often, I'm not sure how we do this in other parts of the code. I'll check what 
can be the best solution.


was (Author: pbacsko):
Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have 
too many events, this code can possibly run too frequently. For example, if you 
go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to 
consume events, size goes back from 1000 to 990, then it prints the size again.

I think we should limit how often we print this message. We should log it too 
often, I'm not sure how we do this in other parts of the code. I'll check what 
can be the best solution.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123
 ] 

Peter Bacsko commented on YARN-10726:
-

Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have 
too many events, this code can possibly run too frequently. For example, if you 
go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to 
consume events, size goes back from 1000 to 990, then it prints the size again.

I think we should limit how often we print this message. We should log it too 
often, I'm not sure how we do this in other parts of the code. I'll check what 
can be the best solution.

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events

2021-04-01 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10726:

Summary: Log the size of DelegationTokenRenewer event queue in case of too 
many pending events  (was: We should log size of pending 
DelegationTokenRenewerEvent queue, when pending too many events.)

> Log the size of DelegationTokenRenewer event queue in case of too many 
> pending events
> -
>
> Key: YARN-10726
> URL: https://issues.apache.org/jira/browse/YARN-10726
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10726.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodesListManager event improvement

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313105#comment-17313105
 ] 

Peter Bacsko commented on YARN-9618:


Thanks for the patch [~zhuqi] and [~gandras] for the review, I committed this 
to trunk.

> NodesListManager event improvement
> --
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch, YARN-9618.007.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9618) NodesListManager event improvement

2021-04-01 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-9618:
---
Summary: NodesListManager event improvement  (was: NodeListManager event 
improvement)

> NodesListManager event improvement
> --
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch, YARN-9618.007.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodeListManager event improvement

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312989#comment-17312989
 ] 

Peter Bacsko commented on YARN-9618:


+1 LGTM

[~gandras] are you OK with the patch?

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch, YARN-9618.007.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2021-04-01 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312945#comment-17312945
 ] 

Peter Bacsko commented on YARN-10720:
-

+1

thanks [~zhuqi] for the patch, committed to trunk.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging

2021-04-01 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10720:

Summary: YARN WebAppProxyServlet should support connection timeout to 
prevent proxy server from hanging  (was: YARN WebAppProxyServlet should support 
connection timeout to prevent proxy server hang.)

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server from hanging
> --
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9618) NodeListManager event improvement

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312516#comment-17312516
 ] 

Peter Bacsko commented on YARN-9618:


Small things:

1.
{noformat}
//Is trigger RMAppNodeUpdateEvent
private Boolean isRMAppEvent = false;
//Is trigger NodesListManagerEvent
private Boolean isNodesListEvent = false;
{noformat}
a) No need for comments
 b) use ordinary "boolean" instead of "Boolean" (also, init to "false" is not 
necessary, it is "false" by default because it's dictated by the JVM spec).

 

2.
{noformat}
Assert.assertFalse(getIsRMAppEvent());
Assert.assertTrue(getIsNodesListEvent());
{noformat}
Add some assertion message here, like
{noformat}
Assert.assertFalse("Got unexpected RM app event", getIsRMAppEvent());
Assert.assertTrue("Received no NodesListManagerEvent", getIsNodesListEvent());
{noformat}
3. Return values of {{getIsNodesListEvent()}} and {{getIsRMAppEvent()}} should 
be just "boolean".

> NodeListManager event improvement
> -
>
> Key: YARN-9618
> URL: https://issues.apache.org/jira/browse/YARN-9618
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Bibin Chundatt
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-9618.001.patch, YARN-9618.002.patch, 
> YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, 
> YARN-9618.006.patch
>
>
> Current implementation nodelistmanager event blocks async dispacher and can 
> cause RM crash and slowing down event processing.
> # Cluster restart with 1K running apps . Each usable event will create 1K 
> events over all events could be 5k*1k events for 5K cluster
> # Event processing is blocked till new events are added to queue.
> Solution :
> # Add another async Event handler similar to scheduler.
> # Instead of adding events to dispatcher directly call RMApp event handler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312492#comment-17312492
 ] 

Peter Bacsko commented on YARN-10720:
-

{noformat}
  } catch (InterruptedException e) {
LOG.warn("doGet() interrupted", e);
resp.setStatus(HttpServletResponse.SC_BAD_REQUEST);
  }
  resp.setStatus(HttpServletResponse.SC_OK);
}
{noformat}

This is not good - you set the response status to {{SC_BAD_REQUEST}} only to 
override it with {{SC_OK}}. You need a "return".

{noformat}
try {
  servlet.init(config);
} catch (ServletException e) {
  LOG.error(e.getMessage());
  fail("Failed to init servlet");
}

try {
  servlet.doGet(request, response);
} catch (ServletException e) {
  LOG.error(e.getMessage());
  fail("ServletException thrown during doGet.");
}
  }
{noformat}

You can remove try-catch here and just add {{throws ServletException}}. If that 
happens for whatever reason, it will be a test error (which is desired - 
checking if the servlet can init is not the purpose of the test), not a test 
failure.

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server hang.
> ---
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, 
> image-2021-03-29-14-04-33-776.png, image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312253#comment-17312253
 ] 

Peter Bacsko commented on YARN-10720:
-

Thanks [~zhuqi] for the patch.

1. As you said {{ExpectedException.none()}} has been deprecated. Either use the 
new {{assertThrows()}} or {{@Test(expected = SocketTimeoutException.class)}}, I 
think using the second is easier.

2.
{noformat}
conf.setInt(YarnConfiguration.RM_PROXY_CONNECTION_TIMEOUT,
1 * 1000);
{noformat}
Just write "1000" instead of "1 * 1000".

3.
{noformat}
try {
  when(response.getOutputStream()).thenReturn(null);
} catch (IOException e) {
  e.printStackTrace();
}
{noformat}
Unnecessary try-catch block. The method already has a {{throws}} clause.

4.
{noformat}
@Override
protected void doGet(HttpServletRequest req, HttpServletResponse 
resp)
throws ServletException, IOException {
  try {
Thread.sleep(10 * 1000);
  } catch (InterruptedException e) {
e.printStackTrace();
  }
  resp.setStatus(HttpServletResponse.SC_OK);
}
{noformat}
Maybe a minor thing, but if you catch {{InterruptedException}}, don't just 
print the stack trace, log it with {{LOG.warn("doGet() interrupted", e)}}. In 
this case, I'd also return with {{HttpServletResponse.SC_BAD_REQUEST}}.

5. 
 {{The web proxy connection timeout, default is 60s(60 * 
1000ms).}}

This already goes to {{yarn-default.xml}}, so you can omit the part "default is 
60s(60 * 1000ms)" and just write "The web proxy connection timeout".

> YARN WebAppProxyServlet should support connection timeout to prevent proxy 
> server hang.
> ---
>
> Key: YARN-10720
> URL: https://issues.apache.org/jira/browse/YARN-10720
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10720.001.patch, YARN-10720.002.patch, 
> YARN-10720.003.patch, image-2021-03-29-14-04-33-776.png, 
> image-2021-03-29-14-05-32-708.png
>
>
> Following is proxy server show, {color:#de350b}too many connections from one 
> client{color}, this caused the proxy server hang, and the yarn web can't jump 
> to web proxy.
> !image-2021-03-29-14-04-33-776.png|width=632,height=57!
> Following is the AM which is abnormal, but proxy server don't know it is 
> abnormal already, so the connections can't be closed, we should add time out 
> support in proxy server to prevent this. And one abnormal AM may cause 
> hundreds even thousands of connections, it is very heavy.
> !image-2021-03-29-14-05-32-708.png|width=669,height=101!
>  
> After i kill the abnormal AM, the proxy server become healthy. This case 
> happened many times in our production clusters, our clusters are huge, and 
> the abnormal AM will be existed in a regular case.
>  
> I will add timeout supported in web proxy server in this jira.
>  
> cc  [~pbacsko] [~ebadger] [~Jim_Brennan]  [~ztang]  [~epayne] [~gandras]  
> [~bteke]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312203#comment-17312203
 ] 

Peter Bacsko commented on YARN-10718:
-

Committed to trunk. Closing.

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10718:

Labels: resourcemanager  (was: )

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: resourcemanager
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10718:

Labels: capacity-scheduler capacityscheduler  (was: resourcemanager)

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.

2021-03-31 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312195#comment-17312195
 ] 

Peter Bacsko commented on YARN-10718:
-

Thanks [~zhuqi], +1 LGTM.

Will commit this soon.

> Fix CapacityScheduler#initScheduler log error. 
> ---
>
> Key: YARN-10718
> URL: https://issues.apache.org/jira/browse/YARN-10718
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png
>
>
> !image-2021-03-28-00-03-28-244.png|width=972,height=52!
> The Resource  toString() method already with "<" and ">" string, it's wrong 
> to add it again.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties

2021-03-24 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307605#comment-17307605
 ] 

Peter Bacsko commented on YARN-10674:
-

Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk.

> fs2cs should generate auto-created queue deletion properties
> 
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties

2021-03-24 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307602#comment-17307602
 ] 

Peter Bacsko commented on YARN-10674:
-

+1 LGTM. I'm going to commit this soon.

> fs2cs should generate auto-created queue deletion properties
> 
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10674) fs2cs should generate auto-created queue deletion properties

2021-03-24 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10674:

Summary: fs2cs should generate auto-created queue deletion properties  
(was: fs2cs: should support auto created queue deletion.)

> fs2cs should generate auto-created queue deletion properties
> 
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306240#comment-17306240
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] I had a discussion with [~gandras], he will post an update soon.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, 
> YARN-10674.015.patch, YARN-10674.016.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10645) Fix queue state related update for auto created queue.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306203#comment-17306203
 ] 

Peter Bacsko commented on YARN-10645:
-

[~zhuqi] [~gandras] is this patch still needed? Looking at Andras' comment, it 
is telling me that this ticket is a duplicate. Is it a dup? 

> Fix queue state related update for auto created queue.
> --
>
> Key: YARN-10645
> URL: https://issues.apache.org/jira/browse/YARN-10645
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10645.001.patch
>
>
> Now the queue state in auto created queue can't be updated after refactor in 
> YARN-10504.
> We should support fix the queue state related logic.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306157#comment-17306157
 ] 

Peter Bacsko commented on YARN-10503:
-

The question is this part:

{noformat}
public enum AbsoluteResourceType {
MEMORY, VCORES, GPUS, FPGAS
}
{noformat}

Do we want to treat GPUs and FPGAs like that? In other parts of the code, we 
have mem/vcore as primary resources, then an array of other resources.  For 
example, constructors from {{org.apache.hadoop.yarn.api.records.Resource}}:

{noformat}
  @Public
  @Stable
  public static Resource newInstance(long memory, int vCores,
  Map others) {
if (others != null) {
  return new LightWeightResource(memory, vCores,
  ResourceUtils.createResourceTypesArray(others));
} else {
  return newInstance(memory, vCores);
}
  }

  @InterfaceAudience.Private
  @InterfaceStability.Unstable
  public static Resource newInstance(Resource resource) {
Resource ret;
int numberOfKnownResourceTypes = ResourceUtils
.getNumberOfKnownResourceTypes();
if (numberOfKnownResourceTypes > 2) {
  ret = new LightWeightResource(resource.getMemorySize(),
  resource.getVirtualCores(), resource.getResources());
} else {
  ret = new LightWeightResource(resource.getMemorySize(),
  resource.getVirtualCores());
}
return ret;
  }
{noformat}

But with this modification, we sort of promote GPU and FPGA to the level of 
vcore and memory, at least from the perspective of the code and it also becomes 
inconsistent with the existing code.

This is just my opinion though. cc [~epayne] [~ebadger].

> Support queue capacity in terms of absolute resources with gpu resourceType.
> 
>
> Key: YARN-10503
> URL: https://issues.apache.org/jira/browse/YARN-10503
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10503.001.patch, YARN-10503.002.patch, 
> YARN-10503.003.patch
>
>
> Now the absolute resources are memory and cores.
> {code:java}
> /**
>  * Different resource types supported.
>  */
> public enum AbsoluteResourceType {
>   MEMORY, VCORES;
> }{code}
> But in our GPU production clusters, we need to support more resourceTypes.
> It's very import for cluster scaling when with different resourceType 
> absolute demands.
>  
> This Jira will handle GPU first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10704) The CS effective capacity for absolute mode in UI should support GPU and other custom resources.

2021-03-22 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306154#comment-17306154
 ] 

Peter Bacsko commented on YARN-10704:
-

Thanks [~zhuqi] I have some minor comments:

1.
{noformat}
sb.append(" The CS effective capacity for absolute mode in UI should support GPU and 
> other custom resources.
> 
>
> Key: YARN-10704
> URL: https://issues.apache.org/jira/browse/YARN-10704
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10704.001.patch, YARN-10704.002.patch, 
> image-2021-03-19-12-05-28-412.png, image-2021-03-19-12-08-35-273.png
>
>
> Actually there are no information about the effective capacity about GPU in 
> UI for absolute resource mode.
> !image-2021-03-19-12-05-28-412.png|width=873,height=136!
> But we have this information in QueueMetrics:
> !image-2021-03-19-12-08-35-273.png|width=613,height=268!
>  
> It's very important for our GPU users to use in absolute mode, there still 
> have nothing to know GPU absolute information in CS Queue UI. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups

2021-03-19 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971
 ] 

Peter Bacsko edited comment on YARN-10597 at 3/19/21, 3:35 PM:
---

[~shuzirra] is it really that simple? You told me that there were bunch of unit 
test failures when you tried to change it months back. Anyway it's great news 
if the change is tiny.


was (Author: pbacsko):
[~shuzirra] is it really that simple? You told me that there were bunch of unit 
test failures. Anyway it's great news if the change is tiny.

> CSMappingPlacementRule should not create new instance of Groups
> ---
>
> Key: YARN-10597
> URL: https://issues.apache.org/jira/browse/YARN-10597
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10597.001.patch
>
>
> As [~ahussein] pointed out in YARN-10425, no new Groups instance should be 
> created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups

2021-03-19 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971
 ] 

Peter Bacsko commented on YARN-10597:
-

[~shuzirra] is it really that simple? You told me that there were bunch of unit 
test failures. Anyway it's great news if the change is tiny.

> CSMappingPlacementRule should not create new instance of Groups
> ---
>
> Key: YARN-10597
> URL: https://issues.apache.org/jira/browse/YARN-10597
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-10597.001.patch
>
>
> As [~ahussein] pointed out in YARN-10425, no new Groups instance should be 
> created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10641) Refactor the max app related update, and fix maxApllications update error when add new queues.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304117#comment-17304117
 ] 

Peter Bacsko commented on YARN-10641:
-

+1

Thanks for the patch [~zhuqi] and [~gandras] for the review. Committed to trunk.

> Refactor the max app related update, and fix maxApllications update error 
> when add new queues.
> --
>
> Key: YARN-10641
> URL: https://issues.apache.org/jira/browse/YARN-10641
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Critical
> Attachments: YARN-10641.001.patch, YARN-10641.002.patch, 
> YARN-10641.003.patch, YARN-10641.004.patch, YARN-10641.005.patch, 
> YARN-10641.006.patch, image-2021-02-20-15-49-58-677.png, 
> image-2021-02-20-15-53-51-099.png, image-2021-02-20-15-55-44-780.png, 
> image-2021-02-20-16-29-18-519.png, image-2021-02-20-16-31-13-714.png
>
>
> When refactor the update logic in YARN-10504 .
> The update max applications based abs/cap is wrong, this should be fixed, 
> because the max applications is key part to limit applications in CS.
> For example: 
> When adding a dynamic queue, the other children's max app of parent queue are 
> not updated correctly:
> !image-2021-02-20-15-53-51-099.png|width=639,height=509!  
> The new added queue's max app will updated correctly:
> !image-2021-02-20-15-55-44-780.png|width=542,height=426!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304089#comment-17304089
 ] 

Peter Bacsko commented on YARN-10692:
-

Thanks [~zhuqi] for the patch, committed to trunk.

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch, 
> YARN-10692.003.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304078#comment-17304078
 ] 

Peter Bacsko commented on YARN-10692:
-

+1 LGTM.

Committing this soon.

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch, 
> YARN-10692.003.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10685) Fix typos in AbstractCSQueue

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304041#comment-17304041
 ] 

Peter Bacsko commented on YARN-10685:
-

+1 thanks [~zhuqi] for the patch, committed to trunk.

> Fix typos in AbstractCSQueue
> 
>
> Key: YARN-10685
> URL: https://issues.apache.org/jira/browse/YARN-10685
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10685.001.patch, YARN-10685.002.patch, 
> YARN-10685.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10685) Fix typos in AbstractCSQueue

2021-03-18 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10685:

Summary: Fix typos in AbstractCSQueue  (was: Fixed some Typo  in 
AbstractCSQueue.)

> Fix typos in AbstractCSQueue
> 
>
> Key: YARN-10685
> URL: https://issues.apache.org/jira/browse/YARN-10685
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10685.001.patch, YARN-10685.002.patch, 
> YARN-10685.003.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-18 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304027#comment-17304027
 ] 

Peter Bacsko commented on YARN-10674:
-

Thanks [~zhuqi] for the patch. I think we are very close.

I still have some comments:
 1.
{noformat}
  private FSConfigToCSConfigConverterParams.
  PreemptionMode disablePreemption;
  private FSConfigToCSConfigConverterParams.
  PreemptionMode preemptionMode;
{noformat}
We don't need two enums. We need only one which covers all states (enabled / 
observeonly / nopolicy).

You can extend {{PreemptionMode}} with a new variable which says whether it's 
enabled or disabled:
{noformat}
  public enum PreemptionMode {
ENABLE("enable", true),
NO_POLICY("nopolicy", false),
OBSERVE_ONLY("observeonly", false);

private String cliOption;
private boolean enabled;

PreemptionMode(String cliOption, boolean enabled) {
  this.cliOption = cliOption;
  this.enabled = enabled;
}

public String getCliOption() {
  return cliOption;
}

public boolean isEnabled() {
  return enabled;
}
{noformat}
So you just call {{preemptionMode.isEnabled()}} and don't need two variables 
just to hold the information whether it's enabled or not.

2. {{public static PreemptionMode fromString(String cliOption)}} --> this 
method never returns ENABLED, which is important (also, pls change "ENABLE" to 
"ENABLED", note the "D" at the end).

cc [~gandras] please review patch v14.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542
 ] 

Peter Bacsko edited comment on YARN-10692 at 3/17/21, 4:11 PM:
---

Thanks [~zhuqi] in general this looks good.

I just have two nits:
1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, 
the method name looks better this way

2. {{getNodeGPUUtilization()}} you can simplify the addition with streams:
{noformat}
float totalGpuUtilization = 0;
if (gpuList != null &&
gpuList.size() != 0) {

  totalGpuUtilization = gpuList
.stream()
.map(g -> g.getGpuUtilizations().getOverallGpuUtilization())
.collect(Collectors.summingDouble(Float::floatValue))
.floatValue() / gpuList.size();
}

return totalGpuUtilization;
{noformat}

Also, you should consider renaming "totalGpuUtilization" to 
"nodeGpuUtilization" so that it matches the method name.


was (Author: pbacsko):
Thanks [~zhuqi] in general this looks good.

I just have two nits:
1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, 
the method name looks better this way

2. {{getNodeGPUUtilization()}} you can simplify the addition with streams:
{noformat}
float totalGpuUtilization = 0;
if (gpuList != null &&
gpuList.size() != 0) {

  totalGpuUtilization = gpuList
.stream()
.map(g -> g.getGpuUtilizations().getOverallGpuUtilization())
.collect(Collectors.summingDouble(Float::floatValue))
.floatValue() / gpuList.size();
}

return totalGpuUtilization;
{noformat}

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542
 ] 

Peter Bacsko commented on YARN-10692:
-

Thanks [~zhuqi] in general this looks good.

I just have two nits:
1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, 
the method name looks better this way

2. {{getNodeGPUUtilization()}} you can simplify the addition with streams:
{noformat}
float totalGpuUtilization = 0;
if (gpuList != null &&
gpuList.size() != 0) {

  totalGpuUtilization = gpuList
.stream()
.map(g -> g.getGpuUtilizations().getOverallGpuUtilization())
.collect(Collectors.summingDouble(Float::floatValue))
.floatValue() / gpuList.size();
}

return totalGpuUtilization;
{noformat}

> Add Node GPU Utilization and apply to NodeMetrics.
> --
>
> Key: YARN-10692
> URL: https://issues.apache.org/jira/browse/YARN-10692
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10692.001.patch, YARN-10692.002.patch
>
>
> Now there are no node level GPU Utilization, this issue will add it, and add 
> it to NodeMetrics first.
> cc [~pbacsko]  [~Jim_Brennan]  [~ebadger]  [~gandras]  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2021-03-17 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10497:

Labels: capacity-scheduler capacityscheduler  (was: )

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Fix For: 3.4.0
>
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, 
> YARN-10497.006.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303365#comment-17303365
 ] 

Peter Bacsko commented on YARN-10497:
-

+1

Thanks [~wangda] / [~zhuqi] for the patch and [~gandras], [~shuzirra]  for the 
review. Committed to trunk.

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, 
> YARN-10497.006.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303342#comment-17303342
 ] 

Peter Bacsko commented on YARN-10674:
-

[~gandras] good suggestions, thanks! [~zhuqi] please apply the suggested 
modifications. 

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303245#comment-17303245
 ] 

Peter Bacsko commented on YARN-10497:
-

I think it's good. Let's wait for Jenkins and I'll commit it.

> Fix an issue in CapacityScheduler which fails to delete queues
> --
>
> Key: YARN-10497
> URL: https://issues.apache.org/jira/browse/YARN-10497
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Major
> Attachments: YARN-10497.001.patch, YARN-10497.002.patch, 
> YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, 
> YARN-10497.006.patch
>
>
> We saw an exception when using queue mutation APIs:
> {code:java}
> 2020-11-13 16:47:46,327 WARN 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: 
> CapacityScheduler configuration validation failed:java.io.IOException: Queue 
> root.am2cmQueueSecond not found
> {code}
> Which comes from this code:
> {code:java}
> List siblingQueues = getSiblingQueues(queueToRemove,
> proposedConf);
> if (!siblingQueues.contains(queueName)) {
>   throw new IOException("Queue " + queueToRemove + " not found");
> } 
> {code}
> (Inside MutableCSConfigurationProvider)
> If you look at the method:
> {code:java}
>  
>   private List getSiblingQueues(String queuePath, Configuration conf) 
> {
> String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.'));
> String childQueuesKey = CapacitySchedulerConfiguration.PREFIX +
> parentQueue + CapacitySchedulerConfiguration.DOT +
> CapacitySchedulerConfiguration.QUEUES;
> return new ArrayList<>(conf.getStringCollection(childQueuesKey));
>   }
> {code}
> And here's capacity-scheduler.xml I got
> {code:java}
> yarn.scheduler.capacity.root.queuesdefault, q1, 
> q2
> {code}
> You can notice there're spaces between default, q1, a2
> So conf.getStringCollection returns:
> {code:java}
> default
> q1
> ...
> {code}
> Which causes match issue when we try to delete the queue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-17 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303222#comment-17303222
 ] 

Peter Bacsko commented on YARN-10674:
-

[~gandras] do you have further comments? I think the patch is in good shape now.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, 
> YARN-10674.012.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878
 ] 

Peter Bacsko edited comment on YARN-10370 at 3/16/21, 8:36 PM:
---

[~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There 
are some open tasks left.

I think it's safe to say that this feature is ready and the remaining tasks can 
be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we 
might need to keep this open for a long time.

IMO we should move the open / patch available tasks under a new umbrella and 
resolve this, marked with a proper Fix version.

Opinions?


was (Author: pbacsko):
[~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There 
are some open tasks left.

I think it's safe to say that the umbrella is done and the remaining tasks can 
be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we 
might need to keep this open for a long time.

IMO we should move the open / patch available tasks under a new umbrella and 
resolve this, marked with a proper Fix version.

Opinions?

> [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue 
> Mapping rules
> ---
>
> Key: YARN-10370
> URL: https://issues.apache.org/jira/browse/YARN-10370
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: MappingRuleEnhancements.pdf, Possible extensions of 
> mapping rule format in Capacity Scheduler.pdf
>
>
> To continue closing the feature gaps between Fair Scheduler and Capacity 
> Scheduler to help users migrate between the scheduler more easy, we need to 
> add some of the Fair Scheduler placement rules to the capacity scheduler's 
> queue mapping functionality.
> With [~snemeth] and [~pbacsko] we've created the following design docs about 
> the proposed changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878
 ] 

Peter Bacsko commented on YARN-10370:
-

[~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There 
are some open tasks left.

I think it's safe to say that the umbrella is done and the remaining tasks can 
be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we 
might need to keep this open for a long time.

IMO we should move the open / patch available tasks under a new umbrella and 
resolve this, marked with a proper Fix version.

Opinions?

> [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue 
> Mapping rules
> ---
>
> Key: YARN-10370
> URL: https://issues.apache.org/jira/browse/YARN-10370
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: yarn
>Reporter: Gergely Pollak
>Assignee: Gergely Pollak
>Priority: Major
>  Labels: capacity-scheduler, capacityscheduler
> Attachments: MappingRuleEnhancements.pdf, Possible extensions of 
> mapping rule format in Capacity Scheduler.pdf
>
>
> To continue closing the feature gaps between Fair Scheduler and Capacity 
> Scheduler to help users migrate between the scheduler more easy, we need to 
> add some of the Fair Scheduler placement rules to the capacity scheduler's 
> queue mapping functionality.
> With [~snemeth] and [~pbacsko] we've created the following design docs about 
> the proposed changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302599#comment-17302599
 ] 

Peter Bacsko commented on YARN-10686:
-

+1

Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk.

> Fix 
> TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
> -
>
> Key: YARN-10686
> URL: https://issues.apache.org/jira/browse/YARN-10686
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10686.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode

2021-03-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10686:

Summary: Fix 
TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
  (was: Fix testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode user 
error.)

> Fix 
> TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
> -
>
> Key: YARN-10686
> URL: https://issues.apache.org/jira/browse/YARN-10686
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10686.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302567#comment-17302567
 ] 

Peter Bacsko commented on YARN-10682:
-

+1

Thanks for the patch [~zhuqi] and [~gandras] for the review, committed to trunk.

> The scheduler monitor policies conf should trim values separated by comma
> -
>
> Key: YARN-10682
> URL: https://issues.apache.org/jira/browse/YARN-10682
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10682.001.patch
>
>
> When i configured scheduler monitor policies with space, the RM will start 
> with error.
> The conf should support trim between "," , such as :
> "a,b,c" is supported now, but "a,   b,  c" is not supported now, just add 
> trim in this jira.
>  
> When tested multi policy, it happened.
>  
>  yarn.resourcemanager.scheduler.monitor.policies
>  
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy,
>    
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma

2021-03-16 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10682:

Summary: The scheduler monitor policies conf should trim values separated 
by comma  (was: The scheduler monitor policies conf should support trim between 
",".)

> The scheduler monitor policies conf should trim values separated by comma
> -
>
> Key: YARN-10682
> URL: https://issues.apache.org/jira/browse/YARN-10682
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10682.001.patch
>
>
> When i configured scheduler monitor policies with space, the RM will start 
> with error.
> The conf should support trim between "," , such as :
> "a,b,c" is supported now, but "a,   b,  c" is not supported now, just add 
> trim in this jira.
>  
> When tested multi policy, it happened.
>  
>  yarn.resourcemanager.scheduler.monitor.policies
>  
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy,
>    
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-16 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302548#comment-17302548
 ] 

Peter Bacsko commented on YARN-10674:
-

Thanks [~zhuqi] this is definitely looks better. We're close to the final 
version.

Some comments:
 1.
{noformat}
Disable the preemption with nopolicy or observeonly mode, " +
"default mode is nopolicy with no arg." +
"When use nopolicy arg, it means to remove " +
"ProportionalCapacityPreemptionPolicy for CS preemption, " +
"When use observeonly arg, " +
"it means to set " +

"yarn.resourcemanager.monitor.capacity.preemption.observe_only " +
"to true"
{noformat}
I'd to slightly modify this text:
{noformat}
Disable the preemption with \"nopolicy\" or \"observeonly\" mode.
Default is \"nopolicy\".
\"nopolicy\" removes ProportionalCapacityPreemptionPolicy from
the list of monitor policies.
\"observeronly\" sets 
\"yarn.resourcemanager.monitor.capacity.preemption.observe_only\"
to true.
{noformat}
2. This definition:
 {{private String disablePreemptionMode;}}

This should be a simple enum like:
{noformat}
public enum DisablePreemptionMode {
  OBSERVE_ONLY {
@Override
String getCliOption() {
  return "observeonly";
}
  },
  NO_POLICY {
@Override
String getCliOption() {
  return "nopolicy";
}
  };

  abstract String getCliOption();
}
{noformat}
So you can also use them here:
{noformat}
 private static void checkDisablePreemption(CliOption cliOption,
  String disablePreemptionMode) {
if (disablePreemptionMode == null ||
disablePreemptionMode.trim().isEmpty()) {
  // The default mode is nopolicy.
  return;
}

try {
  DisablePreemptionMode.valueOf(disablePreemptionMode);
} catch (IllegalArgumentException e) {
  throw new PreconditionException(
  String.format("Specified disable-preemption option %s is 
illegal, " +
  " use \"nopolicy\" or \"observeonly\""));
}
{noformat}
"disablePreemptionMode" should be an enum everywhere.

3.
{noformat}
  public void convertSiteProperties(Configuration conf,
  Configuration yarnSiteConfig, boolean drfUsed, boolean 
enableAsyncScheduler) 
  boolean enableAsyncScheduler, boolean userPercentage,
  boolean disablePreemption, String disablePreemptionMode) {
{noformat}
Here "disablePreemptionMode" should be an enum also and make sure that it 
always has a value. If it always has a value, this part becomes much simpler:
{noformat}
  if (disablePreemption && 
  disablePreemptionMode == DisablePreemptionMode.NO_POLICY) {

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, "");
  }
}
{noformat}
4.
 {{AutoCreatedQueueDeletionPolicy.class.getCanonicalName())}}

This string is referenced very often in the tests. Instead, use a final String:
{noformat}
private static final String DELETION_POLICY_CLASS =
   AutoCreatedQueueDeletionPolicy.class.getCanonicalName();
{noformat}
So the readability becomes much better.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, 
> YARN-10674.009.patch, YARN-10674.010.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-12 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] I didn't have too much time to deeply review the patch, but your 
change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption 
observeonly", nothing happens. Could you insert this to 
{{FSConfigToCSConfigConverter}}? I believe that is the best place for it.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-12 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/12/21, 1:34 PM:
---

[~zhuqi] I didn't have too much time to deeply review the patch, but your 
change ignores the "observeonly" setting. So, if I use "\-\-disablepreemption 
observeonly", nothing happens. Could you insert this to 
{{FSConfigToCSConfigConverter}}? I believe that is the best place for it.


was (Author: pbacsko):
[~zhuqi] I didn't have too much time to deeply review the patch, but your 
change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption 
observeonly", nothing happens. Could you insert this to 
{{FSConfigToCSConfigConverter}}? I believe that is the best place for it.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch, YARN-10674.007.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/11/21, 3:25 PM:
---

Ok, I did some research, I think we have 3 options to completely disable 
preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?


was (Author: pbacsko):
Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/11/21, 2:43 PM:
---

Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?


was (Author: pbacsko):
Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602
 ] 

Peter Bacsko commented on YARN-10674:
-

Ok, I did some research, I think we 3 options to completely disable preemption:

1) Set disable_preemption to "root", which will propagate down to other queues.
2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies.
3) Enable "observe_only" property.

I think #1 is not really good, because it relies on a side-effect (propagation 
of a setting). The intention is not clear.

#2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be 
in {{FSYarnSiteConverter}}.
#3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in 
{{yarn-site.xml}}, I just verified it. So this should be placed somewhere else.

So we can do:
1) Vote for what's best
2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with 
values like "nopolicy" or "observeonly" and we pick a default value, eg. 
"nopolicy". So we can do something like:
{noformat}
yarn fs2cs --disable-preemption observeonly --yarnsiteconfig 
/path/to/yarn-site.xml 
{noformat}

[~gandras] [~zhuqi] what do you think?

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299466#comment-17299466
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] yes that's right.

This is the default setting for policies:

{noformat}
  
The list of SchedulingEditPolicy classes that interact with
the scheduler. A particular module may be incompatible with the
scheduler, other policies, or a configuration of either.
yarn.resourcemanager.scheduler.monitor.policies

org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
  
{noformat}

This is from {{yarn-default.xml}}. So when we don't use preemption, we should 
remove this policy.

But we actually have to think a little bit, because how we disable preemption 
affects our downstream Hadoop codebase. So let's wait until we figure out what 
is the best solution to turn off preemption.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299456#comment-17299456
 ] 

Peter Bacsko commented on YARN-10674:
-

[~gandras] h - that's true. I just overcomplicated the whole thing (not 
that preemption in general is easy to begin with).

Yes, we don't need it if we don't have the policy.

[~zhuqi] please wait with the new patch. What Andras said is correct, but there 
might be other changes that I'll recommend.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-11 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299427#comment-17299427
 ] 

Peter Bacsko commented on YARN-10674:
-

I'll do a deeper review today.

[~gandras] you say: "Is setting observe only necessary here? This is an 
extremely subtle property.".

I'm not sure how subtle it is, but it is mentioned in the upstream 
documentation:
|{{yarn.resourcemanager.monitor.capacity.preemption.observe_only}}|If true, run 
the policy but do not affect the cluster with preemption and kill events. 
Default value is false|

However, if someone thinks that disabling preemption for "root" is a better 
solution, I'm not against that. We might need other folks to chime in and share 
their thoughts.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, 
> YARN-10674.006.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10685) Fixed some Typo in AbstractCSQueue.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298874#comment-17298874
 ] 

Peter Bacsko commented on YARN-10685:
-

Sure, I'll check it out.

> Fixed some Typo  in AbstractCSQueue.
> 
>
> Key: YARN-10685
> URL: https://issues.apache.org/jira/browse/YARN-10685
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10685.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298861#comment-17298861
 ] 

Peter Bacsko commented on YARN-10571:
-

[~gandras] thanks for the patch.

I just have one question: the class {{CapacitySchedulerAutoQueueHandler}} was 
renamed to {{CapacitySchedulerQueueHandler}}. But the latter is telling me that 
this is class which handles all kinds of queues, not just auto-created queues. 
Wouldn't it make sense to keep the original name? Even the instance is called 
{{autoQueueHandler}}.

Also, there's a Javadoc and a checkstyle problem.

> Refactor dynamic queue handling logic
> -
>
> Key: YARN-10571
> URL: https://issues.apache.org/jira/browse/YARN-10571
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Andras Gyori
>Assignee: Andras Gyori
>Priority: Minor
> Attachments: YARN-10571.001.patch
>
>
> As per YARN-10506 we have introduced an other mode for auto queue creation 
> and a new class, which handles it. We should move the old, managed queue 
> related logic to CSAutoQueueHandler as well, and do additional cleanup 
> regarding queue management.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298842#comment-17298842
 ] 

Peter Bacsko commented on YARN-10674:
-

Ok, here is what I found:

1. {{RM_SCHEDULER_ENABLE_MONITORS}} --> ok, this can be set to "true" in all 
cases.

2. If FS preemption is disabled --> there is a property which is better than 
configuring the "root" queue. If FS preemption is disabled 
({{yarn.scheduler.fair.preemption}} = {{false}}),
then we should generate 
{{yarn.resourcemanager.monitor.capacity.preemption.observe_only}} = {{true}}. 
This means that we have the monitor thread running but we don't do any 
preemption. So we don't need to set "root.disable_preemption".

3. As I mentioned, the {{Configuration}} object is empty. The problem is, in 
order to use the preemption, we need to set the preemption policy, which is 
missing right now. So, if FS preemption is enabled, this line must be added:


{noformat}
   if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
  ProportionalCapacityPreemptionPolicy.class.getCanonicalName();
...
{noformat}

So, the modified code should look like this:

{noformat}
   yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
   
   if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
  ProportionalCapacityPreemptionPolicy.class.getCanonicalName();
...
   } else {
 // no preemption
 
yarnSiteConfig.setBoolean(CapacitySchedulerConfiguration.PREEMPTION_OBSERVE_ONLY,
   true);
   }

// new code comes here
if (!userPercentage) {
  String policies =
yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES);
  if (policies == null) {
  ...
{noformat}


Please modify the test cases accordingly and the checkstyle issues also.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/10/21, 12:03 PM:


[~zhuqi] thanks for the patch. I found a new property which is probably good 
for us if preemption is completely disabled on the FS side. I have to check if 
it is really acceptable.


was (Author: pbacsko):
[~zhuqi] thanks for the patch. I found a new property which is probably good 
for us if preemption is completely disabled on the FS side. I have to check if 
it is good for us.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-10 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] thanks for the patch. I found a new property which is probably good 
for us if preemption is completely disabled on the FS side. I have to check if 
it is good for us.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch, 
> YARN-10674.003.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298156#comment-17298156
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] this is very interesting. If we set RM Monitors to enabled, it means 
that system-wide preemption is always enabled, too:

AbstractCSQueue:
{noformat}
  private boolean isQueueHierarchyPreemptionDisabled(CSQueue q,
  CapacitySchedulerConfiguration configuration) {
boolean systemWidePreemption =
csContext.getConfiguration()
.getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS,
YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS);
CSQueue parentQ = q.getParent();

// If the system-wide preemption switch is turned off, all of the queues in
// the qPath hierarchy have preemption disabled, so return true.
if (!systemWidePreemption) return true;
{noformat}
However, you already added a policy in YARN-10623, so looks like this property 
always has to be enabled in weight mode. But what if we convert an FS 
configuration which disabled preemption completely?

I think the best thing we can do right now is that we disable preemption for 
"root", which will propagate to all other parent queues.

So I suggest the following approach:
 1. In percentage conversion mode, do not enable RM monitors by default, 
because it's not needed.
 2. In weight mode (which is the default now), we have to enable it. But if 
"yarn.scheduler.fair.preemption" is false, then 
"yarn.scheduler.capacity.root.disable_preemption" must be set to true, but only 
for "root". This can be done in {{FSQueueConverter}}.

cc [~bteke] [~gandras] [~snemeth], not sure if this is a good approach, but I 
can't see anything better.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112
 ] 

Peter Bacsko edited comment on YARN-10674 at 3/9/21, 3:23 PM:
--

[~zhuqi] I have the following comments:

1. This change seems to always enable "RM monitors":
{noformat}
// This should be always true to trigger dynamic queue auto deletion
// when expired.
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
{noformat}
But I don't think this is necessary. We need to enable it in two cases: 
preemption is enabled OR we're in weight mode. We don't have auto-queue delete 
in percentage mode (fs2cs can still convert to percentages with a command line 
switch).
 So I suggest that you pass an extra boolean "usePercentages".

Invocation from {{FSConfigToCSConfigConverter}}:
{noformat}
siteConverter.convertSiteProperties(inputYarnSiteConfig,
convertedYarnSiteConfig, drfUsed,
conversionOptions.isEnableAsyncScheduler(), usePercentages);  <-- last 
argument is new
{noformat}
Then in the site converter:
{noformat}
if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {
  yarnSiteConfig.setBoolean(
  YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
  preemptionEnabled = true;
  ...
}

if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);   // 
setting it again is OK

String policies =
yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES);
if (policies == null) {
  policies = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policies);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
If I think about it, {{yarnSiteConfig}} is the output config. So this cannot 
happen:
{noformat}
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}
{noformat}
This {{Configuration}} object is created with no entries. The {{else}} branch 
will never be taken.

So it can be simplified to:
{noformat}
if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);

String policy = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policy);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
2. This also means two separate test cases:
 * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false)
 * When usePercentages = true, then {{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false)

I recommend the following naming:
 {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario
 {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario


was (Author: pbacsko):
[~zhuqi] I have the following comments:

1. This change seems to always enable "RM monitors":
{noformat}
// This should be always true to trigger dynamic queue auto deletion
// when expired.
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
{noformat}
But I don't think this is necessary. We need to enable it in two cases: 
preemption is enabled OR we're in weight mode. We don't have auto-queue delete 
in percentage mode (fs2cs can still convert to percentages with a command line 
switch).
 So I suggest that you pass an extra boolean "usePercentages".

Invocation from {{FSConfigToCSConfigConverter}}:
{noformat}
siteConverter.convertSiteProperties(inputYarnSiteConfig,
convertedYarnSiteConfig, drfUsed,
conversionOptions.isEnableAsyncScheduler(), usePercentages);  <-- last 
argument is new
{noformat}
Then in the site converter:
{noformat}
if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {
  yarnSiteConfig.setBoolean(
  YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
  preemptionEnabled = true;
  ...
}

if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);   

[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] I have the following comments:

1. This change seems to always enable "RM monitors":
{noformat}
// This should be always true to trigger dynamic queue auto deletion
// when expired.
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
{noformat}
But I don't think this is necessary. We need to enable it in two cases: 
preemption is enabled OR we're in weight mode. We don't have auto-queue delete 
in percentage mode (fs2cs can still convert to percentages with a command line 
switch).
 So I suggest that you pass an extra boolean "usePercentages".

Invocation from {{FSConfigToCSConfigConverter}}:
{noformat}
siteConverter.convertSiteProperties(inputYarnSiteConfig,
convertedYarnSiteConfig, drfUsed,
conversionOptions.isEnableAsyncScheduler(), usePercentages);  <-- last 
argument is new
{noformat}
Then in the site converter:
{noformat}
if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION,
FairSchedulerConfiguration.DEFAULT_PREEMPTION)) {
  yarnSiteConfig.setBoolean(
  YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
  preemptionEnabled = true;
  ...
}

if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);   // 
setting it again is OK

String policies =
yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES);
if (policies == null) {
  policies = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policies);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
If I think about it, {{yarnSiteConfig}} is the output config. So this cannot 
happen:
{noformat}
} else {
  policies += "," + AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();
}
{noformat}
This {{Configuration}} object is created with no entries. The {{else}} branch 
will never be taken.

So it can be simplified to:
{noformat}
if (!usePercentages) {
yarnSiteConfig.setBoolean(
YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);

String policy = AutoCreatedQueueDeletionPolicy.
  class.getCanonicalName();

yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES,
policy);

// Set the expired for deletion interval to 10s, consistent with fs.
yarnSiteConfig.setInt(CapacitySchedulerConfiguration.
AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10);
}
{noformat}
2. This also means two separate test cases:
 * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false)
 * When usePercentages = true, then\{{RM_SCHEDULER_ENABLE_MONITORS}} and 
{{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false)

I recommend the following naming:
 {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario
 {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298092#comment-17298092
 ] 

Peter Bacsko commented on YARN-10674:
-

Ok thanks, I'll review this one soon.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298074#comment-17298074
 ] 

Peter Bacsko commented on YARN-10674:
-

[~zhuqi] am I right when I think that this patch depends on YARN-10682? Because 
this change generates a config entry with "," and it's not supported now.

> fs2cs: should support auto created queue deletion.
> --
>
> Key: YARN-10674
> URL: https://issues.apache.org/jira/browse/YARN-10674
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: fs2cs
> Attachments: YARN-10674.001.patch, YARN-10674.002.patch
>
>
> In FS the auto deletion check interval is 10s.
> {code:java}
> @Override
> public void onCheck() {
>   queueMgr.removeEmptyDynamicQueues();
>   queueMgr.removePendingIncompatibleQueues();
> }
> while (running) {
>   try {
> synchronized (this) {
>   reloadListener.onCheck();
> }
> ...
> Thread.sleep(reloadIntervalMs);
> }
> /** Time to wait between checks of the allocation file */
> public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298069#comment-17298069
 ] 

Peter Bacsko commented on YARN-9615:


+1

I had to commit twice because there are actually two authors.

Thanks for the patch [~jhung] / [~zhuqi] and [~bibinchundatt] for the review.

Committed to trunk.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, 
> YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, 
> image-2021-03-04-10-36-12-441.png, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298068#comment-17298068
 ] 

Peter Bacsko commented on YARN-9615:


Thanks [~zhuqi] patch v11 looks good, committing it soon.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, 
> YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, 
> image-2021-03-04-10-36-12-441.png, screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298064#comment-17298064
 ] 

Peter Bacsko commented on YARN-10679:
-

+1 thanks [~snemeth] for the patch and [~shuzirra] for the review.

Committed to trunk.

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10679.001.patch
>
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298062#comment-17298062
 ] 

Peter Bacsko commented on YARN-10679:
-

Ok, this time the failed test is different, most likely a flaky one. Let's 
investigate it later.

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10679.001.patch
>
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298057#comment-17298057
 ] 

Peter Bacsko commented on YARN-10679:
-

Re-triggered build to see what's going on with TestSLSRunner.

> Better logging of uncaught exceptions throughout SLS
> 
>
> Key: YARN-10679
> URL: https://issues.apache.org/jira/browse/YARN-10679
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10679.001.patch
>
>
> In our internal environment, there was a test failure while running SLS tests 
> with Jenkins.
> It's difficult to align the uncaught exceptions (in this case an NPE) and the 
> log itself as the exception is logged with {{e.printStackTrace()}}.
> This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", 
> exception)}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10681) Fix assertion failure message in BaseSLSRunnerTest

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298041#comment-17298041
 ] 

Peter Bacsko commented on YARN-10681:
-

+1 thanks [~snemeth] and [~shuzirra] for the patch and review, committed to 
trunk.

> Fix assertion failure message in BaseSLSRunnerTest
> --
>
> Key: YARN-10681
> URL: https://issues.apache.org/jira/browse/YARN-10681
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Trivial
> Attachments: YARN-10681.001.patch
>
>
> There is this failure message: 
> https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130
> The word "catched" should be replaced with "caught".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298031#comment-17298031
 ] 

Peter Bacsko commented on YARN-10677:
-

+1 LGTM.

Thanks [~snemeth] for the patch and [~zhuqi] for the review. Committed to 
trunk. (Jenkins is running but I don't expect any issues).

> Logger of SLSFairScheduler is provided with the wrong class
> ---
>
> Key: YARN-10677
> URL: https://issues.apache.org/jira/browse/YARN-10677
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10677.001.patch, YARN-10677.002.patch, 
> YARN-10677.003.patch, YARN-10677.004.patch
>
>
> In SLSFairScheduler, the Logger definition looks like: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
> We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298010#comment-17298010
 ] 

Peter Bacsko commented on YARN-10678:
-

+1 thanks [~snemeth] for the patch and [~shuzirra] for the review.

Committed to trunk.

> Try blocks without catch blocks in SLS scheduler classes can swallow other 
> exceptions
> -
>
> Key: YARN-10678
> URL: https://issues.apache.org/jira/browse/YARN-10678
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10678-unchecked-exception-from-FS-allocate.diff, 
> YARN-10678-unchecked-exception-from-FS-allocate_fixed.diff, 
> YARN-10678.001.patch, 
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_modified.log,
>  
> org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log
>
>
> In SLSFairScheduler, we have this try-finally block (without catch block) in 
> the allocate method: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123
> Similarly, in SLSCapacityScheduler: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131
> In the finally block, the updateQueueWithAllocateRequest is invoked: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L118
> In our internal environment, there was a situation when an NPE was logged 
> from this method: 
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.updateQueueWithAllocateRequest(SLSFairScheduler.java:262)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.allocate(SLSFairScheduler.java:118)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:288)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348)
>   at 
> org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212)
>   at 
> org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> This can happen if the following events occur:
> 1. A runtime exception is thrown in FairScheduler or CapacityScheduler's 
> allocate method 
> 2. In this case, the local variable called 'allocation' remains null: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L110
> 3. In updateQueueWithAllocateRequest, this null object will be dereferenced 
> here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L262
> 4. Then, we have an NPE here: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L117-L122
> In this case, we lost the original exception thrown from 
> FairScheduler#allocate.
> In order to fix this, a catch-block should be introduced and the exception 
> needs to be logged.
> The whole thing applies to SLSCapacityScheduler as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298007#comment-17298007
 ] 

Peter Bacsko commented on YARN-10677:
-

[~snemeth] please fix the whitespace and checkstyle, thanks. 

> Logger of SLSFairScheduler is provided with the wrong class
> ---
>
> Key: YARN-10677
> URL: https://issues.apache.org/jira/browse/YARN-10677
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10677.001.patch, YARN-10677.002.patch, 
> YARN-10677.003.patch
>
>
> In SLSFairScheduler, the Logger definition looks like: 
> https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69
> We need to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10675) Consolidate YARN-10672 and YARN-10447

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297998#comment-17297998
 ] 

Peter Bacsko commented on YARN-10675:
-

+1 LGTM.

Thanks [~snemeth] for the patch, committed to trunk.

> Consolidate YARN-10672 and YARN-10447
> -
>
> Key: YARN-10675
> URL: https://issues.apache.org/jira/browse/YARN-10675
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Attachments: YARN-10675.001.patch
>
>
> Let's consolidate the solution applied for YARN-10672 and apply it to the 
> code changes introduced with YARN-10447.
> Quoting [~pbacsko]: 
> {quote}
> The solution is much straightforward than mine in YARN-10447. Actually we 
> might consider applying this to TestLeafQueue with undoing my changes, 
> because that's more complicated (I had no patience to go deeper with Mockito 
> internal behavior, I just thought well, disable that thread and that's 
> enough).
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10676) Improve code quality in TestTimelineAuthenticationFilterForV1

2021-03-09 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297996#comment-17297996
 ] 

Peter Bacsko commented on YARN-10676:
-

+1 thanks [~snemeth] for the patch and [~bteke] / [~zhuqi] / [~shuzirra] for 
the review.

Committed to trunk.

> Improve code quality in TestTimelineAuthenticationFilterForV1
> -
>
> Key: YARN-10676
> URL: https://issues.apache.org/jira/browse/YARN-10676
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-10676.001.patch
>
>
> - In testcase "testDelegationTokenOperations", the exception message is 
> checked but in case it does not match the assertion, the exception is not 
> printed. This happens 3 times.
> - Assertion messages can be added
> - Fields called "httpSpnegoKeytabFile" and "httpSpnegoPrincipal" can be 
> static final.
> - There's a typo in comment "avaiable" (happens 2 times)
> - There are some Assert.fail() calls, without messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-03-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297642#comment-17297642
 ] 

Peter Bacsko edited comment on YARN-10178 at 3/8/21, 6:54 PM:
--

[~zhuqi] this is a tricky patch, I have to understand what's going on. We might 
ask [~wangda] again to look at it, because I'm not that familiar with the code 
that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new 
Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we 
don't see the random seed that was used for initialization, so this test is not 
reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed);
   LOG.error(e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the 
exception with a modified message, that's also a solution, or wrap it in a 
different exception with a new message which contains the seed. The point is, 
it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
// sanity check
assert queueNames != null && priorities != null && utilizations != 
null
&& queueNames.length > 0 && queueNames.length == 
priorities.length
&& priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.


was (Author: pbacsko):
[~zhuqi] this is a tricky patch, I have to understand what's going on. We might 
ask [~wangda] again to look at it, because I'm not that familiar with the code 
that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new 
Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we 
don't see the random seed that was used for initialization, so this test is not 
reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed, e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the 
exception with a modified message, that's also a solution, or wrap it in a 
different exception with a new message which contains the seed. The point is, 
it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
// sanity check
assert queueNames != null && priorities != null && utilizations != 
null
&& queueNames.length > 0 && queueNames.length == 
priorities.length
&& priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> 

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-03-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297642#comment-17297642
 ] 

Peter Bacsko commented on YARN-10178:
-

[~zhuqi] this is a tricky patch, I have to understand what's going on. We might 
ask [~wangda] again to look at it, because I'm not that familiar with the code 
that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new 
Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we 
don't see the random seed that was used for initialization, so this test is not 
reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed, e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the 
exception with a modified message, that's also a solution, or wrap it in a 
different exception with a new message which contains the seed. The point is, 
it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
// sanity check
assert queueNames != null && priorities != null && utilizations != 
null
&& queueNames.length > 0 && queueNames.length == 
priorities.length
&& priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this 

[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM

2021-03-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297504#comment-17297504
 ] 

Peter Bacsko commented on YARN-9615:


Let's wait for the Jenkins results of patch v10.

> Add dispatcher metrics to RM
> 
>
> Key: YARN-9615
> URL: https://issues.apache.org/jira/browse/YARN-9615
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Jonathan Hung
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-9615.001.patch, YARN-9615.002.patch, 
> YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, 
> YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, 
> YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.poc.patch, 
> image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, 
> screenshot-1.png
>
>
> It'd be good to have counts/processing times for each event type in RM async 
> dispatcher and scheduler async dispatcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky

2021-03-08 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10672:

Fix Version/s: 3.2.3
   3.3.1

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> 

[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297358#comment-17297358
 ] 

Peter Bacsko commented on YARN-10672:
-

+1 overall. Committed changes to branch-3.2 too.

Thanks [~snemeth] for the contribution.

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> 

[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky

2021-03-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297345#comment-17297345
 ] 

Peter Bacsko commented on YARN-10672:
-

Ok, test failures seem to be totally unrelated. The change only concerns 
"TestReservations" and modifies the order of stubbing.

> All testcases in TestReservations are flaky
> ---
>
> Key: YARN-10672
> URL: https://issues.apache.org/jira/browse/YARN-10672
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 
> 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 
> 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, 
> YARN-10672-debuglogs.patch, YARN-10672.001.patch, 
> YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the 
> time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it 
> produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> 
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
> a.assignContainers(clusterResource, node_0,
> new ResourceLimits(clusterResource),
> SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no 
> registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator 
> (RegularContainerAllocator.java:canAssign(312)) - **Can't assign 
> container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler 
> (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real 
> getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration,
>  boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 
> 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub 
> spies - 
>- with doReturn|Throw() family of methods. More in javadocs for 
> Mockito.spy() method.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
>   at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> 

[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bacsko updated YARN-10642:

Fix Version/s: 3.2.3

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1, 3.2.3
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why? "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see pic 
> "debugfornode.png"
> Environment:
>   OS: CentOS Linux release 7.5.1804 (Core) 
>   JDK: jdk1.8.0_281



--
This 

[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297343#comment-17297343
 ] 

Peter Bacsko commented on YARN-10642:
-

Ok, pushed to branch-3.2 as well.

Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. 

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why? "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, Let's forEachRemaining called more slow than 
> take, the problem will reproduction。uni-test is MockForDeadLoop.java.
> I debug MockForDeadLoop.java, and see a Node point itself. You can see 

[jira] [Comment Edited] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995

2021-03-08 Thread Peter Bacsko (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297343#comment-17297343
 ] 

Peter Bacsko edited comment on YARN-10642 at 3/8/21, 1:24 PM:
--

+1
Ok, pushed to branch-3.2 as well.

Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. 


was (Author: pbacsko):
Ok, pushed to branch-3.2 as well.

Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. 

> Race condition: AsyncDispatcher can get stuck by the changes introduced in 
> YARN-8995
> 
>
> Key: YARN-10642
> URL: https://issues.apache.org/jira/browse/YARN-10642
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 3.2.1
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Critical
> Fix For: 3.4.0, 3.3.1
>
> Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, 
> YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, 
> YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, 
> YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, 
> put.png, take.png
>
>
> In our cluster, ResouceManager stuck twice within twenty days. Yarn client 
> can't submit application. I got jstack info at second time, then found the 
> reason.
> I analyze all the jstack, I found many thread stuck because can't get 
> LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the 
> analytical process)
> The reason is that one thread hold the putLock all the time, 
> printEventQueueDetails will called forEachRemaining, then hold putLock and 
> readLock. The AsyncDispatcher will stuck.
> {code}
> Thread 6526 (IPC Server handler 454 on default port 8030):
>   State: RUNNABLE
>   Blocked count: 29988
>   Waited count: 2035029
>   Stack:
> 
> java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926)
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
> 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270)
> 
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408)
> 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215)
> 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432)
> 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> 
> org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99)
> 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040)
> org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958)
> java.security.AccessController.doPrivileged(Native Method)
> {code}
> I analyze LinkedBlockingQueue's source code. I found forEachRemaining in 
> LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take 
> are called in different thread. 
> YARN-8995 introduce printEventQueueDetails method, 
> "eventQueue.stream().collect" will called forEachRemaining method.
> Let's see why? "put.png" shows that how to put("a"), "take.png" shows that 
> how to take()。Specical Node: The removed Node will point itself for help gc!!!
> The key point code is in forEachRemaining, we see LBQSpliterator use 
> forEachRemaining to visit all Node. But when got item value from Node, will 
> release the lock. If at this time, take() will be called. 
> The variable 'p' in forEachRemaining may point a Node which point itself, 
> then forEachRemaining will be in dead loop. You can see it in "deadloop.png"
> Let's see a simple uni-test, 

  1   2   3   4   5   6   7   8   9   10   >