[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321128#comment-17321128 ] Peter Bacsko commented on YARN-10654: - [~snemeth] [~shuzirra] do you guys have some time to review this? It's the equivalent of what FS does. > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10654-001.patch > > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320954#comment-17320954 ] Peter Bacsko commented on YARN-10654: - Uploaded patch v1 which is probably the simplest approach to the '.' problem. > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10654-001.patch > > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10654: Attachment: YARN-10654-001.patch > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-10654-001.patch > > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-10654) Dots '.' in CSMappingRule path variables should be replaced
[ https://issues.apache.org/jira/browse/YARN-10654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko reassigned YARN-10654: --- Assignee: Peter Bacsko (was: Gergely Pollak) > Dots '.' in CSMappingRule path variables should be replaced > --- > > Key: YARN-10654 > URL: https://issues.apache.org/jira/browse/YARN-10654 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Peter Bacsko >Priority: Major > > Dots are used as separators, so we should escape them somehow in the > variables when substituting them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317080#comment-17317080 ] Peter Bacsko commented on YARN-10564: - +1 Committed to trunk. Thanks [~gandras] for the patch and [~zhuqi] for the review. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.006.patch, YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316309#comment-17316309 ] Peter Bacsko commented on YARN-10564: - Thanks [~gandras] I have the following suggestions: please add comments to the "for" loop which explains this. I don't want to dictate the wording. It could be more sentences. I think it's important. Also, maybe also comment that "supportedWildcardLevel" or MAX_WILDCARD_LEVEL might change in the future (just like me, people might realize that the range is [0-1] and it might make people confused). Also, an overall comment like "collect all template settings based on prefix, then finally apply the collected settings to the newly created queue" might be useful. I'd put it somewhere before the "while" loop, but this is just an idea. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277 ] Peter Bacsko edited comment on YARN-10564 at 4/7/21, 12:16 PM: --- Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which modifies "queuePathParts". First we try to find the templates for the parent explicitly, then we step back a wildcard at each iteration. By changing "queuePathParts", the prefix changes so eventually we might find a parent which contains templates. Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected values for the original queue. Is this correct? was (Author: pbacsko): Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which modifies "queuePathParts". First we try to find the templates for the parent explicitly, then we step back each wildcard at a time. By changing "queuePathParts", the prefix changes so eventually we might find a parent which contains templates. Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected values for the original queue. Is this correct? > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316277#comment-17316277 ] Peter Bacsko commented on YARN-10564: - Thanks [~gandras], I think I get it. I guess the trick is the "for" loop which modifies "queuePathParts". First we try to find the templates for the parent explicitly, then we step back each wildcard at a time. By changing "queuePathParts", the prefix changes so eventually we might find a parent which contains templates. Finally, we call {{setConfigFromTemplateEntries()}} where we set the collected values for the original queue. Is this correct? > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213 ] Peter Bacsko edited comment on YARN-10564 at 4/7/21, 11:51 AM: --- [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? I don't understand what it is meant to represent. was (Author: pbacsko): [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213 ] Peter Bacsko edited comment on YARN-10564 at 4/7/21, 10:49 AM: --- [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set the capacity 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}}. It seems to me that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. was (Author: pbacsko): [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10564) Support Auto Queue Creation template configurations
[ https://issues.apache.org/jira/browse/YARN-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316213#comment-17316213 ] Peter Bacsko commented on YARN-10564: - [~gandras] thanks for the patch. >From coding POV it looks ok, this is more like a high level review. There's are some things I just can't figure out (maybe I'm in a bad shape today). 1. Let's say you set 6w for {{root.a.*}}. Then a dynamic queue {{root.a.newparent.newchild}} get created. How does the weight settings propagate to "newparent" and "newchild"? I kept looking at the code, but it's just not obvious. I can see that "root.a" will have an entry in {{templateEntries}}, but then what? 2. I can't deciper this part: {noformat} for (int i = 0; i <= wildcardLevel; ++i) { queuePathParts.set(queuePathParts.size() - 1 - i, WILDCARD_QUEUE); } {noformat} What's happening here? 3. There is a variable called "supportedWildcardLevel". What is "supported" means in this context? Later on we set it to {{Math.min(queueHierarchyParts - 1, MAX_WILDCARD_LEVEL);}} which seems to be that it is either 0 or 1, because {{MAX_WILDCARD_LEVEL}} is 1. I assume most of the time it's going to be 1? Mentally I don't understand what it is meant to represent. > Support Auto Queue Creation template configurations > --- > > Key: YARN-10564 > URL: https://issues.apache.org/jira/browse/YARN-10564 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Major > Attachments: YARN-10564.001.patch, YARN-10564.002.patch, > YARN-10564.003.patch, YARN-10564.004.patch, YARN-10564.005.patch, > YARN-10564.poc.001.patch > > > Similar to how the template configuration works for ManagedParents, we need > to support templates for the new auto queue creation logic. Proposition is to > allow wildcards in template configs such as: > {noformat} > yarn.scheduler.capacity.root.*.*.weight 10{noformat} > which would mean, that set weight to 10 of every leaf of every parent under > root. > We should possibly take an approach, that could support arbitrary depth of > template configuration, because we might need to lift the limitation of auto > queue nesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313241#comment-17313241 ] Peter Bacsko commented on YARN-10726: - Ok, I strongly believe that the failing tests are flaky. [~zhuqi] could you verify it by running them locally a couple of times? > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch, YARN-10726.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10693) Add document for YARN-10623 auto refresh queue conf in cs.
[ https://issues.apache.org/jira/browse/YARN-10693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313219#comment-17313219 ] Peter Bacsko commented on YARN-10693: - I'll review this as soon as I have some spare cycles. > Add document for YARN-10623 auto refresh queue conf in cs. > -- > > Key: YARN-10693 > URL: https://issues.apache.org/jira/browse/YARN-10693 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10693.001.patch, YARN-10693.002.patch, > YARN-10693.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.
[ https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313218#comment-17313218 ] Peter Bacsko commented on YARN-10637: - Thanks [~zhuqi] I think it's good then. [~gandras] do you have any comments? > We should support fs to cs support for auto refresh queues when conf changed, > after YARN-10623 finished. > > > Key: YARN-10637 > URL: https://issues.apache.org/jira/browse/YARN-10637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10637.001.patch, YARN-10637.002.patch, > YARN-10637.003.patch, YARN-10637.004.patch > > > cc [~pbacsko] [~gandras] [~bteke] > We should also fill this, when YARN-10623 finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313192#comment-17313192 ] Peter Bacsko commented on YARN-10726: - Ah, I already committed the change. Let's hope Jenkins comes back green :) +1 > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch, YARN-10726.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313189#comment-17313189 ] Peter Bacsko commented on YARN-10726: - "hadoop.yarn.server.resourcemanager.security.TestDelegationTokenRenewer" - this is unrelated I believe. This test case has been failing for a long time. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch, YARN-10726.002.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10637) We should support fs to cs support for auto refresh queues when conf changed, after YARN-10623 finished.
[ https://issues.apache.org/jira/browse/YARN-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313184#comment-17313184 ] Peter Bacsko commented on YARN-10637: - Thanks [~zhuqi] this makes sense. Is this always enabled in Fair Scheduler? Because we should only add this policy if auto-refresh is enabled on the FS-side. > We should support fs to cs support for auto refresh queues when conf changed, > after YARN-10623 finished. > > > Key: YARN-10637 > URL: https://issues.apache.org/jira/browse/YARN-10637 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10637.001.patch, YARN-10637.002.patch, > YARN-10637.003.patch, YARN-10637.004.patch > > > cc [~pbacsko] [~gandras] [~bteke] > We should also fill this, when YARN-10623 finished. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313138#comment-17313138 ] Peter Bacsko commented on YARN-10726: - This is from {{AsyncDispatcher}}: {noformat} if (qSize != 0 && qSize % 1000 == 0 && lastEventQueueSizeLogged != qSize) { lastEventQueueSizeLogged = qSize; LOG.info("Size of event-queue is " + qSize); } {noformat} Update the code with {{lastEventQueueSizeLogged}}. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123 ] Peter Bacsko edited comment on YARN-10726 at 4/1/21, 12:01 PM: --- Thanks [~zhuqi]. I think it's a good idea. My only concern (which might not be valid) is that we have too many events, this code can possibly run too frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to consume events, size goes back from 1000 to 990, then it prints the size again. I think we should limit how often we print this message. We should log it too often, I'm not sure how we do this in other parts of the code. I'll check what can be the best solution. was (Author: pbacsko): Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have too many events, this code can possibly run too frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to consume events, size goes back from 1000 to 990, then it prints the size again. I think we should limit how often we print this message. We should log it too often, I'm not sure how we do this in other parts of the code. I'll check what can be the best solution. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313123#comment-17313123 ] Peter Bacsko commented on YARN-10726: - Thanks [~zhuqi]. I think it's a good idea. My only "concern" is that we have too many events, this code can possibly run too frequently. For example, if you go 998, 998, 999, 1000, 1001, 1002, then it prints at 1000, then it starts to consume events, size goes back from 1000 to 990, then it prints the size again. I think we should limit how often we print this message. We should log it too often, I'm not sure how we do this in other parts of the code. I'll check what can be the best solution. > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10726) Log the size of DelegationTokenRenewer event queue in case of too many pending events
[ https://issues.apache.org/jira/browse/YARN-10726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10726: Summary: Log the size of DelegationTokenRenewer event queue in case of too many pending events (was: We should log size of pending DelegationTokenRenewerEvent queue, when pending too many events.) > Log the size of DelegationTokenRenewer event queue in case of too many > pending events > - > > Key: YARN-10726 > URL: https://issues.apache.org/jira/browse/YARN-10726 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10726.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodesListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313105#comment-17313105 ] Peter Bacsko commented on YARN-9618: Thanks for the patch [~zhuqi] and [~gandras] for the review, I committed this to trunk. > NodesListManager event improvement > -- > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Fix For: 3.4.0 > > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch, YARN-9618.007.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9618) NodesListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9618: --- Summary: NodesListManager event improvement (was: NodeListManager event improvement) > NodesListManager event improvement > -- > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch, YARN-9618.007.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312989#comment-17312989 ] Peter Bacsko commented on YARN-9618: +1 LGTM [~gandras] are you OK with the patch? > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch, YARN-9618.007.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312945#comment-17312945 ] Peter Bacsko commented on YARN-10720: - +1 thanks [~zhuqi] for the patch, committed to trunk. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10720: Summary: YARN WebAppProxyServlet should support connection timeout to prevent proxy server from hanging (was: YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.) > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server from hanging > -- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > YARN-10720.006.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9618) NodeListManager event improvement
[ https://issues.apache.org/jira/browse/YARN-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312516#comment-17312516 ] Peter Bacsko commented on YARN-9618: Small things: 1. {noformat} //Is trigger RMAppNodeUpdateEvent private Boolean isRMAppEvent = false; //Is trigger NodesListManagerEvent private Boolean isNodesListEvent = false; {noformat} a) No need for comments b) use ordinary "boolean" instead of "Boolean" (also, init to "false" is not necessary, it is "false" by default because it's dictated by the JVM spec). 2. {noformat} Assert.assertFalse(getIsRMAppEvent()); Assert.assertTrue(getIsNodesListEvent()); {noformat} Add some assertion message here, like {noformat} Assert.assertFalse("Got unexpected RM app event", getIsRMAppEvent()); Assert.assertTrue("Received no NodesListManagerEvent", getIsNodesListEvent()); {noformat} 3. Return values of {{getIsNodesListEvent()}} and {{getIsRMAppEvent()}} should be just "boolean". > NodeListManager event improvement > - > > Key: YARN-9618 > URL: https://issues.apache.org/jira/browse/YARN-9618 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Bibin Chundatt >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-9618.001.patch, YARN-9618.002.patch, > YARN-9618.003.patch, YARN-9618.004.patch, YARN-9618.005.patch, > YARN-9618.006.patch > > > Current implementation nodelistmanager event blocks async dispacher and can > cause RM crash and slowing down event processing. > # Cluster restart with 1K running apps . Each usable event will create 1K > events over all events could be 5k*1k events for 5K cluster > # Event processing is blocked till new events are added to queue. > Solution : > # Add another async Event handler similar to scheduler. > # Instead of adding events to dispatcher directly call RMApp event handler. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312492#comment-17312492 ] Peter Bacsko commented on YARN-10720: - {noformat} } catch (InterruptedException e) { LOG.warn("doGet() interrupted", e); resp.setStatus(HttpServletResponse.SC_BAD_REQUEST); } resp.setStatus(HttpServletResponse.SC_OK); } {noformat} This is not good - you set the response status to {{SC_BAD_REQUEST}} only to override it with {{SC_OK}}. You need a "return". {noformat} try { servlet.init(config); } catch (ServletException e) { LOG.error(e.getMessage()); fail("Failed to init servlet"); } try { servlet.doGet(request, response); } catch (ServletException e) { LOG.error(e.getMessage()); fail("ServletException thrown during doGet."); } } {noformat} You can remove try-catch here and just add {{throws ServletException}}. If that happens for whatever reason, it will be a test error (which is desired - checking if the servlet can init is not the purpose of the test), not a test failure. > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server hang. > --- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, YARN-10720.004.patch, YARN-10720.005.patch, > image-2021-03-29-14-04-33-776.png, image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10720) YARN WebAppProxyServlet should support connection timeout to prevent proxy server hang.
[ https://issues.apache.org/jira/browse/YARN-10720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312253#comment-17312253 ] Peter Bacsko commented on YARN-10720: - Thanks [~zhuqi] for the patch. 1. As you said {{ExpectedException.none()}} has been deprecated. Either use the new {{assertThrows()}} or {{@Test(expected = SocketTimeoutException.class)}}, I think using the second is easier. 2. {noformat} conf.setInt(YarnConfiguration.RM_PROXY_CONNECTION_TIMEOUT, 1 * 1000); {noformat} Just write "1000" instead of "1 * 1000". 3. {noformat} try { when(response.getOutputStream()).thenReturn(null); } catch (IOException e) { e.printStackTrace(); } {noformat} Unnecessary try-catch block. The method already has a {{throws}} clause. 4. {noformat} @Override protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws ServletException, IOException { try { Thread.sleep(10 * 1000); } catch (InterruptedException e) { e.printStackTrace(); } resp.setStatus(HttpServletResponse.SC_OK); } {noformat} Maybe a minor thing, but if you catch {{InterruptedException}}, don't just print the stack trace, log it with {{LOG.warn("doGet() interrupted", e)}}. In this case, I'd also return with {{HttpServletResponse.SC_BAD_REQUEST}}. 5. {{The web proxy connection timeout, default is 60s(60 * 1000ms).}} This already goes to {{yarn-default.xml}}, so you can omit the part "default is 60s(60 * 1000ms)" and just write "The web proxy connection timeout". > YARN WebAppProxyServlet should support connection timeout to prevent proxy > server hang. > --- > > Key: YARN-10720 > URL: https://issues.apache.org/jira/browse/YARN-10720 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10720.001.patch, YARN-10720.002.patch, > YARN-10720.003.patch, image-2021-03-29-14-04-33-776.png, > image-2021-03-29-14-05-32-708.png > > > Following is proxy server show, {color:#de350b}too many connections from one > client{color}, this caused the proxy server hang, and the yarn web can't jump > to web proxy. > !image-2021-03-29-14-04-33-776.png|width=632,height=57! > Following is the AM which is abnormal, but proxy server don't know it is > abnormal already, so the connections can't be closed, we should add time out > support in proxy server to prevent this. And one abnormal AM may cause > hundreds even thousands of connections, it is very heavy. > !image-2021-03-29-14-05-32-708.png|width=669,height=101! > > After i kill the abnormal AM, the proxy server become healthy. This case > happened many times in our production clusters, our clusters are huge, and > the abnormal AM will be existed in a regular case. > > I will add timeout supported in web proxy server in this jira. > > cc [~pbacsko] [~ebadger] [~Jim_Brennan] [~ztang] [~epayne] [~gandras] > [~bteke] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312203#comment-17312203 ] Peter Bacsko commented on YARN-10718: - Committed to trunk. Closing. > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10718: Labels: resourcemanager (was: ) > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: resourcemanager > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10718: Labels: capacity-scheduler capacityscheduler (was: resourcemanager) > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10718) Fix CapacityScheduler#initScheduler log error.
[ https://issues.apache.org/jira/browse/YARN-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312195#comment-17312195 ] Peter Bacsko commented on YARN-10718: - Thanks [~zhuqi], +1 LGTM. Will commit this soon. > Fix CapacityScheduler#initScheduler log error. > --- > > Key: YARN-10718 > URL: https://issues.apache.org/jira/browse/YARN-10718 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10718.001.patch, image-2021-03-28-00-03-28-244.png > > > !image-2021-03-28-00-03-28-244.png|width=972,height=52! > The Resource toString() method already with "<" and ">" string, it's wrong > to add it again. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307605#comment-17307605 ] Peter Bacsko commented on YARN-10674: - Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk. > fs2cs should generate auto-created queue deletion properties > > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs should generate auto-created queue deletion properties
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307602#comment-17307602 ] Peter Bacsko commented on YARN-10674: - +1 LGTM. I'm going to commit this soon. > fs2cs should generate auto-created queue deletion properties > > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10674) fs2cs should generate auto-created queue deletion properties
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10674: Summary: fs2cs should generate auto-created queue deletion properties (was: fs2cs: should support auto created queue deletion.) > fs2cs should generate auto-created queue deletion properties > > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch, YARN-10674.017.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306240#comment-17306240 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] I had a discussion with [~gandras], he will post an update soon. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch, > YARN-10674.015.patch, YARN-10674.016.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10645) Fix queue state related update for auto created queue.
[ https://issues.apache.org/jira/browse/YARN-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306203#comment-17306203 ] Peter Bacsko commented on YARN-10645: - [~zhuqi] [~gandras] is this patch still needed? Looking at Andras' comment, it is telling me that this ticket is a duplicate. Is it a dup? > Fix queue state related update for auto created queue. > -- > > Key: YARN-10645 > URL: https://issues.apache.org/jira/browse/YARN-10645 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10645.001.patch > > > Now the queue state in auto created queue can't be updated after refactor in > YARN-10504. > We should support fix the queue state related logic. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10503) Support queue capacity in terms of absolute resources with gpu resourceType.
[ https://issues.apache.org/jira/browse/YARN-10503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306157#comment-17306157 ] Peter Bacsko commented on YARN-10503: - The question is this part: {noformat} public enum AbsoluteResourceType { MEMORY, VCORES, GPUS, FPGAS } {noformat} Do we want to treat GPUs and FPGAs like that? In other parts of the code, we have mem/vcore as primary resources, then an array of other resources. For example, constructors from {{org.apache.hadoop.yarn.api.records.Resource}}: {noformat} @Public @Stable public static Resource newInstance(long memory, int vCores, Map others) { if (others != null) { return new LightWeightResource(memory, vCores, ResourceUtils.createResourceTypesArray(others)); } else { return newInstance(memory, vCores); } } @InterfaceAudience.Private @InterfaceStability.Unstable public static Resource newInstance(Resource resource) { Resource ret; int numberOfKnownResourceTypes = ResourceUtils .getNumberOfKnownResourceTypes(); if (numberOfKnownResourceTypes > 2) { ret = new LightWeightResource(resource.getMemorySize(), resource.getVirtualCores(), resource.getResources()); } else { ret = new LightWeightResource(resource.getMemorySize(), resource.getVirtualCores()); } return ret; } {noformat} But with this modification, we sort of promote GPU and FPGA to the level of vcore and memory, at least from the perspective of the code and it also becomes inconsistent with the existing code. This is just my opinion though. cc [~epayne] [~ebadger]. > Support queue capacity in terms of absolute resources with gpu resourceType. > > > Key: YARN-10503 > URL: https://issues.apache.org/jira/browse/YARN-10503 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10503.001.patch, YARN-10503.002.patch, > YARN-10503.003.patch > > > Now the absolute resources are memory and cores. > {code:java} > /** > * Different resource types supported. > */ > public enum AbsoluteResourceType { > MEMORY, VCORES; > }{code} > But in our GPU production clusters, we need to support more resourceTypes. > It's very import for cluster scaling when with different resourceType > absolute demands. > > This Jira will handle GPU first. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10704) The CS effective capacity for absolute mode in UI should support GPU and other custom resources.
[ https://issues.apache.org/jira/browse/YARN-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17306154#comment-17306154 ] Peter Bacsko commented on YARN-10704: - Thanks [~zhuqi] I have some minor comments: 1. {noformat} sb.append(" The CS effective capacity for absolute mode in UI should support GPU and > other custom resources. > > > Key: YARN-10704 > URL: https://issues.apache.org/jira/browse/YARN-10704 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10704.001.patch, YARN-10704.002.patch, > image-2021-03-19-12-05-28-412.png, image-2021-03-19-12-08-35-273.png > > > Actually there are no information about the effective capacity about GPU in > UI for absolute resource mode. > !image-2021-03-19-12-05-28-412.png|width=873,height=136! > But we have this information in QueueMetrics: > !image-2021-03-19-12-08-35-273.png|width=613,height=268! > > It's very important for our GPU users to use in absolute mode, there still > have nothing to know GPU absolute information in CS Queue UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups
[ https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971 ] Peter Bacsko edited comment on YARN-10597 at 3/19/21, 3:35 PM: --- [~shuzirra] is it really that simple? You told me that there were bunch of unit test failures when you tried to change it months back. Anyway it's great news if the change is tiny. was (Author: pbacsko): [~shuzirra] is it really that simple? You told me that there were bunch of unit test failures. Anyway it's great news if the change is tiny. > CSMappingPlacementRule should not create new instance of Groups > --- > > Key: YARN-10597 > URL: https://issues.apache.org/jira/browse/YARN-10597 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-10597.001.patch > > > As [~ahussein] pointed out in YARN-10425, no new Groups instance should be > created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10597) CSMappingPlacementRule should not create new instance of Groups
[ https://issues.apache.org/jira/browse/YARN-10597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304971#comment-17304971 ] Peter Bacsko commented on YARN-10597: - [~shuzirra] is it really that simple? You told me that there were bunch of unit test failures. Anyway it's great news if the change is tiny. > CSMappingPlacementRule should not create new instance of Groups > --- > > Key: YARN-10597 > URL: https://issues.apache.org/jira/browse/YARN-10597 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-10597.001.patch > > > As [~ahussein] pointed out in YARN-10425, no new Groups instance should be > created. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10641) Refactor the max app related update, and fix maxApllications update error when add new queues.
[ https://issues.apache.org/jira/browse/YARN-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304117#comment-17304117 ] Peter Bacsko commented on YARN-10641: - +1 Thanks for the patch [~zhuqi] and [~gandras] for the review. Committed to trunk. > Refactor the max app related update, and fix maxApllications update error > when add new queues. > -- > > Key: YARN-10641 > URL: https://issues.apache.org/jira/browse/YARN-10641 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Critical > Attachments: YARN-10641.001.patch, YARN-10641.002.patch, > YARN-10641.003.patch, YARN-10641.004.patch, YARN-10641.005.patch, > YARN-10641.006.patch, image-2021-02-20-15-49-58-677.png, > image-2021-02-20-15-53-51-099.png, image-2021-02-20-15-55-44-780.png, > image-2021-02-20-16-29-18-519.png, image-2021-02-20-16-31-13-714.png > > > When refactor the update logic in YARN-10504 . > The update max applications based abs/cap is wrong, this should be fixed, > because the max applications is key part to limit applications in CS. > For example: > When adding a dynamic queue, the other children's max app of parent queue are > not updated correctly: > !image-2021-02-20-15-53-51-099.png|width=639,height=509! > The new added queue's max app will updated correctly: > !image-2021-02-20-15-55-44-780.png|width=542,height=426! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304089#comment-17304089 ] Peter Bacsko commented on YARN-10692: - Thanks [~zhuqi] for the patch, committed to trunk. > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch, > YARN-10692.003.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304078#comment-17304078 ] Peter Bacsko commented on YARN-10692: - +1 LGTM. Committing this soon. > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch, > YARN-10692.003.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10685) Fix typos in AbstractCSQueue
[ https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304041#comment-17304041 ] Peter Bacsko commented on YARN-10685: - +1 thanks [~zhuqi] for the patch, committed to trunk. > Fix typos in AbstractCSQueue > > > Key: YARN-10685 > URL: https://issues.apache.org/jira/browse/YARN-10685 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10685.001.patch, YARN-10685.002.patch, > YARN-10685.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10685) Fix typos in AbstractCSQueue
[ https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10685: Summary: Fix typos in AbstractCSQueue (was: Fixed some Typo in AbstractCSQueue.) > Fix typos in AbstractCSQueue > > > Key: YARN-10685 > URL: https://issues.apache.org/jira/browse/YARN-10685 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10685.001.patch, YARN-10685.002.patch, > YARN-10685.003.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304027#comment-17304027 ] Peter Bacsko commented on YARN-10674: - Thanks [~zhuqi] for the patch. I think we are very close. I still have some comments: 1. {noformat} private FSConfigToCSConfigConverterParams. PreemptionMode disablePreemption; private FSConfigToCSConfigConverterParams. PreemptionMode preemptionMode; {noformat} We don't need two enums. We need only one which covers all states (enabled / observeonly / nopolicy). You can extend {{PreemptionMode}} with a new variable which says whether it's enabled or disabled: {noformat} public enum PreemptionMode { ENABLE("enable", true), NO_POLICY("nopolicy", false), OBSERVE_ONLY("observeonly", false); private String cliOption; private boolean enabled; PreemptionMode(String cliOption, boolean enabled) { this.cliOption = cliOption; this.enabled = enabled; } public String getCliOption() { return cliOption; } public boolean isEnabled() { return enabled; } {noformat} So you just call {{preemptionMode.isEnabled()}} and don't need two variables just to hold the information whether it's enabled or not. 2. {{public static PreemptionMode fromString(String cliOption)}} --> this method never returns ENABLED, which is important (also, pls change "ENABLE" to "ENABLED", note the "D" at the end). cc [~gandras] please review patch v14. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch, YARN-10674.013.patch, YARN-10674.014.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542 ] Peter Bacsko edited comment on YARN-10692 at 3/17/21, 4:11 PM: --- Thanks [~zhuqi] in general this looks good. I just have two nits: 1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, the method name looks better this way 2. {{getNodeGPUUtilization()}} you can simplify the addition with streams: {noformat} float totalGpuUtilization = 0; if (gpuList != null && gpuList.size() != 0) { totalGpuUtilization = gpuList .stream() .map(g -> g.getGpuUtilizations().getOverallGpuUtilization()) .collect(Collectors.summingDouble(Float::floatValue)) .floatValue() / gpuList.size(); } return totalGpuUtilization; {noformat} Also, you should consider renaming "totalGpuUtilization" to "nodeGpuUtilization" so that it matches the method name. was (Author: pbacsko): Thanks [~zhuqi] in general this looks good. I just have two nits: 1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, the method name looks better this way 2. {{getNodeGPUUtilization()}} you can simplify the addition with streams: {noformat} float totalGpuUtilization = 0; if (gpuList != null && gpuList.size() != 0) { totalGpuUtilization = gpuList .stream() .map(g -> g.getGpuUtilizations().getOverallGpuUtilization()) .collect(Collectors.summingDouble(Float::floatValue)) .floatValue() / gpuList.size(); } return totalGpuUtilization; {noformat} > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10692) Add Node GPU Utilization and apply to NodeMetrics.
[ https://issues.apache.org/jira/browse/YARN-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303542#comment-17303542 ] Peter Bacsko commented on YARN-10692: - Thanks [~zhuqi] in general this looks good. I just have two nits: 1. {{getNodeGPUUtilization()}} --> rename this to {{getNodeGpuUtilization()}}, the method name looks better this way 2. {{getNodeGPUUtilization()}} you can simplify the addition with streams: {noformat} float totalGpuUtilization = 0; if (gpuList != null && gpuList.size() != 0) { totalGpuUtilization = gpuList .stream() .map(g -> g.getGpuUtilizations().getOverallGpuUtilization()) .collect(Collectors.summingDouble(Float::floatValue)) .floatValue() / gpuList.size(); } return totalGpuUtilization; {noformat} > Add Node GPU Utilization and apply to NodeMetrics. > -- > > Key: YARN-10692 > URL: https://issues.apache.org/jira/browse/YARN-10692 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10692.001.patch, YARN-10692.002.patch > > > Now there are no node level GPU Utilization, this issue will add it, and add > it to NodeMetrics first. > cc [~pbacsko] [~Jim_Brennan] [~ebadger] [~gandras] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues
[ https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10497: Labels: capacity-scheduler capacityscheduler (was: ) > Fix an issue in CapacityScheduler which fails to delete queues > -- > > Key: YARN-10497 > URL: https://issues.apache.org/jira/browse/YARN-10497 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Fix For: 3.4.0 > > Attachments: YARN-10497.001.patch, YARN-10497.002.patch, > YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, > YARN-10497.006.patch > > > We saw an exception when using queue mutation APIs: > {code:java} > 2020-11-13 16:47:46,327 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Queue > root.am2cmQueueSecond not found > {code} > Which comes from this code: > {code:java} > List siblingQueues = getSiblingQueues(queueToRemove, > proposedConf); > if (!siblingQueues.contains(queueName)) { > throw new IOException("Queue " + queueToRemove + " not found"); > } > {code} > (Inside MutableCSConfigurationProvider) > If you look at the method: > {code:java} > > private List getSiblingQueues(String queuePath, Configuration conf) > { > String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.')); > String childQueuesKey = CapacitySchedulerConfiguration.PREFIX + > parentQueue + CapacitySchedulerConfiguration.DOT + > CapacitySchedulerConfiguration.QUEUES; > return new ArrayList<>(conf.getStringCollection(childQueuesKey)); > } > {code} > And here's capacity-scheduler.xml I got > {code:java} > yarn.scheduler.capacity.root.queuesdefault, q1, > q2 > {code} > You can notice there're spaces between default, q1, a2 > So conf.getStringCollection returns: > {code:java} > default > q1 > ... > {code} > Which causes match issue when we try to delete the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues
[ https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303365#comment-17303365 ] Peter Bacsko commented on YARN-10497: - +1 Thanks [~wangda] / [~zhuqi] for the patch and [~gandras], [~shuzirra] for the review. Committed to trunk. > Fix an issue in CapacityScheduler which fails to delete queues > -- > > Key: YARN-10497 > URL: https://issues.apache.org/jira/browse/YARN-10497 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-10497.001.patch, YARN-10497.002.patch, > YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, > YARN-10497.006.patch > > > We saw an exception when using queue mutation APIs: > {code:java} > 2020-11-13 16:47:46,327 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Queue > root.am2cmQueueSecond not found > {code} > Which comes from this code: > {code:java} > List siblingQueues = getSiblingQueues(queueToRemove, > proposedConf); > if (!siblingQueues.contains(queueName)) { > throw new IOException("Queue " + queueToRemove + " not found"); > } > {code} > (Inside MutableCSConfigurationProvider) > If you look at the method: > {code:java} > > private List getSiblingQueues(String queuePath, Configuration conf) > { > String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.')); > String childQueuesKey = CapacitySchedulerConfiguration.PREFIX + > parentQueue + CapacitySchedulerConfiguration.DOT + > CapacitySchedulerConfiguration.QUEUES; > return new ArrayList<>(conf.getStringCollection(childQueuesKey)); > } > {code} > And here's capacity-scheduler.xml I got > {code:java} > yarn.scheduler.capacity.root.queuesdefault, q1, > q2 > {code} > You can notice there're spaces between default, q1, a2 > So conf.getStringCollection returns: > {code:java} > default > q1 > ... > {code} > Which causes match issue when we try to delete the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303342#comment-17303342 ] Peter Bacsko commented on YARN-10674: - [~gandras] good suggestions, thanks! [~zhuqi] please apply the suggested modifications. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10497) Fix an issue in CapacityScheduler which fails to delete queues
[ https://issues.apache.org/jira/browse/YARN-10497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303245#comment-17303245 ] Peter Bacsko commented on YARN-10497: - I think it's good. Let's wait for Jenkins and I'll commit it. > Fix an issue in CapacityScheduler which fails to delete queues > -- > > Key: YARN-10497 > URL: https://issues.apache.org/jira/browse/YARN-10497 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wangda Tan >Assignee: Wangda Tan >Priority: Major > Attachments: YARN-10497.001.patch, YARN-10497.002.patch, > YARN-10497.003.patch, YARN-10497.004.patch, YARN-10497.005.patch, > YARN-10497.006.patch > > > We saw an exception when using queue mutation APIs: > {code:java} > 2020-11-13 16:47:46,327 WARN > org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebServices: > CapacityScheduler configuration validation failed:java.io.IOException: Queue > root.am2cmQueueSecond not found > {code} > Which comes from this code: > {code:java} > List siblingQueues = getSiblingQueues(queueToRemove, > proposedConf); > if (!siblingQueues.contains(queueName)) { > throw new IOException("Queue " + queueToRemove + " not found"); > } > {code} > (Inside MutableCSConfigurationProvider) > If you look at the method: > {code:java} > > private List getSiblingQueues(String queuePath, Configuration conf) > { > String parentQueue = queuePath.substring(0, queuePath.lastIndexOf('.')); > String childQueuesKey = CapacitySchedulerConfiguration.PREFIX + > parentQueue + CapacitySchedulerConfiguration.DOT + > CapacitySchedulerConfiguration.QUEUES; > return new ArrayList<>(conf.getStringCollection(childQueuesKey)); > } > {code} > And here's capacity-scheduler.xml I got > {code:java} > yarn.scheduler.capacity.root.queuesdefault, q1, > q2 > {code} > You can notice there're spaces between default, q1, a2 > So conf.getStringCollection returns: > {code:java} > default > q1 > ... > {code} > Which causes match issue when we try to delete the queue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303222#comment-17303222 ] Peter Bacsko commented on YARN-10674: - [~gandras] do you have further comments? I think the patch is in good shape now. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch, YARN-10674.011.patch, > YARN-10674.012.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules
[ https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878 ] Peter Bacsko edited comment on YARN-10370 at 3/16/21, 8:36 PM: --- [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that this feature is ready and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? was (Author: pbacsko): [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that the umbrella is done and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? > [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue > Mapping rules > --- > > Key: YARN-10370 > URL: https://issues.apache.org/jira/browse/YARN-10370 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: MappingRuleEnhancements.pdf, Possible extensions of > mapping rule format in Capacity Scheduler.pdf > > > To continue closing the feature gaps between Fair Scheduler and Capacity > Scheduler to help users migrate between the scheduler more easy, we need to > add some of the Fair Scheduler placement rules to the capacity scheduler's > queue mapping functionality. > With [~snemeth] and [~pbacsko] we've created the following design docs about > the proposed changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10370) [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue Mapping rules
[ https://issues.apache.org/jira/browse/YARN-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302878#comment-17302878 ] Peter Bacsko commented on YARN-10370: - [~shuzirra] [~snemeth] the vast majority of tasks in this JIRA are done. There are some open tasks left. I think it's safe to say that the umbrella is done and the remaining tasks can be completed either as standalone tasks or under a "Part II" JIRA. Otherwise we might need to keep this open for a long time. IMO we should move the open / patch available tasks under a new umbrella and resolve this, marked with a proper Fix version. Opinions? > [Umbrella] Reduce the feature gap between FS Placement Rules and CS Queue > Mapping rules > --- > > Key: YARN-10370 > URL: https://issues.apache.org/jira/browse/YARN-10370 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Gergely Pollak >Assignee: Gergely Pollak >Priority: Major > Labels: capacity-scheduler, capacityscheduler > Attachments: MappingRuleEnhancements.pdf, Possible extensions of > mapping rule format in Capacity Scheduler.pdf > > > To continue closing the feature gaps between Fair Scheduler and Capacity > Scheduler to help users migrate between the scheduler more easy, we need to > add some of the Fair Scheduler placement rules to the capacity scheduler's > queue mapping functionality. > With [~snemeth] and [~pbacsko] we've created the following design docs about > the proposed changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
[ https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302599#comment-17302599 ] Peter Bacsko commented on YARN-10686: - +1 Thanks [~zhuqi] for the patch and [~gandras] for the review. Committed to trunk. > Fix > TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode > - > > Key: YARN-10686 > URL: https://issues.apache.org/jira/browse/YARN-10686 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10686.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10686) Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode
[ https://issues.apache.org/jira/browse/YARN-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10686: Summary: Fix TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode (was: Fix testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode user error.) > Fix > TestCapacitySchedulerAutoQueueCreation#testAutoQueueCreationFailsForEmptyPathWithAQCAndWeightMode > - > > Key: YARN-10686 > URL: https://issues.apache.org/jira/browse/YARN-10686 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10686.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma
[ https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302567#comment-17302567 ] Peter Bacsko commented on YARN-10682: - +1 Thanks for the patch [~zhuqi] and [~gandras] for the review, committed to trunk. > The scheduler monitor policies conf should trim values separated by comma > - > > Key: YARN-10682 > URL: https://issues.apache.org/jira/browse/YARN-10682 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10682.001.patch > > > When i configured scheduler monitor policies with space, the RM will start > with error. > The conf should support trim between "," , such as : > "a,b,c" is supported now, but "a, b, c" is not supported now, just add > trim in this jira. > > When tested multi policy, it happened. > > yarn.resourcemanager.scheduler.monitor.policies > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy, > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10682) The scheduler monitor policies conf should trim values separated by comma
[ https://issues.apache.org/jira/browse/YARN-10682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10682: Summary: The scheduler monitor policies conf should trim values separated by comma (was: The scheduler monitor policies conf should support trim between ",".) > The scheduler monitor policies conf should trim values separated by comma > - > > Key: YARN-10682 > URL: https://issues.apache.org/jira/browse/YARN-10682 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10682.001.patch > > > When i configured scheduler monitor policies with space, the RM will start > with error. > The conf should support trim between "," , such as : > "a,b,c" is supported now, but "a, b, c" is not supported now, just add > trim in this jira. > > When tested multi policy, it happened. > > yarn.resourcemanager.scheduler.monitor.policies > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.QueueConfigurationAutoRefreshPolicy, > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AutoCreatedQueueDeletionPolicy > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17302548#comment-17302548 ] Peter Bacsko commented on YARN-10674: - Thanks [~zhuqi] this is definitely looks better. We're close to the final version. Some comments: 1. {noformat} Disable the preemption with nopolicy or observeonly mode, " + "default mode is nopolicy with no arg." + "When use nopolicy arg, it means to remove " + "ProportionalCapacityPreemptionPolicy for CS preemption, " + "When use observeonly arg, " + "it means to set " + "yarn.resourcemanager.monitor.capacity.preemption.observe_only " + "to true" {noformat} I'd to slightly modify this text: {noformat} Disable the preemption with \"nopolicy\" or \"observeonly\" mode. Default is \"nopolicy\". \"nopolicy\" removes ProportionalCapacityPreemptionPolicy from the list of monitor policies. \"observeronly\" sets \"yarn.resourcemanager.monitor.capacity.preemption.observe_only\" to true. {noformat} 2. This definition: {{private String disablePreemptionMode;}} This should be a simple enum like: {noformat} public enum DisablePreemptionMode { OBSERVE_ONLY { @Override String getCliOption() { return "observeonly"; } }, NO_POLICY { @Override String getCliOption() { return "nopolicy"; } }; abstract String getCliOption(); } {noformat} So you can also use them here: {noformat} private static void checkDisablePreemption(CliOption cliOption, String disablePreemptionMode) { if (disablePreemptionMode == null || disablePreemptionMode.trim().isEmpty()) { // The default mode is nopolicy. return; } try { DisablePreemptionMode.valueOf(disablePreemptionMode); } catch (IllegalArgumentException e) { throw new PreconditionException( String.format("Specified disable-preemption option %s is illegal, " + " use \"nopolicy\" or \"observeonly\"")); } {noformat} "disablePreemptionMode" should be an enum everywhere. 3. {noformat} public void convertSiteProperties(Configuration conf, Configuration yarnSiteConfig, boolean drfUsed, boolean enableAsyncScheduler) boolean enableAsyncScheduler, boolean userPercentage, boolean disablePreemption, String disablePreemptionMode) { {noformat} Here "disablePreemptionMode" should be an enum also and make sure that it always has a value. If it always has a value, this part becomes much simpler: {noformat} if (disablePreemption && disablePreemptionMode == DisablePreemptionMode.NO_POLICY) { yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, ""); } } {noformat} 4. {{AutoCreatedQueueDeletionPolicy.class.getCanonicalName())}} This string is referenced very often in the tests. Instead, use a final String: {noformat} private static final String DELETION_POLICY_CLASS = AutoCreatedQueueDeletionPolicy.class.getCanonicalName(); {noformat} So the readability becomes much better. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch, YARN-10674.008.patch, > YARN-10674.009.patch, YARN-10674.010.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] I didn't have too much time to deeply review the patch, but your change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption observeonly", nothing happens. Could you insert this to {{FSConfigToCSConfigConverter}}? I believe that is the best place for it. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17300319#comment-17300319 ] Peter Bacsko edited comment on YARN-10674 at 3/12/21, 1:34 PM: --- [~zhuqi] I didn't have too much time to deeply review the patch, but your change ignores the "observeonly" setting. So, if I use "\-\-disablepreemption observeonly", nothing happens. Could you insert this to {{FSConfigToCSConfigConverter}}? I believe that is the best place for it. was (Author: pbacsko): [~zhuqi] I didn't have too much time to deeply review the patch, but your change ignore the "observeonly" setting. So, if I use "\-\-disablepreemption observeonly", nothing happens. Could you insert this to {{FSConfigToCSConfigConverter}}? I believe that is the best place for it. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch, YARN-10674.007.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602 ] Peter Bacsko edited comment on YARN-10674 at 3/11/21, 3:25 PM: --- Ok, I did some research, I think we have 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? was (Author: pbacsko): Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602 ] Peter Bacsko edited comment on YARN-10674 at 3/11/21, 2:43 PM: --- Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT to {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? was (Author: pbacsko): Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299602#comment-17299602 ] Peter Bacsko commented on YARN-10674: - Ok, I did some research, I think we 3 options to completely disable preemption: 1) Set disable_preemption to "root", which will propagate down to other queues. 2) Remove "ProportionalCapacityPreemptionPolicy" from the list of policies. 3) Enable "observe_only" property. I think #1 is not really good, because it relies on a side-effect (propagation of a setting). The intention is not clear. #2 is perfectly acceptable and this goes to {{yarn-site.xml}} so it should be in {{FSYarnSiteConverter}}. #3 is also OK, but that goes to {{capacity-scheduler.xml}} and NOT in {{yarn-site.xml}}, I just verified it. So this should be placed somewhere else. So we can do: 1) Vote for what's best 2) Introduce a command line switch like "-dp" "\-\-disable-preemption" with values like "nopolicy" or "observeonly" and we pick a default value, eg. "nopolicy". So we can do something like: {noformat} yarn fs2cs --disable-preemption observeonly --yarnsiteconfig /path/to/yarn-site.xml {noformat} [~gandras] [~zhuqi] what do you think? > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299466#comment-17299466 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] yes that's right. This is the default setting for policies: {noformat} The list of SchedulingEditPolicy classes that interact with the scheduler. A particular module may be incompatible with the scheduler, other policies, or a configuration of either. yarn.resourcemanager.scheduler.monitor.policies org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy {noformat} This is from {{yarn-default.xml}}. So when we don't use preemption, we should remove this policy. But we actually have to think a little bit, because how we disable preemption affects our downstream Hadoop codebase. So let's wait until we figure out what is the best solution to turn off preemption. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299456#comment-17299456 ] Peter Bacsko commented on YARN-10674: - [~gandras] h - that's true. I just overcomplicated the whole thing (not that preemption in general is easy to begin with). Yes, we don't need it if we don't have the policy. [~zhuqi] please wait with the new patch. What Andras said is correct, but there might be other changes that I'll recommend. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17299427#comment-17299427 ] Peter Bacsko commented on YARN-10674: - I'll do a deeper review today. [~gandras] you say: "Is setting observe only necessary here? This is an extremely subtle property.". I'm not sure how subtle it is, but it is mentioned in the upstream documentation: |{{yarn.resourcemanager.monitor.capacity.preemption.observe_only}}|If true, run the policy but do not affect the cluster with preemption and kill events. Default value is false| However, if someone thinks that disabling preemption for "root" is a better solution, I'm not against that. We might need other folks to chime in and share their thoughts. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch, YARN-10674.004.patch, YARN-10674.005.patch, > YARN-10674.006.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10685) Fixed some Typo in AbstractCSQueue.
[ https://issues.apache.org/jira/browse/YARN-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298874#comment-17298874 ] Peter Bacsko commented on YARN-10685: - Sure, I'll check it out. > Fixed some Typo in AbstractCSQueue. > > > Key: YARN-10685 > URL: https://issues.apache.org/jira/browse/YARN-10685 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10685.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10571) Refactor dynamic queue handling logic
[ https://issues.apache.org/jira/browse/YARN-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298861#comment-17298861 ] Peter Bacsko commented on YARN-10571: - [~gandras] thanks for the patch. I just have one question: the class {{CapacitySchedulerAutoQueueHandler}} was renamed to {{CapacitySchedulerQueueHandler}}. But the latter is telling me that this is class which handles all kinds of queues, not just auto-created queues. Wouldn't it make sense to keep the original name? Even the instance is called {{autoQueueHandler}}. Also, there's a Javadoc and a checkstyle problem. > Refactor dynamic queue handling logic > - > > Key: YARN-10571 > URL: https://issues.apache.org/jira/browse/YARN-10571 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Andras Gyori >Assignee: Andras Gyori >Priority: Minor > Attachments: YARN-10571.001.patch > > > As per YARN-10506 we have introduced an other mode for auto queue creation > and a new class, which handles it. We should move the old, managed queue > related logic to CSAutoQueueHandler as well, and do additional cleanup > regarding queue management. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298842#comment-17298842 ] Peter Bacsko commented on YARN-10674: - Ok, here is what I found: 1. {{RM_SCHEDULER_ENABLE_MONITORS}} --> ok, this can be set to "true" in all cases. 2. If FS preemption is disabled --> there is a property which is better than configuring the "root" queue. If FS preemption is disabled ({{yarn.scheduler.fair.preemption}} = {{false}}), then we should generate {{yarn.resourcemanager.monitor.capacity.preemption.observe_only}} = {{true}}. This means that we have the monitor thread running but we don't do any preemption. So we don't need to set "root.disable_preemption". 3. As I mentioned, the {{Configuration}} object is empty. The problem is, in order to use the preemption, we need to set the preemption policy, which is missing right now. So, if FS preemption is enabled, this line must be added: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, ProportionalCapacityPreemptionPolicy.class.getCanonicalName(); ... {noformat} So, the modified code should look like this: {noformat} yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, ProportionalCapacityPreemptionPolicy.class.getCanonicalName(); ... } else { // no preemption yarnSiteConfig.setBoolean(CapacitySchedulerConfiguration.PREEMPTION_OBSERVE_ONLY, true); } // new code comes here if (!userPercentage) { String policies = yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES); if (policies == null) { ... {noformat} Please modify the test cases accordingly and the checkstyle issues also. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776 ] Peter Bacsko edited comment on YARN-10674 at 3/10/21, 12:03 PM: [~zhuqi] thanks for the patch. I found a new property which is probably good for us if preemption is completely disabled on the FS side. I have to check if it is really acceptable. was (Author: pbacsko): [~zhuqi] thanks for the patch. I found a new property which is probably good for us if preemption is completely disabled on the FS side. I have to check if it is good for us. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298776#comment-17298776 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] thanks for the patch. I found a new property which is probably good for us if preemption is completely disabled on the FS side. I have to check if it is good for us. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch, > YARN-10674.003.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298156#comment-17298156 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] this is very interesting. If we set RM Monitors to enabled, it means that system-wide preemption is always enabled, too: AbstractCSQueue: {noformat} private boolean isQueueHierarchyPreemptionDisabled(CSQueue q, CapacitySchedulerConfiguration configuration) { boolean systemWidePreemption = csContext.getConfiguration() .getBoolean(YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, YarnConfiguration.DEFAULT_RM_SCHEDULER_ENABLE_MONITORS); CSQueue parentQ = q.getParent(); // If the system-wide preemption switch is turned off, all of the queues in // the qPath hierarchy have preemption disabled, so return true. if (!systemWidePreemption) return true; {noformat} However, you already added a policy in YARN-10623, so looks like this property always has to be enabled in weight mode. But what if we convert an FS configuration which disabled preemption completely? I think the best thing we can do right now is that we disable preemption for "root", which will propagate to all other parent queues. So I suggest the following approach: 1. In percentage conversion mode, do not enable RM monitors by default, because it's not needed. 2. In weight mode (which is the default now), we have to enable it. But if "yarn.scheduler.fair.preemption" is false, then "yarn.scheduler.capacity.root.disable_preemption" must be set to true, but only for "root". This can be done in {{FSQueueConverter}}. cc [~bteke] [~gandras] [~snemeth], not sure if this is a good approach, but I can't see anything better. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112 ] Peter Bacsko edited comment on YARN-10674 at 3/9/21, 3:23 PM: -- [~zhuqi] I have the following comments: 1. This change seems to always enable "RM monitors": {noformat} // This should be always true to trigger dynamic queue auto deletion // when expired. yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); {noformat} But I don't think this is necessary. We need to enable it in two cases: preemption is enabled OR we're in weight mode. We don't have auto-queue delete in percentage mode (fs2cs can still convert to percentages with a command line switch). So I suggest that you pass an extra boolean "usePercentages". Invocation from {{FSConfigToCSConfigConverter}}: {noformat} siteConverter.convertSiteProperties(inputYarnSiteConfig, convertedYarnSiteConfig, drfUsed, conversionOptions.isEnableAsyncScheduler(), usePercentages); <-- last argument is new {noformat} Then in the site converter: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); preemptionEnabled = true; ... } if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); // setting it again is OK String policies = yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES); if (policies == null) { policies = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policies); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} If I think about it, {{yarnSiteConfig}} is the output config. So this cannot happen: {noformat} } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } {noformat} This {{Configuration}} object is created with no entries. The {{else}} branch will never be taken. So it can be simplified to: {noformat} if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); String policy = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policy); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} 2. This also means two separate test cases: * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false) * When usePercentages = true, then {{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false) I recommend the following naming: {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario was (Author: pbacsko): [~zhuqi] I have the following comments: 1. This change seems to always enable "RM monitors": {noformat} // This should be always true to trigger dynamic queue auto deletion // when expired. yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); {noformat} But I don't think this is necessary. We need to enable it in two cases: preemption is enabled OR we're in weight mode. We don't have auto-queue delete in percentage mode (fs2cs can still convert to percentages with a command line switch). So I suggest that you pass an extra boolean "usePercentages". Invocation from {{FSConfigToCSConfigConverter}}: {noformat} siteConverter.convertSiteProperties(inputYarnSiteConfig, convertedYarnSiteConfig, drfUsed, conversionOptions.isEnableAsyncScheduler(), usePercentages); <-- last argument is new {noformat} Then in the site converter: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); preemptionEnabled = true; ... } if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true);
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298112#comment-17298112 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] I have the following comments: 1. This change seems to always enable "RM monitors": {noformat} // This should be always true to trigger dynamic queue auto deletion // when expired. yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); {noformat} But I don't think this is necessary. We need to enable it in two cases: preemption is enabled OR we're in weight mode. We don't have auto-queue delete in percentage mode (fs2cs can still convert to percentages with a command line switch). So I suggest that you pass an extra boolean "usePercentages". Invocation from {{FSConfigToCSConfigConverter}}: {noformat} siteConverter.convertSiteProperties(inputYarnSiteConfig, convertedYarnSiteConfig, drfUsed, conversionOptions.isEnableAsyncScheduler(), usePercentages); <-- last argument is new {noformat} Then in the site converter: {noformat} if (conf.getBoolean(FairSchedulerConfiguration.PREEMPTION, FairSchedulerConfiguration.DEFAULT_PREEMPTION)) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); preemptionEnabled = true; ... } if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); // setting it again is OK String policies = yarnSiteConfig.get(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES); if (policies == null) { policies = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policies); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} If I think about it, {{yarnSiteConfig}} is the output config. So this cannot happen: {noformat} } else { policies += "," + AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); } {noformat} This {{Configuration}} object is created with no entries. The {{else}} branch will never be taken. So it can be simplified to: {noformat} if (!usePercentages) { yarnSiteConfig.setBoolean( YarnConfiguration.RM_SCHEDULER_ENABLE_MONITORS, true); String policy = AutoCreatedQueueDeletionPolicy. class.getCanonicalName(); yarnSiteConfig.set(YarnConfiguration.RM_SCHEDULER_MONITOR_POLICIES, policy); // Set the expired for deletion interval to 10s, consistent with fs. yarnSiteConfig.setInt(CapacitySchedulerConfiguration. AUTO_CREATE_CHILD_QUEUE_EXPIRED_TIME, 10); } {noformat} 2. This also means two separate test cases: * When usePercentages = false, then {{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should be set (with preemption = false) * When usePercentages = true, then\{{RM_SCHEDULER_ENABLE_MONITORS}} and {{RM_SCHEDULER_MONITOR_POLICIES}} should NOT be set (with preemption = false) I recommend the following naming: {{testRmMonitorsAndPoliciesSetWhenUsingWeights()}} - first scenario {{testRmMonitorsAndPoliciesSetWhenUsingPercentages()}} - second scenario > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298092#comment-17298092 ] Peter Bacsko commented on YARN-10674: - Ok thanks, I'll review this one soon. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10674) fs2cs: should support auto created queue deletion.
[ https://issues.apache.org/jira/browse/YARN-10674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298074#comment-17298074 ] Peter Bacsko commented on YARN-10674: - [~zhuqi] am I right when I think that this patch depends on YARN-10682? Because this change generates a config entry with "," and it's not supported now. > fs2cs: should support auto created queue deletion. > -- > > Key: YARN-10674 > URL: https://issues.apache.org/jira/browse/YARN-10674 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Qi Zhu >Assignee: Qi Zhu >Priority: Major > Labels: fs2cs > Attachments: YARN-10674.001.patch, YARN-10674.002.patch > > > In FS the auto deletion check interval is 10s. > {code:java} > @Override > public void onCheck() { > queueMgr.removeEmptyDynamicQueues(); > queueMgr.removePendingIncompatibleQueues(); > } > while (running) { > try { > synchronized (this) { > reloadListener.onCheck(); > } > ... > Thread.sleep(reloadIntervalMs); > } > /** Time to wait between checks of the allocation file */ > public static final long ALLOC_RELOAD_INTERVAL_MS = 10 * 1000;{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298069#comment-17298069 ] Peter Bacsko commented on YARN-9615: +1 I had to commit twice because there are actually two authors. Thanks for the patch [~jhung] / [~zhuqi] and [~bibinchundatt] for the review. Committed to trunk. > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-9615.001.patch, YARN-9615.002.patch, > YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, > YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, > YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, > YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, > image-2021-03-04-10-36-12-441.png, screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298068#comment-17298068 ] Peter Bacsko commented on YARN-9615: Thanks [~zhuqi] patch v11 looks good, committing it soon. > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-9615.001.patch, YARN-9615.002.patch, > YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, > YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, > YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.011.patch, > YARN-9615.011.patch, YARN-9615.poc.patch, image-2021-03-04-10-35-10-626.png, > image-2021-03-04-10-36-12-441.png, screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS
[ https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298064#comment-17298064 ] Peter Bacsko commented on YARN-10679: - +1 thanks [~snemeth] for the patch and [~shuzirra] for the review. Committed to trunk. > Better logging of uncaught exceptions throughout SLS > > > Key: YARN-10679 > URL: https://issues.apache.org/jira/browse/YARN-10679 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10679.001.patch > > > In our internal environment, there was a test failure while running SLS tests > with Jenkins. > It's difficult to align the uncaught exceptions (in this case an NPE) and the > log itself as the exception is logged with {{e.printStackTrace()}}. > This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", > exception)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS
[ https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298062#comment-17298062 ] Peter Bacsko commented on YARN-10679: - Ok, this time the failed test is different, most likely a flaky one. Let's investigate it later. > Better logging of uncaught exceptions throughout SLS > > > Key: YARN-10679 > URL: https://issues.apache.org/jira/browse/YARN-10679 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10679.001.patch > > > In our internal environment, there was a test failure while running SLS tests > with Jenkins. > It's difficult to align the uncaught exceptions (in this case an NPE) and the > log itself as the exception is logged with {{e.printStackTrace()}}. > This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", > exception)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10679) Better logging of uncaught exceptions throughout SLS
[ https://issues.apache.org/jira/browse/YARN-10679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298057#comment-17298057 ] Peter Bacsko commented on YARN-10679: - Re-triggered build to see what's going on with TestSLSRunner. > Better logging of uncaught exceptions throughout SLS > > > Key: YARN-10679 > URL: https://issues.apache.org/jira/browse/YARN-10679 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10679.001.patch > > > In our internal environment, there was a test failure while running SLS tests > with Jenkins. > It's difficult to align the uncaught exceptions (in this case an NPE) and the > log itself as the exception is logged with {{e.printStackTrace()}}. > This jira is to replace printStackTrace calls in SLS with {{LOG.error("msg", > exception)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10681) Fix assertion failure message in BaseSLSRunnerTest
[ https://issues.apache.org/jira/browse/YARN-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298041#comment-17298041 ] Peter Bacsko commented on YARN-10681: - +1 thanks [~snemeth] and [~shuzirra] for the patch and review, committed to trunk. > Fix assertion failure message in BaseSLSRunnerTest > -- > > Key: YARN-10681 > URL: https://issues.apache.org/jira/browse/YARN-10681 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Trivial > Attachments: YARN-10681.001.patch > > > There is this failure message: > https://github.com/apache/hadoop/blob/a89ca56a1b0eb949f56e7c6c5c25fdf87914a02f/hadoop-tools/hadoop-sls/src/test/java/org/apache/hadoop/yarn/sls/BaseSLSRunnerTest.java#L129-L130 > The word "catched" should be replaced with "caught". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class
[ https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298031#comment-17298031 ] Peter Bacsko commented on YARN-10677: - +1 LGTM. Thanks [~snemeth] for the patch and [~zhuqi] for the review. Committed to trunk. (Jenkins is running but I don't expect any issues). > Logger of SLSFairScheduler is provided with the wrong class > --- > > Key: YARN-10677 > URL: https://issues.apache.org/jira/browse/YARN-10677 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10677.001.patch, YARN-10677.002.patch, > YARN-10677.003.patch, YARN-10677.004.patch > > > In SLSFairScheduler, the Logger definition looks like: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69 > We need to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10678) Try blocks without catch blocks in SLS scheduler classes can swallow other exceptions
[ https://issues.apache.org/jira/browse/YARN-10678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298010#comment-17298010 ] Peter Bacsko commented on YARN-10678: - +1 thanks [~snemeth] for the patch and [~shuzirra] for the review. Committed to trunk. > Try blocks without catch blocks in SLS scheduler classes can swallow other > exceptions > - > > Key: YARN-10678 > URL: https://issues.apache.org/jira/browse/YARN-10678 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10678-unchecked-exception-from-FS-allocate.diff, > YARN-10678-unchecked-exception-from-FS-allocate_fixed.diff, > YARN-10678.001.patch, > org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_modified.log, > > org.apache.hadoop.yarn.sls.TestReservationSystemInvariants__testSimulatorRunning_original.log > > > In SLSFairScheduler, we have this try-finally block (without catch block) in > the allocate method: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L109-L123 > Similarly, in SLSCapacityScheduler: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSCapacityScheduler.java#L116-L131 > In the finally block, the updateQueueWithAllocateRequest is invoked: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L118 > In our internal environment, there was a situation when an NPE was logged > from this method: > {code} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.updateQueueWithAllocateRequest(SLSFairScheduler.java:262) > at > org.apache.hadoop.yarn.sls.scheduler.SLSFairScheduler.allocate(SLSFairScheduler.java:118) > at > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:288) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > at > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > at > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:436) > at > org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:352) > at > org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator$1.run(MRAMSimulator.java:349) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898) > at > org.apache.hadoop.yarn.sls.appmaster.MRAMSimulator.sendContainerRequest(MRAMSimulator.java:348) > at > org.apache.hadoop.yarn.sls.appmaster.AMSimulator.middleStep(AMSimulator.java:212) > at > org.apache.hadoop.yarn.sls.scheduler.TaskRunner$Task.run(TaskRunner.java:94) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This can happen if the following events occur: > 1. A runtime exception is thrown in FairScheduler or CapacityScheduler's > allocate method > 2. In this case, the local variable called 'allocation' remains null: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L110 > 3. In updateQueueWithAllocateRequest, this null object will be dereferenced > here: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L262 > 4. Then, we have an NPE here: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L117-L122 > In this case, we lost the original exception thrown from > FairScheduler#allocate. > In order to fix this, a catch-block should be introduced and the exception > needs to be logged. > The whole thing applies to SLSCapacityScheduler as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (YARN-10677) Logger of SLSFairScheduler is provided with the wrong class
[ https://issues.apache.org/jira/browse/YARN-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298007#comment-17298007 ] Peter Bacsko commented on YARN-10677: - [~snemeth] please fix the whitespace and checkstyle, thanks. > Logger of SLSFairScheduler is provided with the wrong class > --- > > Key: YARN-10677 > URL: https://issues.apache.org/jira/browse/YARN-10677 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10677.001.patch, YARN-10677.002.patch, > YARN-10677.003.patch > > > In SLSFairScheduler, the Logger definition looks like: > https://github.com/apache/hadoop/blob/9cb51bf106802c78b1400fba9f1d1c7e772dd5e7/hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/SLSFairScheduler.java#L69 > We need to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10675) Consolidate YARN-10672 and YARN-10447
[ https://issues.apache.org/jira/browse/YARN-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297998#comment-17297998 ] Peter Bacsko commented on YARN-10675: - +1 LGTM. Thanks [~snemeth] for the patch, committed to trunk. > Consolidate YARN-10672 and YARN-10447 > - > > Key: YARN-10675 > URL: https://issues.apache.org/jira/browse/YARN-10675 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-10675.001.patch > > > Let's consolidate the solution applied for YARN-10672 and apply it to the > code changes introduced with YARN-10447. > Quoting [~pbacsko]: > {quote} > The solution is much straightforward than mine in YARN-10447. Actually we > might consider applying this to TestLeafQueue with undoing my changes, > because that's more complicated (I had no patience to go deeper with Mockito > internal behavior, I just thought well, disable that thread and that's > enough). > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10676) Improve code quality in TestTimelineAuthenticationFilterForV1
[ https://issues.apache.org/jira/browse/YARN-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297996#comment-17297996 ] Peter Bacsko commented on YARN-10676: - +1 thanks [~snemeth] for the patch and [~bteke] / [~zhuqi] / [~shuzirra] for the review. Committed to trunk. > Improve code quality in TestTimelineAuthenticationFilterForV1 > - > > Key: YARN-10676 > URL: https://issues.apache.org/jira/browse/YARN-10676 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-10676.001.patch > > > - In testcase "testDelegationTokenOperations", the exception message is > checked but in case it does not match the assertion, the exception is not > printed. This happens 3 times. > - Assertion messages can be added > - Fields called "httpSpnegoKeytabFile" and "httpSpnegoPrincipal" can be > static final. > - There's a typo in comment "avaiable" (happens 2 times) > - There are some Assert.fail() calls, without messages. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297642#comment-17297642 ] Peter Bacsko edited comment on YARN-10178 at 3/8/21, 6:54 PM: -- [~zhuqi] this is a tricky patch, I have to understand what's going on. We might ask [~wangda] again to look at it, because I'm not that familiar with the code that has been modified. Having said that, I have some recommendations: 1. {{private final static Random RANDOM = new Random(System.currentTimeMillis());}} Is there a reason why this is static? {{RANDOM}} is only used in the test. Another problem is that, let's assume that it fails. But the problem is that we don't see the random seed that was used for initialization, so this test is not reproducible. I suggest rewriting the test like: {noformat} long seed = System.nanoTime(); // I think nanoTime is better try { .. test code .. } catch (AssertionFailedError e) { LOG.error("Test failed, seed = {}", seed); LOG.error(e); throw e; } {noformat} So at least we can check the logs for the seed number. Or maybe rethrow the exception with a modified message, that's also a solution, or wrap it in a different exception with a new message which contains the seed. The point is, it should be visible. 2. This sanity test only works if JVM is started with "-ea": {noformat} // sanity check assert queueNames != null && priorities != null && utilizations != null && queueNames.length > 0 && queueNames.length == priorities.length && priorities.length == utilizations.length; {noformat} I think this should be converted to normal JUnit assertion or just remove it. was (Author: pbacsko): [~zhuqi] this is a tricky patch, I have to understand what's going on. We might ask [~wangda] again to look at it, because I'm not that familiar with the code that has been modified. Having said that, I have some recommendations: 1. {{private final static Random RANDOM = new Random(System.currentTimeMillis());}} Is there a reason why this is static? {{RANDOM}} is only used in the test. Another problem is that, let's assume that it fails. But the problem is that we don't see the random seed that was used for initialization, so this test is not reproducible. I suggest rewriting the test like: {noformat} long seed = System.nanoTime(); // I think nanoTime is better try { .. test code .. } catch (AssertionFailedError e) { LOG.error("Test failed, seed = {}", seed, e); throw e; } {noformat} So at least we can check the logs for the seed number. Or maybe rethrow the exception with a modified message, that's also a solution, or wrap it in a different exception with a new message which contains the seed. The point is, it should be visible. 2. This sanity test only works if JVM is started with "-ea": {noformat} // sanity check assert queueNames != null && priorities != null && utilizations != null && queueNames.length > 0 && queueNames.length == priorities.length && priorities.length == utilizations.length; {noformat} I think this should be converted to normal JUnit assertion or just remove it. > Global Scheduler async thread crash caused by 'Comparison method violates its > general contract' > --- > > Key: YARN-10178 > URL: https://issues.apache.org/jira/browse/YARN-10178 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.2.1 >Reporter: tuyu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10178.001.patch, YARN-10178.002.patch, > YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch > > > Global Scheduler Async Thread crash stack > {code:java} > ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: > Comparison method violates its general contract! >at > java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeForceCollapse(TimSort.java:457) > at java.util.TimSort.sort(TimSort.java:254) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1462) > at java.util.Collections.sort(Collections.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) > at >
[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297642#comment-17297642 ] Peter Bacsko commented on YARN-10178: - [~zhuqi] this is a tricky patch, I have to understand what's going on. We might ask [~wangda] again to look at it, because I'm not that familiar with the code that has been modified. Having said that, I have some recommendations: 1. {{private final static Random RANDOM = new Random(System.currentTimeMillis());}} Is there a reason why this is static? {{RANDOM}} is only used in the test. Another problem is that, let's assume that it fails. But the problem is that we don't see the random seed that was used for initialization, so this test is not reproducible. I suggest rewriting the test like: {noformat} long seed = System.nanoTime(); // I think nanoTime is better try { .. test code .. } catch (AssertionFailedError e) { LOG.error("Test failed, seed = {}", seed, e); throw e; } {noformat} So at least we can check the logs for the seed number. Or maybe rethrow the exception with a modified message, that's also a solution, or wrap it in a different exception with a new message which contains the seed. The point is, it should be visible. 2. This sanity test only works if JVM is started with "-ea": {noformat} // sanity check assert queueNames != null && priorities != null && utilizations != null && queueNames.length > 0 && queueNames.length == priorities.length && priorities.length == utilizations.length; {noformat} I think this should be converted to normal JUnit assertion or just remove it. > Global Scheduler async thread crash caused by 'Comparison method violates its > general contract' > --- > > Key: YARN-10178 > URL: https://issues.apache.org/jira/browse/YARN-10178 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler >Affects Versions: 3.2.1 >Reporter: tuyu >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-10178.001.patch, YARN-10178.002.patch, > YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch > > > Global Scheduler Async Thread crash stack > {code:java} > ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: > Comparison method violates its general contract! >at > java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeForceCollapse(TimSort.java:457) > at java.util.TimSort.sort(TimSort.java:254) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1462) > at java.util.Collections.sort(Collections.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616) > {code} > JAVA 8 Arrays.sort default use timsort algo, and timsort has few require > {code:java} > 1.x.compareTo(y) != y.compareTo(x) > 2.x>y,y>z --> x > z > 3.x=y, x.compareTo(z) == y.compareTo(z) > {code} > if not Arrays paramters not satify this
[jira] [Commented] (YARN-9615) Add dispatcher metrics to RM
[ https://issues.apache.org/jira/browse/YARN-9615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297504#comment-17297504 ] Peter Bacsko commented on YARN-9615: Let's wait for the Jenkins results of patch v10. > Add dispatcher metrics to RM > > > Key: YARN-9615 > URL: https://issues.apache.org/jira/browse/YARN-9615 > Project: Hadoop YARN > Issue Type: Task >Reporter: Jonathan Hung >Assignee: Qi Zhu >Priority: Major > Attachments: YARN-9615.001.patch, YARN-9615.002.patch, > YARN-9615.003.patch, YARN-9615.004.patch, YARN-9615.005.patch, > YARN-9615.006.patch, YARN-9615.007.patch, YARN-9615.008.patch, > YARN-9615.009.patch, YARN-9615.010.patch, YARN-9615.poc.patch, > image-2021-03-04-10-35-10-626.png, image-2021-03-04-10-36-12-441.png, > screenshot-1.png > > > It'd be good to have counts/processing times for each event type in RM async > dispatcher and scheduler async dispatcher. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10672) All testcases in TestReservations are flaky
[ https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10672: Fix Version/s: 3.2.3 3.3.1 > All testcases in TestReservations are flaky > --- > > Key: YARN-10672 > URL: https://issues.apache.org/jira/browse/YARN-10672 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot > 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at > 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, > YARN-10672-debuglogs.patch, YARN-10672.001.patch, > YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch > > > All testcases in TestReservations are flaky > Running a particular test in TestReservations 100 times never passes all the > time. > For example, let's run testReservationNoContinueLook 100 times. For me, it > produced 39 failed and 61 passed results. > Sometimes just 1 out of 100 runs is failed. > Screenshot is attached. > Stacktrace: > {code:java} > java.lang.AssertionError: > Expected :2048 > Actual :0 > > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642) > {code} > The test fails here: > {code:java} > // Start testing... > // Only AM > TestUtils.applyResourceCommitRequest(clusterResource, > a.assignContainers(clusterResource, node_0, > new ResourceLimits(clusterResource), > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps); > assertEquals(2 * GB, a.getUsedResources().getMemorySize()); > {code} > With some debugging (patch attached), I realized that sometimes there are no > registered nodes so the AM can't be allocated and test will fail: > {code:java} > 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator > (RegularContainerAllocator.java:canAssign(312)) - **Can't assign > container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f > {code} > In these cases, this is also printed from > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes: > {code:java} > 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler > (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real > getNumClusterNodes > {code} > h2. Let's break this down: > 1. The mocking happens in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration, > boolean): > {code:java} > cs.setRMContext(spyRMContext); > cs.init(csConf); > cs.start(); > when(cs.getNumClusterNodes()).thenReturn(3); > {code} > Under no circumstances this could be allowed to return any other value than 3. > However, as mentioned above, sometimes the real method of > 'getNumClusterNodes' is called on CapacityScheduler. > 2. Sometimes, this gets printed to the console: > {code:java} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > Integer cannot be returned by isMultiNodePlacementEnabled() > isMultiNodePlacementEnabled() should return boolean > *** > If you're unsure why you're getting above error read on. > Due to the nature of the syntax above problem might occur because: > 1. This exception *might* occur in wrongly written multi-threaded tests. >Please refer to Mockito FAQ on limitations of concurrency testing. > 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub > spies - >- with doReturn|Throw() family of methods. More in javadocs for > Mockito.spy() method. > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566) > at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at >
[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky
[ https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297358#comment-17297358 ] Peter Bacsko commented on YARN-10672: - +1 overall. Committed changes to branch-3.2 too. Thanks [~snemeth] for the contribution. > All testcases in TestReservations are flaky > --- > > Key: YARN-10672 > URL: https://issues.apache.org/jira/browse/YARN-10672 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.4.0 > > Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot > 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at > 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, > YARN-10672-debuglogs.patch, YARN-10672.001.patch, > YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch > > > All testcases in TestReservations are flaky > Running a particular test in TestReservations 100 times never passes all the > time. > For example, let's run testReservationNoContinueLook 100 times. For me, it > produced 39 failed and 61 passed results. > Sometimes just 1 out of 100 runs is failed. > Screenshot is attached. > Stacktrace: > {code:java} > java.lang.AssertionError: > Expected :2048 > Actual :0 > > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642) > {code} > The test fails here: > {code:java} > // Start testing... > // Only AM > TestUtils.applyResourceCommitRequest(clusterResource, > a.assignContainers(clusterResource, node_0, > new ResourceLimits(clusterResource), > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps); > assertEquals(2 * GB, a.getUsedResources().getMemorySize()); > {code} > With some debugging (patch attached), I realized that sometimes there are no > registered nodes so the AM can't be allocated and test will fail: > {code:java} > 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator > (RegularContainerAllocator.java:canAssign(312)) - **Can't assign > container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f > {code} > In these cases, this is also printed from > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes: > {code:java} > 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler > (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real > getNumClusterNodes > {code} > h2. Let's break this down: > 1. The mocking happens in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration, > boolean): > {code:java} > cs.setRMContext(spyRMContext); > cs.init(csConf); > cs.start(); > when(cs.getNumClusterNodes()).thenReturn(3); > {code} > Under no circumstances this could be allowed to return any other value than 3. > However, as mentioned above, sometimes the real method of > 'getNumClusterNodes' is called on CapacityScheduler. > 2. Sometimes, this gets printed to the console: > {code:java} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > Integer cannot be returned by isMultiNodePlacementEnabled() > isMultiNodePlacementEnabled() should return boolean > *** > If you're unsure why you're getting above error read on. > Due to the nature of the syntax above problem might occur because: > 1. This exception *might* occur in wrongly written multi-threaded tests. >Please refer to Mockito FAQ on limitations of concurrency testing. > 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub > spies - >- with doReturn|Throw() family of methods. More in javadocs for > Mockito.spy() method. > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566) > at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at >
[jira] [Commented] (YARN-10672) All testcases in TestReservations are flaky
[ https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297345#comment-17297345 ] Peter Bacsko commented on YARN-10672: - Ok, test failures seem to be totally unrelated. The change only concerns "TestReservations" and modifies the order of stubbing. > All testcases in TestReservations are flaky > --- > > Key: YARN-10672 > URL: https://issues.apache.org/jira/browse/YARN-10672 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Fix For: 3.4.0 > > Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot > 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at > 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, > YARN-10672-debuglogs.patch, YARN-10672.001.patch, > YARN-10672.branch-3.2.001.patch, YARN-10672.branch-3.3.001.patch > > > All testcases in TestReservations are flaky > Running a particular test in TestReservations 100 times never passes all the > time. > For example, let's run testReservationNoContinueLook 100 times. For me, it > produced 39 failed and 61 passed results. > Sometimes just 1 out of 100 runs is failed. > Screenshot is attached. > Stacktrace: > {code:java} > java.lang.AssertionError: > Expected :2048 > Actual :0 > > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642) > {code} > The test fails here: > {code:java} > // Start testing... > // Only AM > TestUtils.applyResourceCommitRequest(clusterResource, > a.assignContainers(clusterResource, node_0, > new ResourceLimits(clusterResource), > SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps); > assertEquals(2 * GB, a.getUsedResources().getMemorySize()); > {code} > With some debugging (patch attached), I realized that sometimes there are no > registered nodes so the AM can't be allocated and test will fail: > {code:java} > 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator > (RegularContainerAllocator.java:canAssign(312)) - **Can't assign > container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f > {code} > In these cases, this is also printed from > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes: > {code:java} > 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler > (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real > getNumClusterNodes > {code} > h2. Let's break this down: > 1. The mocking happens in > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration, > boolean): > {code:java} > cs.setRMContext(spyRMContext); > cs.init(csConf); > cs.start(); > when(cs.getNumClusterNodes()).thenReturn(3); > {code} > Under no circumstances this could be allowed to return any other value than 3. > However, as mentioned above, sometimes the real method of > 'getNumClusterNodes' is called on CapacityScheduler. > 2. Sometimes, this gets printed to the console: > {code:java} > org.mockito.exceptions.misusing.WrongTypeOfReturnValue: > Integer cannot be returned by isMultiNodePlacementEnabled() > isMultiNodePlacementEnabled() should return boolean > *** > If you're unsure why you're getting above error read on. > Due to the nature of the syntax above problem might occur because: > 1. This exception *might* occur in wrongly written multi-threaded tests. >Please refer to Mockito FAQ on limitations of concurrency testing. > 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub > spies - >- with doReturn|Throw() family of methods. More in javadocs for > Mockito.spy() method. > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566) > at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Updated] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995
[ https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-10642: Fix Version/s: 3.2.3 > Race condition: AsyncDispatcher can get stuck by the changes introduced in > YARN-8995 > > > Key: YARN-10642 > URL: https://issues.apache.org/jira/browse/YARN-10642 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Critical > Fix For: 3.4.0, 3.3.1, 3.2.3 > > Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, > YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, > YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, > YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, > put.png, take.png > > > In our cluster, ResouceManager stuck twice within twenty days. Yarn client > can't submit application. I got jstack info at second time, then found the > reason. > I analyze all the jstack, I found many thread stuck because can't get > LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the > analytical process) > The reason is that one thread hold the putLock all the time, > printEventQueueDetails will called forEachRemaining, then hold putLock and > readLock. The AsyncDispatcher will stuck. > {code} > Thread 6526 (IPC Server handler 454 on default port 8030): > State: RUNNABLE > Blocked count: 29988 > Waited count: 2035029 > Stack: > > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215) > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432) > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958) > java.security.AccessController.doPrivileged(Native Method) > {code} > I analyze LinkedBlockingQueue's source code. I found forEachRemaining in > LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take > are called in different thread. > YARN-8995 introduce printEventQueueDetails method, > "eventQueue.stream().collect" will called forEachRemaining method. > Let's see why? "put.png" shows that how to put("a"), "take.png" shows that > how to take()。Specical Node: The removed Node will point itself for help gc!!! > The key point code is in forEachRemaining, we see LBQSpliterator use > forEachRemaining to visit all Node. But when got item value from Node, will > release the lock. If at this time, take() will be called. > The variable 'p' in forEachRemaining may point a Node which point itself, > then forEachRemaining will be in dead loop. You can see it in "deadloop.png" > Let's see a simple uni-test, Let's forEachRemaining called more slow than > take, the problem will reproduction。uni-test is MockForDeadLoop.java. > I debug MockForDeadLoop.java, and see a Node point itself. You can see pic > "debugfornode.png" > Environment: > OS: CentOS Linux release 7.5.1804 (Core) > JDK: jdk1.8.0_281 -- This
[jira] [Commented] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995
[ https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297343#comment-17297343 ] Peter Bacsko commented on YARN-10642: - Ok, pushed to branch-3.2 as well. Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. > Race condition: AsyncDispatcher can get stuck by the changes introduced in > YARN-8995 > > > Key: YARN-10642 > URL: https://issues.apache.org/jira/browse/YARN-10642 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Critical > Fix For: 3.4.0, 3.3.1 > > Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, > YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, > YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, > YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, > put.png, take.png > > > In our cluster, ResouceManager stuck twice within twenty days. Yarn client > can't submit application. I got jstack info at second time, then found the > reason. > I analyze all the jstack, I found many thread stuck because can't get > LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the > analytical process) > The reason is that one thread hold the putLock all the time, > printEventQueueDetails will called forEachRemaining, then hold putLock and > readLock. The AsyncDispatcher will stuck. > {code} > Thread 6526 (IPC Server handler 454 on default port 8030): > State: RUNNABLE > Blocked count: 29988 > Waited count: 2035029 > Stack: > > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215) > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432) > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958) > java.security.AccessController.doPrivileged(Native Method) > {code} > I analyze LinkedBlockingQueue's source code. I found forEachRemaining in > LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take > are called in different thread. > YARN-8995 introduce printEventQueueDetails method, > "eventQueue.stream().collect" will called forEachRemaining method. > Let's see why? "put.png" shows that how to put("a"), "take.png" shows that > how to take()。Specical Node: The removed Node will point itself for help gc!!! > The key point code is in forEachRemaining, we see LBQSpliterator use > forEachRemaining to visit all Node. But when got item value from Node, will > release the lock. If at this time, take() will be called. > The variable 'p' in forEachRemaining may point a Node which point itself, > then forEachRemaining will be in dead loop. You can see it in "deadloop.png" > Let's see a simple uni-test, Let's forEachRemaining called more slow than > take, the problem will reproduction。uni-test is MockForDeadLoop.java. > I debug MockForDeadLoop.java, and see a Node point itself. You can see
[jira] [Comment Edited] (YARN-10642) Race condition: AsyncDispatcher can get stuck by the changes introduced in YARN-8995
[ https://issues.apache.org/jira/browse/YARN-10642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297343#comment-17297343 ] Peter Bacsko edited comment on YARN-10642 at 3/8/21, 1:24 PM: -- +1 Ok, pushed to branch-3.2 as well. Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. was (Author: pbacsko): Ok, pushed to branch-3.2 as well. Thanks for the patch [~zhengchenyu] and [~bteke] / [~zhuqi] for the review. > Race condition: AsyncDispatcher can get stuck by the changes introduced in > YARN-8995 > > > Key: YARN-10642 > URL: https://issues.apache.org/jira/browse/YARN-10642 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 3.2.1 >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Critical > Fix For: 3.4.0, 3.3.1 > > Attachments: MockForDeadLoop.java, YARN-10642-branch-3.2.001.patch, > YARN-10642-branch-3.2.002.patch, YARN-10642-branch-3.3.001.patch, > YARN-10642.001.patch, YARN-10642.002.patch, YARN-10642.003.patch, > YARN-10642.004.patch, YARN-10642.005.patch, deadloop.png, debugfornode.png, > put.png, take.png > > > In our cluster, ResouceManager stuck twice within twenty days. Yarn client > can't submit application. I got jstack info at second time, then found the > reason. > I analyze all the jstack, I found many thread stuck because can't get > LinkedBlockingQueue.putLock. (Note: Sorry for limited space , omit the > analytical process) > The reason is that one thread hold the putLock all the time, > printEventQueueDetails will called forEachRemaining, then hold putLock and > readLock. The AsyncDispatcher will stuck. > {code} > Thread 6526 (IPC Server handler 454 on default port 8030): > State: RUNNABLE > Blocked count: 29988 > Waited count: 2035029 > Stack: > > java.util.concurrent.LinkedBlockingQueue$LBQSpliterator.forEachRemaining(LinkedBlockingQueue.java:926) > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) > > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.printEventQueueDetails(AsyncDispatcher.java:270) > > org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:295) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.handleProgress(DefaultAMSProcessor.java:408) > > org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:215) > > org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75) > > org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92) > > org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:432) > > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60) > > org.apache.hadoop.yarn.proto.ApplicationMasterProtocol$ApplicationMasterProtocolService$2.callBlockingMethod(ApplicationMasterProtocol.java:99) > > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528) > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1040) > org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:958) > java.security.AccessController.doPrivileged(Native Method) > {code} > I analyze LinkedBlockingQueue's source code. I found forEachRemaining in > LinkedBlockingQueue.LBQSpliterator may stuck, when forEachRemaining and take > are called in different thread. > YARN-8995 introduce printEventQueueDetails method, > "eventQueue.stream().collect" will called forEachRemaining method. > Let's see why? "put.png" shows that how to put("a"), "take.png" shows that > how to take()。Specical Node: The removed Node will point itself for help gc!!! > The key point code is in forEachRemaining, we see LBQSpliterator use > forEachRemaining to visit all Node. But when got item value from Node, will > release the lock. If at this time, take() will be called. > The variable 'p' in forEachRemaining may point a Node which point itself, > then forEachRemaining will be in dead loop. You can see it in "deadloop.png" > Let's see a simple uni-test,