[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416342#comment-17416342
 ] 

Hadoop QA commented on YARN-10935:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  8m 
39s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} branch-3.2 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 
 7s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 32s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 15m 
47s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
38s{color} | {color:green}{color} | {color:green} branch-3.2 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
46s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 18s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
30s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
42s{color} | {color:green}{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 46s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/1209/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color}
 | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} The patch does not generate 
ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}146m 17s{color} | 
{color:black}{color} | {color:black}{color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | ClientAPI=1.41 ServerAPI=1.41 base: 

[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416314#comment-17416314
 ] 

Eric Payne commented on YARN-10935:
---

OK, attached branch-3.2 patch.
Thanks [~ebadger].

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, 
> YARN-10935.branch-3.2.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.branch-3.2.003.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch, 
> YARN-10935.branch-3.2.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416308#comment-17416308
 ] 

Hadoop QA commented on YARN-10935:
--

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  9m 
35s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} branch-2.10 Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
38s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green}{color} | {color:green} branch-2.10 passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  4m  
0s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are 
enabled, using SpotBugs. {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
27s{color} | {color:green}{color} | {color:green} branch-2.10 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
40s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
44s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
25s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
41s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Azul Systems, Inc.-1.7.0_262-b10 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_292-8u292-b10-0ubuntu1~16.04.1-b10 {color} |
| {color:green}+1{color} | {color:green} spotbugs {color} | {color:green}  1m 
34s{color} | {color:green}{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} || ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 64m 
17s{color} | {color:green}{color} 

[jira] [Commented] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416276#comment-17416276
 ] 

Eric Payne commented on YARN-10935:
---

Attaching the branch-2.10 patch. Will look into the branch-3.2 patch.

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Payne (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Payne updated YARN-10935:
--
Attachment: YARN-10935.branch-2.10.003.patch

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch, YARN-10935.branch-2.10.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10935) AM Total Queue Limit goes below per-user AM Limit if parent is full.

2021-09-16 Thread Eric Badger (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Badger updated YARN-10935:
---
Fix Version/s: 3.3.2
   3.4.0

[~epayne], looks like it's clean back to branch-3.3. So I committed it to trunk 
(3.4) and branch-3.3. But I'll need patches for branch-3.2 onwards if you'd 
like it backported

> AM Total Queue Limit goes below per-user AM Limit if parent is full.
> 
>
> Key: YARN-10935
> URL: https://issues.apache.org/jira/browse/YARN-10935
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacity scheduler, capacityscheduler
>Reporter: Eric Payne
>Assignee: Eric Payne
>Priority: Major
> Fix For: 3.4.0, 3.3.2
>
> Attachments: Screen Shot 2021-09-07 at 12.49.52 PM.png, Screen Shot 
> 2021-09-07 at 12.55.37 PM.png, YARN-10935.001.patch, YARN-10935.002.patch, 
> YARN-10935.003.patch
>
>
> This happens when DRF is enabled and all of one resource is consumed but the 
> second resources still has plenty available.
> This is reproduceable by setting up a parent queue where the capacity and max 
> capacity are the same, with 2 or more sub-queues whose max capacity is 100%.
> In one of the sub-queues, start a long-running app that consumes all 
> resources in the parent queue's hieararchy. This app will consume all of the 
> memory but not vary many vcores (for example)
> In a second queue, submit an app. The *{{Max Application Master Resources Per 
> User}}* limit is much more than the *{{Max Application Master Resources}}* 
> limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10958) Use correct configuration for Group service init in CSMappingPlacementRule

2021-09-16 Thread Peter Bacsko (Jira)
Peter Bacsko created YARN-10958:
---

 Summary: Use correct configuration for Group service init in 
CSMappingPlacementRule
 Key: YARN-10958
 URL: https://issues.apache.org/jira/browse/YARN-10958
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Peter Bacsko


There is a potential problem in {{CSMappingPlacementRule.java}}:
{noformat}
if (groups == null) {
  groups = Groups.getUserToGroupsMappingService(conf);
}
{noformat}
The problem is, we're supposed to pass {{scheduler.getConf()}}. The "conf" 
object is the config for capacity scheduler, which does not include the 
property which selects the group service provider. Therefore, the current code 
just works by chance, because Group mapping service is already initialized at 
this point. See the original fix in YARN-10053.

Also, need a unit test to verify it.

Idea:
 # Create a Configuration object in which the property 
"hadoop.security.group.mapping" refers to an existing a test implementation.
 # Add a new method to {{Groups}} which nulls out the singleton instance, eg. 
{{Groups.reset()}}.
 # Create a mock CapacityScheduler where {{getConf()}} and 
{{getConfiguration()}} contain different settings for 
"hadoop.security.group.mapping". Since {{getConf()}} is the service config, 
this should return the config object created in step #1.
 # Create an instance of {{CSMappingPlacementRule}} with a single primary group 
rule.
 # Run the placement evaluation.
 # Expected: returned queue matches what is supposed to be coming from the test 
group mapping service ("testuser" --> "testqueue").
 # Modify "hadoop.security.group.mapping" in the config object created in step 
#1.
 # Call {{Groups.refresh()}} which changes the group mapping ("testuser" --> 
"testqueue2"). This requires that the test group mapping service implement 
{{GroupMappingServiceProvider.cacheGroupsRefresh()}}.
 # Create a new instance of {{CSMappingPlacementRule}}.
 # Run the placement evaluation again
 # Expected: with the same user, the target queue has changed.

This looks convoluted, but these steps make sure that:
 # {{CSMappingPlacementRule}} will force the initialization of groups.
 # We select the correct configuration for group service init.
 # We don't create a new {{Groups}} instance if the singleton is initialized, 
so we cover the original problem described in YARN-10597.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10936) Fix typo in LogAggregationFileController

2021-09-16 Thread Jira


 [ 
https://issues.apache.org/jira/browse/YARN-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tibor Kovács reassigned YARN-10936:
---

Assignee: Tibor Kovács  (was: Tamas Domok)

> Fix typo in LogAggregationFileController
> 
>
> Key: YARN-10936
> URL: https://issues.apache.org/jira/browse/YARN-10936
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Tamas Domok
>Assignee: Tibor Kovács
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> -LOG.warn("Failed to check if FileSystem suppports permissions on "
> +LOG.warn("Failed to check if FileSystem supports permissions on "{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10936) Fix typo in LogAggregationFileController

2021-09-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated YARN-10936:
--
Labels: pull-request-available  (was: )

> Fix typo in LogAggregationFileController
> 
>
> Key: YARN-10936
> URL: https://issues.apache.org/jira/browse/YARN-10936
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.1
>Reporter: Tamas Domok
>Assignee: Tamas Domok
>Priority: Trivial
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> -LOG.warn("Failed to check if FileSystem suppports permissions on "
> +LOG.warn("Failed to check if FileSystem supports permissions on "{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-16 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10955:

Description: 
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.

  was:
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.


> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> workState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
> keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add 

[jira] [Updated] (YARN-10955) Add health check mechanism to improve troubleshooting skills for RM

2021-09-16 Thread Tao Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-10955:

Description: 
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) and 
keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.

  was:
RM is the most complex component in YARN with many basic or core services 
including RPC servers, event dispatchers, HTTP server, core scheduler, state 
managers etc., and some of them depends on other basic components like 
ZooKeeper, HDFS. 

Currently we may have to find some suspicious traces from many related metrics 
and tremendous logs while encountering an unclear issue, hope to locate the 
root cause of the problem. For example, some applications keep staying in 
NEW_SAVING state, which can be caused by lost of ZooKeeper connections or jam 
in event dispatcher, the useful traces is sinking in many metrics and logs. 
That's not easy to figure out what happened even for some experts, let alone 
common users.

So I propose to add a common health check mechanism to improve troubleshooting 
skills for RM, in my general thought, we can
 * add a HealthReporter interface as follows:
{code:java}
public interface HealthReporter {
  HealthReport getHealthReport();
}
{code}
HealthReport can have some generic fields like isHealthy(boolean), 
updateTime(long), diagnostics(string) and keyMetrics(Map).

 * make some key services implement HealthReporter interface and generate 
health report via evaluating the internal state.
 * add HealthCheckerService which can manage and monitor all reportable 
services, support checking and fetching health reports periodically and 
manually (can be triggered by REST API), publishing metrics and logs as well.


> Add health check mechanism to improve troubleshooting skills for RM
> ---
>
> Key: YARN-10955
> URL: https://issues.apache.org/jira/browse/YARN-10955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
>
> RM is the most complex component in YARN with many basic or core services 
> including RPC servers, event dispatchers, HTTP server, core scheduler, state 
> managers etc., and some of them depends on other basic components like 
> ZooKeeper, HDFS. 
> Currently we may have to find some suspicious traces from many related 
> metrics and tremendous logs while encountering an unclear issue, hope to 
> locate the root cause of the problem. For example, some applications keep 
> staying in NEW_SAVING state, which can be caused by lost of ZooKeeper 
> connections or jam in event dispatcher, the useful traces is sinking in many 
> metrics and logs. That's not easy to figure out what happened even for some 
> experts, let alone common users.
> So I propose to add a common health check mechanism to improve 
> troubleshooting skills for RM, in my general thought, we can
>  * add a HealthReporter interface as follows:
> {code:java}
> public interface HealthReporter {
>   HealthReport getHealthReport();
> }
> {code}
> HealthReport can have some generic fields like isHealthy(boolean), 
> processState(enum: NORMAL/IDLE/BUSY), updateTime(long), diagnostics(string) 
> and keyMetrics(Map).
>  * make some key services implement HealthReporter interface and generate 
> health report via evaluating the internal state.
>  * add HealthCheckerService which can manage