[jira] [Comment Edited] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812123#comment-16812123
 ] 

Weiwei Yang edited comment on YARN-9413 at 4/8/19 5:46 AM:
---

Thanks for conforming that. +1. Just committed to branch-3.0. Now this is fixed 
on all 3.x versions. Thanks [~Tao Yang] for the contribution.


was (Author: cheersyang):
Thanks for conforming that. +1. Committing now.

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang updated YARN-9413:
--
Fix Version/s: 3.0.4

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812123#comment-16812123
 ] 

Weiwei Yang commented on YARN-9413:
---

Thanks for conforming that. +1. Committing now.

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812122#comment-16812122
 ] 

Hudson commented on YARN-9313:
--

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16360 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/16360/])
YARN-9313. Support asynchronized scheduling mode and multi-node lookup (wwei: 
rev fc05b0e70e9bb556d6bdc00fa8735e18a6f90bc9)
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesLogger.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesSchedulerActivities.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesSchedulerActivitiesWithMultiNodesEnabled.java
* (edit) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java
* (add) 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/TestActivitiesManager.java


> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9313.001.patch, YARN-9313.002.patch, 
> YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812118#comment-16812118
 ] 

Weiwei Yang commented on YARN-9313:
---

+1, committing now. Thanks [~Tao Yang] .

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch, YARN-9313.002.patch, 
> YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812110#comment-16812110
 ] 

Tao Yang commented on YARN-9413:


The checkstyle issue seems the same as above and UT failures are not related to 
this patch (I can reproduce them in branch-3.0 without this patch).

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812101#comment-16812101
 ] 

Hadoop QA commented on YARN-9313:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
27s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 14s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
17s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 in trunk has 2 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 39s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 77m 
15s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}128m 31s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9313 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12965152/YARN-9313.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux ecd425ade2d0 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 0d47d28 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| findbugs | 
https://builds.apache.org/job/PreCommit-YARN-Build/23909/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23909/testReport/ |
| Max. process+thread count | 904 (vs. ulimit of 1) |
| modules | C: 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 U: 

[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812086#comment-16812086
 ] 

Hadoop QA commented on YARN-9413:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-3.0 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
32s{color} | {color:green} branch-3.0 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
36s{color} | {color:green} branch-3.0 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} branch-3.0 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} branch-3.0 passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 23s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
12s{color} | {color:green} branch-3.0 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} branch-3.0 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 32s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 1 new + 158 unchanged - 4 fixed = 159 total (was 162) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 38s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 30s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}120m 30s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
 |
|   | 
hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService 
|
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:e402791 |
| JIRA Issue | YARN-9413 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12965109/YARN-9413.branch-3.0.001.patch
 |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 21557175738a 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | branch-3.0 / f824f4d |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 

[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812056#comment-16812056
 ] 

Tao Yang commented on YARN-9313:


Attached v5 patch to fix remaining checkstyle errors, UT and findbugs failures 
seems not related to this patch.

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch, YARN-9313.002.patch, 
> YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9455) SchedulerInvalidResoureRequestException has a typo in its class (and file) name

2019-04-07 Thread Anh (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812054#comment-16812054
 ] 

Anh commented on YARN-9455:
---

Hi [~snemeth], can I take this item?

> SchedulerInvalidResoureRequestException has a typo in its class (and file) 
> name
> ---
>
> Key: YARN-9455
> URL: https://issues.apache.org/jira/browse/YARN-9455
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>  Labels: newbie
>
> The class name should be: SchedulerInvalidResourceRequestException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9313:
---
Attachment: YARN-9313.005.patch

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch, YARN-9313.002.patch, 
> YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812019#comment-16812019
 ] 

Weiwei Yang commented on YARN-9413:
---

Thanks [~Tao Yang] , reopen to trigger jenkins job on branch-3.0.

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Reopened] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Weiwei Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiwei Yang reopened YARN-9413:
---

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Weiwei Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812018#comment-16812018
 ] 

Weiwei Yang commented on YARN-9313:
---

Thanks [~Tao Yang] for the update, patch looks good. Can you please fix 
remaining 2 checkstyle issues?

Looks like we are hinting some flaky UTs again, should not be related to this 
patch.

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch, YARN-9313.002.patch, 
> YARN-9313.003.patch, YARN-9313.004.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9445) yarn.admin.acl is futile

2019-04-07 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811958#comment-16811958
 ] 

Eric Yang commented on YARN-9445:
-

[~sunilg] [~bibinchundatt] Security should be designed to be permissive from 
admin point of view instead of mutually exclusive.  Security may appear as 
mutually exclusive (allow or disallowed) from user's point of view.  However, 
proper security design should be permissive from admin point of view.  Admin 
must have ability to perform the same operation if user is not available to 
carry out the operation.

{quote}a) yarn.admin.acls=yarn. and for e, 
.queueA.acl_submit_applications=john. Now user "john" can submit app to 
queueA. "yarn" user should not be able to submit.{quote}

I do not believe disallowing system admin to submit job improves security in 
the above statement.  It only create inconvenience for impersonation that YARN 
service user credential can not submit job on behave of the user.  Admin can 
always run "sudo" to submit the job for the user.  Hence, this artificially 
designed mutually exclusive constraint is a no-op security feature.  Some 
improvement in this area would make the system easier to operate and avoid 
paradox that prevent admin from fixing user's problem.

> yarn.admin.acl is futile
> 
>
> Key: YARN-9445
> URL: https://issues.apache.org/jira/browse/YARN-9445
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.3.0
>Reporter: Peter Simon
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9445.001.patch
>
>
> * Define a queue with restrictive administerApps settings (e.g. yarn)
>  * Set yarn.admin.acl to "*".
>  * Try to submit an application with user yarn, it is denied.
> This way my expected behaviour would be that while everyone is admin, I can 
> submit to whatever pool.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9445) yarn.admin.acl is futile

2019-04-07 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811950#comment-16811950
 ] 

Szilard Nemeth edited comment on YARN-9445 at 4/7/19 7:41 PM:
--

[~sunilg], [~bibinchundatt]: 

I'm confused. Reading the 3.2.0 docs 
([https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists]
 for FS/ACLs) says: 

"Queue Access Control Lists (ACLs) allow administrators to control who may take 
actions on particular queues. They are configured with the aclSubmitApps and 
aclAdministerApps properties, which can be set per queue. Currently the only 
supported administrative action is killing an application. An administrator may 
also submit applications to it." 

In this sense, aclAdministerApps not only gives permissions to execute admin 
operations but also gives submission permissions to queues. 

For me, not giving an administrator rights to everything seems controversial, 
so the documentation is more logical. All in all, if we go with the direction 
that admins don't get submission rights then we should also make sure the 
documentation is in line with the decision. 

I do agree with [~eyang] about restricting the default admin ACL to something 
else than '*' but this requires a follow-up jira, I think.


was (Author: snemeth):
[~sunilg], [~bibinchundatt]: 

I'm confused. Reading the 3.2.0 docs 
([https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists]
 for FS/ACLs) says: 

"Queue Access Control Lists (ACLs) allow administrators to control who may take 
actions on particular queues. They are configured with the aclSubmitApps and 
aclAdministerApps properties, which can be set per queue. Currently the only 
supported administrative action is killing an application. An administrator may 
also submit applications to it." 

In this sense, aclAdministerApps not only gives permissions to execute admin 
operations but also gives submission permissions to queues. 

For me, not giving an administrator rights to everything seems controversial, 
so the documentation is more logical. All in all, if we go with the direction 
that admins son't get submiasion rights then we should alao make sure the 
documentation is in line with the decision. 

I do agree with [~eyang] about restricting the default admin ACL to aomething 
else than '*' but this requires a follow-up jira, I think.

> yarn.admin.acl is futile
> 
>
> Key: YARN-9445
> URL: https://issues.apache.org/jira/browse/YARN-9445
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.3.0
>Reporter: Peter Simon
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9445.001.patch
>
>
> * Define a queue with restrictive administerApps settings (e.g. yarn)
>  * Set yarn.admin.acl to "*".
>  * Try to submit an application with user yarn, it is denied.
> This way my expected behaviour would be that while everyone is admin, I can 
> submit to whatever pool.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9445) yarn.admin.acl is futile

2019-04-07 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811950#comment-16811950
 ] 

Szilard Nemeth edited comment on YARN-9445 at 4/7/19 7:39 PM:
--

[~sunilg], [~bibinchundatt]: 

I'm confused. Reading the 3.2.0 docs 
([https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists]
 for FS/ACLs) says: 

"Queue Access Control Lists (ACLs) allow administrators to control who may take 
actions on particular queues. They are configured with the aclSubmitApps and 
aclAdministerApps properties, which can be set per queue. Currently the only 
supported administrative action is killing an application. An administrator may 
also submit applications to it." 

In this sense, aclAdministerApps not only gives permissions to execute admin 
operations but also gives submission permissions to queues. 

For me, not giving an administrator rights to everything seems controversial, 
so the documentation is more logical. All in all, if we go with the direction 
that admins son't get submiasion rights then we should alao make sure the 
documentation is in line with the decision. 

I do agree with [~eyang] about restricting the default admin ACL to aomething 
else than '*' but this requires a follow-up jira, I think.


was (Author: snemeth):
[~sunilg], [~bibinchundatt]: 

I'm confused. Reading the 3.2.0 docs 
(https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists
 for FS/ACLs) says: 

"Queue Access Control Lists (ACLs) allow administrators to control who may take 
actions on particular queues. They are configured with the aclSubmitApps and 
aclAdministerApps properties, which can be set per queue. Currently the only 
supported administrative action is killing an application. An administrator may 
also submit applications to it." 

In this sense, aclAdministerApps not only gives permissions to execute admin 
operations but also gives submiasion permissions to queues. 

For me, not giving an administrator rights to everything seems controversial, 
so the documentation is more logical. All in all, if we go with the direction 
that admins son't get submiasion rights then we should alao make sure the 
documentation is in line with the decision. 

I do agree with [~eyang] about restricting the default admin ACL to aomething 
else than '*' but this requires a follow-up jira, I think.

> yarn.admin.acl is futile
> 
>
> Key: YARN-9445
> URL: https://issues.apache.org/jira/browse/YARN-9445
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.3.0
>Reporter: Peter Simon
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9445.001.patch
>
>
> * Define a queue with restrictive administerApps settings (e.g. yarn)
>  * Set yarn.admin.acl to "*".
>  * Try to submit an application with user yarn, it is denied.
> This way my expected behaviour would be that while everyone is admin, I can 
> submit to whatever pool.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9445) yarn.admin.acl is futile

2019-04-07 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811950#comment-16811950
 ] 

Szilard Nemeth commented on YARN-9445:
--

[~sunilg], [~bibinchundatt]: 

I'm confused. Reading the 3.2.0 docs 
(https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists
 for FS/ACLs) says: 

"Queue Access Control Lists (ACLs) allow administrators to control who may take 
actions on particular queues. They are configured with the aclSubmitApps and 
aclAdministerApps properties, which can be set per queue. Currently the only 
supported administrative action is killing an application. An administrator may 
also submit applications to it." 

In this sense, aclAdministerApps not only gives permissions to execute admin 
operations but also gives submiasion permissions to queues. 

For me, not giving an administrator rights to everything seems controversial, 
so the documentation is more logical. All in all, if we go with the direction 
that admins son't get submiasion rights then we should alao make sure the 
documentation is in line with the decision. 

I do agree with [~eyang] about restricting the default admin ACL to aomething 
else than '*' but this requires a follow-up jira, I think.

> yarn.admin.acl is futile
> 
>
> Key: YARN-9445
> URL: https://issues.apache.org/jira/browse/YARN-9445
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.3.0
>Reporter: Peter Simon
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9445.001.patch
>
>
> * Define a queue with restrictive administerApps settings (e.g. yarn)
>  * Set yarn.admin.acl to "*".
>  * Try to submit an application with user yarn, it is denied.
> This way my expected behaviour would be that while everyone is admin, I can 
> submit to whatever pool.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2019-04-07 Thread Prabhu Joseph (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811866#comment-16811866
 ] 

Prabhu Joseph commented on YARN-6929:
-

[~eyang] Have changed the App Log Dir Structure to below format. 

{code}
{aggregation_log_root} /  / bucket_{suffix} / {cluster_timestamp} / 
{bucket1} / {bucket2} / {appId}

where aggregation_log_root is yarn.nodemanager.remote-app-log-dir
  suffix is yarn.nodemanager.remote-app-log-dir-suffix (logs) 
  cluster_timestamp is application_timestamp
  bucket1 is application#getId % 1
  bucket2 is application_timestamp % 1
  
{code}

*The patch changes below:*

1. {{LogAggregationFileController}} changed to create new app log dir structure
2. {{AggregatedLogDeletionService}} changed to remove older bucket / app dirs 
as per retention.
3. {{LogAggregationFileControllerFactory}} and 
{{LogAggregationIndexedFileController}} changed to include both old and new app 
log dir structure.
4. New config {{yarn.nodemanager.remote-app-log-dir-include-older}}  (default 
true) introduced to include older app log dirs also while accessing the yarn 
logs. This can be configured to false later if user does not want / have older 
log dir structure. 

*Functional Testing Done:*
{code}
1. Check if new application logs gets written into correct app log dir 
structure.
2. Yarn Logs Cli 
3. Accessing Logs from RM UI / HistoryServer UI works fine while job is running 
/ complete.
4. Accessing Older Logs.
{code}


*App Log Dir Structure for sample job:*

{code}
[hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/
Found 2 items
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:26 
/app-logs/ambari-qa/bucket_logs
drwxrwx---   - ambari-qa hadoop  0 2019-04-05 15:01 
/app-logs/ambari-qa/logs
[hdfs@yarn-ats-2 yarn]$ 
[hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs
Found 1 items
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:30 
/app-logs/ambari-qa/bucket_logs/1554476304275
[hdfs@yarn-ats-2 yarn]$ hadoop fs -ls 
/app-logs/ambari-qa/bucket_logs/1554476304275
Found 4 items
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:26 
/app-logs/ambari-qa/bucket_logs/1554476304275/0004
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:29 
/app-logs/ambari-qa/bucket_logs/1554476304275/0005
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:29 
/app-logs/ambari-qa/bucket_logs/1554476304275/0006
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:30 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007
[hdfs@yarn-ats-2 yarn]$ 
[hdfs@yarn-ats-2 yarn]$ hadoop fs -ls 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007
Found 1 items
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:30 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275
[hdfs@yarn-ats-2 yarn]$ hadoop fs -ls 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275
Found 1 items
drwxrwx---   - ambari-qa hadoop  0 2019-04-07 12:31 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007
[hdfs@yarn-ats-2 yarn]$ 
[hdfs@yarn-ats-2 yarn]$ hadoop fs -ls 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007
Found 2 items
-rw-r-   3 ambari-qa hadoop  94103 2019-04-07 12:31 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007/yarn-ats-2_45454
-rw-r-   3 ambari-qa hadoop  80434 2019-04-07 12:31 
/app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007/yarn-ats-3_45454

{code}

*App Log Dir Structure after deletion:*

{code}
[hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs
[hdfs@yarn-ats-2 yarn]$ 
{code}


> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-6929-007.patch, YARN-6929.1.patch, 
> YARN-6929.2.patch, YARN-6929.2.patch, YARN-6929.3.patch, YARN-6929.4.patch, 
> YARN-6929.5.patch, YARN-6929.6.patch, YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> 

[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable

2019-04-07 Thread Prabhu Joseph (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prabhu Joseph updated YARN-6929:

Attachment: YARN-6929-007.patch

> yarn.nodemanager.remote-app-log-dir structure is not scalable
> -
>
> Key: YARN-6929
> URL: https://issues.apache.org/jira/browse/YARN-6929
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: YARN-6929-007.patch, YARN-6929.1.patch, 
> YARN-6929.2.patch, YARN-6929.2.patch, YARN-6929.3.patch, YARN-6929.4.patch, 
> YARN-6929.5.patch, YARN-6929.6.patch, YARN-6929.patch
>
>
> The current directory structure for yarn.nodemanager.remote-app-log-dir is 
> not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). 
> With retention yarn.log-aggregation.retain-seconds of 7days, there are more 
> chances LogAggregationService fails to create a new directory with 
> FSLimitException$MaxDirectoryItemsExceededException.
> The current structure is 
> //logs/. This can be 
> improved with adding date as a subdirectory like 
> //logs// 
> {code}
> WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService:
>  Application failed to init aggregation 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813)
>  
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600)
>  
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>  
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>  
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) 
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) 
> at java.security.AccessController.doPrivileged(Native Method) 
> at javax.security.auth.Subject.doAs(Subject.java:415) 
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>  
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443)
>  
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException):
>  The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 
> items=1048576 
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021)
>  
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072)
>  
> at 
> 

[jira] [Assigned] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run

2019-04-07 Thread Wanqiang Ji (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wanqiang Ji reassigned YARN-9453:
-

Assignee: Wanqiang Ji

> Clean up code long if-else chain in ApplicationCLI#run
> --
>
> Key: YARN-9453
> URL: https://issues.apache.org/jira/browse/YARN-9453
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Wanqiang Ji
>Priority: Major
>  Labels: newbie
>
> org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, 
> contains a long if-else chain and many many conditions. 
> As a start, the bodies of the conditions could be extracted to methods and a 
> more clean solution could be introduced to parse the argument values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7721) TestContinuousScheduling fails sporadically with NPE

2019-04-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811816#comment-16811816
 ] 

Hadoop QA commented on YARN-7721:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 41s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
18s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 in trunk has 2 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
31s{color} | {color:green} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 0 new + 2 unchanged - 1 fixed = 2 total (was 3) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 55s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 77m 
23s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch 
passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}128m 11s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-7721 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12945331/YARN-7721.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 77effa19d583 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ec143cb |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| findbugs | 
https://builds.apache.org/job/PreCommit-YARN-Build/23907/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/23907/testReport/ |
| Max. process+thread count | 891 (vs. ulimit of 

[jira] [Commented] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run

2019-04-07 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811800#comment-16811800
 ] 

Szilard Nemeth commented on YARN-9453:
--

[~jiwq]: Sure, please take it!

> Clean up code long if-else chain in ApplicationCLI#run
> --
>
> Key: YARN-9453
> URL: https://issues.apache.org/jira/browse/YARN-9453
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>  Labels: newbie
>
> org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, 
> contains a long if-else chain and many many conditions. 
> As a start, the bodies of the conditions could be extracted to methods and a 
> more clean solution could be introduced to parse the argument values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run

2019-04-07 Thread Wanqiang Ji (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811794#comment-16811794
 ] 

Wanqiang Ji commented on YARN-9453:
---

Hi [~snemeth], I can work for this if you not mind.

> Clean up code long if-else chain in ApplicationCLI#run
> --
>
> Key: YARN-9453
> URL: https://issues.apache.org/jira/browse/YARN-9453
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>  Labels: newbie
>
> org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, 
> contains a long if-else chain and many many conditions. 
> As a start, the bodies of the conditions could be extracted to methods and a 
> more clean solution could be introduced to parse the argument values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9457) Integrate custom resource metrics better for FairScheduler

2019-04-07 Thread Szilard Nemeth (JIRA)
Szilard Nemeth created YARN-9457:


 Summary: Integrate custom resource metrics better for FairScheduler
 Key: YARN-9457
 URL: https://issues.apache.org/jira/browse/YARN-9457
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth


YARN-8842 added 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetricsForCustomResources.
This class stores all metrics data for custom resource types.
A field is there in QueueMetrics to hold an object of this class.

Similarly, YARN-9322 added FSQueueMetricsForCustomResources and added an object 
of this class to FSQueueMetrics.


This jira is about to investigate how it is possible to integrate 
QueueMetricsForCustomResources into QueueMetrics and 
FSQueueMetricsForCustomResources into FSQueueMetrics. 
The trick is that the Metrics annotation 
(org.apache.hadoop.metrics2.annotation.Metric) is used to expose values on JMX.

We need to implement a mechanism where QueueMetrics / FSQueueMetrics classes do 
contain a field of the custom resource values which is a map of resource names 
as keys, and longs as values.
This way, we don't need the new classes (QueueMetricsForCustomResources and 
FSQueueMetricsForCustomResources), the code could be much cleaner and 
consistent.

The hardest part possibly is to find a way to expose metrics values from a map. 
We obviously can't use the Metrics annotation so a mechanism is required to 
expose the values on JMX.
For a quick search, I haven't found any way like this in the code
[~wilfreds]: Are you aware of any way to expose values like this?
Most probably, we need to check how the Metrics annotation is processed, 
understand the whole flow and check what is the underlying mechanism of the 
metrics propagation to the JMX interface.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9456) Class ResourceMappings uses a List of Serializables instead of more specific types

2019-04-07 Thread Szilard Nemeth (JIRA)
Szilard Nemeth created YARN-9456:


 Summary: Class ResourceMappings uses a List of Serializables 
instead of more specific types
 Key: YARN-9456
 URL: https://issues.apache.org/jira/browse/YARN-9456
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth


List used everywhere across ResourceMappings. 
This class should receive a Class and cast the list if possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities

2019-04-07 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811744#comment-16811744
 ] 

Tao Yang commented on YARN-9050:


Hi, [~adam.antal].
As far as I know, currently activities is only used by CS but logically it's a 
common module and can be used by any types of schedulers. Some improvements 
like 3) will do some basic modifications and can be called by fair scheduler to 
get details such as insufficient resource diagnosis. It's wonderful to hear 
that these improvements can be used by FS, and I would like to discuss further 
details.

> [Umbrella] Usability improvements for scheduler activities
> --
>
> Key: YARN-9050
> URL: https://issues.apache.org/jira/browse/YARN-9050
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: capacityscheduler
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: image-2018-11-23-16-46-38-138.png
>
>
> We have did some usability improvements for scheduler activities based on 
> YARN3.1 in our cluster as follows:
>  1. Not available for multi-thread asynchronous scheduling. App and node 
> activities maybe confused when multiple scheduling threads record activities 
> of different allocation processes in the same variables like appsAllocation 
> and recordingNodesAllocation in ActivitiesManager. I think these variables 
> should be thread-local to make activities clear among multiple threads.
>  2. Incomplete activities for multi-node lookup mechanism, since 
> ActivitiesLogger will skip recording through \{{if (node == null || 
> activitiesManager == null) }} when node is null which represents this 
> allocation is for multi-nodes. We need support recording activities for 
> multi-node lookup mechanism.
>  3. Current app activities can not meet requirements of diagnostics, for 
> example, we can know that node doesn't match request but hard to know why, 
> especially when using placement constraints, it's difficult to make a 
> detailed diagnosis manually. So I propose to improve the diagnoses of 
> activities, add diagnosis for placement constraints check, update 
> insufficient resource diagnosis with detailed info (like 'insufficient 
> resource names:[memory-mb]') and so on.
>  4. Add more useful fields for app activities, in some scenarios we need to 
> distinguish different requests but can't locate requests based on app 
> activities info, there are some other fields can help to filter what we want 
> such as allocation tags. We have added containerPriority, allocationRequestId 
> and allocationTags fields in AppAllocation.
>  5. Filter app activities by key fields, sometimes the results of app 
> activities is massive, it's hard to find what we want. We have support filter 
> by allocation-tags to meet requirements from some apps, more over, we can 
> take container-priority and allocation-request-id as candidates if necessary.
>  6. Aggregate app activities by diagnoses. For a single allocation process, 
> activities still can be massive in a large cluster, we frequently want to 
> know why request can't be allocated in cluster, it's hard to check every node 
> manually in a large cluster, so that aggregation for app activities by 
> diagnoses is necessary to solve this trouble. We have added groupingType 
> parameter for app-activities REST API for this, supports grouping by 
> diagnostics.
> I think we can have a discuss about these points, useful improvements which 
> can be accepted will be added into the patch. Thanks.
> Running design doc is attached 
> [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811758#comment-16811758
 ] 

Hadoop QA commented on YARN-9313:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 28s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
11s{color} | {color:red} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 in trunk has 2 extant Findbugs warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
28s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 31s{color} | {color:orange} 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
 The patch generated 2 new + 121 unchanged - 0 fixed = 123 total (was 121) 
{color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 11s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 76m  4s{color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}124m 57s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestResourceTrackerService |
|   | 
hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f |
| JIRA Issue | YARN-9313 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12965108/YARN-9313.004.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 78993416744d 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ec143cb |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_191 |
| findbugs | v3.1.0-RC1 |
| findbugs | 

[jira] [Created] (YARN-9455) SchedulerInvalidResoureRequestException has a typo in its class (and file) name

2019-04-07 Thread Szilard Nemeth (JIRA)
Szilard Nemeth created YARN-9455:


 Summary: SchedulerInvalidResoureRequestException has a typo in its 
class (and file) name
 Key: YARN-9455
 URL: https://issues.apache.org/jira/browse/YARN-9455
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth


The class name should be: SchedulerInvalidResourceRequestException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9455) SchedulerInvalidResoureRequestException has a typo in its class (and file) name

2019-04-07 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9455:
-
Labels: newbie  (was: )

> SchedulerInvalidResoureRequestException has a typo in its class (and file) 
> name
> ---
>
> Key: YARN-9455
> URL: https://issues.apache.org/jira/browse/YARN-9455
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>  Labels: newbie
>
> The class name should be: SchedulerInvalidResourceRequestException



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9454) Add detailed log about list applications command

2019-04-07 Thread Szilard Nemeth (JIRA)
Szilard Nemeth created YARN-9454:


 Summary: Add detailed log about list applications command
 Key: YARN-9454
 URL: https://issues.apache.org/jira/browse/YARN-9454
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth


When a user lists YARN applications with the RM admin CLI, we have one audit 
log here 
(https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L924)
However, a more extensive logging could be added.

This is the call chain, when such a list command got executed (from bottom to 
top):


{code:java}
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications
org.apache.hadoop.yarn.client.api.impl.YarnClientImpl#getApplications(java.util.Set,
 java.util.EnumSet, 
java.util.Set)
ApplicationCLI.listApplications(Set, EnumSet, 
Set)  (org.apache.hadoop.yarn.client.cli)
ApplicationCLI.run(String[])  (org.apache.hadoop.yarn.client.cli)
{code}


org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications: 
This is the place that fits perfectly for adding a more detailed log message 
about the request or the response (or both).
In my opinion, a trace (or debug) level log would be great at the end of this 
method, logging the whole response, so any potential issues with the code can 
be troubleshot more easily. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9313:
---
Attachment: YARN-9313.004.patch

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch, YARN-9313.002.patch, 
> YARN-9313.003.patch, YARN-9313.004.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run

2019-04-07 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9453:
-
Description: 
org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, 
contains a long if-else chain and many many conditions. 
As a start, the bodies of the conditions could be extracted to methods and a 
more clean solution could be introduced to parse the argument values.

> Clean up code long if-else chain in ApplicationCLI#run
> --
>
> Key: YARN-9453
> URL: https://issues.apache.org/jira/browse/YARN-9453
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>
> org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, 
> contains a long if-else chain and many many conditions. 
> As a start, the bodies of the conditions could be extracted to methods and a 
> more clean solution could be introduced to parse the argument values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run

2019-04-07 Thread Szilard Nemeth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Szilard Nemeth updated YARN-9453:
-
Labels: newbie  (was: )

> Clean up code long if-else chain in ApplicationCLI#run
> --
>
> Key: YARN-9453
> URL: https://issues.apache.org/jira/browse/YARN-9453
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Priority: Major
>  Labels: newbie
>
> org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, 
> contains a long if-else chain and many many conditions. 
> As a start, the bodies of the conditions could be extracted to methods and a 
> more clean solution could be introduced to parse the argument values.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities

2019-04-07 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811725#comment-16811725
 ] 

Tao Yang commented on YARN-9313:


Attached v4 patch. Thanks [~cheersyang] for your advices.

> Support asynchronized scheduling mode and multi-node lookup mechanism for 
> scheduler activities
> --
>
> Key: YARN-9313
> URL: https://issues.apache.org/jira/browse/YARN-9313
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Attachments: YARN-9313.001.patch, YARN-9313.002.patch, 
> YARN-9313.003.patch, YARN-9313.004.patch
>
>
> [Design 
> doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run

2019-04-07 Thread Szilard Nemeth (JIRA)
Szilard Nemeth created YARN-9453:


 Summary: Clean up code long if-else chain in ApplicationCLI#run
 Key: YARN-9453
 URL: https://issues.apache.org/jira/browse/YARN-9453
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Szilard Nemeth






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7721) TestContinuousScheduling fails sporadically with NPE

2019-04-07 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811770#comment-16811770
 ] 

Szilard Nemeth commented on YARN-7721:
--

Hi [~wilfreds]!
Thanks for this patch!
+1 (non-binding)

> TestContinuousScheduling fails sporadically with NPE
> 
>
> Key: YARN-7721
> URL: https://issues.apache.org/jira/browse/YARN-7721
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: fairscheduler
>Affects Versions: 3.1.0
>Reporter: Jason Lowe
>Assignee: Wilfred Spiegelenburg
>Priority: Major
> Attachments: YARN-7721.001.patch
>
>
> TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime is 
> failing sporadically with an NPE in precommit builds, and I can usually 
> reproduce it locally after a few tries:
> {noformat}
> [ERROR] 
> testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
>   Time elapsed: 0.085 s  <<< ERROR!
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:383)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> [...]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9123) Clean up and split testcases in TestNMWebServices for GPU support

2019-04-07 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811769#comment-16811769
 ] 

Szilard Nemeth commented on YARN-9123:
--

Hi [~jojochuang]!
The only one in the checkstyle logs is this: 

{code:java}
TestNMWebServices.java:196:  public long a = NM_RESOURCE_VALUE;:19: 
Variable 'a' must be private and have accessor methods. [VisibilityModifier]

{code}

May I deal with this or is this patch ready for commit ?

Thanks!

> Clean up and split testcases in TestNMWebServices for GPU support
> -
>
> Key: YARN-9123
> URL: https://issues.apache.org/jira/browse/YARN-9123
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Szilard Nemeth
>Assignee: Szilard Nemeth
>Priority: Minor
> Attachments: YARN-9123.001.patch, YARN-9123.002.patch, 
> YARN-9123.003.patch, YARN-9123.004.patch, YARN-9123.005.patch, 
> YARN-9123.006.patch
>
>
> The following testcases can be cleaned up a bit: 
> TestNMWebServices#testGetNMResourceInfo - Can be split up to 3 different cases
> TestNMWebServices#testGetYarnGpuResourceInfo



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9437) RMNodeImpls occupy too much memory and causes RM GC to take a long time

2019-04-07 Thread qiuliang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811738#comment-16811738
 ] 

qiuliang commented on YARN-9437:


As I understand it, there are two cases that may cause the completedContainers 
in RMNodeImpl to not be released.
1. When RMAppAttemptImpl receives the CONTAINER_FINISHED(not amContainer) 
event, it will add this container to justFinishedContainers. When processing 
the AM heartbeat, RMAppAttemptImpl first sends the container in 
finishedContainersSentToAM to NM, and RMNodeImpl also removes these containers 
from the completedContainers. Then transfer the containers in 
justFinishedContainers to finishedContainersSentToAM and wait for the next AM 
heartbeat to send these containers to NM. If RMAppAttemptImpl accepts the event 
of AM unregistration, justFinishedContainers is not empty, then the container 
in justFinishedContainers may not have the opportunity to transfer to 
finishedContainersSentToAM, so that these containers are not sent to NM, and 
RMNodeImpl does not release these containers.
2. When RMAppAttemptImpl is in the final state and receives the 
CONTAINER_FINISHED event, just add this container to justFinishedContainers and 
not send it to NM.
For the first case, my idea is that when RMAppAttemptImpl handles the 
amContainer finished event, the container in justFinishedContainers is 
transferred to finishedContainersSentToAM and sent to NM along with 
amContainer. I am not sure if there is any other impact. For the second case, 
when RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED 
event, these containers are sent directly to NM, but I am worried that this 
will generate many events.

> RMNodeImpls occupy too much memory and causes RM GC to take a long time
> ---
>
> Key: YARN-9437
> URL: https://issues.apache.org/jira/browse/YARN-9437
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.1
>Reporter: qiuliang
>Priority: Minor
> Attachments: 1.png, 2.png, 3.png
>
>
> We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of 
> RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each 
> RMNodeImpl has approximately 14M. The reason is that there is a 130,000+ 
> completedcontainers in each RMNodeImpl that has not been released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Tao Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811729#comment-16811729
 ] 

Tao Yang commented on YARN-9413:


Thanks [~cheersyang], [~snemeth] for the review and commit.
{quote}
could you please take a look if this issue happens in branch-3.0 too? If it 
does, please help to provide a patch for branch-3.0. 
{quote}
Yes, it does. I have attached a patch for branch-3.0 and just add a test for 
capacity scheduler in this branch since TestAMRestart doesn't extend 
ParameterizedSchedulerTestBase to test both capacity and fair scheduler and 
this issue won't happen for fair scheduler.

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9445) yarn.admin.acl is futile

2019-04-07 Thread Szilard Nemeth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811766#comment-16811766
 ] 

Szilard Nemeth commented on YARN-9445:
--

I will let [~shuzirra] answer the concerns here, in the meantime let's involve 
[~wilfreds] as well!

> yarn.admin.acl is futile
> 
>
> Key: YARN-9445
> URL: https://issues.apache.org/jira/browse/YARN-9445
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: security
>Affects Versions: 3.3.0
>Reporter: Peter Simon
>Assignee: Gergely Pollak
>Priority: Major
> Attachments: YARN-9445.001.patch
>
>
> * Define a queue with restrictive administerApps settings (e.g. yarn)
>  * Set yarn.admin.acl to "*".
>  * Try to submit an application with user yarn, it is denied.
> This way my expected behaviour would be that while everyone is admin, I can 
> submit to whatever pool.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9413) Queue resource leak after app fail for CapacityScheduler

2019-04-07 Thread Tao Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Yang updated YARN-9413:
---
Attachment: YARN-9413.branch-3.0.001.patch

> Queue resource leak after app fail for CapacityScheduler
> 
>
> Key: YARN-9413
> URL: https://issues.apache.org/jira/browse/YARN-9413
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacityscheduler
>Affects Versions: 3.1.2
>Reporter: Tao Yang
>Assignee: Tao Yang
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9413.001.patch, YARN-9413.002.patch, 
> YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, 
> image-2019-03-29-10-47-47-953.png
>
>
> To reproduce this problem:
>  # Submit an app which is configured to keep containers across app attempts 
> and should fail after AM finished at first time (am-max-attempts=1).
>  # App is started with 2 containers running on NM1 node.
>  # Fail the AM of the application with PREEMPTED exit status which should not 
> count towards max attempt retry but app will fail immediately.
>  # Used resource of this queue leaks after app fail.
> The root cause is the inconsistency of handling app attempt failure between 
> RMAppAttemptImpl$BaseFinalTransition#transition and 
> RMAppImpl$AttemptFailedTransition#transition:
>  # After app fail, RMAppFailedAttemptEvent will be sent in 
> RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM 
> container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it 
> will not count towards max attempt retry, so that it will send 
> AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and 
> RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true.
>  # RMAppImpl$AttemptFailedTransition#transition handle 
> RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1.
>  # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in 
> CapcityScheduler#doneApplicationAttempt, it will skip killing and calling 
> completion process for containers belong to this app, so that queue resource 
> leak happens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org