[jira] [Comment Edited] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812123#comment-16812123 ] Weiwei Yang edited comment on YARN-9413 at 4/8/19 5:46 AM: --- Thanks for conforming that. +1. Just committed to branch-3.0. Now this is fixed on all 3.x versions. Thanks [~Tao Yang] for the contribution. was (Author: cheersyang): Thanks for conforming that. +1. Committing now. > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-9413: -- Fix Version/s: 3.0.4 > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812123#comment-16812123 ] Weiwei Yang commented on YARN-9413: --- Thanks for conforming that. +1. Committing now. > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812122#comment-16812122 ] Hudson commented on YARN-9313: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16360 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16360/]) YARN-9313. Support asynchronized scheduling mode and multi-node lookup (wwei: rev fc05b0e70e9bb556d6bdc00fa8735e18a6f90bc9) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesLogger.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesSchedulerActivities.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesSchedulerActivitiesWithMultiNodesEnabled.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/TestActivitiesManager.java > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9313.001.patch, YARN-9313.002.patch, > YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812118#comment-16812118 ] Weiwei Yang commented on YARN-9313: --- +1, committing now. Thanks [~Tao Yang] . > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch, YARN-9313.002.patch, > YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812110#comment-16812110 ] Tao Yang commented on YARN-9413: The checkstyle issue seems the same as above and UT failures are not related to this patch (I can reproduce them in branch-3.0 without this patch). > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812101#comment-16812101 ] Hadoop QA commented on YARN-9313: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 14s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 17s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 39s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 77m 15s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}128m 31s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9313 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12965152/YARN-9313.005.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ecd425ade2d0 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0d47d28 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-YARN-Build/23909/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23909/testReport/ | | Max. process+thread count | 904 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U:
[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812086#comment-16812086 ] Hadoop QA commented on YARN-9413: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 10m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-3.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 32s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 36s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 32s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 23s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s{color} | {color:green} branch-3.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} branch-3.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 32s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 1 new + 158 unchanged - 4 fixed = 159 total (was 162) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 38s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 30s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}120m 30s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation | | | hadoop.yarn.server.resourcemanager.TestOpportunisticContainerAllocatorAMService | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:e402791 | | JIRA Issue | YARN-9413 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12965109/YARN-9413.branch-3.0.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 21557175738a 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | branch-3.0 / f824f4d | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle |
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812056#comment-16812056 ] Tao Yang commented on YARN-9313: Attached v5 patch to fix remaining checkstyle errors, UT and findbugs failures seems not related to this patch. > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch, YARN-9313.002.patch, > YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9455) SchedulerInvalidResoureRequestException has a typo in its class (and file) name
[ https://issues.apache.org/jira/browse/YARN-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812054#comment-16812054 ] Anh commented on YARN-9455: --- Hi [~snemeth], can I take this item? > SchedulerInvalidResoureRequestException has a typo in its class (and file) > name > --- > > Key: YARN-9455 > URL: https://issues.apache.org/jira/browse/YARN-9455 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Priority: Major > Labels: newbie > > The class name should be: SchedulerInvalidResourceRequestException -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9313: --- Attachment: YARN-9313.005.patch > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch, YARN-9313.002.patch, > YARN-9313.003.patch, YARN-9313.004.patch, YARN-9313.005.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812019#comment-16812019 ] Weiwei Yang commented on YARN-9413: --- Thanks [~Tao Yang] , reopen to trigger jenkins job on branch-3.0. > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Reopened] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reopened YARN-9413: --- > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812018#comment-16812018 ] Weiwei Yang commented on YARN-9313: --- Thanks [~Tao Yang] for the update, patch looks good. Can you please fix remaining 2 checkstyle issues? Looks like we are hinting some flaky UTs again, should not be related to this patch. > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch, YARN-9313.002.patch, > YARN-9313.003.patch, YARN-9313.004.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9445) yarn.admin.acl is futile
[ https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811958#comment-16811958 ] Eric Yang commented on YARN-9445: - [~sunilg] [~bibinchundatt] Security should be designed to be permissive from admin point of view instead of mutually exclusive. Security may appear as mutually exclusive (allow or disallowed) from user's point of view. However, proper security design should be permissive from admin point of view. Admin must have ability to perform the same operation if user is not available to carry out the operation. {quote}a) yarn.admin.acls=yarn. and for e, .queueA.acl_submit_applications=john. Now user "john" can submit app to queueA. "yarn" user should not be able to submit.{quote} I do not believe disallowing system admin to submit job improves security in the above statement. It only create inconvenience for impersonation that YARN service user credential can not submit job on behave of the user. Admin can always run "sudo" to submit the job for the user. Hence, this artificially designed mutually exclusive constraint is a no-op security feature. Some improvement in this area would make the system easier to operate and avoid paradox that prevent admin from fixing user's problem. > yarn.admin.acl is futile > > > Key: YARN-9445 > URL: https://issues.apache.org/jira/browse/YARN-9445 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 3.3.0 >Reporter: Peter Simon >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-9445.001.patch > > > * Define a queue with restrictive administerApps settings (e.g. yarn) > * Set yarn.admin.acl to "*". > * Try to submit an application with user yarn, it is denied. > This way my expected behaviour would be that while everyone is admin, I can > submit to whatever pool. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9445) yarn.admin.acl is futile
[ https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811950#comment-16811950 ] Szilard Nemeth edited comment on YARN-9445 at 4/7/19 7:41 PM: -- [~sunilg], [~bibinchundatt]: I'm confused. Reading the 3.2.0 docs ([https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists] for FS/ACLs) says: "Queue Access Control Lists (ACLs) allow administrators to control who may take actions on particular queues. They are configured with the aclSubmitApps and aclAdministerApps properties, which can be set per queue. Currently the only supported administrative action is killing an application. An administrator may also submit applications to it." In this sense, aclAdministerApps not only gives permissions to execute admin operations but also gives submission permissions to queues. For me, not giving an administrator rights to everything seems controversial, so the documentation is more logical. All in all, if we go with the direction that admins don't get submission rights then we should also make sure the documentation is in line with the decision. I do agree with [~eyang] about restricting the default admin ACL to something else than '*' but this requires a follow-up jira, I think. was (Author: snemeth): [~sunilg], [~bibinchundatt]: I'm confused. Reading the 3.2.0 docs ([https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists] for FS/ACLs) says: "Queue Access Control Lists (ACLs) allow administrators to control who may take actions on particular queues. They are configured with the aclSubmitApps and aclAdministerApps properties, which can be set per queue. Currently the only supported administrative action is killing an application. An administrator may also submit applications to it." In this sense, aclAdministerApps not only gives permissions to execute admin operations but also gives submission permissions to queues. For me, not giving an administrator rights to everything seems controversial, so the documentation is more logical. All in all, if we go with the direction that admins son't get submiasion rights then we should alao make sure the documentation is in line with the decision. I do agree with [~eyang] about restricting the default admin ACL to aomething else than '*' but this requires a follow-up jira, I think. > yarn.admin.acl is futile > > > Key: YARN-9445 > URL: https://issues.apache.org/jira/browse/YARN-9445 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 3.3.0 >Reporter: Peter Simon >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-9445.001.patch > > > * Define a queue with restrictive administerApps settings (e.g. yarn) > * Set yarn.admin.acl to "*". > * Try to submit an application with user yarn, it is denied. > This way my expected behaviour would be that while everyone is admin, I can > submit to whatever pool. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9445) yarn.admin.acl is futile
[ https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811950#comment-16811950 ] Szilard Nemeth edited comment on YARN-9445 at 4/7/19 7:39 PM: -- [~sunilg], [~bibinchundatt]: I'm confused. Reading the 3.2.0 docs ([https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists] for FS/ACLs) says: "Queue Access Control Lists (ACLs) allow administrators to control who may take actions on particular queues. They are configured with the aclSubmitApps and aclAdministerApps properties, which can be set per queue. Currently the only supported administrative action is killing an application. An administrator may also submit applications to it." In this sense, aclAdministerApps not only gives permissions to execute admin operations but also gives submission permissions to queues. For me, not giving an administrator rights to everything seems controversial, so the documentation is more logical. All in all, if we go with the direction that admins son't get submiasion rights then we should alao make sure the documentation is in line with the decision. I do agree with [~eyang] about restricting the default admin ACL to aomething else than '*' but this requires a follow-up jira, I think. was (Author: snemeth): [~sunilg], [~bibinchundatt]: I'm confused. Reading the 3.2.0 docs (https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists for FS/ACLs) says: "Queue Access Control Lists (ACLs) allow administrators to control who may take actions on particular queues. They are configured with the aclSubmitApps and aclAdministerApps properties, which can be set per queue. Currently the only supported administrative action is killing an application. An administrator may also submit applications to it." In this sense, aclAdministerApps not only gives permissions to execute admin operations but also gives submiasion permissions to queues. For me, not giving an administrator rights to everything seems controversial, so the documentation is more logical. All in all, if we go with the direction that admins son't get submiasion rights then we should alao make sure the documentation is in line with the decision. I do agree with [~eyang] about restricting the default admin ACL to aomething else than '*' but this requires a follow-up jira, I think. > yarn.admin.acl is futile > > > Key: YARN-9445 > URL: https://issues.apache.org/jira/browse/YARN-9445 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 3.3.0 >Reporter: Peter Simon >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-9445.001.patch > > > * Define a queue with restrictive administerApps settings (e.g. yarn) > * Set yarn.admin.acl to "*". > * Try to submit an application with user yarn, it is denied. > This way my expected behaviour would be that while everyone is admin, I can > submit to whatever pool. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9445) yarn.admin.acl is futile
[ https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811950#comment-16811950 ] Szilard Nemeth commented on YARN-9445: -- [~sunilg], [~bibinchundatt]: I'm confused. Reading the 3.2.0 docs (https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Queue_Access_Control_Lists for FS/ACLs) says: "Queue Access Control Lists (ACLs) allow administrators to control who may take actions on particular queues. They are configured with the aclSubmitApps and aclAdministerApps properties, which can be set per queue. Currently the only supported administrative action is killing an application. An administrator may also submit applications to it." In this sense, aclAdministerApps not only gives permissions to execute admin operations but also gives submiasion permissions to queues. For me, not giving an administrator rights to everything seems controversial, so the documentation is more logical. All in all, if we go with the direction that admins son't get submiasion rights then we should alao make sure the documentation is in line with the decision. I do agree with [~eyang] about restricting the default admin ACL to aomething else than '*' but this requires a follow-up jira, I think. > yarn.admin.acl is futile > > > Key: YARN-9445 > URL: https://issues.apache.org/jira/browse/YARN-9445 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 3.3.0 >Reporter: Peter Simon >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-9445.001.patch > > > * Define a queue with restrictive administerApps settings (e.g. yarn) > * Set yarn.admin.acl to "*". > * Try to submit an application with user yarn, it is denied. > This way my expected behaviour would be that while everyone is admin, I can > submit to whatever pool. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable
[ https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811866#comment-16811866 ] Prabhu Joseph commented on YARN-6929: - [~eyang] Have changed the App Log Dir Structure to below format. {code} {aggregation_log_root} / / bucket_{suffix} / {cluster_timestamp} / {bucket1} / {bucket2} / {appId} where aggregation_log_root is yarn.nodemanager.remote-app-log-dir suffix is yarn.nodemanager.remote-app-log-dir-suffix (logs) cluster_timestamp is application_timestamp bucket1 is application#getId % 1 bucket2 is application_timestamp % 1 {code} *The patch changes below:* 1. {{LogAggregationFileController}} changed to create new app log dir structure 2. {{AggregatedLogDeletionService}} changed to remove older bucket / app dirs as per retention. 3. {{LogAggregationFileControllerFactory}} and {{LogAggregationIndexedFileController}} changed to include both old and new app log dir structure. 4. New config {{yarn.nodemanager.remote-app-log-dir-include-older}} (default true) introduced to include older app log dirs also while accessing the yarn logs. This can be configured to false later if user does not want / have older log dir structure. *Functional Testing Done:* {code} 1. Check if new application logs gets written into correct app log dir structure. 2. Yarn Logs Cli 3. Accessing Logs from RM UI / HistoryServer UI works fine while job is running / complete. 4. Accessing Older Logs. {code} *App Log Dir Structure for sample job:* {code} [hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/ Found 2 items drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:26 /app-logs/ambari-qa/bucket_logs drwxrwx--- - ambari-qa hadoop 0 2019-04-05 15:01 /app-logs/ambari-qa/logs [hdfs@yarn-ats-2 yarn]$ [hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs Found 1 items drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:30 /app-logs/ambari-qa/bucket_logs/1554476304275 [hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs/1554476304275 Found 4 items drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:26 /app-logs/ambari-qa/bucket_logs/1554476304275/0004 drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:29 /app-logs/ambari-qa/bucket_logs/1554476304275/0005 drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:29 /app-logs/ambari-qa/bucket_logs/1554476304275/0006 drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:30 /app-logs/ambari-qa/bucket_logs/1554476304275/0007 [hdfs@yarn-ats-2 yarn]$ [hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs/1554476304275/0007 Found 1 items drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:30 /app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275 [hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275 Found 1 items drwxrwx--- - ambari-qa hadoop 0 2019-04-07 12:31 /app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007 [hdfs@yarn-ats-2 yarn]$ [hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007 Found 2 items -rw-r- 3 ambari-qa hadoop 94103 2019-04-07 12:31 /app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007/yarn-ats-2_45454 -rw-r- 3 ambari-qa hadoop 80434 2019-04-07 12:31 /app-logs/ambari-qa/bucket_logs/1554476304275/0007/4275/application_1554476304275_0007/yarn-ats-3_45454 {code} *App Log Dir Structure after deletion:* {code} [hdfs@yarn-ats-2 yarn]$ hadoop fs -ls /app-logs/ambari-qa/bucket_logs [hdfs@yarn-ats-2 yarn]$ {code} > yarn.nodemanager.remote-app-log-dir structure is not scalable > - > > Key: YARN-6929 > URL: https://issues.apache.org/jira/browse/YARN-6929 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.7.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-6929-007.patch, YARN-6929.1.patch, > YARN-6929.2.patch, YARN-6929.2.patch, YARN-6929.3.patch, YARN-6929.4.patch, > YARN-6929.5.patch, YARN-6929.6.patch, YARN-6929.patch > > > The current directory structure for yarn.nodemanager.remote-app-log-dir is > not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). > With retention yarn.log-aggregation.retain-seconds of 7days, there are more > chances LogAggregationService fails to create a new directory with > FSLimitException$MaxDirectoryItemsExceededException. > The current structure is > //logs/. This can be > improved with adding date as a subdirectory like > //logs// > {code} > WARN >
[jira] [Updated] (YARN-6929) yarn.nodemanager.remote-app-log-dir structure is not scalable
[ https://issues.apache.org/jira/browse/YARN-6929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-6929: Attachment: YARN-6929-007.patch > yarn.nodemanager.remote-app-log-dir structure is not scalable > - > > Key: YARN-6929 > URL: https://issues.apache.org/jira/browse/YARN-6929 > Project: Hadoop YARN > Issue Type: Bug > Components: log-aggregation >Affects Versions: 2.7.3 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-6929-007.patch, YARN-6929.1.patch, > YARN-6929.2.patch, YARN-6929.2.patch, YARN-6929.3.patch, YARN-6929.4.patch, > YARN-6929.5.patch, YARN-6929.6.patch, YARN-6929.patch > > > The current directory structure for yarn.nodemanager.remote-app-log-dir is > not scalable. Maximum Subdirectory limit by default is 1048576 (HDFS-6102). > With retention yarn.log-aggregation.retain-seconds of 7days, there are more > chances LogAggregationService fails to create a new directory with > FSLimitException$MaxDirectoryItemsExceededException. > The current structure is > //logs/. This can be > improved with adding date as a subdirectory like > //logs// > {code} > WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: > Application failed to init aggregation > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): > The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 > items=1048576 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021) > > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072) > > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:1841) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsRecursively(FSNamesystem.java:4351) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:4262) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:4221) > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:4194) > > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:813) > > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:600) > > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:308) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:366) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:320) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:443) > > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:67) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.FSLimitException$MaxDirectoryItemsExceededException): > The directory item limit of /app-logs/yarn/logs is exceeded: limit=1048576 > items=1048576 > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.verifyMaxDirItems(FSDirectory.java:2021) > > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:2072) > > at >
[jira] [Assigned] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run
[ https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wanqiang Ji reassigned YARN-9453: - Assignee: Wanqiang Ji > Clean up code long if-else chain in ApplicationCLI#run > -- > > Key: YARN-9453 > URL: https://issues.apache.org/jira/browse/YARN-9453 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Wanqiang Ji >Priority: Major > Labels: newbie > > org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, > contains a long if-else chain and many many conditions. > As a start, the bodies of the conditions could be extracted to methods and a > more clean solution could be introduced to parse the argument values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7721) TestContinuousScheduling fails sporadically with NPE
[ https://issues.apache.org/jira/browse/YARN-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811816#comment-16811816 ] Hadoop QA commented on YARN-7721: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 41s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 18s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 31s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 0 new + 2 unchanged - 1 fixed = 2 total (was 3) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 55s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 77m 23s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}128m 11s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-7721 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12945331/YARN-7721.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 77effa19d583 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ec143cb | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-YARN-Build/23907/artifact/out/branch-findbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager-warnings.html | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23907/testReport/ | | Max. process+thread count | 891 (vs. ulimit of
[jira] [Commented] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run
[ https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811800#comment-16811800 ] Szilard Nemeth commented on YARN-9453: -- [~jiwq]: Sure, please take it! > Clean up code long if-else chain in ApplicationCLI#run > -- > > Key: YARN-9453 > URL: https://issues.apache.org/jira/browse/YARN-9453 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Priority: Major > Labels: newbie > > org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, > contains a long if-else chain and many many conditions. > As a start, the bodies of the conditions could be extracted to methods and a > more clean solution could be introduced to parse the argument values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run
[ https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811794#comment-16811794 ] Wanqiang Ji commented on YARN-9453: --- Hi [~snemeth], I can work for this if you not mind. > Clean up code long if-else chain in ApplicationCLI#run > -- > > Key: YARN-9453 > URL: https://issues.apache.org/jira/browse/YARN-9453 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Priority: Major > Labels: newbie > > org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, > contains a long if-else chain and many many conditions. > As a start, the bodies of the conditions could be extracted to methods and a > more clean solution could be introduced to parse the argument values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9457) Integrate custom resource metrics better for FairScheduler
Szilard Nemeth created YARN-9457: Summary: Integrate custom resource metrics better for FairScheduler Key: YARN-9457 URL: https://issues.apache.org/jira/browse/YARN-9457 Project: Hadoop YARN Issue Type: Improvement Reporter: Szilard Nemeth YARN-8842 added org.apache.hadoop.yarn.server.resourcemanager.scheduler.QueueMetricsForCustomResources. This class stores all metrics data for custom resource types. A field is there in QueueMetrics to hold an object of this class. Similarly, YARN-9322 added FSQueueMetricsForCustomResources and added an object of this class to FSQueueMetrics. This jira is about to investigate how it is possible to integrate QueueMetricsForCustomResources into QueueMetrics and FSQueueMetricsForCustomResources into FSQueueMetrics. The trick is that the Metrics annotation (org.apache.hadoop.metrics2.annotation.Metric) is used to expose values on JMX. We need to implement a mechanism where QueueMetrics / FSQueueMetrics classes do contain a field of the custom resource values which is a map of resource names as keys, and longs as values. This way, we don't need the new classes (QueueMetricsForCustomResources and FSQueueMetricsForCustomResources), the code could be much cleaner and consistent. The hardest part possibly is to find a way to expose metrics values from a map. We obviously can't use the Metrics annotation so a mechanism is required to expose the values on JMX. For a quick search, I haven't found any way like this in the code [~wilfreds]: Are you aware of any way to expose values like this? Most probably, we need to check how the Metrics annotation is processed, understand the whole flow and check what is the underlying mechanism of the metrics propagation to the JMX interface. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9456) Class ResourceMappings uses a List of Serializables instead of more specific types
Szilard Nemeth created YARN-9456: Summary: Class ResourceMappings uses a List of Serializables instead of more specific types Key: YARN-9456 URL: https://issues.apache.org/jira/browse/YARN-9456 Project: Hadoop YARN Issue Type: Improvement Reporter: Szilard Nemeth List used everywhere across ResourceMappings. This class should receive a Class and cast the list if possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9050) [Umbrella] Usability improvements for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811744#comment-16811744 ] Tao Yang commented on YARN-9050: Hi, [~adam.antal]. As far as I know, currently activities is only used by CS but logically it's a common module and can be used by any types of schedulers. Some improvements like 3) will do some basic modifications and can be called by fair scheduler to get details such as insufficient resource diagnosis. It's wonderful to hear that these improvements can be used by FS, and I would like to discuss further details. > [Umbrella] Usability improvements for scheduler activities > -- > > Key: YARN-9050 > URL: https://issues.apache.org/jira/browse/YARN-9050 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: image-2018-11-23-16-46-38-138.png > > > We have did some usability improvements for scheduler activities based on > YARN3.1 in our cluster as follows: > 1. Not available for multi-thread asynchronous scheduling. App and node > activities maybe confused when multiple scheduling threads record activities > of different allocation processes in the same variables like appsAllocation > and recordingNodesAllocation in ActivitiesManager. I think these variables > should be thread-local to make activities clear among multiple threads. > 2. Incomplete activities for multi-node lookup mechanism, since > ActivitiesLogger will skip recording through \{{if (node == null || > activitiesManager == null) }} when node is null which represents this > allocation is for multi-nodes. We need support recording activities for > multi-node lookup mechanism. > 3. Current app activities can not meet requirements of diagnostics, for > example, we can know that node doesn't match request but hard to know why, > especially when using placement constraints, it's difficult to make a > detailed diagnosis manually. So I propose to improve the diagnoses of > activities, add diagnosis for placement constraints check, update > insufficient resource diagnosis with detailed info (like 'insufficient > resource names:[memory-mb]') and so on. > 4. Add more useful fields for app activities, in some scenarios we need to > distinguish different requests but can't locate requests based on app > activities info, there are some other fields can help to filter what we want > such as allocation tags. We have added containerPriority, allocationRequestId > and allocationTags fields in AppAllocation. > 5. Filter app activities by key fields, sometimes the results of app > activities is massive, it's hard to find what we want. We have support filter > by allocation-tags to meet requirements from some apps, more over, we can > take container-priority and allocation-request-id as candidates if necessary. > 6. Aggregate app activities by diagnoses. For a single allocation process, > activities still can be massive in a large cluster, we frequently want to > know why request can't be allocated in cluster, it's hard to check every node > manually in a large cluster, so that aggregation for app activities by > diagnoses is necessary to solve this trouble. We have added groupingType > parameter for app-activities REST API for this, supports grouping by > diagnostics. > I think we can have a discuss about these points, useful improvements which > can be accepted will be added into the patch. Thanks. > Running design doc is attached > [here|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.2jnaobmmfne5]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811758#comment-16811758 ] Hadoop QA commented on YARN-9313: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 28s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 11s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 2 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 31s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 2 new + 121 unchanged - 0 fixed = 123 total (was 121) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 11s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 76m 4s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}124m 57s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.TestResourceTrackerService | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9313 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12965108/YARN-9313.004.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 78993416744d 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ec143cb | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | findbugs |
[jira] [Created] (YARN-9455) SchedulerInvalidResoureRequestException has a typo in its class (and file) name
Szilard Nemeth created YARN-9455: Summary: SchedulerInvalidResoureRequestException has a typo in its class (and file) name Key: YARN-9455 URL: https://issues.apache.org/jira/browse/YARN-9455 Project: Hadoop YARN Issue Type: Improvement Reporter: Szilard Nemeth The class name should be: SchedulerInvalidResourceRequestException -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9455) SchedulerInvalidResoureRequestException has a typo in its class (and file) name
[ https://issues.apache.org/jira/browse/YARN-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-9455: - Labels: newbie (was: ) > SchedulerInvalidResoureRequestException has a typo in its class (and file) > name > --- > > Key: YARN-9455 > URL: https://issues.apache.org/jira/browse/YARN-9455 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Priority: Major > Labels: newbie > > The class name should be: SchedulerInvalidResourceRequestException -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9454) Add detailed log about list applications command
Szilard Nemeth created YARN-9454: Summary: Add detailed log about list applications command Key: YARN-9454 URL: https://issues.apache.org/jira/browse/YARN-9454 Project: Hadoop YARN Issue Type: Improvement Reporter: Szilard Nemeth When a user lists YARN applications with the RM admin CLI, we have one audit log here (https://github.com/apache/hadoop/blob/e40e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ClientRMService.java#L924) However, a more extensive logging could be added. This is the call chain, when such a list command got executed (from bottom to top): {code:java} org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications org.apache.hadoop.yarn.client.api.impl.YarnClientImpl#getApplications(java.util.Set, java.util.EnumSet, java.util.Set) ApplicationCLI.listApplications(Set, EnumSet, Set) (org.apache.hadoop.yarn.client.cli) ApplicationCLI.run(String[]) (org.apache.hadoop.yarn.client.cli) {code} org.apache.hadoop.yarn.server.resourcemanager.ClientRMService#getApplications: This is the place that fits perfectly for adding a more detailed log message about the request or the response (or both). In my opinion, a trace (or debug) level log would be great at the end of this method, logging the whole response, so any potential issues with the code can be troubleshot more easily. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9313: --- Attachment: YARN-9313.004.patch > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch, YARN-9313.002.patch, > YARN-9313.003.patch, YARN-9313.004.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run
[ https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-9453: - Description: org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, contains a long if-else chain and many many conditions. As a start, the bodies of the conditions could be extracted to methods and a more clean solution could be introduced to parse the argument values. > Clean up code long if-else chain in ApplicationCLI#run > -- > > Key: YARN-9453 > URL: https://issues.apache.org/jira/browse/YARN-9453 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Priority: Major > > org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, > contains a long if-else chain and many many conditions. > As a start, the bodies of the conditions could be extracted to methods and a > more clean solution could be introduced to parse the argument values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run
[ https://issues.apache.org/jira/browse/YARN-9453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-9453: - Labels: newbie (was: ) > Clean up code long if-else chain in ApplicationCLI#run > -- > > Key: YARN-9453 > URL: https://issues.apache.org/jira/browse/YARN-9453 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Priority: Major > Labels: newbie > > org.apache.hadoop.yarn.client.cli.ApplicationCLI#run is 630 lines long, > contains a long if-else chain and many many conditions. > As a start, the bodies of the conditions could be extracted to methods and a > more clean solution could be introduced to parse the argument values. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9313) Support asynchronized scheduling mode and multi-node lookup mechanism for scheduler activities
[ https://issues.apache.org/jira/browse/YARN-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811725#comment-16811725 ] Tao Yang commented on YARN-9313: Attached v4 patch. Thanks [~cheersyang] for your advices. > Support asynchronized scheduling mode and multi-node lookup mechanism for > scheduler activities > -- > > Key: YARN-9313 > URL: https://issues.apache.org/jira/browse/YARN-9313 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9313.001.patch, YARN-9313.002.patch, > YARN-9313.003.patch, YARN-9313.004.patch > > > [Design > doc|https://docs.google.com/document/d/1pwf-n3BCLW76bGrmNPM4T6pQ3vC4dVMcN2Ud1hq1t2M/edit#heading=h.d2ru7sigsi7j] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9453) Clean up code long if-else chain in ApplicationCLI#run
Szilard Nemeth created YARN-9453: Summary: Clean up code long if-else chain in ApplicationCLI#run Key: YARN-9453 URL: https://issues.apache.org/jira/browse/YARN-9453 Project: Hadoop YARN Issue Type: Improvement Reporter: Szilard Nemeth -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7721) TestContinuousScheduling fails sporadically with NPE
[ https://issues.apache.org/jira/browse/YARN-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811770#comment-16811770 ] Szilard Nemeth commented on YARN-7721: -- Hi [~wilfreds]! Thanks for this patch! +1 (non-binding) > TestContinuousScheduling fails sporadically with NPE > > > Key: YARN-7721 > URL: https://issues.apache.org/jira/browse/YARN-7721 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler >Affects Versions: 3.1.0 >Reporter: Jason Lowe >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-7721.001.patch > > > TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime is > failing sporadically with an NPE in precommit builds, and I can usually > reproduce it locally after a few tries: > {noformat} > [ERROR] > testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling) > Time elapsed: 0.085 s <<< ERROR! > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:383) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) > [...] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9123) Clean up and split testcases in TestNMWebServices for GPU support
[ https://issues.apache.org/jira/browse/YARN-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811769#comment-16811769 ] Szilard Nemeth commented on YARN-9123: -- Hi [~jojochuang]! The only one in the checkstyle logs is this: {code:java} TestNMWebServices.java:196: public long a = NM_RESOURCE_VALUE;:19: Variable 'a' must be private and have accessor methods. [VisibilityModifier] {code} May I deal with this or is this patch ready for commit ? Thanks! > Clean up and split testcases in TestNMWebServices for GPU support > - > > Key: YARN-9123 > URL: https://issues.apache.org/jira/browse/YARN-9123 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-9123.001.patch, YARN-9123.002.patch, > YARN-9123.003.patch, YARN-9123.004.patch, YARN-9123.005.patch, > YARN-9123.006.patch > > > The following testcases can be cleaned up a bit: > TestNMWebServices#testGetNMResourceInfo - Can be split up to 3 different cases > TestNMWebServices#testGetYarnGpuResourceInfo -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9437) RMNodeImpls occupy too much memory and causes RM GC to take a long time
[ https://issues.apache.org/jira/browse/YARN-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811738#comment-16811738 ] qiuliang commented on YARN-9437: As I understand it, there are two cases that may cause the completedContainers in RMNodeImpl to not be released. 1. When RMAppAttemptImpl receives the CONTAINER_FINISHED(not amContainer) event, it will add this container to justFinishedContainers. When processing the AM heartbeat, RMAppAttemptImpl first sends the container in finishedContainersSentToAM to NM, and RMNodeImpl also removes these containers from the completedContainers. Then transfer the containers in justFinishedContainers to finishedContainersSentToAM and wait for the next AM heartbeat to send these containers to NM. If RMAppAttemptImpl accepts the event of AM unregistration, justFinishedContainers is not empty, then the container in justFinishedContainers may not have the opportunity to transfer to finishedContainersSentToAM, so that these containers are not sent to NM, and RMNodeImpl does not release these containers. 2. When RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED event, just add this container to justFinishedContainers and not send it to NM. For the first case, my idea is that when RMAppAttemptImpl handles the amContainer finished event, the container in justFinishedContainers is transferred to finishedContainersSentToAM and sent to NM along with amContainer. I am not sure if there is any other impact. For the second case, when RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED event, these containers are sent directly to NM, but I am worried that this will generate many events. > RMNodeImpls occupy too much memory and causes RM GC to take a long time > --- > > Key: YARN-9437 > URL: https://issues.apache.org/jira/browse/YARN-9437 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.9.1 >Reporter: qiuliang >Priority: Minor > Attachments: 1.png, 2.png, 3.png > > > We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of > RM memory is occupied by RMNodeImpl. Analysis of RM memory found that each > RMNodeImpl has approximately 14M. The reason is that there is a 130,000+ > completedcontainers in each RMNodeImpl that has not been released. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811729#comment-16811729 ] Tao Yang commented on YARN-9413: Thanks [~cheersyang], [~snemeth] for the review and commit. {quote} could you please take a look if this issue happens in branch-3.0 too? If it does, please help to provide a patch for branch-3.0. {quote} Yes, it does. I have attached a patch for branch-3.0 and just add a test for capacity scheduler in this branch since TestAMRestart doesn't extend ParameterizedSchedulerTestBase to test both capacity and fair scheduler and this issue won't happen for fair scheduler. > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9445) yarn.admin.acl is futile
[ https://issues.apache.org/jira/browse/YARN-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16811766#comment-16811766 ] Szilard Nemeth commented on YARN-9445: -- I will let [~shuzirra] answer the concerns here, in the meantime let's involve [~wilfreds] as well! > yarn.admin.acl is futile > > > Key: YARN-9445 > URL: https://issues.apache.org/jira/browse/YARN-9445 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Affects Versions: 3.3.0 >Reporter: Peter Simon >Assignee: Gergely Pollak >Priority: Major > Attachments: YARN-9445.001.patch > > > * Define a queue with restrictive administerApps settings (e.g. yarn) > * Set yarn.admin.acl to "*". > * Try to submit an application with user yarn, it is denied. > This way my expected behaviour would be that while everyone is admin, I can > submit to whatever pool. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9413) Queue resource leak after app fail for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9413: --- Attachment: YARN-9413.branch-3.0.001.patch > Queue resource leak after app fail for CapacityScheduler > > > Key: YARN-9413 > URL: https://issues.apache.org/jira/browse/YARN-9413 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.1.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0, 3.2.1, 3.1.3 > > Attachments: YARN-9413.001.patch, YARN-9413.002.patch, > YARN-9413.003.patch, YARN-9413.branch-3.0.001.patch, > image-2019-03-29-10-47-47-953.png > > > To reproduce this problem: > # Submit an app which is configured to keep containers across app attempts > and should fail after AM finished at first time (am-max-attempts=1). > # App is started with 2 containers running on NM1 node. > # Fail the AM of the application with PREEMPTED exit status which should not > count towards max attempt retry but app will fail immediately. > # Used resource of this queue leaks after app fail. > The root cause is the inconsistency of handling app attempt failure between > RMAppAttemptImpl$BaseFinalTransition#transition and > RMAppImpl$AttemptFailedTransition#transition: > # After app fail, RMAppFailedAttemptEvent will be sent in > RMAppAttemptImpl$BaseFinalTransition#transition, if exit status of AM > container is PREEMPTED/ABORTED/DISKS_FAILED/KILLED_BY_RESOURCEMANAGER, it > will not count towards max attempt retry, so that it will send > AppAttemptRemovedSchedulerEvent with keepContainersAcrossAppAttempts=true and > RMAppFailedAttemptEvent with transferStateFromPreviousAttempt=true. > # RMAppImpl$AttemptFailedTransition#transition handle > RMAppFailedAttemptEvent and will fail the app if its max app attempts is 1. > # CapacityScheduler handles AppAttemptRemovedSchedulerEvent in > CapcityScheduler#doneApplicationAttempt, it will skip killing and calling > completion process for containers belong to this app, so that queue resource > leak happens. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org