[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16555962#comment-16555962 ] Hudson commented on YARN-4606: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #14641 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/14641/]) YARN-4606. CapacityScheduler: applications could get starved because (ericp: rev 9485c9aee6e9bb935c3e6ae4da81d70b621781de) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/UsersManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestContainerAllocation.java > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.007.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16553038#comment-16553038 ] Eric Payne commented on YARN-4606: -- +1 Thanks for all of your work on this JIRA, [~maniraj...@gmail.com]. I will be committing this later today or tomorrow. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.007.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552698#comment-16552698 ] Manikandan R commented on YARN-4606: Unit test failure is not related to this patch > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.007.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551179#comment-16551179 ] genericqa commented on YARN-4606: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 29s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 0s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 11s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 77m 35s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}133m 23s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.metrics.TestSystemMetricsPublisher | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:ba1ab08 | | JIRA Issue | YARN-4606 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12932447/YARN-4606.007.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c122c2c6bdac 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 5c19ee3 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/21320/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21320/testReport/ | | Max. process+thread count | 921 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U:
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551016#comment-16551016 ] Manikandan R commented on YARN-4606: [~eepayne] Attaching .007 patch containing #2, #3 & #4. Please check. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.007.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16549772#comment-16549772 ] Eric Payne commented on YARN-4606: -- bq. Will take care of #2, #3 & #4. [~maniraj...@gmail.com]. I think that once you make these changes the unit tests will be fine as they are. I tried to create a unit test that would starve the queue without this patch, but the only way I could come up with was to start a MiniYarnCluster. I think that extra overhead will not give us enough extra code coverage to justify the complexity. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16548093#comment-16548093 ] Manikandan R commented on YARN-4606: Thanks [~eepayne] for your comments. {quote} I have a general concern that these tests are not testing the fix to the starvation problem outlined in the description of this JIRA. I'm trying to determine if there is a clean way to unit test that use case. {quote} Ok. Since Active app starvation happens because of less resource allocation based on incorrect active users count, in addition to checking active users count, Can we check allocated resources for each user? Is it good enough? Earlier, resource allocation (amount of memory, vcores) should be lesser (half of the allocation with this patch based on the example given in jira description). Whereas, with this patch, it should be higher. (or) With this patch, app should complete faster than before because of proper resource allocation as expected. Can we simulate this in test cases and check the app completion time? Will take care of #2, #3 & #4. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16547161#comment-16547161 ] Eric Payne commented on YARN-4606: -- Thank you, [~maniraj...@gmail.com], for the latest patch. The code changes look good. However, I have a couple of points with the tests. - I have a general concern that these tests are not testing the fix to the starvation problem outlined in the description of this JIRA. I'm trying to determine if there is a clean way to unit test that use case. - {TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}}: I am concerned about new tests that take longer than necessary because the unit tests keep taking longer and longer to run. I think that the following things can be done to reduce this test time (in my build environment) from 1min 17sec to 24 sec. -- In the following code, the sleep(5000) outside of the for loop is not necessary. -- In the following code, the sleep(5000) inside of the for loop could be cut down to sleep(500). {code:title=TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps} Thread.sleep(5000); //Triggering this event so that user limit computation can //happen again for (int i = 0; i < 10; i++) { cs.handle(new NodeUpdateSchedulerEvent(rmNode1)); Thread.sleep(5000); } {code} - {{TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps1}}: I don't think this test is necessary. It takes more than 1:20 to run in my build environment, and as far as I can tell, it is verifying that the active users count is not ever updated after a move if node heartbeats are not received. However, in a running YARN installation, node heartbeats are received every second (by default). Unless I'm missing something, this isn't a use case that one would encounter in a running Hadoop system. - {{TestCapacityScheduler#setupQueueConfigurationForActiveUsersChecks}}: The parameters to {{conf.setUserLimitFactor(...)}} don't need to be 100.0f. User limit factor can be thought of as the multiplier for the amount of a queue that one user can consume. So, if the user limit factor is 1.0f, one user can use the capacity of the queue. If it is 2.0f, one user can use twice the capacity of the queue, and so forth. Since these queues have a capacity of 50%, I would set this to 2.0f. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16539156#comment-16539156 ] genericqa commented on YARN-4606: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 58s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 18s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 71m 42s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}131m 6s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-4606 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12931034/YARN-4606.006.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 0c5bd396 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d503f65 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21204/testReport/ | | Max. process+thread count | 912 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/21204/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > CapacityScheduler: applications could get
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538949#comment-16538949 ] Manikandan R commented on YARN-4606: Fixed whitespace related issues. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538156#comment-16538156 ] genericqa commented on YARN-4606: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 39s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 26m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 23 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch 4 line(s) with tabs. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 16s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 73m 35s{color} | {color:green} hadoop-yarn-server-resourcemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}133m 40s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-4606 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12930945/YARN-4606.005.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c6a11600 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 9bd5bef | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC1 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/21198/artifact/out/whitespace-eol.txt | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/21198/artifact/out/whitespace-tabs.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/21198/testReport/ | | Max. process+thread count | 926 (vs. ulimit of 1) | | modules |
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538043#comment-16538043 ] Manikandan R commented on YARN-4606: [~eepayne] Thanks for the explanation. I've attached .005 patch containing changes and its related test cases for your review. Test cases covers the above discussed use cases as well, can be removed before committing the changes. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537497#comment-16537497 ] Eric Payne commented on YARN-4606: -- {quote} So, after move app operation and if there is no events (which can trigger user limit computation) for brief amount of time, am seeing incorrect numActiveUsersWithOnlyPendingApps count. Is this acceptable? or Should we trigger user limit computation after move operation like how we are doing it in other places? {quote} [~maniraj...@gmail.com], moving the apps to the other queue does trigger a recalculation of user limits since it is adding a new app to the queue and potentially a new user. Also, since it fixes itself after the nodemanager heartbeat of a few seconds (default is 1), I think that is fine. I noticed that there is another bug related to moving apps to a different queue that could affect the above use case. See YARN-8421 > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526328#comment-16526328 ] Manikandan R commented on YARN-4606: [~eepayne] Thank you for great explanation. I am able to understand the flow better now. I revisited "move apps" problem which i raised earlier based on new patch and don't think it requires any changes as variables required to calculate numActiveUsersWithOnlyPendingApps are already being set through submitApplication, finishApplication etc calls. However, I am seeing an minor update issue as described below: Lets say, We want to move all apps from queue, A1 to queue, B1. A1 has 4 apps (Only 2 were accommodated because of max am limit constraint. So, remaining 2 not yet activated). All these 4 apps are triggered by different users from u1 to u4. For example app1 by u1 and so on. Only for app 1 & app2, there is an allocate request in pipeline. At this point, {{numActiveUsers}} is 4 and {{numActiveUsersWithOnlyPendingApps}} is 2 in Queue, A1. Now move has been triggered. Since there were running containers for both app 1 and app 2, app3 and app4 has been activated before app 1 and app 2 in Queue, B1 as both these apps were busy in detaching and attaching containers. After the move operation and thread sleep of 5s, pulled these counts expecting u1 and u2 as ActiveUsersWithOnlyPendingApps, but couldn't able to see it. {{numActiveUsers}} is 2 as u3 and u4 had become active users and {{numActiveUsersWithOnlyPendingApps}} is 0 in Queue B1. Then, introduced an NodeUpdate event after the move operation just to force the user limit computation to see the impact on these counts. Now, can able to ActiveUsersWithOnlyPendingApps as 2 and ActiveUsers as 0 (as both u3 and u4 had become non active users by this time as there are no pending allocate request). So, after move app operation and if there is no events (which can trigger user limit computation) for brief amount of time, am seeing incorrect {{numActiveUsersWithOnlyPendingApps}} count. Is this acceptable? or Should we trigger user limit computation after move operation like how we are doing it in other places? Please share your thoughts and correct my understanding if you see a gap > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522811#comment-16522811 ] Eric Payne commented on YARN-4606: -- {quote}At the same time, this patch is less "strict" in terms of updates (specifically on when? ) compared to approaches discussed in our earlier patches. {quote} The value for number of active apps per user used to be calculated every time through the scheduler loop, which was a performance problem. In order to avoid this heavy calculation, YARN-5889 created the {{UsersManager}}. Instead of doing the calculation every time through the loop, YARN-5889 only recalculates these values when events occurs that could affect this count like new application, app completes, new container request, completed container, etc. In the latest POC patch, {{activeUsersWithOnlyPendingApps}} is part of this flow, so it will always be updated whenever anything happens that could affect this value. {quote}Also, based on our earlier discussions, We need to depend on activeUsers.get() only in certain context and sum of activeUsers.get() and activeUsersWithOnlyPendingApps.get() in some other places. But POC patch always depends on later value. I didn't understand this part. {quote} I think you are referencing this comment from above: {quote}My understanding is that user limit would use activeUsers and things like max AM limit per user, we'd use activeUsers + activeUsersOfPendingApps {quote} {{LeafQueue#activateApplications}} is the only thing that calls {{UsersManager#getNumActiveUsers}}, which it uses to calculate the user-specific AM limit, so it's the one that needs both activeusers + {{activeUsersWithOnlyPendingApps}}. {{UsersManager#computeUserLimit}} uses only activeUsers to calculate the headroom and user limit, which is what we decided in the comment above. Is that your understanding of these comments? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522599#comment-16522599 ] Manikandan R commented on YARN-4606: [~eepayne] Thanks for the patch. At a high level, POC is very simple from implementation perspective and changes would be minimal with this approach. At the same time, this patch is less "strict" in terms of updates (specifically on when? ) compared to approaches discussed in our earlier patches. For example, In earlier approach, numActiveUsersWithOnlyPendingApps would be incremented as soon as app gets activated and gets decremented as soon as AM container gets allocated. In addition, all of these things happens immediately and only after the dependent steps gets completed for sure. Whereas, new POC patch depends on the values (pendingApplications, activeApplications etc of User object), conditions before the actual work (for example, assuming AM container would be allocated successfully based on checks in LeafQueue#activateApplications) and updates numActiveUsersWithOnlyPendingApps as part of regular computeUserLimits flow. All these things is creating a slight discomfort and lead to some of the questions like What is the time frame that we are seeing between accepting the app and updating numActiveUsersWithOnlyPendingApps? Is this time frame acceptable? Aren't we running little slower in doing updates? Is there any chance by which AM container has been failed to allocate? Lets say, If AM container allocation goes through successfully, Would be there any delay in allocating AM containers? During this delayed duration, we are considering the user as active user rather than treating the user as "activeUsersWithOnlyPendingApps". Is this acceptable? I am interested in understanding your thoughts behind this tradeoff. Also, based on our earlier discussions, We need to depend on {{activeUsers.get()}} only in certain context and sum of {{activeUsers.get()}} and {{activeUsersWithOnlyPendingApps.get()}} in some other places. But POC patch always depends on later value. I didn't understand this part. On the other hand, We can avoid {{AppAMAttemptsFailedSchedulerEvent}} related changes completely with this new patch as anyway {{User.finishApplication()}} would be called for sure even when max AM attempts has been reached. Please share your thoughts. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520822#comment-16520822 ] Eric Payne commented on YARN-4606: -- [~maniraj...@gmail.com], we can fix the queue application starvation problem by making most of the changes in the scheduler-specific users managers. For {{CapacityScheduler}}, all the changes can be done in the {{UsersManager}} class. For the other schedulers (FIfo, Fair, etc.), I think there needs to be some amount of changes in the scheduler infrastructure classes to support retrieving iformation such as number of pending and active apps per user, amount of queue's AM limit resources, amount of a user's used AM resources, etc. But I think that most of the changes can be done in {{ActiveUsersManager}} for other schedulers as well. I am attaching a POC patch that only modifies {{UsersManager}}. The {{UsersManager}} already keeps track of all users in the queue. Each user object keeps the number of active apps and the number of pending apps. here is the sequence of events plus proposed change: - When an application is submitted, the user object's pending apps count is incremented - If limits are not exceeded, {{LeafQueue}} activates the app -- {{Leafqueue#activateApplications}} already checks whether or not activation of an application will go over the queue's AM limit. -- If activating the application will not go over the queue's AM limit, {{Leafqueue#activateApplications}} will increment the user object's active app count and decrement the pending app count. -- However, if activating the application will go over the queue's AM limit, the user's pending app count remains the same. - The change made in {{YARN-4606.POC.3.patch}} is that {{UsersManager#activateApplication}} will check whether or not the user object has any active apps. If not, it will not continue (thus not putting the user in the {{activeUsers}} list). I have not yet analyzed the problem you pointed out above regarding moving apps to different queues. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519562#comment-16519562 ] Eric Payne commented on YARN-4606: -- [~maniraj...@gmail.com], I am fine with using this JIRA to fix the {{CapacityScheduler}} and then using follow-on JIRAs to fix the other schedulers. However, I'm not comfortable putting {{CapacityScheduler}}-specific code in {{AppSchedulingInfo}}. I'm hoping that most of this code can be pushed down into the {{ActiveUsersManager}} (for {{FairScheduler}}) and {{UsersManager}} (for {{CapacityScheduler}}) code. I am investigating this now and should know if this is possible by early next week. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517812#comment-16517812 ] Manikandan R commented on YARN-4606: Ok, [~eepayne]. I thought of doing it once we settled down few more open issues and after thorough check on junits. anyways, not a problem. Can you share your thoughts on move app flow? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517635#comment-16517635 ] genericqa commented on YARN-4606: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 35s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 28s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 15s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager in trunk has 1 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 43s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s{color} | {color:red} The patch has 51 line(s) that end in whitespace. Use git apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 39s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 31s{color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 70m 59s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}132m 40s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager | | | Switch statement found in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(SchedulerEvent) where one case falls through to the next case At CapacityScheduler.java:where one case falls through to the next case At CapacityScheduler.java:[lines 1832-1838] | | Failed junit tests | hadoop.yarn.server.resourcemanager.TestRMHA | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation | | | hadoop.yarn.server.resourcemanager.scheduler.TestAppSchedulingInfo | | | hadoop.yarn.server.resourcemanager.reservation.TestCapacityOverTimePolicy | | | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-4606 | | JIRA Patch URL |
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517557#comment-16517557 ] Eric Payne commented on YARN-4606: -- I put this Jira into PATCH AVAILABLE mode so that it would kick the pre-commit build. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16513874#comment-16513874 ] Manikandan R commented on YARN-4606: Thanks [~eepayne] for your reviews. I was trying to address "move app" flow also in addition to your review comments, but stuck with it and took more time than expected. Sorry for the delay. I stuck with a case, admin trying to move an app (waiting for am container) from Queue A to Queue B. As part of this, control reaches {{AppScheduling#move}} through {{CapacityScheduler#moveApplication}}. As a first step, we will need to handle activeUsersWithPendingApps count for both queues. For example, After submitting the app to queue inside {{CapacityScheduler#moveApplication}}, we will need to do something like {quote} //Handle activeUsersWithOnlyPendingApps count appropriately if (app.isPending()) \{ this.getQueue(sourceQueueName).getAbstractUsersManager(). decrNumActiveUsersWithOnlyPendingApps(user); this.getQueue(destQueueName).getAbstractUsersManager(). incrNumActiveUsersWithOnlyPendingApps(user); } {quote} Then, Inside, {{AppScheduling#move}}, we will need to follow the logic similar to changes in {{AppScheduling#updatePendingResources}} to call {{UsersManager#activateApplications}}. Call to {{AppScheduling#updatePendingResources}} happens as part of Allocate flow every now and then. There is no such periodic calls for Move App. At some point, waitingForAMContainer become false for a given app and call to {{UsersManager#activateApplications}} happens and user got activated in normal app flow. We will need to handle the same even in Move App flow. I was thinking of waiting for some duration (possibly based on average am container allocation time? ) so that chance of getting container for am likely to happen. I am not sure. Attached patch contains this change as well. Please advise. Now, coming back to review comments: 1. Yes, it is scheduler specific. [~leftnoteasy] and [~sunilg] Please share your views. 2. For the first cut, I was thinking of fixing this JIRA for CS from end to end. Once fix has been ensured for CS, can apply similar changes to FS as well either with this jira or a different jira. If we are going to address FS related changes in different jira, is it ok to carry the risk you mentioned earlier? Please advise. Either, I can take help from folks who are familiar with FS flow or can hand over to them. Which ever is fine with us. 3. Addressed. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497232#comment-16497232 ] Eric Payne commented on YARN-4606: -- Thanks [~maniraj...@gmail.com] for the updated patch. Here are my comments so far: - I am concerned that this implementation adds code that is specific to {{CapacityScheduler}} inside of {{AppSchedulingInfo}}. I feel that this sets a precedent that makes it hard to maintain a clean separation between abstract and specific scheduler code. Also, this only fixes the problem for the {{CapacityScheduler}}. The previous fix in patch 001 was relying on metrics and I realize that is risky, but it was a more generic fix. I would be interested to hear thoughts from [~sunilg] and [~leftnoteasy]. - Only the {{CapacityScheduler}} has been changed to handle the new {{AppAMAttemptsFailedSchedulerEvent}}. Should the other schedulers handle that as well? If they don't handle it, don't we risk them getting unhandled event exceptions? - In all places where new {{LOG.debug(...)}} statementes are added, please also enclose them with {{if (LOG.isDebugEnabled())}}. This is for the sake of performance, so that the strings are not built, passed to {{LOG.debug()}}, and then thrown away if log debugging is not enabled. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496539#comment-16496539 ] Manikandan R commented on YARN-4606: Attached .003 patch for review. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496118#comment-16496118 ] Manikandan R commented on YARN-4606: [~eepayne] Thanks for reaching out proactively. Sorry for the delay, was completely offline little more than a week for personal work. I resumed activities on this yesterday and facing some issues in extracting am limit using SchedulerApplicationAttempt#getAppAttemptResourceUsage. I am in touch with [~sunilg] on this particular issue. Will upload patch as early as possible. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495656#comment-16495656 ] Eric Payne commented on YARN-4606: -- [~maniraj...@gmail.com], do you have a status on updating this patch? Do you need any help from the community? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480628#comment-16480628 ] Manikandan R commented on YARN-4606: {quote}If we are planning to move calling decrNumActiveUsersOfPendingApps from AppSchedulingInfo#updatePendingResources to SchedulerApplicationAttempt, then do we still need to am usage check against max am limit? I don't think so.{quote} I was wrong here. We need this check. Missed related changes in patch, will incorporate and update patch shortly. Please ignore .002.patch. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479016#comment-16479016 ] Manikandan R commented on YARN-4606: Attaching .002 patch for review. {quote}Does this patch handles the case that one user has multiple pending apps? (Since it doesn't store user to apps information).{quote} Started handling this case. {quote}Should we call this inside {{SchedulerApplicationAttempt#pullNewlyUpdatedContainers}}? I think we should remove active user from pending apps once AM container get allocated{quote} Yes, inside {{SchedulerApplicationAttempt#pullNewlyAllocatedContainers}} and that too after updating containers with tokens as {{SchedulerApplicationAttempt#pullNewlyUpdatedContainers}} does takes care of INCREASE, DECREASE, PROMOTE, DEMOTE cases etc not the regular cases. {quote}Instead of using metrics, it might be better to use SchedulerApplicationAttempt#getAppAttemptResourceUsage instead.{quote} Not required, I guess as explained in previous comment. {quote}I am doing an in-depth review, but I would like to address a few things first regarding method names and comments. I feel that it is important to be accurate in these areas in order to eliminate confusion for those maintaining this code.{quote} Taken care of all related comments. In addition to above changes, We have taken care of app being in ACCEPTED state with all AM attempts has been failed because of some reasons. We would like to decrement the count even in this case and handles this case via signalling scheduler using new event type. Also, I am assuming app MOVE from one queue to another doesn't require changes as it happen only when app is running? Thanks [~sunilg] for providing suggestions in some of the above steps. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468872#comment-16468872 ] Manikandan R commented on YARN-4606: {quote}1) Does this patch handles the case that one user has multiple pending apps? (Since it doesn't store user to apps information).{quote} Patch doesn't do anything about this case. As and when user submits an app, CS keeps increasing activeUsersOfPendingApps count as part of accepting the application irrespective of whether app has been submitted by same or different user. {quote}Should we call this inside SchedulerApplicationAttempt#pullNewlyUpdatedContainers? I think we should remove active user from pending apps once AM container get allocated{quote} While trying to understand this through a real testing, encountered a situation where in {{SchedulerApplicationAttempt#pullNewlyUpdatedContainers}} returns empty {{updatedContainers}} always. I was just thinking whether can we call {{abstractUsersManager.decrNumActiveUsersOfPendingApps()}} inside {{SchedulerApplicationAttempt#pullNewlyAllocatedContainers}} something like {code} if(! this.isWaitingForAMContainer() && ! hasActiveUsersOfPendingAppsDecremented.get()) { this.queue.getAbstractUsersManager().decrNumActiveUsersOfPendingApps(); hasActiveUsersOfPendingAppsDecremented.set(true); } {code} If we are planning to move calling {{decrNumActiveUsersOfPendingApps}} from {{AppSchedulingInfo#updatePendingResources}} to {{SchedulerApplicationAttempt}}, then do we still need to am usage check against max am limit? I don't think so. We faced the issue of accepting second app when we were calling {{decrNumActiveUsersOfPendingApps}} inside {{abstractUsersManager.activateApplication()}} and that too from {{AppSchedulingInfo#updatePendingResources}}. I dont think it is required anymore? {quote}Does hasActiveUsersOfPendingAppsDecremented need to be atomic? What is the benefit?{quote} Not required, I guess. Was trying to be too defensive :) Will address names and comments related review points once we conclude the flow. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466541#comment-16466541 ] Eric Payne commented on YARN-4606: -- {code:title=AppSchedulingInfo#updatePendingResources} if(! hasActiveUsersOfPendingAppsDecremented.get()) { abstractUsersManager.decrNumActiveUsersOfPendingApps(); hasActiveUsersOfPendingAppsDecremented.set(true); } {code} Does {{hasActiveUsersOfPendingAppsDecremented}} need to be atomic? What is the benefit? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466136#comment-16466136 ] Eric Payne commented on YARN-4606: -- Thanks [~maniraj...@gmail.com] for your consistent and continuing efforts to fix this problem. I am doing an in-depth review, but I would like to address a few things first regarding method names and comments. I feel that it is important to be accurate in these areas in order to eliminate confusion for those maintaining this code. - All occurrences of "atleast" should be "at least" - Comment for {{AbstractUsersManager#getNumActiveUsers}}: {code:title=AbstractUsersManager#getNumActiveUsers} - * Get number of active users i.e. users with applications which have pending - * resource requests. + * Get number of active users i.e. users with atleast 1 active applications {code} For this comment, I would say "Get number of active users i.e. users with at least 1 running application and and applications requesting resources" - I would prefer it if the name of {{ActiveUsersOfPendingApps}} was changed everywhere to {{ActiveUsersWithOnlyPendingApps}}. This is kind of a nit, but I do feel that the rename would be more descriptive. - {{AbstractUsersManager#incrNumActiveUsersOfPendingApps}}, {{decrNumActiveUsersOfPendingApps}}, and {{getNumActiveUsersOfPendingApps}} Change description to "number of users with only pending apps" - {{UsersManager#activateApplication}} and {{deactivateApplication}} Change "Active users which has atleast 1 pending apps:" to "Active users which have at least 1 pending app:" > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463324#comment-16463324 ] Wangda Tan commented on YARN-4606: -- Thanks [~maniraj...@gmail.com], Some questions: 1) Does this patch handles the case that one user has multiple pending apps? (Since it doesn't store user to apps information). 2) {code} abstractUsersManager.decrNumActiveUsersOfPendingApps(); {code} Should we call this inside {{SchedulerApplicationAttempt#pullNewlyUpdatedContainers}}? I think we should remove active user from pending apps once AM container get allocated. 3) {code} Resources.lessThan(rc, cr, metrics.getUsedAMResources(), metrics.getMaxAMResources()) {code} Instead of using metrics, it might be better to use {{SchedulerApplicationAttempt#getAppAttemptResourceUsage}} instead. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459882#comment-16459882 ] Manikandan R commented on YARN-4606: Attaching .001 patch containing test case for CS and changes required to fix the problem (activeUsersOfPendingApps being-ve issue) mentioned in my previous comment. Also, tested the patch in my pseudo setup for CS. I am planning to run all junits after the review. {quote}AppSchedulingInfo is supports to cache status for pending resource, it might be better to avoid invoking SchedulerAppAttempt's method from AppSchedulingInfo.{quote} [~leftnoteasy] As of now, {{AppSchedulingInfo}} calls only {{SchedulerAppAttempt#isWaitingForAMContainer}} to take some decision, for which I don't see any cache. Can you please explain? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.1.poc.patch, > YARN-4606.POC.2.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453297#comment-16453297 ] Wangda Tan commented on YARN-4606: -- Thanks [~eepayne] / [~maniraj...@gmail.com], Here's my understanding of the proposed approach: 1) When we compute {{max-am-resource-per-user}}, we uses #active-users + #pending-users. 2) When we compute {{max-user-limit}}, we use #active-users only. To me this is correct and (seems) same as what I proposed previously: {code} We should only consider a user is "active" if any of its application is active. And CS will use the "#active-user-which-has-at-least-one-active-app" to compute user-limit. Computation of max-am-resource-per-user needs to be updated as well. We should get a #users-which-has-pending-apps to compute max-am-resource-per-user. {code} I haven't checked very much details of the patch since [~maniraj...@gmail.com] is working on update the tests, etc. Just one suggestion is: AppSchedulingInfo is supports to cache status for pending resource, it might be better to avoid invoking SchedulerAppAttempt's method from AppSchedulingInfo. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Manikandan R >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449383#comment-16449383 ] Manikandan R commented on YARN-4606: Thanks [~eepayne] for the patch explaining your suggestions. I took the delta changes and started working on the Junits and testing in my pseudo cluster. I am encountering a situation where in activeUsersOfPendingApps is -ve at times (probably because of couple of calls to UsersManager#activeApplication() as part of the flow). May be, we need to take out decrement activeUsersOfPendingApps code from UsersManager#activeApplication() and can be called separately from clients. Will need to think through in detail and provide the patch. In the meantime, [~leftnoteasy] can also share his thoughts. Thanks. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449075#comment-16449075 ] Wangda Tan commented on YARN-4606: -- Thanks [~eepayne] / [~maniraj...@gmail.com] for working on the fix. I just unassigned myself, please feel free to assign to you if you plan to do that. I'm going to check the patch / approach in the next two days. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16443176#comment-16443176 ] Eric Payne commented on YARN-4606: -- [~maniraj...@gmail.com], I am attaching {{YARN-4606.POC.2.patch}} that demonstrates what I am suggesting. I have teste4d it in my pseudo cluster and it will only activate a new AM if the used queue AM resources are less than the queue max AM resources, but it also fixes the problem of starvation. I have only tested this on Capacity Scheduler, not Fair Scheduler. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432474#comment-16432474 ] Manikandan R commented on YARN-4606: [~leftnoteasy] [~sunilg] Can you please share your views? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425736#comment-16425736 ] Eric Payne commented on YARN-4606: -- bq. could you briefly summary what is the current issue and solution being discussed? [~leftnoteasy], the latest patch ({{YARN-4606.POC.patch}}) changed the behavior of the capacity scheduler so that it would never give a container to the second app for its AM as long as the first app consumed the entire queue and had pending requests, even when the AM used is lower than AM max. I described it in more detail [above|https://issues.apache.org/jira/browse/YARN-4606?focusedCommentId=16391802=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16391802]. [I suggested that one solution|https://issues.apache.org/jira/browse/YARN-4606?focusedCommentId=16396094=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16396094] would be to modify the code as follows as long as there is a way to do it in an abstract way: {code:title=AppSchedulingInfo#updatePendingResources} if( Not Waiting For AM Container || (Queue Used AM Resources < Queue Max AM Resources) { abstractUsersManager.activateApplication(user, applicationId); } {code} I suggested a way to do that, but it seems a little cumbersome. So then I started wondering if there was a way to leverage the {{Schedulable Apps}} and {{Non-Schedulable Apps}} user info in the {{AppSchedulingInfo#updatePendingResources}} code. I looked more closely, however, and it is too early within {{AppSchedulingInfo#updatePendingResources}} to tell whether or not a new app is destined to be schedulable. So, I think the best suggestion I have is the pseudo-code I posted above. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425249#comment-16425249 ] Manikandan R commented on YARN-4606: {quote}I suspect it may be the first case. Please check to make sure your queue configuration is set to allow multiple running users in the queue. {quote} Yes, you are right. I can able to run App2 by User 2 after making the config changes. On the other hand, With the patch, container is not getting allocated to App2's AM while App1 is running. {quote}The concern is that even though they may be in both FSQueueMetrics, and CSQueueMetrics, they are not accessible at the abstract QueueMetrics layer because they have different accessors. It should be possible to add a new, abstract accessor in QueueMetrics that is implemented in FS/CS QueueMetrics. {quote} Yes, something similar can be done in Abstract class level to achieve this if required. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423456#comment-16423456 ] Wangda Tan commented on YARN-4606: -- [~eepayne], thanks for pinging me about this, I'm a bit lost contexts of above question, could you briefly summary what is the current issue and solution being discussed? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423059#comment-16423059 ] Eric Payne commented on YARN-4606: -- {code:title=AppSchedulingInfo#updatePendingResources} if(! this.schedulerApplicationAttempt.isWaitingForAMContainer()) { abstractUsersManager.activateApplication(user, applicationId); } {code} Sorry for backtracking, but after thinking about this some more, I think the question that needs to be asked by {{AppSchedulingInfo#updatePendingResources}} is not "Is this application attempt asking for an AM?", but rather I think the question is "Does this user have schedulable apps (including this attempt)?" I'd like [~sunilg]'s and [~leftnoteasy]'s input on this design suggestion. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422631#comment-16422631 ] Eric Payne commented on YARN-4606: -- {quote}AM container is being allocated to App2 only after App1 completion when cluster is running full. {quote} [~maniraj...@gmail.com], In the current implementation (that is, without the patch), there are a couple of things that can affect whether or not App2 will be given the freed container: - In the running queue, if {{Configured Minimum User Limit Percent}} is set to 100%, only one user can run in the queue at a time. If this is so, then the Capacity Scheduler will only assign new containers to App1 (owned by User1). However, if {{Configured Minimum User Limit Percent}} is 50% or less, the Capacity Scheduler will assign new containers to App2 (owned by User2) until they both have 50% of the queue or one stops asking for new resources. - In the running queue, if {{Used Application Master Resources}} equals {{Max Application Master Resources}}, the Capacity Scheduler will not assign an AM to App2. - The same thing happens if {{Num Schedulable Applications}} is equal to {{Max Applications}}, but that's probably not the case here. I suspect it may be the first case. Please check to make sure your queue configuration is set to allow multiple running users in the queue. {quote}{quote}However, I'm not sure of the best way to get the values for a queue's Used AM Resources and Max AM Resources from this context. Those may be capacity scheduler-specific values. {quote} Yes. But I do see some equivalents available in FSQueueMetrics. {quote} The concern is that even though they may be in both FSQueueMetrics, and CSQueueMetrics, they are not accessible at the abstract {{QueueMetrics}} layer because they have different accessors. It should be possible to add a new, abstract accessor in {{QueueMetrics}} that is implemented in FS/CS QueueMetrics. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16418782#comment-16418782 ] Manikandan R commented on YARN-4606: [~eepayne] Thanks for your detailed explanation. Sorry for the delay. {quote}In this scenario, User2 wants to start App2 but User1 is consuming all resources in the queue with App1. When App1 releases a resource, however, it is not given to App2. The resource is given back to App1, which brings its Pending value down to 19. This is incorrect behavior since Queue1 has room for 2 AMs.{quote} I was trying to understand this behaviour in current code (without my patch) and come to know that AM container is being allocated to App2 only after App1 completion when cluster is running full. In my single node pseudo setup, total cluster resources is 8192M, 8 vcores, only 1 queue (default) with 100% allocation and max am resources is 2048MB, 2 vcores as max am resource percent is 0.2. I submitted an app (say App1) through DS with num_containers as 20. While App1 is running and its pending containers is around 15, submitted second app (say App2) with num_containers as 10. I can see AM container for App2 is being allocated only after App1 completion, which is not in line with your earlier comments. Am I missing anything here? {quote}However, I'm not sure of the best way to get the values for a queue's Used AM Resources and Max AM Resources from this context. Those may be capacity scheduler-specific values. {quote} Yes. But I do see some equivalents available in {{FSQueueMetrics}}. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396094#comment-16396094 ] Eric Payne commented on YARN-4606: -- bq. resources are not assigned to the second app when they should be I'm unsure about the appropriate way to fix this. My original thinking was that we could do something similar to the following: {code:title=AppSchedulingInfo#updatePendingResources} if( Not Waiting For AM Container || (Queue Used AM Resources < Queue Max AM Resources) { abstractUsersManager.activateApplication(user, applicationId); } {code} However, I'm not sure of the best way to get the values for a queue's {{Used AM Resources}} and {{Max AM Resources}} from this context. Those may be capacity scheduler-specific values. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391802#comment-16391802 ] Eric Payne commented on YARN-4606: -- [~maniraj...@gmail.com], thank you for the patch. The overall approach looks fine, but I have a couple of concerns. - The behavior of assigning resources to schedulable applications has changed. With this patch, in the following use case, resources are not assigned to the second app when they should be. I have not analyzed the behavior closely enough to debug the issue, but I wish to document the behavior: -- Queue1 total resources: 40 -- Queue1 Max Application Master Resources: 2 -- Container sizes are all 1 resource |*User Name*|*Applicatiton ID*|*Used AM resources*|*Total Used Resources*|*Pending Resources*| |User1|App1|1|39|20| |User2|App2|0|0|1 (waiting for AM)| -- In this scenario, User2 wants to start App2 but User1 is consuming all resources in the queue with App1. When App1 releases a resource, however, it is not given to App2. The resource is given back to App1, which brings its Pending value down to 19. This is incorrect behavior since Queue1 has room for 2 AMs. - I think the {{TestRMHA}} unit test needs to be modified to adjust to this patch: {code:java} TestRMHA TestRMHA.testFailoverAndTransitions:219->verifyClusterMetrics:754 Incorrect value for metric activeApplications expected:<1> but was:<0> TestRMHA.testFailoverClearsRMContext:550->verifyClusterMetrics:754 Incorrect value for metric activeApplications expected:<1> but was:<0> {code} - A couple of minor things: -- IIUC, the value stored in {{activeUsersOfPendingApps}} represents the number of suers that do not have any active applications. Is that correct? If so, I think it would be more clear if it were called {{usersWithOnlyPendingApps}}. -- In {{AbstractUsersManager}} and {{ActiveUsersManager}}, *atleast* should be "at least*. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16384645#comment-16384645 ] Manikandan R commented on YARN-4606: [~eepayne] [~sunilg] Thanks for your inputs. Sorry for the delay. Attached POC patch to confirm it is in line with our discussions. Please review the approach. Will need to make it as robust patch by adding tests etc and also have to cover FS, FIFO as well after the feedback. Approach: 1. Introduce activeUsersOfPendingApps in users manager and increment this count as and when apps are accepted. 2. After activating the application, increment activeUsers and decrement activeUsersOfPendingApps in {{UsersManager#activateApplication}} from {{AppSchedulingInfo#updatePendingResources}} only when app is no more waiting for AM container. 3. To calculate max AM limit per user, use activeUsers + activeUsersOfPendingApps. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347329#comment-16347329 ] Eric Payne commented on YARN-4606: -- My understanding is that user limit would use {{activeUsers}} and things like max AM limit per user, we'd use {{activeUsers}} + {{activeUsersOfPendingApps}} > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347236#comment-16347236 ] Manikandan R commented on YARN-4606: Thanks [~eepayne] [~sunilg] for your comments. {quote}And when we need to know all active users in cluster (for user-limit computation etc) we might need to use activeUsers+activeUsersOfPendingApps.\{quote} Shouldn't we use activeUsersOfPendingApps only for user limit calculation? Otherwise, based on the example given in the description, Won't we end up in same situation (2+2 == 4 )? Please correct my understanding. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347220#comment-16347220 ] Sunil G commented on YARN-4606: --- Thanks [~eepayne]. You are correct. - {{activeUsers}}: users that have at least one active app - {{activeUsersOfPendingApps}}: users that have only pending apps This is the correct definition. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16347209#comment-16347209 ] Eric Payne commented on YARN-4606: -- Thanks everyone for the thoughtful analysis. I am still analyzing in more depth, but I have a couple of thoughts: {quote}this is a (known) potential issue of fair ordering policy. {quote} This can happen for fifo ordering policy as well. {quote}have {{activeUsersOfPendingApps}} along with {{activeUsers}}. Hence in case of scheduling we can depend only on {{activeUse}} {quote} We need to be careful with these counts because a user can have both active and pending apps. I think the definitions should be: - {{activeUsers}}: users that have at least one active app - {{activeUsersOfPendingApps}}: users that have only pending apps. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16346808#comment-16346808 ] Sunil G commented on YARN-4606: --- [~maniraj...@gmail.com] You can look into {{SchedulerApplicationAttempt.isWaitingForAMContainer()}} to know an app is pending for its AM container. Hence new method {{activeApplication()}} in {{AppSchedulingInfo is not needed.}} Ideally we have {{ActiveUsersManager}} which has all the active users in that cluster (including apps which are pending). I thin we can have {{activeUsersOfPendingApps along with}} {{activeUsers}} . Hence in case of scheduling we can depend only on activeUsers. And when we need to know all active users in cluster (for user-limit computation etc) we might need to use activeUsers+activeUsersOfPendingApps. cc/ [~leftnoteasy] [~jlowe] [~eepayne] Could you please help to check this and share your thoughts. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16345536#comment-16345536 ] Manikandan R commented on YARN-4606: [~leftnoteasy] Based on my understanding, as of now {{ActiveUsersManager#getNumActiveUsers}} has been used only in {{LeafQueue#getUserAMResourceLimitPerPartition}} to compute the userAMLimit. Given this, ensuring active users count increments only after app activation is sufficient to fix this JIRA? Something like, 1. Introduce a new method {{activeApplication()}} in {{AppSchedulingInfo}} to set new private boolean variable "isActivated" to true. 2. Call #1 from {{LeafQueue#activateApplications}} after {{application.updateAMContainerDiagnostics(AMState.ACTIVATED, null);}} similar to changes in earlier POC patch. 3. Ensure {{abstractUsersManager.activateApplication(user, applicationId);}} in {{AppSchedulingInfo#updatePendingResources}} executes only when isActivate is true. Please share your views. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16286883#comment-16286883 ] Wangda Tan commented on YARN-4606: -- [~jlowe], as we discussed offline, this is a (known) potential issue of fair ordering policy. Please let me know if you have any thoughts on this. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108267#comment-15108267 ] Naganarasimha G R commented on YARN-4606: - Hi [~wangda], Thanks for sharing the proposal, Just to clarify whether my understanding is correct : * While calculating user limit for assigning containers to a app we take the #activeusers = #users who have apps in activated stage and has outstanding requests. * And while calculating userAMlimit for activating a app we take #activeusers = #unique users who are having apps in the pending order policy? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108290#comment-15108290 ] Wangda Tan commented on YARN-4606: -- Hi [~Naganarasimha], bq. While calculating user limit for assigning containers to a app we take the #activeusers = #users who have apps in activated stage and has outstanding requests. Yes bq. And while calculating userAMlimit for activating a app we take #activeusers = #unique users who are having apps in the pending order policy? Yes > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > Attachments: YARN-4606.1.poc.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108196#comment-15108196 ] Wangda Tan commented on YARN-4606: -- Proposed solution: We should only consider a user is "active" if any of its application is active. And CS will use the "#active-user-which-has-at-least-one-active-app" to compute user-limit. Computation of max-am-resource-per-user needs to be updated as well. We should get a #users-which-has-pending-apps to compute max-am-resource-per-user. This looks like a major behavior change to existing scheduler logic. Thoughts? [~vinodkv]/[~jlowe]/[~jianhe]. I'm not sure if FairScheduler needs similar changes as well, if a user in FSLeafQueue doesn't have any runnable apps, should we increase #active-users of QueueMetrics? > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan >Priority: Critical > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108183#comment-15108183 ] Wangda Tan commented on YARN-4606: -- Updated description of the JIRA, originally it is found by [~karams] while doing fairness ordering policy tests, pasting original test cases here just for reference: {code} Encountered while studying behaviour fairness with UserLimitPercent and UserLimitFactor during following test: Ran GridMix with Queue settings: Capacity=10, MaxCap=80, UserLimit=25 UserLimitFactor=32, FairOrderingPolicy only. Encountered a application starving situation where 33 application (190 apps completed out of 761 apps, queue can 345 containers) are running with total of 45 containers running, and that 12 extra only one app(the app was having around 18000 tasks) , all other apps were having AM running only no other containers were given any apps. After that app finished, there were 32 AMs that kept running without any containers for task being launched GridMix was run with following settings: gridmix.client.pending.queue.depth=10, gridmix.job-submission.policy=REPLAY, gridmix.client.submit.threads=5, gridmix.submit.multiplier=0.0001, gridmix.job.type=SLEEPJOB, mapreduce.framework.name=yarn, mapreduce.job.queuename=hive1, mapred.job.queue.name=hive1, gridmix.sleep.max-map-time=5000, gridmix.sleep.max-reduce-time=5000, gridmix.user.resolve.class=org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver With Users file containing 4 users for RoundRobinUserResolver {code} > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > - > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler >Affects Versions: 2.8.0, 2.7.1 >Reporter: Karam Singh >Assignee: Wangda Tan > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)