[ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547161#comment-16547161 ]
Eric Payne commented on YARN-4606: ---------------------------------- Thank you, [~maniraj...@gmail.com], for the latest patch. The code changes look good. However, I have a couple of points with the tests. - I have a general concern that these tests are not testing the fix to the starvation problem outlined in the description of this JIRA. I'm trying to determine if there is a clean way to unit test that use case. - {TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}}: I am concerned about new tests that take longer than necessary because the unit tests keep taking longer and longer to run. I think that the following things can be done to reduce this test time (in my build environment) from 1min 17sec to 24 sec. -- In the following code, the sleep(5000) outside of the for loop is not necessary. -- In the following code, the sleep(5000) inside of the for loop could be cut down to sleep(500). {code:title=TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps} Thread.sleep(5000); //Triggering this event so that user limit computation can //happen again for (int i = 0; i < 10; i++) { cs.handle(new NodeUpdateSchedulerEvent(rmNode1)); Thread.sleep(5000); } {code} - {{TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps1}}: I don't think this test is necessary. It takes more than 1:20 to run in my build environment, and as far as I can tell, it is verifying that the active users count is not ever updated after a move if node heartbeats are not received. However, in a running YARN installation, node heartbeats are received every second (by default). Unless I'm missing something, this isn't a use case that one would encounter in a running Hadoop system. - {{TestCapacityScheduler#setupQueueConfigurationForActiveUsersChecks}}: The parameters to {{conf.setUserLimitFactor(...)}} don't need to be 100.0f. User limit factor can be thought of as the multiplier for the amount of a queue that one user can consume. So, if the user limit factor is 1.0f, one user can use the capacity of the queue. If it is 2.0f, one user can use twice the capacity of the queue, and so forth. Since these queues have a capacity of 50%, I would set this to 2.0f. > CapacityScheduler: applications could get starved because computation of > #activeUsers considers pending apps > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Affects Versions: 2.8.0, 2.7.1 > Reporter: Karam Singh > Assignee: Manikandan R > Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, > YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, > YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, > YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending > (caused by max-am-percent, etc.), ActiveUsersManager still considers the user > is an active user. This could lead to starvation of active applications, for > example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to > user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new > resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org