[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

Eric Payne (JIRA) Tue, 17 Jul 2018 15:22:22 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547161#comment-16547161
 ]


Eric Payne commented on YARN-4606:
----------------------------------

Thank you, [~maniraj...@gmail.com], for the latest patch.

The code changes look good. However, I have a couple of points with the tests.

- I have a general concern that these tests are not testing the fix to the 
starvation problem outlined in the description of this JIRA. I'm trying to 
determine if there is a clean way to unit test that use case.
- {TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}}: I am 
concerned about new tests that take longer than necessary because the unit 
tests keep taking longer and longer to run. I think that the following things 
can be done to reduce this test time (in my build environment) from 1min 17sec 
to 24 sec.
-- In the following code, the sleep(5000) outside of the for loop is not 
necessary.
-- In the following code, the sleep(5000) inside of the for loop could be cut 
down to sleep(500).
{code:title=TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}
    Thread.sleep(5000);

    //Triggering this event so that user limit computation can
    //happen again
    for (int i = 0; i < 10; i++) {
      cs.handle(new NodeUpdateSchedulerEvent(rmNode1));
      Thread.sleep(5000);
   }
{code}

- {{TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps1}}: I 
don't think this test is necessary. It takes more than 1:20 to run in my build 
environment, and as far as I can tell, it is verifying that the active users 
count is not ever updated after a move if node heartbeats are not received. 
However, in a running YARN installation, node heartbeats are received every 
second (by default). Unless I'm missing something, this isn't a use case that 
one would encounter in a running Hadoop system.
- {{TestCapacityScheduler#setupQueueConfigurationForActiveUsersChecks}}: The 
parameters to {{conf.setUserLimitFactor(...)}} don't need to be 100.0f. User 
limit factor can be thought of as the multiplier for the amount of a queue that 
one user can consume. So, if the user limit factor is 1.0f, one user can use 
the capacity of the queue. If it is 2.0f, one user can use twice the capacity 
of the queue, and so forth. Since these queues have a capacity of 50%, I would 
set this to 2.0f.


> CapacityScheduler: applications could get starved because computation of 
> #activeUsers considers pending apps 
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4606
>                 URL: https://issues.apache.org/jira/browse/YARN-4606
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, capacityscheduler
>    Affects Versions: 2.8.0, 2.7.1
>            Reporter: Karam Singh
>            Assignee: Manikandan R
>            Priority: Critical
>         Attachments: YARN-4606.001.patch, YARN-4606.002.patch, 
> YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, 
> YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, 
> YARN-4606.POC.3.patch, YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending 
> (caused by max-am-percent, etc.), ActiveUsersManager still considers the user 
> is an active user. This could lead to starvation of active applications, for 
> example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to 
> user3)/app4(belongs to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new 
> resources. So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps

Reply via email to