[ 
https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388285#comment-14388285
 ] 

zhihai xu commented on YARN-3415:
---------------------------------

[~sandyr], that is a very good idea to move the call to setAMResource that's 
currently in FairScheduler next to the call to getQueue().addAMResourceUsage().
The new patch YARN-3415.001.patch addressed this issue and it also addressed 
your first two comments.

[~ragarwal], thanks for the review.
First I want to clarify the AM resource usage won't be changed when the AM 
container is completed, It will only be changed when the application attempt is 
removed from scheduler, which will call FSLeafQueue#removeApp.
So currently  "Check that AM resource usage becomes 0" is done after all 
application attempts are removed.
{code}
    assertEquals("Queue1's AM resource usage should be 0",
        0, queue1.getAmResourceUsage().getMemory());
{code}

bq. Add a non-AM container to app5. Handle the nodeUpdate event - check that 
the number of live containers is 2.
The old code already had this test for app1, the test can pass without the 
patch.
{code}
    // Still can run non-AM container
    createSchedulingRequestExistingApplication(1024, 1, attId1);
    scheduler.update();
    scheduler.handle(updateEvent);
    assertEquals("Application1 should have two running containers",
        2, app1.getLiveContainers().size());
{code}

I think your issue is due to the non-AM container allocation is delayed after 
AM container is finished, which cause 0 LiveContainers.
My test simulates "complete AM container before non-AM container is allocated", 
the old code will increase the AM resource usage when non-AM container is 
allocated. So without the patch, the test will fail.

> Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
> queue
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-3415
>                 URL: https://issues.apache.org/jira/browse/YARN-3415
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: Rohit Agarwal
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3415.000.patch, YARN-3415.001.patch
>
>
> We encountered this problem while running a spark cluster. The 
> amResourceUsage for a queue became artificially high and then the cluster got 
> deadlocked because the maxAMShare constrain kicked in and no new AM got 
> admitted to the cluster.
> I have described the problem in detail here: 
> https://github.com/apache/spark/pull/5233#issuecomment-87160289
> In summary - the condition for adding the container's memory towards 
> amResourceUsage is fragile. It depends on the number of live containers 
> belonging to the app. We saw that the spark AM went down without explicitly 
> releasing its requested containers and then one of those containers memory 
> was counted towards amResource.
> cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to