zhihai xu commented on YARN-3415:

I uploaded a patch YARN-3415.000.patch for review.
The patch fixed two bugs and did 4 minor code optimizations.
bugs fixed:
1. Checking whether the AM is running before call addAMResourceUsage.
We should only addAMResourceUsage when AM is not running.
Without this fix, the test will fail because queue AmResourceUsage is changed 
by non-AM container.
2. Don’t check non-AM container for queue MaxAMShare limitation.
Without this fix, the test will fail because non-AM container allocation is 
rejected due to MaxAMShare limitation.

code optimizations:
1. remove redundant check for getLiveContainers().size() when 
addAMResourceUsage in FSAppAttempt.
2. remove redundant check for getLiveContainers().size() when check queue 
MaxAMShare(canRunAppAM) in FSAppAttempt.
3. remove redundant check for app.getAMResource() in FSLeafQueue#removeApp.
I didn’t check app.getUnmanagedAM() here, instead I add comments: AmRunning is 
set to true only when getUnmanagedAM() is false.
But checking app.getUnmanagedAM() is also ok for me.
4. check application.isAmRunning() instead of 
application.getLiveContainers().isEmpty() in FairScheduler#allocate.
Because application.getLiveContainers() will consume much more CPU power than 
FairScheduler#allocate is a function which will be called very frequently.

> Non-AM containers can be counted towards amResourceUsage of a fairscheduler 
> queue
> ---------------------------------------------------------------------------------
>                 Key: YARN-3415
>                 URL: https://issues.apache.org/jira/browse/YARN-3415
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: Rohit Agarwal
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3415.000.patch
> We encountered this problem while running a spark cluster. The 
> amResourceUsage for a queue became artificially high and then the cluster got 
> deadlocked because the maxAMShare constrain kicked in and no new AM got 
> admitted to the cluster.
> I have described the problem in detail here: 
> https://github.com/apache/spark/pull/5233#issuecomment-87160289
> In summary - the condition for adding the container's memory towards 
> amResourceUsage is fragile. It depends on the number of live containers 
> belonging to the app. We saw that the spark AM went down without explicitly 
> releasing its requested containers and then one of those containers memory 
> was counted towards amResource.
> cc - [~sandyr]

This message was sent by Atlassian JIRA

Reply via email to