[ 
https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571593#comment-14571593
 ] 

Karthik Kambatla commented on YARN-3655:
----------------------------------------

bq. IMHO, It is not good to add if (isValidReservation) check in 
FSAppAttempt#reserve because all the conditions checked in isValidReservation 
are already checked before we call FSAppAttempt#reserve, it will be duplicate 
code which will affect the performance.
Is it possible to avoid the checks before the call, and do all the checks in 
the call. The reasoning behind this is to have all reservation-related code in 
as few places as possible. If this is not possible, we can leave it as the 
patch has it now.

bq. While adding this check in FSAppAttempt#assignContainer(node) might work in 
practice, it somehow feels out of place. 
Instead of adding the check to assignContainer(node) can we add it to 
assignContainer(node, request, nodeType, reserved)?

> FairScheduler: potential livelock due to maxAMShare limitation and container 
> reservation 
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-3655
>                 URL: https://issues.apache.org/jira/browse/YARN-3655
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-3655.000.patch, YARN-3655.001.patch, 
> YARN-3655.002.patch, YARN-3655.003.patch
>
>
> FairScheduler: potential livelock due to maxAMShare limitation and container 
> reservation.
> If a node is reserved by an application, all the other applications don't 
> have any chance to assign a new container on this node, unless the 
> application which reserves the node assigns a new container on this node or 
> releases the reserved container on this node.
> The problem is if an application tries to call assignReservedContainer and 
> fail to get a new container due to maxAMShare limitation, it will block all 
> other applications to use the nodes it reserves. If all other running 
> applications can't release their AM containers due to being blocked by these 
> reserved containers. A livelock situation can happen.
> The following is the code at FSAppAttempt#assignContainer which can cause 
> this potential livelock.
> {code}
>     // Check the AM resource usage for the leaf queue
>     if (!isAmRunning() && !getUnmanagedAM()) {
>       List<ResourceRequest> ask = appSchedulingInfo.getAllResourceRequests();
>       if (ask.isEmpty() || !getQueue().canRunAppAM(
>           ask.get(0).getCapability())) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Skipping allocation because maxAMShare limit would " +
>               "be exceeded");
>         }
>         return Resources.none();
>       }
>     }
> {code}
> To fix this issue, we can unreserve the node if we can't allocate the AM 
> container on the node due to Max AM share limitation and the node is reserved 
> by the application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to