[ 
https://issues.apache.org/jira/browse/YARN-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541153#comment-16541153
 ] 

Wangda Tan commented on YARN-8511:
----------------------------------

[~cheersyang], 

Thanks for reporting and working on this issue, this is valid issue, and we saw 
it from other places.  

For example, when exclusive use resource types like GPU, we could allocate and 
container to a node before the previous container completed. Memory has the 
same issue.

I'm not sure if your patch works since the {{SchedulerNode#releaseContainer}} 
could be invoked in scenarios like when an AM release container by invoking 
allocate call, or app attempt finishes. Scheduler could still place a new 
container on a node before it terminated by NM.

Instead, I think we should have some hook to handle such event inside 
{{AbstractYarnScheduler#nodeUpdate}}.

However we still have two issues: 
1) If we deduct resource after actual container finishes, it is possible that 
scheduler application attempt already finished. In that case, scheduler is not 
able to deduct resources. (Scheduler relies on SchedulerApplicationAttempt to 
locate RMContainer). I'm not sure if it impacts allocation tags or not.
2) It is also possible that NM spend too much time on terminating containers, 
in our docker-in-docker setup, we observed OS takes several minutes to 
terminate container. And NM could report container is DONE before it is 
actually terminated. (Another bug here). YARN-8508 is caused by the issue.
 
 

> When AM releases a container, RM removes allocation tags before it is 
> released by NM
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-8511
>                 URL: https://issues.apache.org/jira/browse/YARN-8511
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 3.1.0
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Major
>         Attachments: YARN-8511.001.patch, YARN-8511.002.patch
>
>
> User leverages PC with allocation tags to avoid port conflicts between apps, 
> we found sometimes they still get port conflicts. This is a similar issue 
> like YARN-4148. Because RM immediately removes allocation tags once 
> AM#allocate asks to release a container, however container on NM has some 
> delay until it actually gets killed and released the port. We should let RM 
> remove allocation tags AFTER NM confirms the containers are released. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to