[
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139241#comment-16139241
]
Wangda Tan commented on YARN-7086:
----------------------------------
The only potential issue I can see is: prior to this change, AM can assume
containers are released by RM once allocate() returns. In the new world, AM has
to check completed container list in AllocateResponse to make sure containers
are released. It may not be a big issue though since I don't think we guarantee
this in API description.
Beyond that, I like Jason's idea as well, share one fact: When I was doing
async scheduling test in YARN-5139, I found resource commit phase (acquires
write lock, check and update scheduler internal state such as resource usages,
etc.) only takes less than 6% time, most of the time are consumed by
{{CapacityScheduler#allocateContainersToNode}}. I suspect container release
take the similar amount of time (around 6%).
> Release all containers aynchronously
> ------------------------------------
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Arun Suresh
> Assignee: Arun Suresh
>
> We have noticed in production two situations that can cause deadlocks and
> cause scheduling of new containers to come to a halt, especially with regard
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the
> AbstractYarnScheduler and a corresponding scheduler event, which is currently
> used specifically for the container-update code paths (where the scheduler
> realeases temp containers which it creates for the update)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]