[
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616408#comment-16616408
]
Manikandan R commented on YARN-7086:
------------------------------------
[~jlowe] Thanks for very detailed suggestion.
{quote}I'm worried that we're delving into the classic pitfall of optimizing
without profiling data or hands-on experience to prove the optimizations make
sense.{quote}Sorry about this. My understanding from earlier discussion is that
there would be potential performance degradation with LeafQueue lock for sure
and acquiring lock only once was mandatory for releasing batch of containers.
Hence I went with this trade off (multiple container list traversal). Now that
we are interested in doing the next step based on stress test (which is good
for decision making), I will take a look on TestCapacitySchedulerPerf and
perform the tests. Based on the numbers, as you suggested, it can help us to
define the next steps clearly.
> Release all containers aynchronously
> ------------------------------------
>
> Key: YARN-7086
> URL: https://issues.apache.org/jira/browse/YARN-7086
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Reporter: Arun Suresh
> Assignee: Manikandan R
> Priority: Major
> Attachments: YARN-7086.001.patch, YARN-7086.002.patch
>
>
> We have noticed in production two situations that can cause deadlocks and
> cause scheduling of new containers to come to a halt, especially with regard
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the
> AbstractYarnScheduler and a corresponding scheduler event, which is currently
> used specifically for the container-update code paths (where the scheduler
> realeases temp containers which it creates for the update)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]