[ 
https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610823#comment-16610823
 ] 

Jason Lowe commented on YARN-7086:
----------------------------------

I'm worried that we're delving into the classic pitfall of optimizing without 
profiling data or hands-on experience to prove the optimizations make sense.  
As I mentioned above, the big bad lock that slowed container release down in 
the past is now gone, so I don't know if container release is really a big 
problem in trunk anymore.  I think we need some hard data on where the 
bottlenecks are in the updated trunk code with respect to container release and 
data showing this new setup is worth it, especially since we're making 
tradeoffs of multiple container list traversal vs. obtaining the LeafQueue 
lock.  The profile tests should test scenarios where a single container is 
being released and also scenarios where thousands of containers are being 
released in a single AM heartbeat.

We could try running SLS or develop some targeted unit tests to stress this 
code path.  See TestCapacitySchedulerPerf for an example of a unit test that is 
built to stress a particular aspect of the scheduler for performance testing.

> Release all containers aynchronously
> ------------------------------------
>
>                 Key: YARN-7086
>                 URL: https://issues.apache.org/jira/browse/YARN-7086
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Arun Suresh
>            Assignee: Manikandan R
>            Priority: Major
>         Attachments: YARN-7086.001.patch, YARN-7086.002.patch
>
>
> We have noticed in production two situations that can cause deadlocks and 
> cause scheduling of new containers to come to a halt, especially with regard 
> to applications that have a lot of live containers:
> # When these applicaitons release these containers in bulk.
> # When these applications terminate abruptly due to some failure, the 
> scheduler releases all its live containers in a loop.
> To handle the issues mentioned above, we have a patch in production to make 
> sure ALL container releases happen asynchronously - and it has served us well.
> Opening this JIRA to gather feedback on if this is a good idea generally (cc 
> [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd])
> BTW, In YARN-6251, we already have an asyncReleaseContainer() in the 
> AbstractYarnScheduler and a corresponding scheduler event, which is currently 
> used specifically for the container-update code paths (where the scheduler 
> realeases temp containers which it creates for the update)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to