[
https://issues.apache.org/jira/browse/YARN-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060459#comment-15060459
]
Carlo Curino commented on YARN-4198:
------------------------------------
[~xinxianyin] the way we got to this was by running a "busy" workload with lots
of reservation-related pressure to the CS, staring at a profiler and
progressively work out what locks could be weakened, which data structures
could be changed to improve the performance of the scheduler.
I think this is looking at the same set of problems you are tracked in
YARN-3091 but with a particular focus on the needs of the reservation system. I
expect the changes in this patch (we will post an initial version soon), to be
generally useful, and possibly partially overlapping some of YARN-3091
sub-JIRAs.
The improvements we observed were very substantial (we went from thrashing on
locks in a 256 nodes cluster at 50-60 concurrent reservations to jug along
nicely on 2700 nodes cluster at over 1000 concurrent reservations). Note that
all that testing was done for this patch combined with the rest of YARN-4193
work, therefore I suggest that:
# We will do a round of tests of this patch in isolation to make sure the
changes are good independently of the rest of what we did in YARN-4193.
# Post a version of the patch.
# You can review it and help us figure out whether: 1) it is
good/safe/agreeable, 2) how it relates with some of the other efforts that are
ongoing (might resolve some of the sub-JIRAs or provide partial work towards
them).
[~kshukla], [~wangda], [~jianhe], [~jlowe] if you guys have time to look at
this as well, it would be great. As I mentioned to some of you already, this is
a very delicate portion of the scheduler, and we need lots of eyes (ideally
both staring at the patch and testing independently on a cluster) to convince
ourselves that what is proposed is safe/correct and worth.
> CapacityScheduler locking / synchronization improvements
> --------------------------------------------------------
>
> Key: YARN-4198
> URL: https://issues.apache.org/jira/browse/YARN-4198
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Carlo Curino
> Assignee: Alexey Tumanov
>
> In the context of YARN-4193 (which stresses the RM/CS performance) we found
> several performance problems with in the locking/synchronization of the
> CapacityScheduler, as well as inconsistencies that do not normally surface
> (incorrect locking-order of queues protected by CS locks etc). This JIRA
> proposes several refactoring that improve this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)