[ 
https://issues.apache.org/jira/browse/YARN-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060459#comment-15060459
 ] 

Carlo Curino commented on YARN-4198:
------------------------------------

[~xinxianyin] the way we got to this was by running a "busy" workload with lots 
of reservation-related pressure to the CS, staring at a profiler and 
progressively work out what locks could be weakened, which data structures 
could be changed to improve the performance of the scheduler. 

I think this is looking at the same set of problems you are tracked in 
YARN-3091 but with a particular focus on the needs of the reservation system. I 
expect the changes in this patch (we will post an initial version soon), to be 
generally useful, and possibly partially overlapping some of YARN-3091 
sub-JIRAs. 

The improvements we observed were very substantial (we went from thrashing on 
locks in a 256 nodes cluster at 50-60 concurrent reservations to jug along 
nicely on 2700 nodes cluster at over 1000 concurrent reservations). Note that 
all that testing was done for this patch combined with the rest of YARN-4193 
work, therefore I suggest that:
 # We will do a round of tests of this patch in isolation to make sure the 
changes are good independently of the rest of what we did in YARN-4193.
 # Post a version of the patch. 
 # You can review it and help us figure out whether: 1) it is 
good/safe/agreeable, 2) how it relates with some of the other efforts that are 
ongoing (might resolve some of the sub-JIRAs or provide partial work towards 
them). 

[~kshukla], [~wangda], [~jianhe], [~jlowe] if you guys have time to look at 
this as well, it would be great. As I mentioned to some of you already, this is 
a very delicate portion of the scheduler, and we need lots of eyes (ideally 
both staring at the patch and testing independently on a cluster) to convince 
ourselves that what is proposed is safe/correct and worth. 
 

> CapacityScheduler locking / synchronization improvements
> --------------------------------------------------------
>
>                 Key: YARN-4198
>                 URL: https://issues.apache.org/jira/browse/YARN-4198
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Carlo Curino
>            Assignee: Alexey Tumanov
>
> In the context of YARN-4193 (which stresses the RM/CS performance) we found 
> several performance problems with  in the locking/synchronization of the 
> CapacityScheduler, as well as inconsistencies that do not normally surface 
> (incorrect locking-order of queues protected by CS locks etc). This JIRA 
> proposes several refactoring that improve this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to