[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077242#comment-14077242 ]
Subramaniam Venkatraman Krishnan commented on YARN-1707: -------------------------------------------------------- [~wangda] Thanks for the very detailed comments. I agree that understanding the context is essential & glad to help with that. Overall your understanding is spot on, please find answers to your questions below: 1) Yes, it is possible to have multiple PlanQueues (e.g., if two organization want to dynamically allocate their resources, but not share among them). This is also good to "try" reservation on a small scale and slowly ramp up at each org's pace. 2) The extra confs are needed to automate the initialization of key parameters of the dynamic ReservationQueues (without requiring full specification of each of those). 3) Correct 4) Correct 5) First: the Plan guarantees that the sum of reservations never exceed available resources (replanning if needed to maintain this invariant to handle failures). On the other hand, like it happens for normal scheduler we can leverage "overcapacity" to guarantee high cluster utilization. More precisely, depending on the configuration (or dynamically on whether reservations have gang semantics or not) we can allow resources allocated to PlanQueue and ReservationQueue to exceed their guaranteed capacity (i.e., set the dynamic max-capacity above the guaranteed one). In this case preemption might kick in if other apps with more rights on resources have pending askss. Part of the changes in YARN-1957 were driven by this. 6) To limit the scope of changed, we agreed to have a follow up JIRA to address HA. The intuition we have is that it is sufficient to persist the Plan alone. During recovery, the _Plan Follower_ will resync the Plan with the scheduler by creating the dynamic queues for currently active reservations. We will be happy to have your input when we work on the HA JIRA. [~curino] will answer your questions specify to this JIRA. > Making the CapacityScheduler more dynamic > ----------------------------------------- > > Key: YARN-1707 > URL: https://issues.apache.org/jira/browse/YARN-1707 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler > Reporter: Carlo Curino > Assignee: Carlo Curino > Labels: capacity-scheduler > Attachments: YARN-1707.patch > > > The CapacityScheduler is a rather static at the moment, and refreshqueue > provides a rather heavy-handed way to reconfigure it. Moving towards > long-running services (tracked in YARN-896) and to enable more advanced > admission control and resource parcelling we need to make the > CapacityScheduler more dynamic. This is instrumental to the umbrella jira > YARN-1051. > Concretely this require the following changes: > * create queues dynamically > * destroy queues dynamically > * dynamically change queue parameters (e.g., capacity) > * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% > instead of ==100% > We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.2#6252)