[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120729#comment-14120729 ]
Subramaniam Krishnan commented on YARN-1712: -------------------------------------------- Thanks [~leftnoteasy] for taking a look at the patch. Since [~curino] is traveling, I'll try to answer your questions. Your understanding is very close as your steps of 1-7 are correct. There is slight context which is missing that I will try to explain. bq. Question: Why not do 2) after 4)? Is it better to do shrink after excluded expired reservations? Shrinking might be required as reservations are absolute while queues express relative (% of cluster) capacity. We need to shrink first as shrinking might result in additional expired reservations. The expired reservations are determined as those reservations that exist in the scheduler but are not currently active in the Plan (post shrinking if required). I should add that shrinking is a rare exception case when we loose large chuks of cluster capacity. bq. 6) Sort all reservations, from less to more satisfied, and set their new entitlement. bq. Question: Is it possible totalAssignedCapacity > 1? Could you please explain how to avoid it happen? We sort all reservations based on what was promised at this moment of time. That can vary because we support skylines for reservations, i.e. varied resource requirements over time. This is required to handle DAGs as in the case of Tez, Oozie, Hive or Pig queries as the nodes of the DAG will have different resource needs. This is explained in detail in the tech report we uploaded as part of YARN-1051. The totalAssignedCapacity will never exceed 1 because: 1) We always release all excess capacity before starting to allocate fresh capacity. 2) The reservations themselves are validated before being added to the Plan to ensure that they never exceed (YARN-1709 & YARN-1711) the total capacity of the Plan. Like mentioned above, shrinking will handle large transient cluster failures. {quote} One comment is, Current compare and sort reservation is comparing (allocatedResource - guaranteedResource), one feeling at top of my mind is, this may make larger queue can get resource easier than small queue. Is it possible an app can get more resource than other by lying to RM that it needs more resource when fierce competition on resource? {quote} To prevent exactly we do our allocations starting from smallest to largest reservation queue. We enforce sharing policies (YARN-1711) to prevent a single user/app to reserve the entire cluster resources or cause starvation by hoarding resources. Hope this clarifies the logic. Feel free to revert if you have any further questions. > Admission Control: plan follower > -------------------------------- > > Key: YARN-1712 > URL: https://issues.apache.org/jira/browse/YARN-1712 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, resourcemanager > Reporter: Carlo Curino > Assignee: Carlo Curino > Labels: reservations, scheduler > Attachments: YARN-1712.1.patch, YARN-1712.patch > > > This JIRA tracks a thread that continuously propagates the current state of > an inventory subsystem to the scheduler. As the inventory subsystem store the > "plan" of how the resources should be subdivided, the work we propose in this > JIRA realizes such plan by dynamically instructing the CapacityScheduler to > add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)