Wangda Tan commented on YARN-1707:

Thanks for uploading the patch [~curino], [~subru]. They're great additions to 
current CapacityScheduler. I took a look at your patch,

*First I have a couple of questions about its background, especially 
{{PlanQueue}}/{{ReservationQueue}} in this patch. I think understanding 
background is important for me to get a whole picture of this patch. What I can 
understand is,*
# {{PlanQueue}} can have a normal {{ParentQueue}} as its parent, but all 
children of {{PlanQueue}} can only be {{ReservationQueue}}. Is it possible that 
multiple {{PlanQueue}} exist in the cluster?
# {{PlanQueue}} is initially setup in configuration, as same as 
{{ParentQueue}}, it has absolute capacity, etc. But different from 
{{ParentQueue}}, it has user-limit/user-limit-factor, etc.
# {{ReservationQueue}} is dynamically initialized by PlanFollower, when a new 
reservationId acquired, it will create a new {{ReservationQueue}} accordingly
# {{PlanFollower}} can dynamically adjust queue size of {{ReservationQueue}}s 
to make resource reservation can be satisfied.
# Is it possible that sum of reserved resource exceeds limit of 
{{PlanQueue}}/{{ReservationQeueu}} and preemption triggered?
# How to deal with RM restart? It is possible that RM restart during resource 
reservation, we may need to consider how to persistent such queues

Hope you could share your ideas about them.

*For requirement of this ticket (copied from JIRA),*
# create queues dynamically
# destroy queues dynamically
# dynamically change queue parameters (e.g., capacity)
# modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% 
instead of ==100%
# move app across queues

I found #1-#3 are dedicated used by {{PlanQueue}}, {{Reservation}}. IMHO, it 
should be better to added them to CapacityScheduler and don't couple them with 
ReservationSystem, but I cannot think about other solid senarios can leverage 
them. I hope to get feedbacks from community before we couple them with 
ReservationSystem. And as mentioned by [~acmurthy], can we merge add queue to 
existing add new queue mechanism?
#4 should be only valid in {{PlanQueue}}. Because if we change this behavior in 
{{ParentQueue}}, it is possible that some careless admin will mis-setting 
capacities of queues under a parent queue, if sum of their capacity don't 
equals to 1, some resource may not be able to be used by applications. 

*Some other comments (Majorly about move app because we may need consider scope 
of create/destory queues first):*
1) I think we need consider how moving apps across queues work with YARN-1368. 
We can change queue of containers from queueA to queueB, but with YARN-1368, 
during RM restart, container will report it is in queueA (we don't sync them to 
NM when do moveApp operation). I hope [~jianhe] could share some thoughts about 
this as well.
2) Move application in CapacityScheduler need call finishApplication in 
resource queue and submitApplication in target queue to make QueueMetrics 
correct. And submitApplication will check ACL of target queue as well.
3) Should we respect MaxApplicationsPerUser in target queue when trying to move 
app? IMHO, we can stop moving app if MaxApplicationsPerUser reached in target 


> Making the CapacityScheduler more dynamic
> -----------------------------------------
>                 Key: YARN-1707
>                 URL: https://issues.apache.org/jira/browse/YARN-1707
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacityscheduler
>            Reporter: Carlo Curino
>            Assignee: Carlo Curino
>              Labels: capacity-scheduler
>         Attachments: YARN-1707.patch
> The CapacityScheduler is a rather static at the moment, and refreshqueue 
> provides a rather heavy-handed way to reconfigure it. Moving towards 
> long-running services (tracked in YARN-896) and to enable more advanced 
> admission control and resource parcelling we need to make the 
> CapacityScheduler more dynamic. This is instrumental to the umbrella jira 
> YARN-1051.
> Concretely this require the following changes:
> * create queues dynamically
> * destroy queues dynamically
> * dynamically change queue parameters (e.g., capacity) 
> * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100% 
> instead of ==100%
> We limit this to LeafQueues. 

This message was sent by Atlassian JIRA

Reply via email to