[
https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076042#comment-14076042
]
Wangda Tan commented on YARN-1707:
----------------------------------
Thanks for uploading the patch [~curino], [~subru]. They're great additions to
current CapacityScheduler. I took a look at your patch,
*First I have a couple of questions about its background, especially
{{PlanQueue}}/{{ReservationQueue}} in this patch. I think understanding
background is important for me to get a whole picture of this patch. What I can
understand is,*
# {{PlanQueue}} can have a normal {{ParentQueue}} as its parent, but all
children of {{PlanQueue}} can only be {{ReservationQueue}}. Is it possible that
multiple {{PlanQueue}} exist in the cluster?
# {{PlanQueue}} is initially setup in configuration, as same as
{{ParentQueue}}, it has absolute capacity, etc. But different from
{{ParentQueue}}, it has user-limit/user-limit-factor, etc.
# {{ReservationQueue}} is dynamically initialized by PlanFollower, when a new
reservationId acquired, it will create a new {{ReservationQueue}} accordingly
# {{PlanFollower}} can dynamically adjust queue size of {{ReservationQueue}}s
to make resource reservation can be satisfied.
# Is it possible that sum of reserved resource exceeds limit of
{{PlanQueue}}/{{ReservationQeueu}} and preemption triggered?
# How to deal with RM restart? It is possible that RM restart during resource
reservation, we may need to consider how to persistent such queues
Hope you could share your ideas about them.
*For requirement of this ticket (copied from JIRA),*
# create queues dynamically
# destroy queues dynamically
# dynamically change queue parameters (e.g., capacity)
# modify refreshqueue validation to enforce sum(child.getCapacity())<= 100%
instead of ==100%
# move app across queues
I found #1-#3 are dedicated used by {{PlanQueue}}, {{Reservation}}. IMHO, it
should be better to added them to CapacityScheduler and don't couple them with
ReservationSystem, but I cannot think about other solid senarios can leverage
them. I hope to get feedbacks from community before we couple them with
ReservationSystem. And as mentioned by [~acmurthy], can we merge add queue to
existing add new queue mechanism?
#4 should be only valid in {{PlanQueue}}. Because if we change this behavior in
{{ParentQueue}}, it is possible that some careless admin will mis-setting
capacities of queues under a parent queue, if sum of their capacity don't
equals to 1, some resource may not be able to be used by applications.
*Some other comments (Majorly about move app because we may need consider scope
of create/destory queues first):*
1) I think we need consider how moving apps across queues work with YARN-1368.
We can change queue of containers from queueA to queueB, but with YARN-1368,
during RM restart, container will report it is in queueA (we don't sync them to
NM when do moveApp operation). I hope [~jianhe] could share some thoughts about
this as well.
2) Move application in CapacityScheduler need call finishApplication in
resource queue and submitApplication in target queue to make QueueMetrics
correct. And submitApplication will check ACL of target queue as well.
3) Should we respect MaxApplicationsPerUser in target queue when trying to move
app? IMHO, we can stop moving app if MaxApplicationsPerUser reached in target
queue.
Thanks,
Wangda
> Making the CapacityScheduler more dynamic
> -----------------------------------------
>
> Key: YARN-1707
> URL: https://issues.apache.org/jira/browse/YARN-1707
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: capacityscheduler
> Reporter: Carlo Curino
> Assignee: Carlo Curino
> Labels: capacity-scheduler
> Attachments: YARN-1707.patch
>
>
> The CapacityScheduler is a rather static at the moment, and refreshqueue
> provides a rather heavy-handed way to reconfigure it. Moving towards
> long-running services (tracked in YARN-896) and to enable more advanced
> admission control and resource parcelling we need to make the
> CapacityScheduler more dynamic. This is instrumental to the umbrella jira
> YARN-1051.
> Concretely this require the following changes:
> * create queues dynamically
> * destroy queues dynamically
> * dynamically change queue parameters (e.g., capacity)
> * modify refreshqueue validation to enforce sum(child.getCapacity())<= 100%
> instead of ==100%
> We limit this to LeafQueues.
--
This message was sent by Atlassian JIRA
(v6.2#6252)