Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
Hi, On 12/02/18 08:47, Tejun Heo wrote: > Hello, > > On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote: > > - implementation _is not_ hierarchical: only single/plain DEADLINE entities > >can be handled, and they get scheduled at root rq level > > This usually is a deal breaker and often indicates that the cgroup > filesystem is not the right interface for the feature. Can you please > elaborate the interface with some details? The interface is the same as what we have today for groups of RT tasks, and same rules apply. The difference is that when using RT /cpu.rt_runtime_us and /cpu.rt_period_us control RT-Throttling behaviour (fraction of CPU time and granularity), while for DEADLINE the same interface would be used only at admission control time (while servicing a sched_setattr(), attaching tasks to a group or changing group's parameters) since DEADLINE task have their own throttling mechanism already. Intended usage should be very similar. For example, a sys admin that wants to reserve and guarantee CPU bandwidth for a group of tasks would create a group, configure its rt_runtime_us, rt_period_us and put DEADLINE tasks inside it (e.g. video/audio pipeline). Related to what I was saying in the cover letter (i.e., non root access to DEADLINE scheduling) might be a different situation, where sys admin wants to grant a user a certain percentage of CPU time (by creating a group and putting user session inside it) and also control that user doesn't exceed what granted. User would then be free to spawn DEADLINE tasks to service her/his needs up to the maximum bandwidth cap set by sys admin. Does this make any sense and provide a bit more information? Thanks a lot for looking at this! Best, - Juri
Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
Hi, On 12/02/18 08:47, Tejun Heo wrote: > Hello, > > On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote: > > - implementation _is not_ hierarchical: only single/plain DEADLINE entities > >can be handled, and they get scheduled at root rq level > > This usually is a deal breaker and often indicates that the cgroup > filesystem is not the right interface for the feature. Can you please > elaborate the interface with some details? The interface is the same as what we have today for groups of RT tasks, and same rules apply. The difference is that when using RT /cpu.rt_runtime_us and /cpu.rt_period_us control RT-Throttling behaviour (fraction of CPU time and granularity), while for DEADLINE the same interface would be used only at admission control time (while servicing a sched_setattr(), attaching tasks to a group or changing group's parameters) since DEADLINE task have their own throttling mechanism already. Intended usage should be very similar. For example, a sys admin that wants to reserve and guarantee CPU bandwidth for a group of tasks would create a group, configure its rt_runtime_us, rt_period_us and put DEADLINE tasks inside it (e.g. video/audio pipeline). Related to what I was saying in the cover letter (i.e., non root access to DEADLINE scheduling) might be a different situation, where sys admin wants to grant a user a certain percentage of CPU time (by creating a group and putting user session inside it) and also control that user doesn't exceed what granted. User would then be free to spawn DEADLINE tasks to service her/his needs up to the maximum bandwidth cap set by sys admin. Does this make any sense and provide a bit more information? Thanks a lot for looking at this! Best, - Juri
Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
Hello, On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote: > - implementation _is not_ hierarchical: only single/plain DEADLINE entities >can be handled, and they get scheduled at root rq level This usually is a deal breaker and often indicates that the cgroup filesystem is not the right interface for the feature. Can you please elaborate the interface with some details? Thanks. -- tejun
Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
Hello, On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote: > - implementation _is not_ hierarchical: only single/plain DEADLINE entities >can be handled, and they get scheduled at root rq level This usually is a deal breaker and often indicates that the cgroup filesystem is not the right interface for the feature. Can you please elaborate the interface with some details? Thanks. -- tejun
[RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
One of the missing features of DEADLINE (w.r.t. RT) is some way of controlling CPU bandwidth allocation for task groups. Such feature would be especially useful to be able to let normal users use DEADLINE, as the sys admin (with root privilegies) could reserve a fraction of the total available bandwidth to users and let them allocate what needed inside such space. This patch implements cgroup support for DEADLINE, with the following design choices: - implementation _is not_ hierarchical: only single/plain DEADLINE entities can be handled, and they get scheduled at root rq level - DEADLINE_GROUP_SCHED requires RT_GROUP_SCHED (because of the points below) - DEADLINE and RT share bandwidth; therefore, DEADLINE tasks will eat RT bandwidth, as they do today at root level; support for RT_RUNTIME_ SHARE is however missing, an RT task might be able to exceed its group bandwidth constrain if such feature is enabled (more thinking required) - and therefore cpu.rt_runtime_us and cpu.rt_period_us are still controlling a group bandwidth; however, two additional knobs are added # cpu.dl_bw : maximum bandwidth available for the group on each CPU (rt_runtime_us/rt_period_us) # cpu.dl_total_bw : current total (across CPUs) amount of bandwidth allocated by the group (sum of tasks bandwidth) - father/children/siblings rules are the same as for RT Signed-off-by: Juri LelliCc: Ingo Molnar Cc: Peter Zijlstra Cc: Tejun Heo Cc: Luca Abeni Cc: linux-kernel@vger.kernel.org --- init/Kconfig | 12 kernel/sched/autogroup.c | 7 +++ kernel/sched/core.c | 54 +++- kernel/sched/deadline.c | 159 ++- kernel/sched/rt.c| 52 ++-- kernel/sched/sched.h | 20 +- 6 files changed, 292 insertions(+), 12 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index e37f4b2a6445..c6ddda90d51f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -751,6 +751,18 @@ config RT_GROUP_SCHED realtime bandwidth for them. See Documentation/scheduler/sched-rt-group.txt for more information. +config DEADLINE_GROUP_SCHED + bool "Group scheduling for SCHED_DEADLINE" + depends on CGROUP_SCHED + select RT_GROUP_SCHED + default n + help + This feature lets you explicitly specify, in terms of runtime + and period, the bandwidth of each task control group. This means + tasks (and other groups) can be added to it only up to such + "bandwidth cap", which might be useful for avoiding or + controlling oversubscription. + endif #CGROUP_SCHED config CGROUP_PIDS diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index a43df5193538..7cba2e132ac7 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c @@ -90,6 +90,13 @@ static inline struct autogroup *autogroup_create(void) free_rt_sched_group(tg); tg->rt_se = root_task_group.rt_se; tg->rt_rq = root_task_group.rt_rq; +#endif +#ifdef CONFIG_DEADLINE_GROUP_SCHED + /* +* Similarly to what above we do for DEADLINE tasks. +*/ + free_dl_sched_group(tg); + tg->dl_rq = root_task_group.dl_rq; #endif tg->autogroup = ag; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 772a6b3239eb..8bb3e74b9486 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4225,7 +4225,8 @@ static int __sched_setscheduler(struct task_struct *p, #endif #ifdef CONFIG_SMP if (dl_bandwidth_enabled() && dl_policy(policy) && - !(attr->sched_flags & SCHED_FLAG_SUGOV)) { + !(attr->sched_flags & SCHED_FLAG_SUGOV) && + !task_group_is_autogroup(task_group(p))) { cpumask_t *span = rq->rd->span; /* @@ -5900,6 +5901,9 @@ void __init sched_init(void) #endif #ifdef CONFIG_RT_GROUP_SCHED alloc_size += 2 * nr_cpu_ids * sizeof(void **); +#endif +#ifdef CONFIG_DEADLINE_GROUP_SCHED + alloc_size += nr_cpu_ids * sizeof(void **); #endif if (alloc_size) { ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT); @@ -5920,6 +5924,11 @@ void __init sched_init(void) ptr += nr_cpu_ids * sizeof(void **); #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_DEADLINE_GROUP_SCHED + root_task_group.dl_rq = (struct dl_rq **)ptr; + ptr += nr_cpu_ids * sizeof(void **); + +#endif /* CONFIG_DEADLINE_GROUP_SCHED */ } #ifdef CONFIG_CPUMASK_OFFSTACK for_each_possible_cpu(i) { @@ -5941,6 +5950,11 @@ void __init sched_init(void) init_rt_bandwidth(_task_group.rt_bandwidth, global_rt_period(), global_rt_runtime());
[RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support
One of the missing features of DEADLINE (w.r.t. RT) is some way of controlling CPU bandwidth allocation for task groups. Such feature would be especially useful to be able to let normal users use DEADLINE, as the sys admin (with root privilegies) could reserve a fraction of the total available bandwidth to users and let them allocate what needed inside such space. This patch implements cgroup support for DEADLINE, with the following design choices: - implementation _is not_ hierarchical: only single/plain DEADLINE entities can be handled, and they get scheduled at root rq level - DEADLINE_GROUP_SCHED requires RT_GROUP_SCHED (because of the points below) - DEADLINE and RT share bandwidth; therefore, DEADLINE tasks will eat RT bandwidth, as they do today at root level; support for RT_RUNTIME_ SHARE is however missing, an RT task might be able to exceed its group bandwidth constrain if such feature is enabled (more thinking required) - and therefore cpu.rt_runtime_us and cpu.rt_period_us are still controlling a group bandwidth; however, two additional knobs are added # cpu.dl_bw : maximum bandwidth available for the group on each CPU (rt_runtime_us/rt_period_us) # cpu.dl_total_bw : current total (across CPUs) amount of bandwidth allocated by the group (sum of tasks bandwidth) - father/children/siblings rules are the same as for RT Signed-off-by: Juri Lelli Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Tejun Heo Cc: Luca Abeni Cc: linux-kernel@vger.kernel.org --- init/Kconfig | 12 kernel/sched/autogroup.c | 7 +++ kernel/sched/core.c | 54 +++- kernel/sched/deadline.c | 159 ++- kernel/sched/rt.c| 52 ++-- kernel/sched/sched.h | 20 +- 6 files changed, 292 insertions(+), 12 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index e37f4b2a6445..c6ddda90d51f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -751,6 +751,18 @@ config RT_GROUP_SCHED realtime bandwidth for them. See Documentation/scheduler/sched-rt-group.txt for more information. +config DEADLINE_GROUP_SCHED + bool "Group scheduling for SCHED_DEADLINE" + depends on CGROUP_SCHED + select RT_GROUP_SCHED + default n + help + This feature lets you explicitly specify, in terms of runtime + and period, the bandwidth of each task control group. This means + tasks (and other groups) can be added to it only up to such + "bandwidth cap", which might be useful for avoiding or + controlling oversubscription. + endif #CGROUP_SCHED config CGROUP_PIDS diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c index a43df5193538..7cba2e132ac7 100644 --- a/kernel/sched/autogroup.c +++ b/kernel/sched/autogroup.c @@ -90,6 +90,13 @@ static inline struct autogroup *autogroup_create(void) free_rt_sched_group(tg); tg->rt_se = root_task_group.rt_se; tg->rt_rq = root_task_group.rt_rq; +#endif +#ifdef CONFIG_DEADLINE_GROUP_SCHED + /* +* Similarly to what above we do for DEADLINE tasks. +*/ + free_dl_sched_group(tg); + tg->dl_rq = root_task_group.dl_rq; #endif tg->autogroup = ag; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 772a6b3239eb..8bb3e74b9486 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4225,7 +4225,8 @@ static int __sched_setscheduler(struct task_struct *p, #endif #ifdef CONFIG_SMP if (dl_bandwidth_enabled() && dl_policy(policy) && - !(attr->sched_flags & SCHED_FLAG_SUGOV)) { + !(attr->sched_flags & SCHED_FLAG_SUGOV) && + !task_group_is_autogroup(task_group(p))) { cpumask_t *span = rq->rd->span; /* @@ -5900,6 +5901,9 @@ void __init sched_init(void) #endif #ifdef CONFIG_RT_GROUP_SCHED alloc_size += 2 * nr_cpu_ids * sizeof(void **); +#endif +#ifdef CONFIG_DEADLINE_GROUP_SCHED + alloc_size += nr_cpu_ids * sizeof(void **); #endif if (alloc_size) { ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT); @@ -5920,6 +5924,11 @@ void __init sched_init(void) ptr += nr_cpu_ids * sizeof(void **); #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_DEADLINE_GROUP_SCHED + root_task_group.dl_rq = (struct dl_rq **)ptr; + ptr += nr_cpu_ids * sizeof(void **); + +#endif /* CONFIG_DEADLINE_GROUP_SCHED */ } #ifdef CONFIG_CPUMASK_OFFSTACK for_each_possible_cpu(i) { @@ -5941,6 +5950,11 @@ void __init sched_init(void) init_rt_bandwidth(_task_group.rt_bandwidth, global_rt_period(), global_rt_runtime()); #endif /* CONFIG_RT_GROUP_SCHED */ +#ifdef CONFIG_DEADLINE_GROUP_SCHED +