Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support

2018-02-12 Thread Juri Lelli
Hi,

On 12/02/18 08:47, Tejun Heo wrote:
> Hello,
> 
> On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote:
> >  - implementation _is not_ hierarchical: only single/plain DEADLINE entities
> >can be handled, and they get scheduled at root rq level
> 
> This usually is a deal breaker and often indicates that the cgroup
> filesystem is not the right interface for the feature.  Can you please
> elaborate the interface with some details?

The interface is the same as what we have today for groups of RT tasks,
and same rules apply. The difference is that when using RT
/cpu.rt_runtime_us and /cpu.rt_period_us control
RT-Throttling behaviour (fraction of CPU time and granularity), while
for DEADLINE the same interface would be used only at admission control
time (while servicing a sched_setattr(), attaching tasks to a group or
changing group's parameters) since DEADLINE task have their own
throttling mechanism already.

Intended usage should be very similar. For example, a sys admin that
wants to reserve and guarantee CPU bandwidth for a group of tasks would
create a group, configure its rt_runtime_us, rt_period_us and put
DEADLINE tasks inside it (e.g. video/audio pipeline). Related to what I
was saying in the cover letter (i.e., non root access to DEADLINE
scheduling) might be a different situation, where sys admin wants to
grant a user a certain percentage of CPU time (by creating a group and
putting user session inside it) and also control that user doesn't
exceed what granted. User would then be free to spawn DEADLINE tasks to
service her/his needs up to the maximum bandwidth cap set by sys admin.

Does this make any sense and provide a bit more information?

Thanks a lot for looking at this!

Best,

- Juri


Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support

2018-02-12 Thread Juri Lelli
Hi,

On 12/02/18 08:47, Tejun Heo wrote:
> Hello,
> 
> On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote:
> >  - implementation _is not_ hierarchical: only single/plain DEADLINE entities
> >can be handled, and they get scheduled at root rq level
> 
> This usually is a deal breaker and often indicates that the cgroup
> filesystem is not the right interface for the feature.  Can you please
> elaborate the interface with some details?

The interface is the same as what we have today for groups of RT tasks,
and same rules apply. The difference is that when using RT
/cpu.rt_runtime_us and /cpu.rt_period_us control
RT-Throttling behaviour (fraction of CPU time and granularity), while
for DEADLINE the same interface would be used only at admission control
time (while servicing a sched_setattr(), attaching tasks to a group or
changing group's parameters) since DEADLINE task have their own
throttling mechanism already.

Intended usage should be very similar. For example, a sys admin that
wants to reserve and guarantee CPU bandwidth for a group of tasks would
create a group, configure its rt_runtime_us, rt_period_us and put
DEADLINE tasks inside it (e.g. video/audio pipeline). Related to what I
was saying in the cover letter (i.e., non root access to DEADLINE
scheduling) might be a different situation, where sys admin wants to
grant a user a certain percentage of CPU time (by creating a group and
putting user session inside it) and also control that user doesn't
exceed what granted. User would then be free to spawn DEADLINE tasks to
service her/his needs up to the maximum bandwidth cap set by sys admin.

Does this make any sense and provide a bit more information?

Thanks a lot for looking at this!

Best,

- Juri


Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support

2018-02-12 Thread Tejun Heo
Hello,

On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote:
>  - implementation _is not_ hierarchical: only single/plain DEADLINE entities
>can be handled, and they get scheduled at root rq level

This usually is a deal breaker and often indicates that the cgroup
filesystem is not the right interface for the feature.  Can you please
elaborate the interface with some details?

Thanks.

-- 
tejun


Re: [RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support

2018-02-12 Thread Tejun Heo
Hello,

On Mon, Feb 12, 2018 at 02:40:29PM +0100, Juri Lelli wrote:
>  - implementation _is not_ hierarchical: only single/plain DEADLINE entities
>can be handled, and they get scheduled at root rq level

This usually is a deal breaker and often indicates that the cgroup
filesystem is not the right interface for the feature.  Can you please
elaborate the interface with some details?

Thanks.

-- 
tejun


[RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support

2018-02-12 Thread Juri Lelli
One of the missing features of DEADLINE (w.r.t. RT) is some way of controlling
CPU bandwidth allocation for task groups. Such feature would be especially
useful to be able to let normal users use DEADLINE, as the sys admin (with root
privilegies) could reserve a fraction of the total available bandwidth to users
and let them allocate what needed inside such space.

This patch implements cgroup support for DEADLINE, with the following design
choices:

 - implementation _is not_ hierarchical: only single/plain DEADLINE entities
   can be handled, and they get scheduled at root rq level

 - DEADLINE_GROUP_SCHED requires RT_GROUP_SCHED (because of the points below)

 - DEADLINE and RT share bandwidth; therefore, DEADLINE tasks will eat RT
   bandwidth, as they do today at root level; support for RT_RUNTIME_ SHARE is
   however missing, an RT task might be able to exceed its group bandwidth
   constrain if such feature is enabled (more thinking required)

 - and therefore cpu.rt_runtime_us and cpu.rt_period_us are still controlling a
   group bandwidth; however, two additional knobs are added

 # cpu.dl_bw : maximum bandwidth available for the group on each CPU
   (rt_runtime_us/rt_period_us)
 # cpu.dl_total_bw : current total (across CPUs) amount of bandwidth
 allocated by the group (sum of tasks bandwidth)

 - father/children/siblings rules are the same as for RT

Signed-off-by: Juri Lelli 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Luca Abeni 
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig |  12 
 kernel/sched/autogroup.c |   7 +++
 kernel/sched/core.c  |  54 +++-
 kernel/sched/deadline.c  | 159 ++-
 kernel/sched/rt.c|  52 ++--
 kernel/sched/sched.h |  20 +-
 6 files changed, 292 insertions(+), 12 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index e37f4b2a6445..c6ddda90d51f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -751,6 +751,18 @@ config RT_GROUP_SCHED
  realtime bandwidth for them.
  See Documentation/scheduler/sched-rt-group.txt for more information.
 
+config DEADLINE_GROUP_SCHED
+   bool "Group scheduling for SCHED_DEADLINE"
+   depends on CGROUP_SCHED
+   select RT_GROUP_SCHED
+   default n
+   help
+ This feature lets you explicitly specify, in terms of runtime
+ and period, the bandwidth of each task control group. This means
+ tasks (and other groups) can be added to it only up to such
+ "bandwidth cap", which might be useful for avoiding or
+ controlling oversubscription.
+
 endif #CGROUP_SCHED
 
 config CGROUP_PIDS
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index a43df5193538..7cba2e132ac7 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -90,6 +90,13 @@ static inline struct autogroup *autogroup_create(void)
free_rt_sched_group(tg);
tg->rt_se = root_task_group.rt_se;
tg->rt_rq = root_task_group.rt_rq;
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+   /*
+* Similarly to what above we do for DEADLINE tasks.
+*/
+   free_dl_sched_group(tg);
+   tg->dl_rq = root_task_group.dl_rq;
 #endif
tg->autogroup = ag;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 772a6b3239eb..8bb3e74b9486 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4225,7 +4225,8 @@ static int __sched_setscheduler(struct task_struct *p,
 #endif
 #ifdef CONFIG_SMP
if (dl_bandwidth_enabled() && dl_policy(policy) &&
-   !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
+   !(attr->sched_flags & SCHED_FLAG_SUGOV) &&
+   !task_group_is_autogroup(task_group(p))) {
cpumask_t *span = rq->rd->span;
 
/*
@@ -5900,6 +5901,9 @@ void __init sched_init(void)
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+   alloc_size += nr_cpu_ids * sizeof(void **);
 #endif
if (alloc_size) {
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
@@ -5920,6 +5924,11 @@ void __init sched_init(void)
ptr += nr_cpu_ids * sizeof(void **);
 
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+   root_task_group.dl_rq = (struct dl_rq **)ptr;
+   ptr += nr_cpu_ids * sizeof(void **);
+
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
}
 #ifdef CONFIG_CPUMASK_OFFSTACK
for_each_possible_cpu(i) {
@@ -5941,6 +5950,11 @@ void __init sched_init(void)
init_rt_bandwidth(_task_group.rt_bandwidth,
global_rt_period(), global_rt_runtime());
 

[RFC PATCH 2/3] sched/deadline: add task groups bandwidth management support

2018-02-12 Thread Juri Lelli
One of the missing features of DEADLINE (w.r.t. RT) is some way of controlling
CPU bandwidth allocation for task groups. Such feature would be especially
useful to be able to let normal users use DEADLINE, as the sys admin (with root
privilegies) could reserve a fraction of the total available bandwidth to users
and let them allocate what needed inside such space.

This patch implements cgroup support for DEADLINE, with the following design
choices:

 - implementation _is not_ hierarchical: only single/plain DEADLINE entities
   can be handled, and they get scheduled at root rq level

 - DEADLINE_GROUP_SCHED requires RT_GROUP_SCHED (because of the points below)

 - DEADLINE and RT share bandwidth; therefore, DEADLINE tasks will eat RT
   bandwidth, as they do today at root level; support for RT_RUNTIME_ SHARE is
   however missing, an RT task might be able to exceed its group bandwidth
   constrain if such feature is enabled (more thinking required)

 - and therefore cpu.rt_runtime_us and cpu.rt_period_us are still controlling a
   group bandwidth; however, two additional knobs are added

 # cpu.dl_bw : maximum bandwidth available for the group on each CPU
   (rt_runtime_us/rt_period_us)
 # cpu.dl_total_bw : current total (across CPUs) amount of bandwidth
 allocated by the group (sum of tasks bandwidth)

 - father/children/siblings rules are the same as for RT

Signed-off-by: Juri Lelli 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Luca Abeni 
Cc: linux-kernel@vger.kernel.org
---
 init/Kconfig |  12 
 kernel/sched/autogroup.c |   7 +++
 kernel/sched/core.c  |  54 +++-
 kernel/sched/deadline.c  | 159 ++-
 kernel/sched/rt.c|  52 ++--
 kernel/sched/sched.h |  20 +-
 6 files changed, 292 insertions(+), 12 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index e37f4b2a6445..c6ddda90d51f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -751,6 +751,18 @@ config RT_GROUP_SCHED
  realtime bandwidth for them.
  See Documentation/scheduler/sched-rt-group.txt for more information.
 
+config DEADLINE_GROUP_SCHED
+   bool "Group scheduling for SCHED_DEADLINE"
+   depends on CGROUP_SCHED
+   select RT_GROUP_SCHED
+   default n
+   help
+ This feature lets you explicitly specify, in terms of runtime
+ and period, the bandwidth of each task control group. This means
+ tasks (and other groups) can be added to it only up to such
+ "bandwidth cap", which might be useful for avoiding or
+ controlling oversubscription.
+
 endif #CGROUP_SCHED
 
 config CGROUP_PIDS
diff --git a/kernel/sched/autogroup.c b/kernel/sched/autogroup.c
index a43df5193538..7cba2e132ac7 100644
--- a/kernel/sched/autogroup.c
+++ b/kernel/sched/autogroup.c
@@ -90,6 +90,13 @@ static inline struct autogroup *autogroup_create(void)
free_rt_sched_group(tg);
tg->rt_se = root_task_group.rt_se;
tg->rt_rq = root_task_group.rt_rq;
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+   /*
+* Similarly to what above we do for DEADLINE tasks.
+*/
+   free_dl_sched_group(tg);
+   tg->dl_rq = root_task_group.dl_rq;
 #endif
tg->autogroup = ag;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 772a6b3239eb..8bb3e74b9486 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4225,7 +4225,8 @@ static int __sched_setscheduler(struct task_struct *p,
 #endif
 #ifdef CONFIG_SMP
if (dl_bandwidth_enabled() && dl_policy(policy) &&
-   !(attr->sched_flags & SCHED_FLAG_SUGOV)) {
+   !(attr->sched_flags & SCHED_FLAG_SUGOV) &&
+   !task_group_is_autogroup(task_group(p))) {
cpumask_t *span = rq->rd->span;
 
/*
@@ -5900,6 +5901,9 @@ void __init sched_init(void)
 #endif
 #ifdef CONFIG_RT_GROUP_SCHED
alloc_size += 2 * nr_cpu_ids * sizeof(void **);
+#endif
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+   alloc_size += nr_cpu_ids * sizeof(void **);
 #endif
if (alloc_size) {
ptr = (unsigned long)kzalloc(alloc_size, GFP_NOWAIT);
@@ -5920,6 +5924,11 @@ void __init sched_init(void)
ptr += nr_cpu_ids * sizeof(void **);
 
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+   root_task_group.dl_rq = (struct dl_rq **)ptr;
+   ptr += nr_cpu_ids * sizeof(void **);
+
+#endif /* CONFIG_DEADLINE_GROUP_SCHED */
}
 #ifdef CONFIG_CPUMASK_OFFSTACK
for_each_possible_cpu(i) {
@@ -5941,6 +5950,11 @@ void __init sched_init(void)
init_rt_bandwidth(_task_group.rt_bandwidth,
global_rt_period(), global_rt_runtime());
 #endif /* CONFIG_RT_GROUP_SCHED */
+#ifdef CONFIG_DEADLINE_GROUP_SCHED
+