from:"Patrick Bellasi"

Re: [PATCH v5 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-11-06 Thread Patrick Bellasi



Hi Yun,
thanks for keep improving this.

I'm replying here but still considering all other reviewers comments.

Best,
Patrick

On Tue, Nov 03, 2020 at 03:37:56 +0100, Yun Hsiang  
wrote...

> If the user wants to stop controlling uclamp and let the task inherit
> the value from the group, we need a method to reset.
>
> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
> sched_setattr syscall.
>
> The policy is
> _CLAMP_RESET   => reset both min and max
> _CLAMP_RESET | _CLAMP_MIN  => reset min value
> _CLAMP_RESET | _CLAMP_MAX  => reset max value
> _CLAMP_RESET | _CLAMP_MIN | _CLAMP_MAX => reset both min and max

This documentation should be added to the uapi header and, most
importantly, in:
  include/uapi/linux/sched/types.h
where the documentation for struct sched_attr is provided.


> Signed-off-by: Yun Hsiang 
> Reported-by: kernel test robot 
> ---
>  include/uapi/linux/sched.h |  7 +++--
>  kernel/sched/core.c| 59 --
>  2 files changed, 49 insertions(+), 17 deletions(-)
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..6c823ddb1a1e 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -132,17 +132,20 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS   0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX0x40
> +#define SCHED_FLAG_UTIL_CLAMP_RESET  0x80
>  
>  #define SCHED_FLAG_KEEP_ALL  (SCHED_FLAG_KEEP_POLICY | \
>SCHED_FLAG_KEEP_PARAMS)
>  

(Related to the following discussion point)
What about adding in a comment here to call out that the following
definitions are "internal only"?
Moreover, we could probably wrap the following two define within an
#ifdef __KERNEL__/#endif block


Something like:

+ /*
+ * The following definitions are internal only, do not use them to set
+ * set_{set,get}attr() params. Use instead a valid combination of the
+ * flags defined above.
+ */
+ #ifdef __KERNEL__

>  #define SCHED_FLAG_UTIL_CLAMP(SCHED_FLAG_UTIL_CLAMP_MIN | \
> -  SCHED_FLAG_UTIL_CLAMP_MAX)
> +  SCHED_FLAG_UTIL_CLAMP_MAX | \
> +  SCHED_FLAG_UTIL_CLAMP_RESET)

We need the _RESET flag only here (not below), since this is a UCLAMP
feature and it's worth/useful to have a single "all uclamp flags"
definition...

>  #define SCHED_FLAG_ALL   (SCHED_FLAG_RESET_ON_FORK   | \
>SCHED_FLAG_RECLAIM | \
>SCHED_FLAG_DL_OVERRUN  | \
>SCHED_FLAG_KEEP_ALL| \
> -  SCHED_FLAG_UTIL_CLAMP)
> +  SCHED_FLAG_UTIL_CLAMP  | \
> +  SCHED_FLAG_UTIL_CLAMP_RESET)

... i.e., you can drop the chunk above.

+ #endif /* __KERNEL__ */

Regarding Qais comment, I had the same Dietmar's thought: I doubt there
are apps using _FLAGS_ALL from userspace. For DL tasks, since they
cannot fork, it makes no sense to specify, for example
_RESET_ON_FORK|_RECLAIM. For CFS/RT tasks, where UCLAMP is supported, it
makes no sense to specify DL flags.

It's true however that having this def here when it's supposed to be
used only internally, can be kind of "confusing", but it's also useful
to keep the definition aligned with the flags defined above.
The ifdef wrapping proposed above should make this even more explicit.

Perhaps we can also better call this out also with an additional note
right after:

  
https://elixir.bootlin.com/linux/v5.9.6/source/include/uapi/linux/sched/types.h#L43

In that file, I believe the "Task Utilization Attributes" section can
also be improved by adding a description of the _UCLAMP flags semantics.


>  #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 8160ab5263f8..6ae463b64834 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1004,7 +1004,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum 
> uclamp_id clamp_id,
>   return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>  
> -static void __uclamp_update_util_min_rt_default(struct task_struct *p)
> +static inline void __uclamp_update_util_min_rt_default(struct task_struct *p)
>  {

Again, IMO, this is _not_ an unrelated change at all. Actually, I still
would like to do one step more and inline this function in the _only
place_ where it's used. Qais arguments for not doing that where [1]:

  Updating the default rt value is done from different contexts. Hence
  it is important to document the rules under which this update must
  happen and ensure the update happens through a common path.

I don't see why these arguments are not satisfied when we inline, e.g.

---8<---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index

Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-28 Thread Patrick Bellasi



Hi Dietmar, Yun,
I hope I'm not too late before v4 posting ;)

I think the overall approach is sound, I just added in a couple of
cleanups and a possible fix (user_defined reset).

Best,
Patrick


On Tue, Oct 27, 2020 at 16:58:13 +0100, Yun Hsiang  
wrote...

> Hi Diet mar,
> On Mon, Oct 26, 2020 at 08:00:48PM +0100, Dietmar Eggemann wrote:
>> On 26/10/2020 16:45, Yun Hsiang wrote:

[...]

>> I thought about something like this. Only lightly tested. 
>> 
>> ---8<---
>> 
>> From: Dietmar Eggemann 
>> Date: Mon, 26 Oct 2020 13:52:23 +0100
>> Subject: [PATCH] SCHED_FLAG_UTIL_CLAMP_RESET
>> 
>> Signed-off-by: Dietmar Eggemann 
>> ---
>>  include/uapi/linux/sched.h |  4 +++-
>>  kernel/sched/core.c| 31 +++
>>  2 files changed, 30 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 3bac0a8ceab2..0dd890822751 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -132,12 +132,14 @@ struct clone_args {
>>  #define SCHED_FLAG_KEEP_PARAMS  0x10
>>  #define SCHED_FLAG_UTIL_CLAMP_MIN   0x20
>>  #define SCHED_FLAG_UTIL_CLAMP_MAX   0x40
>> +#define SCHED_FLAG_UTIL_CLAMP_RESET 0x80
>>  
>>  #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \
>>   SCHED_FLAG_KEEP_PARAMS)
>>  
>>  #define SCHED_FLAG_UTIL_CLAMP   (SCHED_FLAG_UTIL_CLAMP_MIN | \
>> - SCHED_FLAG_UTIL_CLAMP_MAX)
>> + SCHED_FLAG_UTIL_CLAMP_MAX | \
>> + SCHED_FLAG_UTIL_CLAMP_RESET)
>>  
>>  #define SCHED_FLAG_ALL  (SCHED_FLAG_RESET_ON_FORK   | \
>>   SCHED_FLAG_RECLAIM | \
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3dc415f58bd7..717b1cf5cf1f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1438,6 +1438,23 @@ static int uclamp_validate(struct task_struct *p,
>>  return 0;
>>  }
>>  
>> +static bool uclamp_reset(enum uclamp_id clamp_id, unsigned long flags)
>> +{

Maybe we can add in some comments?

/* No _UCLAMP_RESET flag set: do not reset */
>> +if (!(flags & SCHED_FLAG_UTIL_CLAMP_RESET))
>> +return false;
>> +

/* Only _UCLAMP_RESET flag set: reset both clamps */
>> +if (!(flags & (SCHED_FLAG_UTIL_CLAMP_MIN | SCHED_FLAG_UTIL_CLAMP_MAX)))
>> +return true;
>> +
/* Both _UCLAMP_RESET and _UCLAMP_MIN flags are set: reset only min */
>> +if ((flags & SCHED_FLAG_UTIL_CLAMP_MIN) && clamp_id == UCLAMP_MIN)
>> +return true;
>> +

/* Both _UCLAMP_RESET and _UCLAMP_MAX flags are set: reset only max */
>> +if ((flags & SCHED_FLAG_UTIL_CLAMP_MAX) && clamp_id == UCLAMP_MAX)
>> +return true;

Since the evaluation ordering is important, do we have to better
_always_ use a READ_ONCE() for all flags accesses above, to ensure it is
preserved?

>> +
>> +return false;
>> +}
>> +
>>  static void __setscheduler_uclamp(struct task_struct *p,
>>const struct sched_attr *attr)
>>  {
>> @@ -1449,24 +1466,30 @@ static void __setscheduler_uclamp(struct task_struct 
>> *p,
>>   */

Perhaps we should update the comment above this loop with something
like:

/*
 * Reset to default clamps on forced _UCLAMP_RESET (always) and
 * for tasks without a task-specific value (on scheduling class change).
 */
>>  for_each_clamp_id(clamp_id) {
>>  struct uclamp_se *uc_se = >uclamp_req[clamp_id];
>> +unsigned int value;
>>  
>>  /* Keep using defined clamps across class changes */
>> -if (uc_se->user_defined)
>> +if (!uclamp_reset(clamp_id, attr->sched_flags) &&
>> +uc_se->user_defined) {
>>  continue;
>> +}

I think we miss to reset the user_defined flag here.

What about replacing the above chunk with:

if (uclamp_reset(clamp_id, attr->sched_flags))
uc_se->user_defined = false;
if (uc-se->user_defined)
continue;

?


>>  
>>  /*
>>   * RT by default have a 100% boost value that could be modified
>>   * at runtime.
>>   */
>>  if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
>> -__uclamp_update_util_min_rt_default(p);
>> +value = sysctl_sched_uclamp_util_min_rt_default;

By removing this usage of __uclamp_updadate_util_min_rt_default(p),
the only other usage remaining is the call from:
   uclamp_udpate_util_min_rt_default().

What about an additional cleanup by in-lining the only surviving usage?


>>  else
>> -uclamp_se_set(uc_se, uclamp_none(clamp_id), false);
>> +value = uclamp_none(clamp_id);
>>  
>> +uclamp_se_set(uc_se, value, false);

Re: [PATCH v3 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-28 Thread Patrick Bellasi



On Wed, Oct 28, 2020 at 12:39:43 +0100, Qais Yousef  
wrote...

> On 10/28/20 11:11, Patrick Bellasi wrote:
>> >>  
>> >>   /*
>> >>* RT by default have a 100% boost value that could be modified
>> >>* at runtime.
>> >>*/
>> >>   if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
>> >> - __uclamp_update_util_min_rt_default(p);
>> >> + value = sysctl_sched_uclamp_util_min_rt_default;
>> 
>> By removing this usage of __uclamp_updadate_util_min_rt_default(p),
>> the only other usage remaining is the call from:
>>uclamp_udpate_util_min_rt_default().
>> 
>> What about an additional cleanup by in-lining the only surviving usage?
>
> This is not a cleanup IMO. There is special rule about updating that are
> encoded and documented in this helper function. Namely:
>
>   * p->pi_lock must be held.
>   * p->uclamp_req[].user_defined must be false.

Both these conditions are satisfied in the above call site:
 - user_defined is tested just 4 lines above
 - pi_lock is taken by the caller, i.e. __sched_setscheduler()
Thus, there is no need to test them two times.

Moreover, the same granted pi_lock you check in
__ucalmp_update_util_min_rt_default() is not checked at all in the rest
of __setscheduler_uclamp().

Thus, perhaps we should have just avoided to add
__uclamp_update_util_min_rt_default() since the beginning and:
 - have all its logic in the _only_ place where it's required
 - added the lockdep_assert_held() in __setscheduler_uclamp()

That's why I consider this a very good cleanup opportunity.

> I don't see open coding helps but rather makes the code harder to read and
> prone to introduce bugs if anything gets reshuffled in the future.

It's not open coding IMHO, it's just adding the code that's required.

Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-14 Thread Patrick Bellasi

On Tue, Oct 13, 2020 at 22:25:48 +0200, Dietmar Eggemann 
 wrote...

Hi Dietmar,

> Hi Yun,
>
> On 12/10/2020 18:31, Yun Hsiang wrote:
>> If the user wants to stop controlling uclamp and let the task inherit
>> the value from the group, we need a method to reset.
>> 
>> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
>> sched_setattr syscall.
>
> before we decide on how to implement the 'uclamp user_defined reset'
> feature, could we come back to your use case in
> https://lkml.kernel.org/r/20201002053812.GA176142@ubuntu ?
>
> Lets just consider uclamp min for now. We have:
>
> (1) system-wide:
>
> # cat /proc/sys/kernel/sched_util_clamp_min
>
> 1024
>
> (2) tg (hierarchy) with top-app's cpu.uclamp.min to ~200 (20% of 1024):
>
> # cat /sys/fs/cgroup/cpu/top-app/cpu.uclamp.min
> 20
>
> (3) and 2 cfs tasks A and B in top-app:
>
> # cat /sys/fs/cgroup/cpu/top-app/tasks
>
> pid_A
> pid_B
>
> Then you set A and B's uclamp min to 100. A and B are now user_defined.
> A and B's effective uclamp min value is 100.
>
> Since the task uclamp min values (3) are less than (1) and (2), their
> uclamp min value is not affected by (1) or (2).
>
> If A doesn't want to control itself anymore, it can set its uclamp min
> to e.g. 300. Now A's effective uclamp min value is ~200, i.e. controlled
> by (2), the one of B stays 100.
>
> So the policy is:
>
> (a) If the user_defined task wants to control it's uclamp, use task
> uclamp value less than the tg (hierarchy) (and the system-wide)
> value.
>
> (b) If the user_defined task doesn't want to control it's uclamp
> anymore, use a uclamp value greater than or equal the tg (hierarchy)
> (and the system-wide) value.
>
> So where exactly is the use case which would require a 'uclamp
> user_defined reset' functionality?

Not sure what's the specific use-case Yun is after, but I have at least
one in my mind.

Let say a task does not need boost at all, independently from
the cgroup it's configured to run into. We can go on and set its task
specific value to util_min=0.

In this case, when the task is running alone on a CPU, it will get
always the minimum OPP, independently from its utilization.

Now, after a while (e.g. some special event happens) we want to relax
this constraint and allow the task to run:
  1. at whatever OPP is required by its utilization
  2. with any additional boost possibly enforced by its cgroup

Right now we have only quite cumbersome or hack solution:
  a) go check the current cgroup util_min value and set for the task
  something higher than that
  b) set task::util_min=1024 thus asking for the maximum possible boost

Solution a) is more code for userspace and it's also racy. Solution b)
is misleading since the task does not really want to run at 1024.
It's also potentially over-killing in case the task should be moved to
the root group, which is normally unbounded and thus the task will get
executed always at the max OPP without any specific reason why.

A simple _UCLAMP_RESET flag will allow user-space to easily switch a
tasks to the default behavior (follow utilization or recommended
boosts) which is what a task usually gets when it does not opt-in to
uclamp.

Looking forward to see if Yun has an even more specific use-case.

Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-13 Thread Patrick Bellasi

On Tue, Oct 13, 2020 at 15:32:46 +0200, Qais Yousef  
wrote...

> On 10/13/20 13:46, Patrick Bellasi wrote:
>> > So IMO you just need a single SCHED_FLAG_UTIL_CLAMP_RESET that if set in 
>> > the
>> > attr, you just execute that loop in __setscheduler_uclamp() + reset
>> > uc_se->user_defined.
>> >
>> > It should be invalid to pass the SCHED_FLAG_UTIL_CLAMP_RESET with
>> > SCHED_FLAG_UTIL_CLAMP_MIN/MAX. Both have contradictory meaning IMO.
>> > If user passes both we should return an EINVAL error.
>> 
>> Passing in  _CLAMP_RESET|_CLAMP_MIN will mean reset the min value while
>> keeping the max at whatever it is. I think there could be cases where
>> this support could be on hand.
>
> I am not convinced personally. I'm anxious about what this fine grained 
> control
> means and how it should be used. I think less is more in this case and we can
> always relax the restriction (appropriately) later if it's *really* required.
>
> Particularly the fact that this user_defined is per uclamp_se and that it
> affects the cgroup behavior is implementation details this API shouldn't rely
> on.

The user_defined flag is an implementation details: true, but since the
beginning uclamp _always_ allowed a task to set only one of its clamp
values.

That's why we have UTIL_CLAMP_{MIN,MAX} as separate flags and all the
logic in place to set only one of the two.

> A generic RESET my uclamp settings makes more sense for me as a long term
> API to maintain.
>
> Actually maybe we should even go for a more explicit
> SCHED_FLAG_UTIL_CLAMP_INHERIT_CGROUP flag instead. If we decide to abandon the
> support for this feature in the future, at least we can make it return an 
> error
> without affecting other functionality because of the implicit nature of
> SCHED_FLAG_UTIL_CLAMP_RESET means inherit cgroup value too.

That's not true and it's an even worst implementation detail what you
want to expose.

A task without a task specific clamp _always_ inherits the system
defaults. Resetting a task specific clamp already makes sense also
_without_ cgroups. It means: just do whatever the system allows you to
do.

Only if you are running with CGRoups enabled and the task happens to be
_not_ in the root group, the "CGroups inheritance" happens.
But that's exactly an internal detail a task should not care about.

> That being said, I am not strongly against the fine grained approach if that's
> what Yun wants now or what you both prefer.

It's not a fine grained approach, it's just adding a reset mechanism for
what uclamp already allows to do: setting min and max clamps
independently.

Regarding use cases, I also believe we have many more use cases of tasks
interested in setting/resetting just one clamp than tasks interested in
"fine grain" controlling both clamps at the same time.

> I just think the name of the flag needs to change to be more explicit
> too then.

I don't agree on that and, again, I see much more fine grained details and
internals exposure in what you propose compared to a single generic
_RESET flag.

> It'd be good to hear what others think.

I agree on that ;)

Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-13 Thread Patrick Bellasi



On Tue, Oct 13, 2020 at 12:29:51 +0200, Qais Yousef  
wrote...

> On 10/13/20 10:21, Patrick Bellasi wrote:
>> 

[...]

>> > +#define SCHED_FLAG_UTIL_CLAMP_RESET (SCHED_FLAG_UTIL_CLAMP_RESET_MIN | \
>> > +  SCHED_FLAG_UTIL_CLAMP_RESET_MAX)
>> > +
>> >  #define SCHED_FLAG_ALL(SCHED_FLAG_RESET_ON_FORK   | \
>> > SCHED_FLAG_RECLAIM | \
>> > SCHED_FLAG_DL_OVERRUN  | \
>> > SCHED_FLAG_KEEP_ALL| \
>> > -   SCHED_FLAG_UTIL_CLAMP)
>> > +   SCHED_FLAG_UTIL_CLAMP  | \
>> > +   SCHED_FLAG_UTIL_CLAMP_RESET)
>> 
>> 
>> ... and use it in conjunction with the existing _CLAMP_{MIN,MAX} to know
>> which clamp should be reset?
>
> I think the RESET should restore *both* MIN and MAX and reset the user_defined
> flag. Since the latter is the main purpose of this interface, I don't think 
> you
> can reset the user_defined flag without resetting both MIN and MAX to
> uclamp_none[UCLAMP_MIN/MAX].

We can certainly set one clamp and not the other, and indeed the
user_defined flag is per-clamp_id, thus we can reset one clamp while
still keeping user-defined the other one.


>> >  #endif /* _UAPI_LINUX_SCHED_H */
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 9a2fbf98fd6f..ed4cb412dde7 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -1207,15 +1207,22 @@ static void __setscheduler_uclamp(struct 
>> > task_struct *p,
>> >uclamp_se_set(uc_se, clamp_value, false);
>> >}
>> >  
>> > -  if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>> > +  if (likely(!(attr->sched_flags &
>> > +  (SCHED_FLAG_UTIL_CLAMP | SCHED_FLAG_UTIL_CLAMP_RESET
>> >return;
>> 
>> This check will not be changed, while we will have to add a bypass in
>> uclamp_validate().
>> 
>> >  
>> > -  if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>> > +  if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MIN) {
>> > +  uclamp_se_set(>uclamp_req[UCLAMP_MIN],
>> > +0, false);
>> > +  } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>> >uclamp_se_set(>uclamp_req[UCLAMP_MIN],
>> >  attr->sched_util_min, true);
>> >}
>> >
>> 
>> These checks also will have to be updated to check _RESET and
>> _{MIN,MAX} combinations.
>> 
>> Bonus point would be to be possible to pass in just the _RESET flag if
>> we want to reset both clamps. IOW, passing in _RESET only should be
>> consumed as if we passed in _RESET|_MIN|_MAX.
>> 
>> Caveat, RT tasks have got a special 'reset value' for _MIN.
>> We should check and ensure __uclamp_update_uti_min_rt_default() is
>> property called for those tasks, which likely will require some
>> additional refactoring :/
>
> Hmm I am probably missing something. But if the SCHED_FLAG_UTIL_CLAMP_RESET is
> set, just reset uc_se->user_defined in the loop in __setscheduler_uclamp().
> This should take care of doing the reset properly then. Including for
> RT tasks.

Yes and no. Yes because in principle we can just reset the flag for a
clamp_id without updating the request values, as it is done by the
snippets above, and the internals should work.

However, we will end up reporting the old values when reading from
user-space. We should better check all those reporting code paths or...
just update the requested values as Yun is proposing above.

I like better Yun approach so that we keep internal data structures
aligned with features.

> So IMO you just need a single SCHED_FLAG_UTIL_CLAMP_RESET that if set in the
> attr, you just execute that loop in __setscheduler_uclamp() + reset
> uc_se->user_defined.
>
> It should be invalid to pass the SCHED_FLAG_UTIL_CLAMP_RESET with
> SCHED_FLAG_UTIL_CLAMP_MIN/MAX. Both have contradictory meaning IMO.
> If user passes both we should return an EINVAL error.

Passing in  _CLAMP_RESET|_CLAMP_MIN will mean reset the min value while
keeping the max at whatever it is. I think there could be cases where
this support could be on hand.

However, as in my previous email, by passing in only _CLAMP_RESET, we
should go an reset both.

Re: [PATCH v2] sched/features: Fix !CONFIG_JUMP_LABEL case

2020-10-13 Thread Patrick Bellasi



On Tue, Oct 13, 2020 at 07:31:14 +0200, Juri Lelli  
wrote...

> Commit 765cc3a4b224e ("sched/core: Optimize sched_feat() for
> !CONFIG_SCHED_DEBUG builds") made sched features static for
> !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_
> SCHED_DEBUG enabled and !CONFIG_JUMP_LABEL cases. For the latter echoing
> changes to /sys/kernel/debug/sched_features has the nasty effect of
> effectively changing what sched_features reports, but without actually
> changing the scheduler behaviour (since different translation units get
> different sysctl_sched_features).

Hops, yes, I think I missed to properly check that config :/
Good spot!

> Fix CONFIG_SCHED_DEBUG and !CONFIG_JUMP_LABEL configurations by properly
> restructuring ifdefs.
>
> Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for 
> !CONFIG_SCHED_DEBUG builds")
> Co-developed-by: Daniel Bristot de Oliveira 
> Signed-off-by: Daniel Bristot de Oliveira 
> Signed-off-by: Juri Lelli 

(did you get some wrong formatting for the changelog above?)

> ---
> v1->v2
>  - use CONFIG_JUMP_LABEL (and not the old HAVE_JUMP_LABEL) [Valentin]
> ---
>  kernel/sched/core.c  |  2 +-
>  kernel/sched/sched.h | 13 ++---
>  2 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3dc415f58bd7..a7949e3ed7e7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -44,7 +44,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
>  
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
>  
> -#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
> +#ifdef CONFIG_SCHED_DEBUG
>  /*
>   * Debugging: various feature bits
>   *
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 28709f6b0975..8d1ca65db3b0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1629,7 +1629,7 @@ enum {
>  
>  #undef SCHED_FEAT
>  
> -#if defined(CONFIG_SCHED_DEBUG) && defined(CONFIG_JUMP_LABEL)
> +#ifdef CONFIG_SCHED_DEBUG
>  
>  /*
>   * To support run-time toggling of sched features, all the translation units
> @@ -1637,6 +1637,7 @@ enum {
>   */
>  extern const_debug unsigned int sysctl_sched_features;
>  
> +#ifdef CONFIG_JUMP_LABEL
>  #define SCHED_FEAT(name, enabled)\
>  static __always_inline bool static_branch_##name(struct static_key *key) \
>  {\
> @@ -1649,7 +1650,13 @@ static __always_inline bool 
> static_branch_##name(struct static_key *key) \
>  extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
>  #define sched_feat(x) (static_branch_##x(_feat_keys[__SCHED_FEAT_##x]))
>  
> -#else /* !(SCHED_DEBUG && CONFIG_JUMP_LABEL) */
> +#else /* !CONFIG_JUMP_LABEL */
> +
> +#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
> +
> +#endif /* CONFIG_JUMP_LABEL */
> +
> +#else /* !SCHED_DEBUG */
>  
>  /*
>   * Each translation unit has its own copy of sysctl_sched_features to allow
> @@ -1665,7 +1672,7 @@ static const_debug __maybe_unused unsigned int 
> sysctl_sched_features =
>  
>  #define sched_feat(x) !!(sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
>  
> -#endif /* SCHED_DEBUG && CONFIG_JUMP_LABEL */
> +#endif /* SCHED_DEBUG */
>  
>  extern struct static_key_false sched_numa_balancing;
>  extern struct static_key_false sched_schedstats;

Re: [PATCH v2 1/1] sched/uclamp: add SCHED_FLAG_UTIL_CLAMP_RESET flag to reset uclamp

2020-10-13 Thread Patrick Bellasi



Hi Yun,
thanks for sharing this new implementation.

On Mon, Oct 12, 2020 at 18:31:40 +0200, Yun Hsiang  
wrote...

> If the user wants to stop controlling uclamp and let the task inherit
> the value from the group, we need a method to reset.
>
> Add SCHED_FLAG_UTIL_CLAMP_RESET flag to allow the user to reset uclamp via
> sched_setattr syscall.

Looks like what you say here is not what you code, since you actually
add two new flags, _RESET_{MIN,MAX}.

I think we value instead a simple user-space interface where just the
additional one flag _RESET should be good enough.

> Signed-off-by: Yun Hsiang 
> ---
>  include/uapi/linux/sched.h |  9 -
>  kernel/sched/core.c| 16 
>  2 files changed, 20 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..a12e88c362d8 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -132,6 +132,9 @@ struct clone_args {
>  #define SCHED_FLAG_KEEP_PARAMS   0x10
>  #define SCHED_FLAG_UTIL_CLAMP_MIN0x20
>  #define SCHED_FLAG_UTIL_CLAMP_MAX0x40
> +#define SCHED_FLAG_UTIL_CLAMP_RESET_MIN  0x80
> +#define SCHED_FLAG_UTIL_CLAMP_RESET_MAX  0x100

What about adding just SCHED_FLAG_UTIL_CLAMP_RESET 0x08 ...

>  
>  #define SCHED_FLAG_KEEP_ALL  (SCHED_FLAG_KEEP_POLICY | \
>SCHED_FLAG_KEEP_PARAMS)
> @@ -139,10 +142,14 @@ struct clone_args {
>  #define SCHED_FLAG_UTIL_CLAMP(SCHED_FLAG_UTIL_CLAMP_MIN | \
>SCHED_FLAG_UTIL_CLAMP_MAX)

... making it part of SCHED_FLAG_UTIL_CLAMP ...

>  
> +#define SCHED_FLAG_UTIL_CLAMP_RESET (SCHED_FLAG_UTIL_CLAMP_RESET_MIN | \
> + SCHED_FLAG_UTIL_CLAMP_RESET_MAX)
> +
>  #define SCHED_FLAG_ALL   (SCHED_FLAG_RESET_ON_FORK   | \
>SCHED_FLAG_RECLAIM | \
>SCHED_FLAG_DL_OVERRUN  | \
>SCHED_FLAG_KEEP_ALL| \
> -  SCHED_FLAG_UTIL_CLAMP)
> +  SCHED_FLAG_UTIL_CLAMP  | \
> +  SCHED_FLAG_UTIL_CLAMP_RESET)


... and use it in conjunction with the existing _CLAMP_{MIN,MAX} to know
which clamp should be reset?


>  #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..ed4cb412dde7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1207,15 +1207,22 @@ static void __setscheduler_uclamp(struct task_struct 
> *p,
>   uclamp_se_set(uc_se, clamp_value, false);
>   }
>  
> - if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> + if (likely(!(attr->sched_flags &
> + (SCHED_FLAG_UTIL_CLAMP | SCHED_FLAG_UTIL_CLAMP_RESET
>   return;

This check will not be changed, while we will have to add a bypass in
uclamp_validate().

>  
> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MIN) {
> + uclamp_se_set(>uclamp_req[UCLAMP_MIN],
> +   0, false);
> + } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
>   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
> attr->sched_util_min, true);
>   }
>

These checks also will have to be updated to check _RESET and
_{MIN,MAX} combinations.

Bonus point would be to be possible to pass in just the _RESET flag if
we want to reset both clamps. IOW, passing in _RESET only should be
consumed as if we passed in _RESET|_MIN|_MAX.

Caveat, RT tasks have got a special 'reset value' for _MIN.
We should check and ensure __uclamp_update_uti_min_rt_default() is
property called for those tasks, which likely will require some
additional refactoring :/

> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_RESET_MAX) {
> + uclamp_se_set(>uclamp_req[UCLAMP_MAX],
> +   1024, false);
> + } else if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
>   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
> attr->sched_util_max, true);
>   }
> @@ -4901,7 +4908,8 @@ static int __sched_setscheduler(struct task_struct *p,
>   goto change;
>   if (dl_policy(policy) && dl_param_changed(p, attr))
>   goto change;
> - if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
> + if (attr->sched_flags &
> + (SCHED_FLAG_UTIL_CLAMP | 
> SCHED_FLAG_UTIL_CLAMP_RESET))
>   goto change;
>  
>   p->sched_reset_on_fork = reset_on_fork;

Re: [PATCH 1/1] sched/uclamp: release per-task uclamp control if user set to default value

2020-10-05 Thread Patrick Bellasi



Hi Yun, Dietmar,

On Mon, Oct 05, 2020 at 14:38:18 +0200, Dietmar Eggemann 
 wrote...

> + Patrick Bellasi 
> + Qais Yousef 
>
> On 02.10.20 07:38, Yun Hsiang wrote:
>> On Wed, Sep 30, 2020 at 03:12:51PM +0200, Dietmar Eggemann wrote:
>
> [...]
>
>>> On 28/09/2020 10:26, Yun Hsiang wrote:
>>>> If the user wants to release the util clamp and let cgroup to control it,
>>>> we need a method to reset.
>>>>
>>>> So if the user set the task uclamp to the default value (0 for UCLAMP_MIN
>>>> and 1024 for UCLAMP_MAX), reset the user_defined flag to release control.
>>>>
>>>> Signed-off-by: Yun Hsiang 
>>>
>>> could you explain with a little bit more detail why you would need this
>>> feature?
>>>
>>> Currently we assume that once the per-task uclamp (user-defined) values
>>> are set, you could only change the effective uclamp values of this task
>>> by (1) moving it into another taskgroup or (2) changing the system
>>> default uclamp values.
>>>
>> 
>> Assume a module that controls group (such as top-app in android) uclamp and
>> task A in the group.
>> Once task A set uclamp, it will not be affected by the group setting.

That's not true, and Dietmar example here after is correct.

We call it uclamp since the values are clamps, which are always
aggregate somehow at different levels. IOW, a task has never a full free
choice of the final effective value.

> This depends on the uclamp values of A and /TG (the task group).
>
> Both uclamp values are max aggregated (max aggregation between
> system-wide, taskgroup and task values for UCLAMP_MIN and UCLAMP_MAX).
>
> (1) system-wide: /proc/sys/kernel/sched_util_clamp_[min,max]
>
> (2) taskgroup (hierarchy): /sys/fs/cgroup/cpu/TG/cpu.uclamp.[min,max]
>
> (3) task A:
>
> Example: [uclamp_min, uclamp_max]
>
> (1)  [1024, 1024]
>
> (2)  [25.00 (256), 75.00 (768)]
>
> (3a) [128, 512] : both values are not affected by /TG
>
> (3b) [384, 896] : both values are affected by /TG
>
>
>> If task A doesn't want to control itself anymore,

To be precise, in this case we should say: "if a task don't want to give
up anymore".

Indeed, the base idea is that a task can always and only
"ask for less". What it really gets (effective value) is the minimum
among its request, what the group allows and the system wide value on
top, i.e ref [4,5]:

   eff_value = MIN(system-wide, MIN(tg, task))


>> it can not go back to the initial state to let the module(group) control.
>
> In case A changes its values e.g. from 3a to 3b it will go back to be
> controlled by /TG again (like it was when it had no user defined
> values).

True, however it's also true that strictly speaking once a task has
defined a per-task value, we will always aggregate/clamp that value wrt
to TG and SystemWide value.

>> But the other tasks in the group will be affected by the group.

This is not clear to me.

All tasks in a group will be treated independently. All the tasks are
subject to the same _individual_ aggregation/clamping policy.

> Yes, in case they have no user defined values or have values greater
> than the one of /TG.
>
>> The policy might be
>> 1) if the task wants to control it's uclamp, use task uclamp value

Again, worth to stress, a task has _never_ full control of it's clamp.
Precisely, a task has the freedom to always ask less than what it's
enforced at TG/System level.

IOW, task-specific uclamp values support only a "nice" policy, where a
task can only give up something. Either be _less_ boosted or _more_
capped, which in both cases corresponds to asking for _less_ CPU
bandwidth.

>> (but under group uclamp constraint)
>
> That would be example 3a.
>
>> 2) if the task doesn't want to control it's uclamp, use group uclamp value.
>
> That would be example 3b.
>
>> If the policy is proper, we need a reset method for per-task uclamp.
>> 
>>>> ---
>>>>  kernel/sched/core.c | 7 +--
>>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 9a2fbf98fd6f..fa63d70d783a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1187,6 +1187,7 @@ static void __setscheduler_uclamp(struct task_struct 
>>>> *p,
>>>>  const struct sched_attr *attr)
>>>>  {
>>>>enum uclamp_id clamp_id;
>>>> +  bool user_defined;
>>>>  
>>>>/*
>>>> * On scheduling class change, reset to defaul

Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

2020-07-16 Thread Patrick Bellasi



Hi Vincent,

On Mon, Jul 13, 2020 at 14:59:51 +0200, Vincent Guittot 
 wrote...

> On Fri, 10 Jul 2020 at 21:59, Patrick Bellasi  
> wrote:
>> On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot 
>>  wrote...
>>
>> [...]
>>
>> >> > C) Existing control paths
>> >>
>> >> Assuming:
>> >>
>> >>  C: CFS task currently running on CPUx
>> >>  W: CFS task waking up on the same CPUx
>> >>
>> >> And considering the overall simplified workflow:
>> >>
>> >> core::try_to_wake_up()
>> >>
>> >>   // 1) Select on which CPU W will run
>> >>   core::select_task_rq()
>> >> fair::select_task_rq_fair()
>> >>
>> >>   // 2) Enqueue W on the selected CPU
>> >>   core::ttwu_queue()
>> >> core::ttwu_do_activate()
>> >>   core::activate_task()
>> >> core::enqueue_task()
>> >>   fair::enqueue_task_fair()
>> >> fair::enqueue_entity()
>> >>
>> >>   // 3) Set W's vruntime bonus
>> >>   fair::place_entity()
>> >> se->vruntime = ...
>> >>
>> >>   // 4) Check if C can be preempted by W
>> >>   core::ttwu_do_wakeup()
>> >> core::check_preempt_curr()
>> >>   fair::check_preempt_curr()
>> >> fair::check_preempt_wakeup(curr, se)
>> >>   fair::wakeup_preempt_entity(curr, se)
>> >> vdiff = curr.vruntime - se.vruntime
>> >> return vdiff > wakeup_gran(se)
>> >>
>> >> We see that W preempts C iff:
>> >>
>> >>vdiff > wakeup_gran(se)
>> >>
>> >> Since:
>> >>
>> >> enqueue_entity(cfs_rq, se, flags)
>> >>   place_entity(cfs_rq, se, initial=0)
>> >> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>> >> vruntime = cfs_rq->min_vruntime - thresh
>> >> se->vruntime = max_vruntime(se->vruntime, vruntime)
>> >>
>> >> a waking task's W.vruntime can get a "vruntime bonus" up to:
>> >>  - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
>> >>  - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>> >>
>> >>
>> >> > D) Desired behavior
>> >>
>> >> The "vruntime bonus" (thresh) computed in place_entity() should have a
>> >> per-task definition, which defaults to the current implementation.
>> >>
>> >> A bigger vruntime bonus can be configured for latency sensitive tasks.
>> >> A smaller vruntime bonus can be configured for latency tolerant tasks.
>> >
>> > I'm not sure that adjusting what you called "vruntime bonus" is the
>> > right way to provide some latency because it doesn't only provide a
>> > wakeup latency bonus but also provides a runtime bonus.
>>
>> True, however that's what we already do but _just_ in an hard-coded way.
>>
>> A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
>> FAIR_SLEEPERS. Point is that not all tasks are the same: for some this
>
> From a nice and fair PoV, it's not a bonus but the opposite. In fact
> it's limiting how much credit, the task will keep from its sleep time.

I agree about 'limiting a credit', thus being a _credit_ IMO it's a bonus
and the limiting happens only with GENTLE_FAIR_SLEEPER.

So, in general, tasks _waking up_ get a (limited) credit, i.e. a
vruntime bonus.

Form the FAIR PoV it is even more a bonus since all the machinery AFAIU
it's designed to give some vruntime bonus to _non_ CPU bound / batch
tasks.

That's done to compensate for them being suspended and thus not having a
chance to consume all their fair CPU share in the previous activation.

> Also keep in mind that this value is clamped by its vruntime so a task
> can't get bonus

True, at wakeup we clamped it with the SE (normalized) vruntime.

But still since we do:

   se->vruntime = max(se->vruntime, cfs_rq->min_vruntime-VRUNTIME_BONUS)
  \ A / \--- B ---/

The bigger B is the more likely we are to "penalize" the SE vuntime.


>> bonus can be not really required, for others too small.
>>
>> Regarding the 'runtime bonus' I think it's kind of unavoidable,
>> if we really want a latency sensitive task being scheduled
>> before the others.
>
> That's where I disa

Re: [SchedulerWakeupLatency] Per-task vruntime wakeup bonus

2020-07-10 Thread Patrick Bellasi



On Fri, Jul 10, 2020 at 15:21:48 +0200, Vincent Guittot 
 wrote...

> Hi Patrick,

Hi Vincent,

[...]

>> > C) Existing control paths
>>
>> Assuming:
>>
>>  C: CFS task currently running on CPUx
>>  W: CFS task waking up on the same CPUx
>>
>> And considering the overall simplified workflow:
>>
>> core::try_to_wake_up()
>>
>>   // 1) Select on which CPU W will run
>>   core::select_task_rq()
>> fair::select_task_rq_fair()
>>
>>   // 2) Enqueue W on the selected CPU
>>   core::ttwu_queue()
>> core::ttwu_do_activate()
>>   core::activate_task()
>> core::enqueue_task()
>>   fair::enqueue_task_fair()
>> fair::enqueue_entity()
>>
>>   // 3) Set W's vruntime bonus
>>   fair::place_entity()
>> se->vruntime = ...
>>
>>   // 4) Check if C can be preempted by W
>>   core::ttwu_do_wakeup()
>> core::check_preempt_curr()
>>   fair::check_preempt_curr()
>> fair::check_preempt_wakeup(curr, se)
>>   fair::wakeup_preempt_entity(curr, se)
>> vdiff = curr.vruntime - se.vruntime
>> return vdiff > wakeup_gran(se)
>>
>> We see that W preempts C iff:
>>
>>vdiff > wakeup_gran(se)
>>
>> Since:
>>
>> enqueue_entity(cfs_rq, se, flags)
>>   place_entity(cfs_rq, se, initial=0)
>> thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
>> vruntime = cfs_rq->min_vruntime - thresh
>> se->vruntime = max_vruntime(se->vruntime, vruntime)
>>
>> a waking task's W.vruntime can get a "vruntime bonus" up to:
>>  - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
>>  - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)
>>
>>
>> > D) Desired behavior
>>
>> The "vruntime bonus" (thresh) computed in place_entity() should have a
>> per-task definition, which defaults to the current implementation.
>>
>> A bigger vruntime bonus can be configured for latency sensitive tasks.
>> A smaller vruntime bonus can be configured for latency tolerant tasks.
>
> I'm not sure that adjusting what you called "vruntime bonus" is the
> right way to provide some latency because it doesn't only provide a
> wakeup latency bonus but also provides a runtime bonus.

True, however that's what we already do but _just_ in an hard-coded way.

A task waking up from sleep gets 1 sched_latency bonus, or 1/2 w/o
FAIR_SLEEPERS. Point is that not all tasks are the same: for some this
bonus can be not really required, for others too small.

Regarding the 'runtime bonus' I think it's kind of unavoidable,
if we really want a latency sensitive task being scheduled
before the others.

> It means that one can impact the running time by playing with
> latency_nice whereas the goal is only to impact the wakeup latency.

Well, but I'm not sure how much you can really gain considering that
this bonus is given only at wakeup time: the task should keep
suspending himself. It would get a better result by just asking for a
lower nice value.

Now, asking for a reduced nice value is RLIMIT_NICE and CAP_SYS_NICE
protected. The same will be for latency_nice.

Moreover, considering that by default tasks will get what we already
have as hard-coded or less of a bonus, I don't see how easy should be to
abuse.

To the contrary we can introduce a very useful knob to allow certain
tasks to voluntarily demote themselves and avoid annoying a currently
running task. 

> Instead, it should weight the decision in wakeup_preempt_entity() and
> wakeup_gran()

In those functions we already take the task prio into consideration
(ref details at the end of this message).

Lower nice value tasks have more chances to preempt current since they
will have a smaller wakeup_gran, indeed:

we preempt  IFF   vdiff(se, current) > wakeup_gran(se)
  \/   \-/
   A  B

While task's prio affects B, in this proposal, lantecy_nice works on the
A side of the equation above by making it a bit more task specific.

That said, it's true that both latency_nice and prio will ultimately
play a role on how much CPU bandwidth a task gets.

Question is: do we deem it useful to have an additional knob working on
the A side of the equation above?

Best,
Patrick



---8<--8<--8<--8<--8<--8<--8<--8<--8<--8<---

TL;DR: The nice value already affects the wakeup latency

As reported above:

   check_preempt_wakeup(rq, p, wake_flags)
 wakeup_preempt_entity(curr, se)
(d)vdiff = curr.vruntime - se.vruntime
(e)return vdiff > wakeup_gran(se)

we see that W preempts C iff:

   vdiff > wakeup_gran(se)

But:

   wakeup_gran(se)
 calc_delta_fair(delta=sysctl_sched_wakeup_granularity, se)
   __calc_delta(delta_exec=delta, weight=NICE_0_LOAD, lw=>load)
(c)  wakeup_gran = sched_wakeup_granularity * (NICE_0_LOAD / 
W.load.weight)

Thus, the wakeup granularity of W depends on:
 - the system-wide configured wakeup

Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-30 Thread Patrick Bellasi



On Tue, Jun 30, 2020 at 17:40:34 +0200, Qais Yousef  
wrote...

> Hi Patrick
>
> On 06/30/20 16:55, Patrick Bellasi wrote:
>> 
>> Hi Qais,
>> sorry for commenting on v5 with a v6 already posted, but...
>> ... I cannot keep up with your re-spinning rate ;)
>
> I classified that as a nit really and doesn't affect correctness. We have
> different subjective view on what is better here. I did all the work in the
> past 2 weeks and I think as the author of this patch I have the right to keep
> my preference on subjective matters. I did consider your feedback and didn't
> ignore it and improved the naming and added a comment to make sure there's no
> confusion.
>
> We could nitpick the best name forever, but is it really that important?

Which leans toward confirming the impression I had while reading your
previous response, i.e. you stopped reading at the name change
observation, which would be _just_ a nit-picking, although still worth
IMHO.

Instead, I went further and asked you to consider a different approach:
not adding a new kernel symbol to represent a concept already there.

> I really don't see any added value for one approach or another here to start
> a long debate about it.

Then you could have just called out that instead of silently ignoring
the comment/proposal.

> The comments were small enough that I didn't see any controversy that
> warrants holding the patches longer. I agreed with your proposal to use
> uc_se->active and clarified why your other suggestions don't hold.
>
> You pointed that uclamp_is_enabled() confused you; and I responded that I'll
> change the name.

Perhaps it would not confuse only me having 'something_enabled()'
referring to 'something_used'.

> Sorry for not being explicit about answering the below, but
> I thought my answer implied that I don't prefer it.

Your answer was about a name change, don't see correlation with a
different approach... but should be just me.

>> >> Thus, perhaps we can just use the same pattern used by the
>> >> sched_numa_balancing static key:
>> >> 
>> >>   $ git grep sched_numa_balancing
>> >>   kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
>> >>   kernel/sched/core.c:
>> >> static_branch_enable(_numa_balancing);
>> >>   kernel/sched/core.c:
>> >> static_branch_disable(_numa_balancing);
>> >>   kernel/sched/core.c:int state = 
>> >> static_branch_likely(_numa_balancing);
>> >>   kernel/sched/fair.c:if 
>> >> (!static_branch_likely(_numa_balancing))
>> >>   kernel/sched/fair.c:if 
>> >> (!static_branch_likely(_numa_balancing))
>> >>   kernel/sched/fair.c:if 
>> >> (!static_branch_likely(_numa_balancing))
>> >>   kernel/sched/fair.c:if 
>> >> (static_branch_unlikely(_numa_balancing))
>> >>   kernel/sched/sched.h:extern struct static_key_false 
>> >> sched_numa_balancing;
>> >> 
>> >> IOW: unconditionally define sched_uclamp_used as non static in core.c,
>> >> and use it directly on schedutil too.
>> 
>> So, what about this instead of adding the (renamed) method above?
>
> I am sorry there's no written rule that says one should do it in a specific
> way. And AFAIK both way are implemented in the kernel. I appreciate your
> suggestion but as the person who did all the hard work, I think my preference
> matters here too.

You sure know that sometime reviewing code can be an "hard work" too, so I
would not go down that way at all with the discussion. Quite likely I
have a different "subjective" view on how Open Source development works.

> And actually with my approach when uclamp is not compiled in there's no need 
> to
> define an extra variable; and since uclamp_is_used() is defined as false for
> !CONFIG_UCLAMP_TASK, it'll help with DCE, so less likely to end up with dead
> code that'll never run in the final binary.

Good, this is the simple and small reply I've politely asked for.

Best,
Patrick

Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-30 Thread Patrick Bellasi



Hi Qais,
sorry for commenting on v5 with a v6 already posted, but...
... I cannot keep up with your re-spinning rate ;)

More importantly, perhaps you missed to comment on one of my previous
points.

Will have a better look at the rest of v6 later today.

Cheers,
Patrick

On Tue, Jun 30, 2020 at 11:46:24 +0200, Qais Yousef  
wrote...
> On 06/30/20 10:11, Patrick Bellasi wrote:
>> On Mon, Jun 29, 2020 at 18:26:33 +0200, Qais Yousef  
>> wrote...

[...]

>> > +
>> > +static inline bool uclamp_is_enabled(void)
>> > +{
>> > +  return static_branch_likely(_uclamp_used);
>> > +}
>> 
>> Looks like here we mix up terms, which can be confusing.
>> AFAIKS, we use:
>> - *_enabled for the sched class flags (compile time)
>> - *_usedfor the user-space opting in (run time)
>
> I wanted to add a comment here.
>
> I can rename it to uclamp_is_used() if you want.

In my previous message I was mostly asking about this:

>> Thus, perhaps we can just use the same pattern used by the
>> sched_numa_balancing static key:
>> 
>>   $ git grep sched_numa_balancing
>>   kernel/sched/core.c:DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
>>   kernel/sched/core.c:
>> static_branch_enable(_numa_balancing);
>>   kernel/sched/core.c:
>> static_branch_disable(_numa_balancing);
>>   kernel/sched/core.c:int state = 
>> static_branch_likely(_numa_balancing);
>>   kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
>>   kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
>>   kernel/sched/fair.c:if (!static_branch_likely(_numa_balancing))
>>   kernel/sched/fair.c:if (static_branch_unlikely(_numa_balancing))
>>   kernel/sched/sched.h:extern struct static_key_false sched_numa_balancing;
>> 
>> IOW: unconditionally define sched_uclamp_used as non static in core.c,
>> and use it directly on schedutil too.

So, what about this instead of adding the (renamed) method above?

Re: [PATCH v5 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-30 Thread Patrick Bellasi



Hi Qais,
here are some more 2c from me...

On Mon, Jun 29, 2020 at 18:26:33 +0200, Qais Yousef  
wrote...

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 235b2cae00a0..8d80d6091d86 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -794,6 +794,26 @@ unsigned int sysctl_sched_uclamp_util_max = 
> SCHED_CAPACITY_SCALE;
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> +/*
> + * This static key is used to reduce the uclamp overhead in the fast path. It
> + * primarily disables the call to uclamp_rq_{inc, dec}() in
> + * enqueue/dequeue_task().
> + *
> + * This allows users to continue to enable uclamp in their kernel config with
> + * minimum uclamp overhead in the fast path.
> + *
> + * As soon as userspace modifies any of the uclamp knobs, the static key is
> + * enabled, since we have an actual users that make use of uclamp
> + * functionality.
> + *
> + * The knobs that would enable this static key are:
> + *
> + *   * A task modifying its uclamp value with sched_setattr().
> + *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
> + *   * An admin modifying the cgroup cpu.uclamp.{min, max}

I guess this list can be obtained with a grep or git changelog, moreover
this text will require maintenance.

What about replacing this full comment with something shorted like:

---8<---
  Static key to reduce uclamp overhead in the fast path by disabling
  calls to uclamp_rq_{inc, dec}().
---8<---

> + */
> +DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);
> +
>  /* Integer rounded range for each bucket */
>  #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, 
> UCLAMP_BUCKETS)
>  
> @@ -994,9 +1014,30 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>   lockdep_assert_held(>lock);
>  
>   bucket = _rq->bucket[uc_se->bucket_id];
> - SCHED_WARN_ON(!bucket->tasks);
> +
> + /*
> +  * bucket->tasks could be zero if sched_uclamp_used was enabled while
> +  * the current task was running, hence we could end up with unbalanced
> +  * call to uclamp_rq_dec_id().
> +  *
> +  * Need to be careful of the following enqeueue/dequeue order
> +  * problem too
> +  *
> +  *  enqueue(taskA)
> +  *  // sched_uclamp_used gets enabled
> +  *  enqueue(taskB)
> +  *  dequeue(taskA)
> +  *  // bucket->tasks is now 0
> +  *  dequeue(taskB)
> +  *
> +  * where we could end up with uc_se->active of the task set to true and
> +  * the wrong bucket[uc_se->bucket_id].value.
> +  *
> +  * Hence always make sure we reset things properly.
> +  */
>   if (likely(bucket->tasks))
>   bucket->tasks--;
> +
>   uc_se->active = false;

Better than v4, what about just using this active flag?

---8<---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f360326861e..465a7645713b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -990,6 +990,13 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct 
task_struct *p,
 
lockdep_assert_held(>lock);
 
+   /*
+* If a task was already enqueue at uclamp enable time
+* nothing has been accounted for it.
+*/
+   if (unlikely(!uc_se->active))
+   return;
+
bucket = _rq->bucket[uc_se->bucket_id];
SCHED_WARN_ON(!bucket->tasks);
if (likely(bucket->tasks))
---8<---

This will allow also to keep in all the ref count checks we have,
e.g. the SChed_WARN_ON().


>   /*
> @@ -1032,6 +1073,13 @@ static inline void uclamp_rq_inc(struct rq *rq, struct 
> task_struct *p)
>  {
>   enum uclamp_id clamp_id;
>  
> + /*
> +  * Avoid any overhead until uclamp is actually used by the userspace.
> +  * Including the branch if we use static_branch_likely()

I still find this last sentence hard to parse, but perhaps it's just me
still missing a breakfast :)

> +  */
> + if (!static_branch_unlikely(_uclamp_used))
> + return;

I'm also still wondering if the optimization is still working when we
have that ! in front.

Had a check at:

   
https://elixir.bootlin.com/linux/latest/source/include/linux/jump_label.h#L399

and AFAIU, it all boils down to cook a __branch_check()'s compiler hint,
and ISTR that those are "anti-patterns"?

That said we do have some usages for this pattern too:

$ git grep '!static_branch_unlikely' | wc -l   36
$ git grep 'static_branch_unlikely' | wc -l   220

?

> +
>   if (unlikely(!p->sched_class->uclamp_enabled))
>   return;
>  

[...]

> +/**
> + * uclamp_rq_util_with - clamp @util with @rq and @p effective uclamp values.
> + * @rq:  The rq to clamp against. Must not be NULL.
> + * @util:The util value to clamp.
> + * @p:   The task to clamp against. Can be NULL if you want to 
> clamp
> + *

Re: [PATCH v4 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-26 Thread Patrick Bellasi



On Thu, Jun 25, 2020 at 17:43:52 +0200, Qais Yousef  
wrote...

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 235b2cae00a0..e2f1fffa013c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -794,6 +794,25 @@ unsigned int sysctl_sched_uclamp_util_max = 
> SCHED_CAPACITY_SCALE;
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> +/*
> + * This static key is used to reduce the uclamp overhead in the fast path. It
> + * only disables the call to uclamp_rq_{inc, dec}() in 
> enqueue/dequeue_task().
> + *
> + * This allows users to continue to enable uclamp in their kernel config with
> + * minimum uclamp overhead in the fast path.
> + *
> + * As soon as userspace modifies any of the uclamp knobs, the static key is
> + * enabled, since we have an actual users that make use of uclamp
> + * functionality.
> + *
> + * The knobs that would enable this static key are:
> + *
> + *   * A task modifying its uclamp value with sched_setattr().
> + *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
> + *   * An admin modifying the cgroup cpu.uclamp.{min, max}
> + */
> +static DEFINE_STATIC_KEY_FALSE(sched_uclamp_used);

This is me being choosy, but given that:

---8<---
% grep -e '_used[ )]' fair.c core.c
fair.c: return static_key_false(&__cfs_bandwidth_used);
fair.c: static_key_slow_inc_cpuslocked(&__cfs_bandwidth_used);
fair.c: static_key_slow_dec_cpuslocked(&__cfs_bandwidth_used);

% grep -e '_enabled[ )]' fair.c core.c
fair.c: if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
fair.c: if (!cfs_rq->runtime_enabled || cfs_rq->nr_running)
fair.c: if (!cfs_rq->runtime_enabled || cfs_rq->curr)
fair.c: if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
fair.c: cfs_rq->runtime_enabled = 0;
fair.c: cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
fair.c: if (!cfs_rq->runtime_enabled)
fair.c: cfs_rq->runtime_enabled = 0;
core.c: if (static_key_false((_steal_rq_enabled))) {
core.c: if (unlikely(!p->sched_class->uclamp_enabled))
core.c: if (unlikely(!p->sched_class->uclamp_enabled))
core.c:  * Prevent race between setting of cfs_rq->runtime_enabled and
core.c: runtime_enabled = quota != RUNTIME_INF;
core.c: runtime_was_enabled = cfs_b->quota != RUNTIME_INF;
core.c: if (runtime_enabled && !runtime_was_enabled)
core.c: if (runtime_enabled)
core.c: cfs_rq->runtime_enabled = runtime_enabled;
core.c: if (runtime_was_enabled && !runtime_enabled)
---8<---

even just for consistency shake, I would still prefer sched_uclamp_enabled for
the static key name.

> +
>  /* Integer rounded range for each bucket */
>  #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, 
> UCLAMP_BUCKETS)
>  
> @@ -994,9 +1013,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>   lockdep_assert_held(>lock);
>  
>   bucket = _rq->bucket[uc_se->bucket_id];
> - SCHED_WARN_ON(!bucket->tasks);
> - if (likely(bucket->tasks))
> - bucket->tasks--;
> +
> + /*
> +  * This could happen if sched_uclamp_used was enabled while the
> +  * current task was running, hence we could end up with unbalanced call
> +  * to uclamp_rq_dec_id().
> +  */
> + if (unlikely(!bucket->tasks))
> + return;
> +
> + bucket->tasks--;
>   uc_se->active = false;

In this chunk you are indeed changing the code.

Are we sure there are not issues with patterns like:

  enqueue(taskA)
  // uclamp gets enabled
  enqueue(taskB)
  dequeue(taskA)
  // bucket->tasks is now 0
  dequeue(taskB)

TaskB has been enqueued with with uclamp enabled, thus it
has got uc_se->active=True and enforced its clamp value at RQ level.

But with your change above we don't reset that anymore.

As per my previous proposal: why not just removing the SCHED_WARN_ON?
That's the only real problem in the code above, since now we are not
more granted to have balanced inc/dec.

[...]

> +bool uclamp_is_enabled(void)
> +{
> + return static_branch_likely(_uclamp_used);
> +}
> +

The above and the following changes are not necessary if we just
add the guards at the beginning of uclamp_rq_util_with() instead of what
you do in PATCH1.

I think this will give better overheads removal by avoiding not
necessary min/max clamps for both RT and CFS.
  
> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index 7fbaee24c824..3f4e296ccb67 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -210,7 +210,7 @@ unsigned long schedutil_cpu_util(int cpu, unsigned long 
> util_cfs,
>   unsigned long dl_util, util, irq;
>   struct rq *rq = cpu_rq(cpu);
>  
> - if (!IS_BUILTIN(CONFIG_UCLAMP_TASK) &&
> + if (!uclamp_is_enabled() &&
>   type == FREQUENCY_UTIL && rt_rq_is_runnable(>rt)) {
>   return max;
>   }
> diff --git

Re: [PATCH v4 1/2] sched/uclamp: Fix initialization of struct uclamp_rq

2020-06-26 Thread Patrick Bellasi



On Thu, Jun 25, 2020 at 17:43:51 +0200, Qais Yousef  
wrote...

> struct uclamp_rq was zeroed out entirely in assumption that in the first
> call to uclamp_rq_inc() they'd be initialized correctly in accordance to
> default settings.

Perhaps I was not clear in my previous comment:

   https://lore.kernel.org/lkml/87sgekorfq.derkl...@matbug.net/

when I did say:

   Does not this means the problem is more likely with
   uclamp_rq_util_with(), which should be guarded?

I did not mean that we have to guard the calls to that function but
instead that we should just make that function aware of uclamp being
opted in or not.

> But when next patch introduces a static key to skip
> uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
> will fail to perform any frequency changes because the
> rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
> means all rqs are capped to 0 by default.

The initialization you wants to do here it's needed because with the
current approach you keep calling the same uclamp_rq_util_with() and
keep doing min/max aggregations even when uclamp is not opted in.
But this means also that we have min/max aggregation _when not really
required_.

> Fix it by making sure we do proper initialization at init without
> relying on uclamp_rq_inc() doing it later.

My proposal was as simple as:

---8<---
  static __always_inline
  unsigned long uclamp_rq_util_with(struct rq *rq, unsigned long util,
  struct task_struct *p)
  {
unsigned long min_util = READ_ONCE(rq->uclamp[UCLAMP_MIN].value);
unsigned long max_util = READ_ONCE(rq->uclamp[UCLAMP_MAX].value);
  
+   if (!static_branch_unlikely(_uclamp_used))
+   return rt_task(p) ? uclamp_none(UCLAMP_MAX) : util 
  
if (p) {
min_util = max(min_util, uclamp_eff_value(p, UCLAMP_MIN));
max_util = max(max_util, uclamp_eff_value(p, UCLAMP_MAX));
}
  
/*
 * Since CPU's {min,max}_util clamps are MAX aggregated considering
 * RUNNABLE tasks with _different_ clamps, we can end up with an
 * inversion. Fix it now when the clamps are applied.
 */
if (unlikely(min_util >= max_util))
return min_util;
  
return clamp(util, min_util, max_util);
  }
---8<---

Such small change is more self-contained IMHO and does not remove
an existing optimizations like this lazy RQ's initialization at first
usage.

Moreover, it can folded in the following patch, with all the other
static keys shortcuts.

Re: [PATCH v2 2/2] sched/uclamp: Protect uclamp fast path code with static key

2020-06-24 Thread Patrick Bellasi



On Fri, Jun 19, 2020 at 19:20:11 +0200, Qais Yousef  
wrote...

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4265861e13e9..9ab22f699613 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -793,6 +793,25 @@ unsigned int sysctl_sched_uclamp_util_max = 
> SCHED_CAPACITY_SCALE;
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> +/*
> + * This static key is used to reduce the uclamp overhead in the fast path. It
> + * only disables the call to uclamp_rq_{inc, dec}() in 
> enqueue/dequeue_task().
> + *
> + * This allows users to continue to enable uclamp in their kernel config with
> + * minimum uclamp overhead in the fast path.
> + *
> + * As soon as userspace modifies any of the uclamp knobs, the static key is
> + * disabled, since we have an actual users that make use of uclamp
> + * functionality.
> + *
> + * The knobs that would disable this static key are:
> + *
> + *   * A task modifying its uclamp value with sched_setattr().
> + *   * An admin modifying the sysctl_sched_uclamp_{min, max} via procfs.
> + *   * An admin modifying the cgroup cpu.uclamp.{min, max}
> + */
> +static DEFINE_STATIC_KEY_TRUE(sched_uclamp_unused);

I would personally prefer a non negated semantic.

Why not using 'sched_uclamp_enabled'?

> +
>  /* Integer rounded range for each bucket */
>  #define UCLAMP_BUCKET_DELTA DIV_ROUND_CLOSEST(SCHED_CAPACITY_SCALE, 
> UCLAMP_BUCKETS)
>  
> @@ -993,9 +1012,16 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>   lockdep_assert_held(>lock);
>  
>   bucket = _rq->bucket[uc_se->bucket_id];
> - SCHED_WARN_ON(!bucket->tasks);
> - if (likely(bucket->tasks))
> - bucket->tasks--;
> +
> + /*
> +  * This could happen if sched_uclamp_unused was disabled while the
> +  * current task was running, hence we could end up with unbalanced call
> +  * to uclamp_rq_dec_id().
> +  */
> + if (unlikely(!bucket->tasks))
> + return;
> +
> + bucket->tasks--;

Since above you are not really changing the logic, why changing the
code?

The SCHED_WARN_ON/if(likely) is a defensive programming thing.
I understand that SCHED_WARN_ON() can now be misleading because of the
unbalanced calls but... why not just removing it?

Maybe also adding in the comment, but I don't see valid reasons to
change the code if the functionality is not changing.


>   uc_se->active = false;
>  
>   /*
> @@ -1031,6 +1057,13 @@ static inline void uclamp_rq_inc(struct rq *rq, struct 
> task_struct *p)
>  {
>   enum uclamp_id clamp_id;
>  
> + /*
> +  * Avoid any overhead until uclamp is actually used by the userspace.
> +  * Including the potential JMP if we use static_branch_unlikely()

The comment above (about unlikely) seems not to match the code?

> +  */
> + if (static_branch_likely(_uclamp_unused))
> + return;

Moreover, something like:

   if (static_key_false(_uclamp_enabled))
return;

is not just good enough?

> +
>   if (unlikely(!p->sched_class->uclamp_enabled))
>   return;

Since we already have these per sched_class gates, I'm wondering if it
could make sense to just re-purpose them.

Problem with the static key is that if just one RT task opts in, CFS
will still pay the overheads, and vice versa too.

So, an alternative approach could be to opt in sched classes on-demand.

The above if(unlikely) is not exactly has a static key true, but I
assume we agree the overheads we are tacking are nothing compared to
that check, aren't they?


> @@ -1046,6 +1079,13 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
> task_struct *p)
>  {
>   enum uclamp_id clamp_id;
>  
> + /*
> +  * Avoid any overhead until uclamp is actually used by the userspace.
> +  * Including the potential JMP if we use static_branch_unlikely()
> +  */
> + if (static_branch_likely(_uclamp_unused))
> + return;
> +
>   if (unlikely(!p->sched_class->uclamp_enabled))
>   return;
>  
> @@ -1155,9 +1195,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>   update_root_tg = true;
>   }
>  
> - if (update_root_tg)
> + if (update_root_tg) {
>   uclamp_update_root_tg();
>  
> + if (static_branch_unlikely(_uclamp_unused))
> + static_branch_disable(_uclamp_unused);
> + }
> +

Can we move the above into a function?

Something similar to set_schedstats(bool), what about uclamp_enable(bool)?

>   /*
>* We update all RUNNABLE tasks only when task groups are in use.
>* Otherwise, keep it simple and do just a lazy update at each next
> @@ -1221,6 +1265,9 @@ static void __setscheduler_uclamp(struct task_struct *p,
>   if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
>   return;
>  
> +

Re: [PATCH v2 1/2] sched/uclamp: Fix initialization of strut uclamp_rq

2020-06-24 Thread Patrick Bellasi



Hi Qais,

On Fri, Jun 19, 2020 at 19:20:10 +0200, Qais Yousef  
wrote...

> struct uclamp_rq was zeroed out entirely in assumption that in the first
> call to uclamp_rq_inc() they'd be initialized correctly in accordance to
> default settings.
>
> But when next patch introduces a static key to skip
> uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
> will fail to perform any frequency changes because the
> rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
> means all rqs are capped to 0 by default.

Does not this means the problem is more likely with uclamp_rq_util_with(),
which should be guarded?

Otherwise, we will also keep doing useless min/max aggregations each
time schedutil calls that function, thus not completely removing
uclamp overheads while user-space has not opted in.

What about dropping this and add the guard in the following patch, along
with the others?

> Fix it by making sure we do proper initialization at init without

>
> Fix it by making sure we do proper initialization at init without
> relying on uclamp_rq_inc() doing it later.
>
> Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
> Signed-off-by: Qais Yousef 
> Cc: Juri Lelli 
> Cc: Vincent Guittot 
> Cc: Dietmar Eggemann 
> Cc: Steven Rostedt 
> Cc: Ben Segall 
> Cc: Mel Gorman 
> CC: Patrick Bellasi 
> Cc: Chris Redpath 
> Cc: Lukasz Luba 
> Cc: linux-kernel@vger.kernel.org
> ---
>  kernel/sched/core.c | 23 ++-
>  1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a43c84c27c6f..4265861e13e9 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1248,6 +1248,22 @@ static void uclamp_fork(struct task_struct *p)
>   }
>  }
>  
> +static void __init init_uclamp_rq(struct rq *rq)
> +{
> + enum uclamp_id clamp_id;
> + struct uclamp_rq *uc_rq = rq->uclamp;
> +
> + for_each_clamp_id(clamp_id) {
> + memset(uc_rq[clamp_id].bucket,
> +0,
> +sizeof(struct uclamp_bucket)*UCLAMP_BUCKETS);
> +
> + uc_rq[clamp_id].value = uclamp_none(clamp_id);
> + }
> +
> + rq->uclamp_flags = 0;
> +}
> +
>  static void __init init_uclamp(void)
>  {
>   struct uclamp_se uc_max = {};
> @@ -1256,11 +1272,8 @@ static void __init init_uclamp(void)
>  
>   mutex_init(_mutex);
>  
> - for_each_possible_cpu(cpu) {
> - memset(_rq(cpu)->uclamp, 0,
> - sizeof(struct uclamp_rq)*UCLAMP_CNT);
> - cpu_rq(cpu)->uclamp_flags = 0;
> - }
> + for_each_possible_cpu(cpu)
> + init_uclamp_rq(cpu_rq(cpu));
>  
>   for_each_clamp_id(clamp_id) {
>   uclamp_se_set(_task.uclamp_req[clamp_id],

[SchedulerWakeupLatency] Per-task vruntime wakeup bonus

2020-06-23 Thread Patrick Bellasi



On Tue, Jun 23, 2020 at 09:29:03 +0200, Patrick Bellasi 
 wrote...

> .:: Scheduler Wakeup Path Requirements Collection Template
> ==
>
> A) Name

Runtime tunable vruntime wakeup bonus.


> B) Target behavior

All SCHED_OTHER tasks get the same (max) vruntime wakeup bonus. This
bonus affects the chance the task has to preempt the currently running
task. Some tasks, which are (known to be) latency tolerant, should have
a smaller chance to preempt a (known to be) latency sensitive task. To
the contrary, latency sensitive tasks should have a higher chance to
preempt a currently running latency tolerant task.

This task specific distinction is not provided by the current
implementation and all SCHED_OTHER tasks are handled according to the
same simple, system-wide and not run-time tunable policy.


> C) Existing control paths

Assuming:

 C: CFS task currently running on CPUx
 W: CFS task waking up on the same CPUx

And considering the overall simplified workflow:

core::try_to_wake_up()

  // 1) Select on which CPU W will run
  core::select_task_rq()
fair::select_task_rq_fair()

  // 2) Enqueue W on the selected CPU
  core::ttwu_queue()
core::ttwu_do_activate()
  core::activate_task()
core::enqueue_task()
  fair::enqueue_task_fair()
fair::enqueue_entity()

  // 3) Set W's vruntime bonus
  fair::place_entity()
se->vruntime = ...

  // 4) Check if C can be preempted by W
  core::ttwu_do_wakeup()
core::check_preempt_curr()
  fair::check_preempt_curr()
fair::check_preempt_wakeup(curr, se)
  fair::wakeup_preempt_entity(curr, se)
vdiff = curr.vruntime - se.vruntime
return vdiff > wakeup_gran(se)

We see that W preempts C iff:

   vdiff > wakeup_gran(se)

Since:

enqueue_entity(cfs_rq, se, flags)
  place_entity(cfs_rq, se, initial=0)
thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)
vruntime = cfs_rq->min_vruntime - thresh
se->vruntime = max_vruntime(se->vruntime, vruntime)

a waking task's W.vruntime can get a "vruntime bonus" up to:
 - 1   scheduler latency (w/  GENTLE_FAIR_SLEEPERS)
 - 1/2 scheduler latency (w/o GENTLE_FAIR_SLEEPERS)


> D) Desired behavior

The "vruntime bonus" (thresh) computed in place_entity() should have a
per-task definition, which defaults to the current implementation.

A bigger vruntime bonus can be configured for latency sensitive tasks.
A smaller vruntime bonus can be configured for latency tolerant tasks.

TL;DR

The "vruntime bonus" is meant to give sleepers a compensation for the
service deficit due to them not having (possibly) fully consumed their
assigned fair CPU quota within the current sched_latency interval, see:

  commit 51e0304ce6e5 ("sched: Implement a gentler fair-sleepers feature")

The scheduler does that based on a conservative assumption: when a task
sleeps it gives up a portion (P) of its fair CPU bandwidth share in the
current sched_latency period.
Willing to be FAIR, i.e. each task gets a FAIR quota of the CPU in each
sched_latency period, the scheduler wants to give back P to waking
tasks.

However, striving to minimize overheads and complexity, the CFS
scheduler does that using a simple heuristic: each task waking up gets a
bonus, which is capped at one sched_latency period, independently from
"its nature".

What the scheduler completely disregards is that being completely FAIR
is not always necessary. Depending on the nature of a task, not all
tasks require a bonus. To the contrary:

 - a GENTLE_FAIR_SLEEPERS bonus given to a background task could result
   in preempting a latency sensitive currently running task

 - giving only 1/2 scheduler latency bonus to a latency sensitive task
   could result in that task being preempted before it completes its
   current activation.


> E) Existing knobs

The SCHED_FEAT(GENTLE_FAIR_SLEEPERS, true) defined vruntime bonus value
can be considered the current mainline default value.

This means that "all" CFS tasks waking up will get a

   0.5 * sysctl_sched_latency

vruntime bonus wrt the cfs_rq->min_vruntime.


> F) Existing knobs limitations

GENTLE_FAIR_SLEEPERS is a system-wide knob and it's not run-time
tunable on production systems (being a SCHED_DEBUG feature).

Thus, the sched_feature should be removed and replaced by a per-task
knob.


> G) Proportionality Analysis

The value of the vruntime bonus directly affects the chance a task has
to preempt the currently running task.

Indeed, from the code analysis in C:

  thresh = sysctl_sched_latency / (GENTLE_FAIR_SLEEPERS ? 2 : 1)

is the "wakeup bonus", which is used as:

  vruntime = cfs_rq->min_vruntime - thresh
  se->vruntime = max_vruntime(se->vruntime, vruntime)
  vdiff = curr.vrunti

Scheduler wakeup path tuning surface: Use-Cases and Requirements

2020-06-23 Thread Patrick Bellasi

Since last year's OSPM Summit we started conceiving the idea that task
wakeup path could be better tuned for certain classes of workloads
and usage scenarios. Various people showed interest for a possible
tuning interface for the scheduler wakeup path.

.:: The Problem
===

The discussions we had so far [1] have not been effective in clearly
identifying if a common tuning surface is possible. The last discussion
at this year's OSPM Summit [2,3] was also kind of inconclusive and left
us with the message: start by collecting the requirements and then see
what interface fits them the best.

General consensus is that a unified interface can be challenging and
maybe not feasible. However, generalisation is also a value
and we should strive for it whenever it's possible.

Someone might think that we did not put enough effort in the analysis of
requirements. Perhaps what we missed so far is also a structured and
organised way to collect requirements which also can help in factoring
out the things they have in common.

.:: The Proposal

This thread aims at providing a guided template for the description of
different task wakeup use-cases. It does that by setting a series of
questions aimed at precisely describing what's "currently broken", what
we would like to have instead and how we could achieve it.

What we propose here is that, for each wakeup use-case, we start
by replying to this email to provide the required details/comments for
a predefined list of questions. This will generate independent
discussion threads. Each thread will be free to focus on a specific
proposal but still all the thread will be reasoning around a common set
of fundamental concepts.

The hope is that, by representing all the use-cases as sufficiently
detailed responses to a common set of questions, once the discussion
settles down, we can more easily verify if there are common things
surfacing which then can be naturally wrapped into a unified user-space
API.

A first use-case description, following the template guidelines, will
be posted shortly after this message. This also will provide an example
for how to use the template.

NOTE: Whenever required, pseudo-code or simplified C can be used.

I hope this can drive a fruitful discussion in preparation for LPC!

Best,
Patrick

---8<--- For templates submissions: reply only to the following ---8<---

.:: Scheduler Wakeup Path Requirements Collection Template
==

A) Name: unique one-liner name for the proposed use-case

B) Target behaviour: one paragraph to describe the wakeup path issue

C) Existing control paths: reference to code paths to justify B)

Assuming v5.6 as the reference kernel, this section should provide
links to code paths such as, e.g.

fair.c:3917

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/fair.c?h=v5.6#n3917

Alternatively code snippets can be added inline, e.g.

/*
* The 'current' period is already promised to the current tasks,
* however the extra weight of the new task will slow them down a
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);

NOTE: if the use-case exists only outside the mainline Linux kernel
this section can stay empty

D) Desired behaviour: one paragraph to describe the desired update

NOTE: the current mainline expression is assumed to be correct
for existing use-cases. Thus, here we are looking for run-time
tuning of those existing features.

E) Existing knobs (if any): reference to whatever existing tunable

Some features can already be tuned, but perhaps only via compile time
knobs, SCHED_FEATs or system wide tunable.
If that's the case, we should document them here and explain how they
currently work and what are (if any) the implicit assumptions, e.g.
what is the currently implemented scheduling policy/heuristic.

F) Existing knobs (if any): one paragraph description of the limitations

If the existing knobs are not up to the job for this use-case,
shortly explain here why. It could be because a tuning surface is
already there but it's hardcoded (e.g. compile time knob) or too
coarse grained (e.g. a SCHED_FEAT).

G) Proportionality Analysis: check the nature of the target behavior

Goal here is to verify and discuss if the behaviour (B) has a
proportional nature: different values of the control knobs (E) are
expected to produce different effects for (B).

Special care should be taken to check if the target behaviour has an
intrinsically "binary nature", i.e. only two values make really
sense. In this case it would be very useful to argument why a
generalisation towards a non-binary behaviours does NOT make sense.

H) Range

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-06-05 Thread Patrick Bellasi



On Fri, Jun 05, 2020 at 13:32:04 +0200, Qais Yousef  
wrote...

> On 06/05/20 09:55, Patrick Bellasi wrote:
>> On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef  
>> wrote...

[...]

>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 0464569f26a7..9f48090eb926 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
>> > struct task_struct *p,
>> >  * e.g. due to future modification, warn and fixup the expected 
>> > value.
>> >  */
>> > SCHED_WARN_ON(bucket->value > rq_clamp);
>> > +#if 0
>> > if (bucket->value >= rq_clamp) {
>> > bkt_clamp = uclamp_rq_max_value(rq, clamp_id, 
>> > uc_se->value);
>> > WRITE_ONCE(uc_rq->value, bkt_clamp);
>> > }
>> > +#endif
>> 
>> Yep, that's likely where we have most of the overhead at dequeue time,
>> sine _sometimes_ we need to update the cpu's clamp value.
>> 
>> However, while running perf sched pipe, I expect:
>>  - all tasks to have the same clamp value
>>  - all CPUs to have _always_ at least one RUNNABLE task
>> 
>> Given these two conditions above, if the CPU is never "CFS idle" (i.e.
>> without RUNNABLE CFS tasks), the code above should never be triggered.
>> More on that later...
>
> So the cost is only incurred by idle cpus is what you're saying.

Not really, you pay the cost every time you need to reduce the CPU clamp
value. This can happen also on a busy CPU but only when you dequeue the
last task defining the current uclamp(cpu) value and the remaining
RUNNABLE tasks have a lower value.

>> >  }
>> >
>> >  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>> >
>> >
>> >
>> > uclamp_rq_max_value() could be expensive as it loops over all buckets.
>> 
>> It loops over UCLAMP_CNT values which are defined to fit into a single
>
> I think you meant to say UCLAMP_BUCKETS which is defined 5 by default.

Right, UCLAMP_BUCKETS.

>> $L. That was the optimal space/time complexity compromise we found to
>> get the MAX of a set of values.
>
> It actually covers two cachelines, see below and my other email to
> Mel.

The two cache lines are covered if you consider both min and max clamps.
One single CLAMP_ID has a _size_ which fits into a single cache line.

However, to be precise:
- while uclamp_min spans a single cache line, uclamp_max is likely
  across two
- at enqueue/dequeue time we update both min/max, thus we can touch
  both cache lines

>> > Commenting this whole path out strangely doesn't just 'fix' it,
>> > but produces  better results to no-uclamp kernel :-/
>> >
>> > # ./perf bench -r 20 sched pipe -T -l 5
>> > Without uclamp:5039
>> > With uclamp:   4832
>> > With uclamp+patch: 5729
>> 
>> I explain it below: with that code removed you never decrease the CPU's
>> uclamp value. Thus, the first time you schedule an RT task you go to MAX
>> OPP and stay there forever.
>
> Okay.
>
>> 
>> > It might be because schedutil gets biased differently by uclamp..? If I 
>> > move to
>> > performance governor these numbers almost double.
>> >
>> > I don't know. But this promoted me to look closer and
>> 
>> Just to resume, when a task is dequeued we can have only these cases:
>> 
>> - uclamp(task) < uclamp(cpu):
>>   this happens when the task was co-scheduled with other tasks with
>>   higher clamp values which are still RUNNABLE.
>>   In this case there are no uclamp(cpu) updates.
>> 
>> - uclamp(task) == uclamp(cpu):
>>   this happens when the task was one of the tasks defining the current
>>   uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
>>   tasks clamp values.
>> 
>> In this last case we _not_ always need to do a uclamp(cpu) update.
>> Indeed the update is required _only_ when that task was _the last_ task
>> defining the current uclamp(cpu) value.
>> 
>> In this case we use uclamp_rq_max_value() to do a linear scan of
>> UCLAMP_CNT values which fits into a single cache line.
>
> Again, I think you mean UCLAMP_BUCKETS here. Unless I missed something, they
> span 2 cahcelines on 64bit machines and 64b cacheline size.

Correct:
- s/UCLAMP_CNT/UCLAMP_BUCLKETS/
- 1 cacheline per CLAMP_ID
- the array scan works on 1 CLAMP_ID:
  - spanning 1 cache line for uclamp_min
  - spannin

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-06-05 Thread Patrick Bellasi

Hi Qais,

On Wed, Jun 03, 2020 at 18:52:00 +0200, Qais Yousef  
wrote...

> On 06/03/20 16:59, Vincent Guittot wrote:
>> When I want to stress the fast path i usually use "perf bench sched pipe -T "
>> The tip/sched/core on my arm octo core gives the following results for
>> 20 iterations of perf bench sched pipe -T -l 5
>> 
>> all uclamp config disabled  50035.4(+/- 0.334%)
>> all uclamp config enabled  48749.8(+/- 0.339%)   -2.64%

I use to run the same test but I don't remember such big numbers :/

>> It's quite easy to reproduce and probably easier to study the impact
>
> Thanks Vincent. This is very useful!
>
> I could reproduce that on my Juno.
>
> One of the codepath I was suspecting seems to affect it.
>
>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..9f48090eb926 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,10 +1063,12 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>  * e.g. due to future modification, warn and fixup the expected value.
>  */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> +#if 0
> if (bucket->value >= rq_clamp) {
> bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> WRITE_ONCE(uc_rq->value, bkt_clamp);
> }
> +#endif

Yep, that's likely where we have most of the overhead at dequeue time,
sine _sometimes_ we need to update the cpu's clamp value.

However, while running perf sched pipe, I expect:
 - all tasks to have the same clamp value
 - all CPUs to have _always_ at least one RUNNABLE task

Given these two conditions above, if the CPU is never "CFS idle" (i.e.
without RUNNABLE CFS tasks), the code above should never be triggered.
More on that later...

>  }
>
>  static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>
>
>
> uclamp_rq_max_value() could be expensive as it loops over all buckets.

It loops over UCLAMP_CNT values which are defined to fit into a single
$L. That was the optimal space/time complexity compromise we found to
get the MAX of a set of values.

> Commenting this whole path out strangely doesn't just 'fix' it,
> but produces  better results to no-uclamp kernel :-/
>
> # ./perf bench -r 20 sched pipe -T -l 5
> Without uclamp:   5039
> With uclamp:  4832
> With uclamp+patch:5729

I explain it below: with that code removed you never decrease the CPU's
uclamp value. Thus, the first time you schedule an RT task you go to MAX
OPP and stay there forever.

> It might be because schedutil gets biased differently by uclamp..? If I move 
> to
> performance governor these numbers almost double.
>
> I don't know. But this promoted me to look closer and

Just to resume, when a task is dequeued we can have only these cases:

- uclamp(task) < uclamp(cpu):
  this happens when the task was co-scheduled with other tasks with
  higher clamp values which are still RUNNABLE.
  In this case there are no uclamp(cpu) updates.

- uclamp(task) == uclamp(cpu):
  this happens when the task was one of the tasks defining the current
  uclamp(cpu) value, which is defined to track the MAX of the RUNNABLE
  tasks clamp values.

In this last case we _not_ always need to do a uclamp(cpu) update.
Indeed the update is required _only_ when that task was _the last_ task
defining the current uclamp(cpu) value.

In this case we use uclamp_rq_max_value() to do a linear scan of
UCLAMP_CNT values which fits into a single cache line.

> I think I spotted a bug where in the if condition we check for '>='
> instead of '>', causing us to take the supposedly impossible fail safe
> path.

The fail safe path is when the '>' condition matches, which is what the
SCHED_WARN_ON tell us. Indeed, we never expect uclamp(cpu) to be bigger
than one of its RUNNABLE tasks. If that should happen we WARN and fix
the cpu clamp value for the best.

The normal path is instead '=' and, according to by previous resume,
it's expected to be executed _only_ when we dequeue the last task of the
clamp group defining the current uclamp(cpu) value.

> Mind trying with the below patch please?
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0464569f26a7..50d66d4016ff 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1063,7 +1063,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, 
> struct task_struct *p,
>  * e.g. due to future modification, warn and fixup the expected value.
>  */
> SCHED_WARN_ON(bucket->value > rq_clamp);
> -   if (bucket->value >= rq_clamp) {
> +   if (bucket->value > rq_clamp) {
> bkt_clamp = uclamp_rq_max_value(rq, clamp_id, uc_se->value);
> WRITE_ONCE(uc_rq->value, bkt_clamp);
> }

This patch is thus bogus, since we never expect to have uclamp(cpu)
bigger than uclamp(task) and thus we will never reset the clamp value of
a cpu.

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-06-03 Thread Patrick Bellasi



Hi Dietmar,
thanks for sharing these numbers.

On Tue, Jun 02, 2020 at 18:46:00 +0200, Dietmar Eggemann 
 wrote...

[...]

> I ran these tests on 'Ubuntu 18.04 Desktop' on Intel E5-2690 v2
> (2 sockets * 10 cores * 2 threads) with powersave governor as:
>
> $ numactl -N 0 ./run-mmtests.sh XXX

Great setup, it's worth to rule out all possible noise source (freq
scaling, thermal throttling, NUMA scheduler, etc.).
Wondering if disabling HT can also help here in reducing results "noise"?

> w/ config-network-netperf-unbound.
>
> Running w/o 'numactl -N 0' gives slightly worse results.
>
> without-clamp  : CONFIG_UCLAMP_TASK is not set
> with-clamp : CONFIG_UCLAMP_TASK=y,
>  CONFIG_UCLAMP_TASK_GROUP is not set
> with-clamp-tskgrp  : CONFIG_UCLAMP_TASK=y,
>  CONFIG_UCLAMP_TASK_GROUP=y
>
>
> netperf-udp
> ./5.7.0-rc7./5.7.0-rc7
> ./5.7.0-rc7
>   without-clamp with-clamp  
> with-clamp-tskgrp

Can you please specify how to read the following scores? I give it a run
to my local netperf and it reports Throughput, thous I would expect the
higher the better... but... this seems something different.

> Hmean send-64 153.62 (   0.00%)  151.80 *  -1.19%*  
> 155.60 *   1.28%*
> Hmean send-128306.77 (   0.00%)  306.27 *  -0.16%*  
> 309.39 *   0.85%*
> Hmean send-256608.54 (   0.00%)  604.28 *  -0.70%*  
> 613.42 *   0.80%*
> Hmean send-1024  2395.80 (   0.00%) 2365.67 *  -1.26%* 
> 2409.50 *   0.57%*
> Hmean send-2048  4608.70 (   0.00%) 4544.02 *  -1.40%* 
> 4665.96 *   1.24%*
> Hmean send-3312  7223.97 (   0.00%) 7158.88 *  -0.90%* 
> 7331.23 *   1.48%*
> Hmean send-4096  8729.53 (   0.00%) 8598.78 *  -1.50%* 
> 8860.47 *   1.50%*
> Hmean send-8192 14961.77 (   0.00%)14418.92 *  -3.63%*
> 14908.36 *  -0.36%*
> Hmean send-1638425799.50 (   0.00%)25025.64 *  -3.00%*
> 25831.20 *   0.12%*

If I read it as the lower the score the better, all the above results
tell us that with-clamp is even better, while with-clamp-tskgrp
is not that much worst.

The other way around (the higher the score the better) would look odd
since we definitively add in more code and complexity when uclamp has
the TG support enabled we would not expect better scores.

> Hmean recv-64 153.62 (   0.00%)  151.80 *  -1.19%*  
> 155.60 *   1.28%*
> Hmean recv-128306.77 (   0.00%)  306.27 *  -0.16%*  
> 309.39 *   0.85%*
> Hmean recv-256608.54 (   0.00%)  604.28 *  -0.70%*  
> 613.42 *   0.80%*
> Hmean recv-1024  2395.80 (   0.00%) 2365.67 *  -1.26%* 
> 2409.50 *   0.57%*
> Hmean recv-2048  4608.70 (   0.00%) 4544.02 *  -1.40%* 
> 4665.95 *   1.24%*
> Hmean recv-3312  7223.97 (   0.00%) 7158.88 *  -0.90%* 
> 7331.23 *   1.48%*
> Hmean recv-4096  8729.53 (   0.00%) 8598.78 *  -1.50%* 
> 8860.47 *   1.50%*
> Hmean recv-8192 14961.61 (   0.00%)14418.88 *  -3.63%*
> 14908.30 *  -0.36%*
> Hmean recv-1638425799.39 (   0.00%)25025.49 *  -3.00%*
> 25831.00 *   0.12%*
>
> netperf-tcp
>  
> Hmean 64  818.65 (   0.00%)  812.98 *  -0.69%*  
> 826.17 *   0.92%*
> Hmean 1281569.55 (   0.00%) 1555.79 *  -0.88%* 
> 1586.94 *   1.11%*
> Hmean 2562952.86 (   0.00%) 2915.07 *  -1.28%* 
> 2968.15 *   0.52%*
> Hmean 1024  10425.91 (   0.00%)10296.68 *  -1.24%*
> 10418.38 *  -0.07%*
> Hmean 2048  17454.51 (   0.00%)17369.57 *  -0.49%*
> 17419.24 *  -0.20%*
> Hmean 3312  22509.95 (   0.00%)9.69 *  -1.25%*
> 22373.32 *  -0.61%*
> Hmean 4096  25033.23 (   0.00%)24859.59 *  -0.69%*
> 24912.50 *  -0.48%*
> Hmean 8192  32080.51 (   0.00%)31744.51 *  -1.05%*
> 31800.45 *  -0.87%*
> Hmean 16384 36531.86 (   0.00%)37064.68 *   1.46%*
> 37397.71 *   2.37%*
>
> The diffs are smaller than on openSUSE Leap 15.1 and some of the
> uclamp taskgroup results are better?
>
> With this test setup we now can play with the uclamp code in
> enqueue_task() and dequeue_task().
>
> ---
>
> W/ config-network-netperf-unbound (only netperf-udp and buffer size 64):
>
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 5.7.0-rc7_with-clamp/perf.data 
> | grep activate_task
>
> # Event 'cycles:ppp'
> #
> # Baseline  Delta Abs  Shared ObjectSymbol
>
>  0.02% +0.54%  [kernel.vmlinux] [k] activate_task
>  0.02% +0.38%  [kernel.vmlinux] [k] deactivate_task
>
> $ perf diff 5.7.0-rc7_without-clamp/perf.data 
> 5.7.0-rc7_with-clamp-tskgrp/perf.data | grep activate_task
>
>  0.02% +0.35%  [kernel.vmlinux] [k]

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-05-28 Thread Patrick Bellasi



[+Giovanni]

On Thu, May 28, 2020 at 20:29:14 +0200, Peter Zijlstra  
wrote...

> On Thu, May 28, 2020 at 05:51:31PM +0100, Qais Yousef wrote:

>> I had a humble try to catch the overhead but wasn't successful. The 
>> observation
>> wasn't missed by us too then.
>
> Right, I remember us doing benchmarks when we introduced all this and
> clearly we missed something. I would be good if Mel can share which
> benchmark hurt most so we can go have a look.

Indeed, would be great to have a description of their test setup and
results. Perhaps Giovanni can also support us on that.

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-05-15 Thread Patrick Bellasi



I Qais,
I see we are converging toward the final shape. :)

Function wise code looks ok to me now.

Lemme just point out few more remarks and possible nit-picks.
I guess at the end it's up to you to decide if you wanna follow up with
a v6 and to the maintainers to decide how picky they wanna be.

Otherwise, FWIW, feel free to consider this a LGTM.

Best,
Patrick

On Mon, May 11, 2020 at 17:40:52 +0200, Qais Yousef  
wrote...

[...]

> +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
> +enum uclamp_id clamp_id)
> +{
> + unsigned int default_util_min = sysctl_sched_uclamp_util_min_rt_default;
> + struct uclamp_se *uc_se;
> +
> + /* Only sync for UCLAMP_MIN and RT tasks */
> + if (clamp_id != UCLAMP_MIN || !rt_task(p))
> + return;
> +
> + uc_se = >uclamp_req[UCLAMP_MIN];

I went back to v3 version, where this was done above:

   https://lore.kernel.org/lkml/20200429113255.ga19...@codeaurora.org/
   Message-ID: 20200429113255.ga19...@codeaurora.org

and still I don't see why we want to keep it after this first check?

IMO it's just not required and it makes to code a tiny uglier.

> +
> + /*
> +  * Only sync if user didn't override the default request and the sysctl
> +  * knob has changed.
> +  */
> + if (uc_se->user_defined || uc_se->value == default_util_min)
> + return;
> +

nit-pick: the two comments above are stating the obvious.

> + uclamp_se_set(uc_se, default_util_min, false);
> +}
> +
>  static inline struct uclamp_se
>  uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> @@ -907,8 +949,13 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id 
> clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> - struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> - struct uclamp_se uc_max = uclamp_default[clamp_id];
> + struct uclamp_se uc_req, uc_max;
> +
> + /* Sync up any change to sysctl_sched_uclamp_util_min_rt_default. */

same here: the comment is stating the obvious.

Maybe even just by using a different function name we better document
the code, e.g. uclamp_rt_restrict(p, clamp_id);

This will implicitly convey the purpose: RT tasks can be somehow
further restricted, i.e. in addition to the TG restriction following.


> + uclamp_sync_util_min_rt_default(p, clamp_id);
> +
> + uc_req = uclamp_tg_restrict(p, clamp_id);
> + uc_max = uclamp_default[clamp_id];
>  
>   /* System default restrictions always apply */
>   if (unlikely(uc_req.value > uc_max.value))

[...]

Re: [PATCH v4 2/2] Documentation/sysctl: Document uclamp sysctl knobs

2020-05-11 Thread Patrick Bellasi



Hi Qais,

On Tue, May 05, 2020 at 16:56:37 +0200, Qais Yousef  
wrote...

>> > +sched_util_clamp_min_rt_default:
>> > +
>> > +
>> > +By default Linux is tuned for performance. Which means that RT tasks 
>> > always run
>> > +at the highest frequency and most capable (highest capacity) CPU (in
>> > +heterogeneous systems).
>> > +
>> > +Uclamp achieves this by setting the requested uclamp.min of all RT tasks 
>> > to
>> > +SCHED_CAPACITY_SCALE (1024) by default, which effectively boosts the 
>> > tasks to
>> > +run at the highest frequency and biases them to run on the biggest CPU.
>> > +
>> > +This knob allows admins to change the default behavior when uclamp is 
>> > being
>> > +used. In battery powered devices particularly, running at the maximum
>> > +capacity and frequency will increase energy consumption and shorten the 
>> > battery
>> > +life.
>> > +
>> > +This knob is only effective for RT tasks which the user hasn't modified 
>> > their
>> > +requested uclamp.min value via sched_setattr() syscall.
>> > +
>> > +This knob will not escape the constraint imposed by sched_util_clamp_min
>> > +defined above.
>> 
>> Perhaps it's worth to specify that this value is going to be clamped by
>> the values above? Otherwise it's a bit ambiguous to know what happen
>> when it's bigger than schedu_util_clamp_min.
>
> Hmm for me that sentence says exactly what you're asking for.
>
> So what you want is
>
>   s/will not escape the constraint imposed by/will be clamped by/
>
> ?
>
> I'm not sure if this will help if the above is already ambiguous. Maybe if
> I explicitly say
>
>   ..will not escape the *range* constrained imposed by..
>
> sched_util_clamp_min is already defined as a range constraint, so hopefully it
> should hit the mark better now?

Right, that also can work.

>> 
>> > +Any modification is applied lazily on the next opportunity the scheduler 
>> > needs
>> > +to calculate the effective value of uclamp.min of the task.
>> ^
>> 
>> This is also an implementation detail, I would remove it.
>
> The idea is that this value is not updated 'immediately'/synchronously. So
> currently RUNNING tasks will not see the effect, which could generate 
> confusion
> when users trip over it. IMO giving an idea of how it's updated will help with
> expectation of the users. I doubt any will care, but I think it's an important
> behavior element that is worth conveying and documenting. I'd be happy to
> reword it if necessary.

Right, I agree on giving an hint on the lazy update. What I was pointing
out was mainly the reference to the 'effective' value. Maybe we can just
drop that word.

> I have this now
>
> """
>  984 This knob will not escape the range constraint imposed by 
> sched_util_clamp_min
>  985 defined above.
>  986
>  987 For example if
>  988
>  989 sched_util_clamp_min_rt_default = 800
>  990 sched_util_clamp_min = 600
>  991
>  992 Then the boost will be clamped to 600 because 800 is outside of the 
> permissible
>  993 range of [0:600]. This could happen for instance if a powersave mode will
>  994 restrict all boosts temporarily by modifying sched_util_clamp_min. As 
> soon as
>  995 this restriction is lifted, the requested sched_util_clamp_min_rt_default
>  996 will take effect.
>  997
>  998 Any modification is applied lazily to currently running tasks and should 
> be
>  999 visible by the next wakeup.
> """

That's better IMHO, would just slightly change the last sentence to:

   Any modification is applied lazily to tasks and is effective
   starting from their next wakeup.

Best,
Patrick

Re: [PATCH v4 2/2] Documentation/sysctl: Document uclamp sysctl knobs

2020-05-03 Thread Patrick Bellasi



Hi Qais,

On Fri, May 01, 2020 at 13:49:27 +0200, Qais Yousef  
wrote...

[...]

> diff --git a/Documentation/admin-guide/sysctl/kernel.rst 
> b/Documentation/admin-guide/sysctl/kernel.rst
> index 0d427fd10941..521c18ce3d92 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -940,6 +940,54 @@ Enables/disables scheduler statistics. Enabling this 
> feature
>  incurs a small amount of overhead in the scheduler but is
>  useful for debugging and performance tuning.
>  
> +sched_util_clamp_min:
> +=
> +
> +Max allowed *minimum* utilization.
> +
> +Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible
^^^

Mmm... I feel one of the two is an implementation detail which should
probably not be exposed?

The user perhaps needs to know the value (1024) but we don't need to
expose the internal representation.


> +value.
> +
> +It means that any requested uclamp.min value cannot be greater than
> +sched_util_clamp_min, i.e., it is restricted to the range
> +[0:sched_util_clamp_min].
> +
> +sched_util_clamp_max:
> +=
> +
> +Max allowed *maximum* utilization.
> +
> +Default value is SCHED_CAPACITY_SCALE (1024), which is the maximum possible
> +value.
> +
> +It means that any requested uclamp.max value cannot be greater than
> +sched_util_clamp_max, i.e., it is restricted to the range
> +[0:sched_util_clamp_max].
> +
> +sched_util_clamp_min_rt_default:
> +
> +
> +By default Linux is tuned for performance. Which means that RT tasks always 
> run
> +at the highest frequency and most capable (highest capacity) CPU (in
> +heterogeneous systems).
> +
> +Uclamp achieves this by setting the requested uclamp.min of all RT tasks to
> +SCHED_CAPACITY_SCALE (1024) by default, which effectively boosts the tasks to
> +run at the highest frequency and biases them to run on the biggest CPU.
> +
> +This knob allows admins to change the default behavior when uclamp is being
> +used. In battery powered devices particularly, running at the maximum
> +capacity and frequency will increase energy consumption and shorten the 
> battery
> +life.
> +
> +This knob is only effective for RT tasks which the user hasn't modified their
> +requested uclamp.min value via sched_setattr() syscall.
> +
> +This knob will not escape the constraint imposed by sched_util_clamp_min
> +defined above.

Perhaps it's worth to specify that this value is going to be clamped by
the values above? Otherwise it's a bit ambiguous to know what happen
when it's bigger than schedu_util_clamp_min.

> +Any modification is applied lazily on the next opportunity the scheduler 
> needs
> +to calculate the effective value of uclamp.min of the task.
^

This is also an implementation detail, I would remove it.

>  
>  seccomp
>  ===


Best,
Patrick

Re: [PATCH v4 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-05-03 Thread Patrick Bellasi



Hi Qais,

few notes follows, but in general I like the way code is now organised.

On Fri, May 01, 2020 at 13:49:26 +0200, Qais Yousef  
wrote...

[...]

> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index d4f6215ee03f..e62cef019094 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -59,6 +59,7 @@ extern int sysctl_sched_rt_runtime;
>  #ifdef CONFIG_UCLAMP_TASK
>  extern unsigned int sysctl_sched_uclamp_util_min;
>  extern unsigned int sysctl_sched_uclamp_util_max;
> +extern unsigned int sysctl_sched_uclamp_util_min_rt_default;
>  #endif
>  
>  #ifdef CONFIG_CFS_BANDWIDTH
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 9a2fbf98fd6f..15d2978e1869 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -790,6 +790,26 @@ unsigned int sysctl_sched_uclamp_util_min = 
> SCHED_CAPACITY_SCALE;
>  /* Max allowed maximum utilization */
>  unsigned int sysctl_sched_uclamp_util_max = SCHED_CAPACITY_SCALE;
>  
> +/*
> + * By default RT tasks run at the maximum performance point/capacity of the
> + * system. Uclamp enforces this by always setting UCLAMP_MIN of RT tasks to
> + * SCHED_CAPACITY_SCALE.
> + *
> + * This knob allows admins to change the default behavior when uclamp is 
> being
> + * used. In battery powered devices, particularly, running at the maximum
> + * capacity and frequency will increase energy consumption and shorten the
> + * battery life.
> + *
> + * This knob only affects RT tasks that their uclamp_se->user_defined == 
> false.
> + *
> + * This knob will not override the system default sched_util_clamp_min 
> defined
> + * above.
> + *
> + * Any modification is applied lazily on the next attempt to calculate the
> + * effective value of the task.
> + */
> +unsigned int sysctl_sched_uclamp_util_min_rt_default = SCHED_CAPACITY_SCALE;
> +
>  /* All clamps are required to be less or equal than these values */
>  static struct uclamp_se uclamp_default[UCLAMP_CNT];
>  
> @@ -872,6 +892,28 @@ unsigned int uclamp_rq_max_value(struct rq *rq, enum 
> uclamp_id clamp_id,
>   return uclamp_idle_value(rq, clamp_id, clamp_value);
>  }
>  
> +static inline void uclamp_sync_util_min_rt_default(struct task_struct *p,
> +enum uclamp_id clamp_id)
> +{
> + struct uclamp_se *uc_se;
> +
> + /* Only sync for UCLAMP_MIN and RT tasks */
> + if (clamp_id != UCLAMP_MIN || likely(!rt_task(p)))
  ^^
Are we sure that likely makes any difference when used like that?

I believe you should either use:

if (likely(clamp_id != UCLAMP_MIN || !rt_task(p)))

or completely drop it.

> + return;
> +
> + uc_se = >uclamp_req[UCLAMP_MIN];

nit-pick: you can probably move this at declaration time.

The compiler will be smart enough to either post-pone the init or, given
the likely() above, "pre-fetch" the value.

Anyway, the compiler is likely smarter then us. :)

> +
> + /*
> +  * Only sync if user didn't override the default request and the sysctl
> +  * knob has changed.
> +  */
> + if (unlikely(uc_se->user_defined) ||
> + likely(uc_se->value == sysctl_sched_uclamp_util_min_rt_default))
> + return;

Same here, I believe likely/unlikely work only if wrapping a full if()
condition. Thus, you should probably better split the above in two
separate checks, which also makes for a better inline doc.

> +
> + uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt_default, false);

Nit-pick: perhaps we can also improve a bit readability by defining at
the beginning an alias variable with a shorter name, e.g.

   unsigned int uclamp_min =  sysctl_sched_uclamp_util_min_rt_default;

?

> +}
> +
>  static inline struct uclamp_se
>  uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> @@ -907,8 +949,15 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id 
> clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> - struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> - struct uclamp_se uc_max = uclamp_default[clamp_id];
> + struct uclamp_se uc_req, uc_max;
> +
> + /*
> +  * Sync up any change to sysctl_sched_uclamp_util_min_rt_default value.
 ^
> +  */

nit-pick: we can use a single line comment if you drop the (useless)
'value' at the end.

> + uclamp_sync_util_min_rt_default(p, clamp_id);
> +
> + uc_req = uclamp_tg_restrict(p, clamp_id);
> + uc_max = uclamp_default[clamp_id];
>  
>   /* System default restrictions always apply */
>   if (unlikely(uc_req.value > uc_max.value))
> @@ -1114,12 +1163,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>   loff_t *ppos)
>  {
>   bool update_root_tg = false;
> - int

Re: [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases

2019-10-21 Thread Patrick Bellasi

Hi Peter,

On 14-Oct 16:52, Peter Zijlstra wrote:
> 
> The energy aware schedutil patches remimded me this was still pending.
> 
> On Fri, Aug 02, 2019 at 10:47:25AM +0100, Patrick Bellasi wrote:
> > Hi Peter, Vincent,
> > is there anything different I can do on this?
> 
> I think both Vincent and me are basically fine with the patch, it was
> the Changelog/explanation for it that sat uneasy.
> 
> Specifically I think the 'confusion' around the PELT invariance stuff
> doesn't help.
> 
> I think that if you present it simply as making util_est directly follow
> upward motion and only decay on downward -- and the rationale for it --
> then it should be fine.

Ok, I'll update the commit message to remove the PELT related
ambiguity and post a new version soon.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: Usecases for the per-task latency-nice attribute

2019-09-18 Thread Patrick Bellasi

On Wed, Sep 18, 2019 at 16:22:32 +0100, Vincent Guittot wrote...

> On Wed, 18 Sep 2019 at 16:19, Patrick Bellasi  wrote:

[...]

>> $> Wakeup path tunings
>> ==
>>
>> Some additional possible use-cases was already discussed in [3]:
>>
>>  - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
>>depending on crossing certain pre-configured threshold of latency
>>niceness.
>>
>>  - dynamically bias the vruntime updates we do in place_entity()
>>depending on the actual latency niceness of a task.
>>
>>PeterZ thinks this is dangerous but that we can "(carefully) fumble a
>>bit there."
>
> I agree with Peter that we can easily break the fairness if we bias vruntime

Just to be more precise here and also to better understand, here I'm
talking about turning the tweaks we already have for:

 - START_DEBIT
 - GENTLE_FAIR_SLEEPERS

a bit more parametric and proportional to the latency-nice of a task.

In principle, if a task declares a positive latency niceness, could we
not read this also as "I accept to be a bit penalised in terms of
fairness at wakeup time"?

Whatever tweaks we do there should affect anyway only one sched_latency
period... although I'm not yet sure if that's possible and how.

>>  - bias the decisions we take in check_preempt_tick() still depending
>>on a relative comparison of the current and wakeup task latency
>>niceness values.
>
> This one seems possible as it will mainly enable a task to preempt
> "earlier" the running task but will not break the fairness
> So the main impact will be the number of context switch between tasks
> to favor or not the scheduling latency

Preempting before is definitively a nice-to-have feature.

At the same time it's interesting a support where a low latency-nice
task (e.g. TOP_APP) RUNNABLE on a CPU has better chances to be executed
up to completion without being preempted by an high latency-nice task
(e.g. BACKGROUND) waking up on its CPU.

For that to happen, we need a mechanism to "delay" the execution of a
less important RUNNABLE task up to a certain period.

It's impacting the fairness, true, but latency-nice in this case will
means that we want to "complete faster", not just "start faster".

Is this definition something we can reason about?

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: Usecases for the per-task latency-nice attribute

2019-09-18 Thread Patrick Bellasi

ith a different name, since jitters clashes
with other RT related concepts.

Maybe we don't even need a name at all, the other two attributes you
specify are good enough to identify those tasks: they are just "small
background" tasks.

  small  : because on their small util_est value
  background : because of their high latency-nice value

> already active core and thus refrains from waking up of a new core if
> possible. This requires tagging of tasks from the userspace hinting which
> tasks are un-important and thus waking-up a new core to minimize the
> latency is un-necessary for such tasks.
> As per the discussion on the posted RFC, it will be appropriate to use the
> task latency property where a task with the highest latency-nice value can
> be packed.

We should better defined here what you mean with "highest" latency-nice
value, do you really mean the top of the range, e.g. 19?

Or...

> But for this specific use-cases, having just a binary value to know which
> task is latency-sensitive and which not is sufficient enough, but having a
> range is also a good way to go where above some threshold the task can be
> packed.

... yes, maybe we can reason about a "threshold based profiling" where
something like for example:

   /proc/sys/kernel/sched_packing_util_max: 200
   /proc/sys/kernel/sched_packing_latency_min : 17

means that a task with latency-nice >= 17 and util_est <= 200 will be packed?


$> Wakeup path tunings
==

Some additional possible use-cases was already discussed in [3]:

 - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
   depending on crossing certain pre-configured threshold of latency
   niceness.
  
 - dynamically bias the vruntime updates we do in place_entity()
   depending on the actual latency niceness of a task.
  
   PeterZ thinks this is dangerous but that we can "(carefully) fumble a
   bit there."
  
 - bias the decisions we take in check_preempt_tick() still depending
   on a relative comparison of the current and wakeup task latency
   niceness values.

> References:
> ===
> [1]. https://lkml.org/lkml/2019/8/30/829
> [2]. https://lkml.org/lkml/2019/7/25/296

  [3]. Message-ID: <20190905114709.gm2...@hirez.programming.kicks-ass.net>
   
https://lore.kernel.org/lkml/20190905114709.gm2...@hirez.programming.kicks-ass.net/


Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: linux-next: Tree for Sep 16 (kernel/sched/core.c)

2019-09-18 Thread Patrick Bellasi



On Wed, Sep 18, 2019 at 07:05:53 +0100, Ingo Molnar wrote...

> * Randy Dunlap  wrote:
>
>> On 9/17/19 6:38 AM, Patrick Bellasi wrote:
>> > 
>> > On Tue, Sep 17, 2019 at 08:52:42 +0100, Ingo Molnar wrote...
>> > 
>> >> * Randy Dunlap  wrote:
>> >>
>> >>> On 9/16/19 3:38 PM, Mark Brown wrote:
>> >>>> Hi all,
>> >>>>
>> >>>> Changes since 20190915:
>> >>>>
>> >>>
>> >>> on x86_64:
>> >>>
>> >>> when CONFIG_CGROUPS is not set:
>> > 
>> > Hi Randy,
>> > thanks for the report.
>> > 
>> >>>   CC  kernel/sched/core.o
>> >>> ../kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
>> >>> ../kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
>> >>>   struct css_task_iter it;
>> >>>^~
>> >>>   CC  kernel/printk/printk_safe.o
>> >>> ../kernel/sched/core.c:1084:2: error: implicit declaration of function 
>> >>> ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? 
>> >>> [-Werror=implicit-function-declaration]
>> >>>   css_task_iter_start(css, 0, );
>> >>>   ^~~
>> >>>   __sg_page_iter_start
>> >>> ../kernel/sched/core.c:1085:14: error: implicit declaration of function 
>> >>> ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? 
>> >>> [-Werror=implicit-function-declaration]
>> >>>   while ((p = css_task_iter_next())) {
>> >>>   ^~
>> >>>   __sg_page_iter_next
>> >>> ../kernel/sched/core.c:1091:2: error: implicit declaration of function 
>> >>> ‘css_task_iter_end’; did you mean ‘get_task_cred’? 
>> >>> [-Werror=implicit-function-declaration]
>> >>>   css_task_iter_end();
>> >>>   ^
>> >>>   get_task_cred
>> >>> ../kernel/sched/core.c:1081:23: warning: unused variable ‘it’ 
>> >>> [-Wunused-variable]
>> >>>   struct css_task_iter it;
>> >>>^~
>> >>>
>> >>
>> >> I cannot reproduce this build failue: I took Linus's latest which has all 
>> >> the -next scheduler commits included (ad062195731b), and an x86-64 "make 
>> >> defconfig" and a disabling of CONFIG_CGROUPS still resuls in a kernel 
>> >> that builds fine.
>> > 
>> > Same here Ingo, I cannot reproduce on arm64 and !CONFIG_CGROUPS and
>> > testing on tip/sched/core.
>> > 
>> > However, if you like, the following patch can make that code a
>> > bit more "robust".
>> > 
>> > Best,
>> > Patrick
>> > 
>> > ---8<---
>> > From 7e17b7bb08dd8dfc57e01c2a7b6875439eb47cbe Mon Sep 17 00:00:00 2001
>> > From: Patrick Bellasi 
>> > Date: Tue, 17 Sep 2019 14:12:10 +0100
>> > Subject: [PATCH 1/1] sched/core: uclamp: Fix compile error on 
>> > !CONFIG_CGROUPS
>> > 
>> > Randy reported a compiler error on x86_64 and !CONFIG_CGROUPS which is due
>> > to uclamp_update_active_tasks() using the undefined css_task_iter().
>> > 
>> > Since uclamp_update_active_tasks() is used only when cgroup support is
>> > enabled, fix that by properly guarding that function at compile time.
>> > 
>> > Signed-off-by: Patrick Bellasi 
>> > Link: 
>> > https://lore.kernel.org/lkml/1898d3c9-1997-17ce-a022-a5e28c8dc...@infradead.org/
>> > Fixes: commit babbe170e05 ("sched/uclamp: Update CPU's refcount on TG's 
>> > clamp changes")
>> 
>> Acked-by: Randy Dunlap  # build-tested
>> 
>> Thanks.
>
> Build failures like this one shouldn't depend on the compiler version - 
> and it's still a mystery how and why this build bug triggered - we cannot 
> apply the fix without knowing the answer to those questions.

Right, but it's also quite strange it's not triggering without the
guarding above. The only definition of struct css_task_iter I can see is
the one
provided in:

   include/linux/cgroup.h:50
   
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/cgroup.h?h=35f7a95266153b1cf0caca3aa9661cb721864527#n50

which is CONFIG_CGROUPS guarded.

> Can you reproduce the build bug with Linus's latest tree? If not, which 
> part of -next triggers the build failure?

I tried again using this morning's Linus tree headed at:

  commit 35f7a9526615 ("Merge tag 'devprop-5.4-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")

and compilation actually fails for me too.

Everything is fine in v5.3 with !CONFIG_CGROUPS and a git bisect
between v5.3 and Linus master points to:

  commit babbe170e053c ("sched/uclamp: Update CPU's refcount on TG's clamp 
changes")

So, I think it's really my fault not properly testing !CONFIG_CGROUP,
which is enforced by default from CONFIG_SCHED_AUTOGROUP.

The patch above fixes the compilation error, hope this helps.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: linux-next: Tree for Sep 16 (kernel/sched/core.c)

2019-09-17 Thread Patrick Bellasi



On Tue, Sep 17, 2019 at 08:52:42 +0100, Ingo Molnar wrote...

> * Randy Dunlap  wrote:
>
>> On 9/16/19 3:38 PM, Mark Brown wrote:
>> > Hi all,
>> > 
>> > Changes since 20190915:
>> > 
>> 
>> on x86_64:
>> 
>> when CONFIG_CGROUPS is not set:

Hi Randy,
thanks for the report.

>>   CC  kernel/sched/core.o
>> ../kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
>> ../kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
>>   struct css_task_iter it;
>>^~
>>   CC  kernel/printk/printk_safe.o
>> ../kernel/sched/core.c:1084:2: error: implicit declaration of function 
>> ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? 
>> [-Werror=implicit-function-declaration]
>>   css_task_iter_start(css, 0, );
>>   ^~~
>>   __sg_page_iter_start
>> ../kernel/sched/core.c:1085:14: error: implicit declaration of function 
>> ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? 
>> [-Werror=implicit-function-declaration]
>>   while ((p = css_task_iter_next())) {
>>   ^~
>>   __sg_page_iter_next
>> ../kernel/sched/core.c:1091:2: error: implicit declaration of function 
>> ‘css_task_iter_end’; did you mean ‘get_task_cred’? 
>> [-Werror=implicit-function-declaration]
>>   css_task_iter_end();
>>   ^
>>   get_task_cred
>> ../kernel/sched/core.c:1081:23: warning: unused variable ‘it’ 
>> [-Wunused-variable]
>>   struct css_task_iter it;
>>^~
>> 
>
> I cannot reproduce this build failue: I took Linus's latest which has all 
> the -next scheduler commits included (ad062195731b), and an x86-64 "make 
> defconfig" and a disabling of CONFIG_CGROUPS still resuls in a kernel 
> that builds fine.

Same here Ingo, I cannot reproduce on arm64 and !CONFIG_CGROUPS and
testing on tip/sched/core.

However, if you like, the following patch can make that code a
bit more "robust".

Best,
Patrick

---8<---
>From 7e17b7bb08dd8dfc57e01c2a7b6875439eb47cbe Mon Sep 17 00:00:00 2001
From: Patrick Bellasi 
Date: Tue, 17 Sep 2019 14:12:10 +0100
Subject: [PATCH 1/1] sched/core: uclamp: Fix compile error on !CONFIG_CGROUPS

Randy reported a compiler error on x86_64 and !CONFIG_CGROUPS which is due
to uclamp_update_active_tasks() using the undefined css_task_iter().

Since uclamp_update_active_tasks() is used only when cgroup support is
enabled, fix that by properly guarding that function at compile time.

Signed-off-by: Patrick Bellasi 
Link: 
https://lore.kernel.org/lkml/1898d3c9-1997-17ce-a022-a5e28c8dc...@infradead.org/
Fixes: commit babbe170e05 ("sched/uclamp: Update CPU's refcount on TG's clamp 
changes")
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3c7b90bcbe4e..14873ad4b34a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,7 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
 static inline void
 uclamp_update_active(struct task_struct *p, enum uclamp_id clamp_id)
 {
@@ -1091,7 +1092,6 @@ uclamp_update_active_tasks(struct cgroup_subsys_state 
*css,
css_task_iter_end();
 }
 
-#ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
 {
-- 
2.22.0
---8<---

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi



On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...

> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>> There's a few things wrong there; I really feel that if we call it nice,
>>> it should be like nice. Otherwise we should call it latency-bias and not
>>> have the association with nice to confuse people.
>>>
>>> Secondly; the default should be in the middle of the range. Naturally
>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>> want [0,1024], then the default really should be 512, but personally I
>>> like 0 better as a default, in which case we need negative numbers.
>>>
>>> This is important because we want to be able to bias towards less
>>> importance to (tail) latency as well as more importantance to (tail)
>>> latency.
>>>
>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>> 
>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>> 
>
> For something like latency-, I don't see the point of having
> such a wide range. The nice range is probably more than enough - and before
> even bothering about the range, we should probably agree on what the range
> should represent.
>
> If it's niceness, I read it as: positive latency-nice value means we're
> nice to latency, means we reduce it. So the further up you go, the more you
> restrict your wakeup scan. I think it's quite easy to map that into the
> code: current behaviour at 0, with a decreasing scan mask size as we go
> towards +19. I don't think anyone needs 512 steps to tune this.
>
> I don't know what logic we'd follow for negative values though. Maybe
> latency-nice -20 means always going through the slowpath, but what of the
> intermediate values?

Yep, I think so fare we are all converging towards the idea to use the
a signed range. Regarding the range itself, yes: 1024 looks very
oversized, but +-20 is still something which leave room for a bit of
flexibility and it also better matches the idea that we don't want to
"enumerate behaviours" but just expose a knob. To map certain "bias" we
could benefit from a slightly larger range.

> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be

For the wakeup path there is also the TurboSched proposal by Parth:

   Message-ID: <20190725070857.6639-1-pa...@linux.ibm.com> 
   https://lore.kernel.org/lkml/20190725070857.6639-1-pa...@linux.ibm.com/

we should keep in mind.

> applied elsewhere (e.g. load-balance, something like task_hot() and its
> use of sysctl_sched_migration_cost).

For LB can you come up with some better description of what usages you
see could benefit from a "per task" or "per task-group" latency niceness?

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi



On Thu, Sep 05, 2019 at 12:40:30 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:
>
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>
> I'm arguing that signed _is_ more human friendly ;-)

... but you are not human. :)

>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>
> See, I find the signed thing more natural, negative is a bias away from
> latency sensitive, positive is a bias towards latency sensitive.
>
> Also; 0 is a good default value ;-)

Yes, that's appealing indeed.

>> Moreover, we will never match completely the nice semantic, give that
>> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
>> usage change for each step?
>
> Only because we were nice when implementing it. Posix leaves it
> unspecified and we could change it at any time. The only real semantics
> is a relative 'weight' (opengroup uses the term 'favourable').

Good to know, I was considering it a POXIS requirement.

>> Could changing the name to "latency-tolerance" break the tie by marking
>> its difference wrt prior/nice levels? AFAIR, that was also the original
>> proposal [1] by PaulT during the OSPM discussion.
>
> latency torrerance could still be a signed entity, positive would
> signify we're more tolerant of latency (ie. less sensitive) while
> negative would be less tolerant (ie. more sensitive).

Right.

>> For latency-nice instead we will likely base our biasing strategies on
>> some predefined (maybe system-wide configurable) const thresholds.
>
> I'm not quite sure; yes, for some of these things, like the idle search
> on wakeup, certainly. But say for wakeup-preemption, we could definitely
> make it a task relative attribute.

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi



On Thu, Sep 05, 2019 at 12:30:02 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>> On 09/05/19 12:46, Peter Zijlstra wrote:
>
>> > This is important because we want to be able to bias towards less
>> > importance to (tail) latency as well as more importantance to (tail)
>> > latency.
>> > 
>> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Another use case I'm considering is using latency-nice to prefer an idle CPU 
>> if
>> latency-nice is set otherwise go for the most energy efficient CPU.
>> 
>> Ie: sacrifice (some) energy for latency.
>> 
>> The way I see interpreting latency-nice here as a binary switch. But
>> maybe we can use the range to select what (some) energy to sacrifice
>> mean here. Hmmm.
>
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).
>
> In your case, I'm thinking you mean >0, we want to lower the latency.
>
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
>
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Right, although I think behaviours could still be exported but via a
different and configurable interface, using thresholds.

Either at compile time or via procfs maybe we can expose and properly
document what happen in the scheduler if/when a task has a "latency
niceness" crossing a given threshold.

For example, by setting something like:

   /proc/sys/kernel/sched_cfs_latency_idle = 1000

we state that the task is going to be scheduled according to the
SCHED_IDLE policy.

  ( ( (tomatoes target here) ) )

Not sure also if we wanna commit to user-space APIs how we internally
map/translate a "latency niceness" value into a scheduler behaviour
bias. Maybe better not at least at the very beginning.

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi



On Thu, Sep 05, 2019 at 12:13:47 +0100, Qais Yousef wrote...

> On 09/05/19 12:46, Peter Zijlstra wrote:
>> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>> 
>> > > From just reading the above, I would expect it to have the range
>> > > [-20,19] just like normal nice. Apparently this is not so.
>> > 
>> > Regarding the range for the latency-nice values, I guess we have two
>> > options:
>> > 
>> >   - [-20..19], which makes it similar to priorities
>> >   downside: we quite likely end up with a kernel space representation
>> >   which does not match the user-space one, e.g. look at
>> >   task_struct::prio.
>> > 
>> >   - [0..1024], which makes it more similar to a "percentage"
>> > 
>> > Being latency-nice a new concept, we are not constrained by POSIX and
>> > IMHO the [0..1024] scale is a better fit.
>> > 
>> > That will translate into:
>> > 
>> >   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>> >   policies are disabled and we wakeup up as fast as possible
>> > 
>> >   latency-nice=1024 : maximum niceness, where for example we can imaging
>> >   to turn switch a CFS task to be SCHED_IDLE?
>> 
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>> 
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>> 
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>> 
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Another use case I'm considering is using latency-nice to prefer an idle CPU 
> if
> latency-nice is set otherwise go for the most energy efficient CPU.
>
> Ie: sacrifice (some) energy for latency.
>
> The way I see interpreting latency-nice here as a binary switch. But maybe we
> can use the range to select what (some) energy to sacrifice mean here. Hmmm.

I see this concept possibly evolving into something more then just a
binary switch. Not yet convinced if it make sense and/or it's possible
but, in principle, I was thinking about these possible usages for CFS
tasks:

 - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
   depending on crossing certain pre-configured threshold of latency
   niceness.

 - dynamically bias the vruntime updates we do in place_entity()
   depending on the actual latency niceness of a task.

 - bias the decisions we take in check_preempt_tick() still depending
   on a relative comparison of the current and wakeup task latency
   niceness values.

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi

On Thu, Sep 05, 2019 at 11:46:16 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>
>> > From just reading the above, I would expect it to have the range
>> > [-20,19] just like normal nice. Apparently this is not so.
>> 
>> Regarding the range for the latency-nice values, I guess we have two
>> options:
>> 
>>   - [-20..19], which makes it similar to priorities
>>   downside: we quite likely end up with a kernel space representation
>>   which does not match the user-space one, e.g. look at
>>   task_struct::prio.
>> 
>>   - [0..1024], which makes it more similar to a "percentage"
>> 
>> Being latency-nice a new concept, we are not constrained by POSIX and
>> IMHO the [0..1024] scale is a better fit.
>> 
>> That will translate into:
>> 
>>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>>   policies are disabled and we wakeup up as fast as possible
>> 
>>   latency-nice=1024 : maximum niceness, where for example we can imaging
>>   to turn switch a CFS task to be SCHED_IDLE?
>
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
>
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
>
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
>
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Right, we have this dualism to deal with and current mainline behaviour
is somehow in the middle.

BTW, the FB requirement is the same we have in Android.
We want some CFS tasks to have very small latency and a low chance
to be preempted by the wake-up of less-important "background" tasks.

I'm not totally against the usage of a signed range, but I'm thinking
that since we are introducing a new (non POSIX) concept we can get the
chance to make it more human friendly.

Give the two extremes above, would not be much simpler and intuitive to
have 0 implementing the FB/Android (no latency) case and 1024 the
(max latency) Oracle case?

Moreover, we will never match completely the nice semantic, give that
a 1 nice unit has a proper math meaning, isn't something like 10% CPU
usage change for each step?

For latency-nice instead we will likely base our biasing strategies on
some predefined (maybe system-wide configurable) const thresholds.

Could changing the name to "latency-tolerance" break the tie by marking
its difference wrt prior/nice levels? AFAIR, that was also the original
proposal [1] by PaulT during the OSPM discussion.

Best,
Patrick

[1] https://youtu.be/oz43thSFqmk?t=1302

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 0/9] Task latency-nice

2019-09-05 Thread Patrick Bellasi



On Fri, Aug 30, 2019 at 18:49:35 +0100, subhra mazumdar wrote...

> Introduce new per task property latency-nice for controlling scalability
> in scheduler idle CPU search path.

As per my comments in other message, we should try to better split the
"latency nice" concept introduction (API and mechanisms) from its usage
in different scheduler code paths.

This distinction should be evident from the cover letter too. What you
use "latency nice" for is just one possible use-case, thus it's
important to make sure it's generic enough to fit other usages too.

> Valid latency-nice values are from 1 to
> 100 indicating 1% to 100% search of the LLC domain in select_idle_cpu. New
> CPU cgroup file cpu.latency-nice is added as an interface to set and get.
> All tasks in the same cgroup share the same latency-nice value. Using a
> lower latency-nice value can help latency intolerant tasks e.g very short
> running OLTP threads where full LLC search cost can be significant compared
> to run time of the threads. The default latency-nice value is 5.
>
> In addition to latency-nice, it also adds a new sched feature SIS_CORE to
> be able to disable idle core search altogether which is costly and hurts
> more than it helps in short running workloads.

I don't see why you latency-nice cannot be used to implement what you
get with NO_SIS_CORE.

IMHO, "latency nice" should be exposed as a pretty generic concept which
progressively enables different "levels of biasing" both at wake-up time
and load-balance time.

Why it's not possible to have SIS_CORE/NO_SIS_CORE switch implemented
just as different threshold values for the latency-nice value of a task?

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 4/9] sched: SIS_CORE to disable idle core search

2019-09-05 Thread Patrick Bellasi



On Fri, Aug 30, 2019 at 18:49:39 +0100, subhra mazumdar wrote...

> Use SIS_CORE to disable idle core search. For some workloads
> select_idle_core becomes a scalability bottleneck, removing it improves
> throughput. Also there are workloads where disabling it can hurt latency,
> so need to have an option.
>
> Signed-off-by: subhra mazumdar 
> ---
>  kernel/sched/fair.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c31082d..23ec9c6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6268,9 +6268,11 @@ static int select_idle_sibling(struct task_struct *p, 
> int prev, int target)
>   if (!sd)
>   return target;
>  
> - i = select_idle_core(p, sd, target);
> - if ((unsigned)i < nr_cpumask_bits)
> - return i;
> + if (sched_feat(SIS_CORE)) {
> + i = select_idle_core(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
>  
>   i = select_idle_cpu(p, sd, target);
>   if ((unsigned)i < nr_cpumask_bits)

This looks like should be squashed with the previous one, or whatever
code you'll add to define when this "biasing" is to be used or not.

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 3/9] sched: add sched feature to disable idle core search

2019-09-05 Thread Patrick Bellasi



On Fri, Aug 30, 2019 at 18:49:38 +0100, subhra mazumdar wrote...

> Add a new sched feature SIS_CORE to have an option to disable idle core
> search (select_idle_core).
>
> Signed-off-by: subhra mazumdar 
> ---
>  kernel/sched/features.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 858589b..de4d506 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_AVG_CPU, false)
>  SCHED_FEAT(SIS_PROP, true)
> +SCHED_FEAT(SIS_CORE, true)

Why do we need a sched_feature? If you think there are systems in
which the usage of latency-nice does not make sense for in "Select Idle
Sibling", then we should probably better add a new Kconfig option.

If that's the case, you can probably use the init/Kconfig's
"Scheduler features" section, recently added by:

  commit 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")

>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi



On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...

> On 9/4/19 11:02 PM, Tim Chen wrote:
>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>> 
>> 
>> Subhra,
>> 
>> Thanks for posting the patchset.  Having a latency nice hint
>> is useful beyond idle load balancing.  I can think of other
>> application scenarios, like scheduling batch machine learning AVX 512
>> processes with latency sensitive processes.  AVX512 limits the frequency
>> of the CPU and it is best to avoid latency sensitive task on the
>> same core with AVX512.  So latency nice hint allows the scheduler
>> to have a criteria to determine the latency sensitivity of a task
>> and arrange latency sensitive tasks away from AVX512 tasks.
>> 
>
>
> Hi Tim and Subhra,
>
> This patchset seems to be interesting for my TurboSched patches as well
> where I try to pack jitter tasks on fewer cores to get higher Turbo 
> Frequencies.
> Well, the problem I face is that we sometime end up putting multiple jitter 
> tasks on a core
> running some latency sensitive application which may see performance 
> degradation.
> So my plan was to classify such tasks to be latency sensitive thereby hinting 
> the load
> balancer to not put tasks on such cores.
>
> TurboSched: https://lkml.org/lkml/2019/7/25/296
>
>> You configure the latency hint on a cgroup basis.
>> But I think not all tasks in a cgroup necessarily have the same
>> latency sensitivity.
>> 
>> For example, I can see that cgroup can be applied on a per user basis,
>> and the user could run different tasks that have different latency 
>> sensitivity.
>> We may also need a way to configure latency sensitivity on a per task basis 
>> instead on
>> a per cgroup basis.
>> 
>
> AFAIU, the problem defined above intersects with my patches as well where the 
> interface
> is required to classify the jitter tasks. I have already tried few methods 
> like
> syscall and cgroup to classify such tasks and maybe something like that can 
> be adopted
> with these patchset as well.

Agree, these two patchest are definitively overlapping in terms of
mechanisms and APIs to expose to userspace. You to guys seems to target
different goals but the general approach should be:

 - expose a single and abstract concept to user-space
   latency-nice or latency-tolerant as PaulT proposed at OSPM

 - map this concept in kernel-space to different kind of bias, both at
   wakeup time and load-balance time, and use both for RT and CFS tasks.

That's my understanding at least ;)

I guess we will have interesting discussions at the upcoming LPC to
figure out a solution fitting all needs.

> Thanks,
> Parth

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi

ency_nice_write_u64,
> + },
>   { } /* Terminate */
>  };
>  
> @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
>   .write = cpu_max_write,
>   },
>  #endif
> + {
> + .name = "latency-nice",
> + .read_u64 = cpu_latency_nice_read_u64,
> + .write_u64 = cpu_latency_nice_write_u64,
> + },
>   { } /* terminate */
>  };
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..b08d00c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, 
> struct task_group *parent)
>   goto err;
>  
>   tg->shares = NICE_0_LOAD;
> + tg->latency_nice = LATENCY_NICE_DEFAULT;
   
Maybe better NICE_0_LATENCY to be more consistent?


>   init_cfs_bandwidth(tg_cfs_bandwidth(tg));
>  
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..365c928 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq 
> *this_rq) { }
>  #define NICE_0_LOAD  (1L << NICE_0_LOAD_SHIFT)
>  
>  /*
> + * Latency-nice default value
> + */
> +#define  LATENCY_NICE_DEFAULT5
> +#define  LATENCY_NICE_MIN1
> +#define  LATENCY_NICE_MAX100

Values 1 and 5 looks kind of arbitrary.
For the range specifically, I already commented in this other message:

   Message-ID: <87r24v2i14@arm.com>
   https://lore.kernel.org/lkml/87r24v2i14@arm.com/

> +
> +/*
>   * Single value that decides SCHED_DEADLINE internal math precision.
>   * 10 -> just above 1us
>   * 9  -> just above 0.5us
> @@ -362,6 +369,7 @@ struct cfs_bandwidth {
>  /* Task group related information */
>  struct task_group {
>   struct cgroup_subsys_state css;
> + u64 latency_nice;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   /* schedulable entities of this group on each CPU */


Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi



On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...

> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
>
> *sigh*, no. We start with a normal per task attribute, and then later,
> if it is needed and makes sense, we add it to cgroups.

FWIW, to add on top of what Peter says, we used this same approach for
uclamp and it proved to be a very effective way to come up with a good
design. General principles have been:

 - a system wide API [1] (under /proc/sys/kernel/sched_*) defines
   default values for all tasks affected by that feature.
   This interface has to define also upper bounds for task specific
   values. Thus, in the case of latency-nice, it should be set by
   default to the MIN value, since that's the current mainline
   behaviour: all tasks are latency sensitive.

 - a per-task API [2] (via the sched_setattr() syscall) can be used to
   relax the system wide setting thus implementing a "nice" policy.

 - a per-taskgroup API [3] (via cpu controller's attributes) can be used
   to relax the system-wide settings and restrict the per-task API.

The above features are worth to be added in that exact order.

> Also, your Changelog fails on pretty much every point. It doesn't
> explain why, it doesn't describe anything and so on.

On the description side, I guess it's worth to mention somewhere to
which scheduling classes this feature can be useful for. It's worth to
mention that it can apply only to:

 - CFS tasks: for example, at wakeup time a task with an high
   latency-nice should avoid to preempt a low latency-nice task.
   Maybe by mapping the latency nice value into proper vruntime
   normalization value?

 - RT tasks: for example, at wakeup time a task with an high
   latency-nice value could avoid to preempt a CFS task.

I'm sure there will be discussion about some of these features, that's
why it's important in the proposal presentation to keep a well defined
distinction among the "mechanisms and API" and how we use the new
concept to "bias" some scheduler policies.

> From just reading the above, I would expect it to have the range
> [-20,19] just like normal nice. Apparently this is not so.

Regarding the range for the latency-nice values, I guess we have two
options:

  - [-20..19], which makes it similar to priorities
  downside: we quite likely end up with a kernel space representation
  which does not match the user-space one, e.g. look at
  task_struct::prio.

  - [0..1024], which makes it more similar to a "percentage"

Being latency-nice a new concept, we are not constrained by POSIX and
IMHO the [0..1024] scale is a better fit.

That will translate into:

  latency-nice=0 : default (current mainline) behaviour, all "biasing"
  policies are disabled and we wakeup up as fast as possible

  latency-nice=1024 : maximum niceness, where for example we can imaging
  to turn switch a CFS task to be SCHED_IDLE?

Best,
Patrick

[1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
[2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support 
utilization clamping")
[3] 5 patches in today's tip/sched/core up to:
    commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp 
changes")

-- 
#include 

Patrick Bellasi

[tip: sched/core] sched/uclamp: Propagate parent clamps

2019-09-03 Thread tip-bot2 for Patrick Bellasi

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0b60ba2dd342016e4e717dbaa4ca9af3a43f4434
Gitweb:
https://git.kernel.org/tip/0b60ba2dd342016e4e717dbaa4ca9af3a43f4434
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:07 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:38 +02:00

sched/uclamp: Propagate parent clamps

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 44 +++-
 kernel/sched/sched.h |  2 ++-
 2 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c186abe..8855481 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1166,6 +1166,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+   root_task_group.uclamp[clamp_id] = uc_max;
 #endif
}
 }
@@ -6824,6 +6825,7 @@ static inline void alloc_uclamp_sched_group(struct 
task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(>uclamp_req[clamp_id],
  uclamp_none(clamp_id), false);
+   tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
 #endif
 }
@@ -7070,6 +7072,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+   struct cgroup_subsys_state *top_css = css;
+   struct uclamp_se *uc_parent = NULL;
+   struct uclamp_se *uc_se = NULL;
+   unsigned int eff[UCLAMP_CNT];
+   unsigned int clamp_id;
+   unsigned int clamps;
+
+   css_for_each_descendant_pre(css, top_css) {
+   uc_parent = css_tg(css)->parent
+   ? css_tg(css)->parent->uclamp : NULL;
+
+   for_each_clamp_id(clamp_id) {
+   /* Assume effective clamps matches requested clamps */
+   eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+   /* Cap effective clamps with parent's effective clamps 
*/
+   if (uc_parent &&
+   eff[clamp_id] > uc_parent[clamp_id].value) {
+   eff[clamp_id] = uc_parent[clamp_id].value;
+   }
+   }
+   /* Ensure protection is always capped by limit */
+   eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+   /* Propagate most restrictive effective clamps */
+   clamps = 0x0;
+   uc_se = css_tg(css)->uclamp;
+   for_each_clamp_id(clamp_id) {
+   if (eff[clamp_id] == uc_se[clamp_id].value)
+   continue;
+   uc_se[clamp_id].value = eff[clamp_id];
+   uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
+   clamps |= (0x1 << clamp_id);
+   }
+   if (!clamps)
+   css = css_rightmost_descendant(css);
+   }
+}
 
 /*
  * Integer 10^N with a given N exponent by casting to integer the literal "1eN"
@@ -7138,6 +7179,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file 
*of, char *buf,
 */

[tip: sched/core] sched/uclamp: Extend CPU's cgroup controller

2019-09-03 Thread tip-bot2 for Patrick Bellasi

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 2480c093130f64ac3a410504fa8b3db1fc4b87ce
Gitweb:
https://git.kernel.org/tip/2480c093130f64ac3a410504fa8b3db1fc4b87ce
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:06 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:37 +02:00

sched/uclamp: Extend CPU's cgroup controller

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run at least at a
  minimum frequency which corresponds to the uclamp.min
  utilization

- uclamp.max: defines the maximum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run up to a
  maximum frequency which corresponds to the uclamp.max
  utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 Documentation/admin-guide/cgroup-v2.rst |  34 -
 init/Kconfig|  22 +++-
 kernel/sched/core.c | 193 ++-
 kernel/sched/sched.h|   8 +-
 4 files changed, 253 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005..5f1c266 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit 
models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the freq

[tip: sched/core] sched/uclamp: Propagate system defaults to the root group

2019-09-03 Thread tip-bot2 for Patrick Bellasi

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 7274a5c1bbec45f06f1fff4b8c8b5855b6cc189d
Gitweb:
https://git.kernel.org/tip/7274a5c1bbec45f06f1fff4b8c8b5855b6cc189d
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:08 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:38 +02:00

sched/uclamp: Propagate system defaults to the root group

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-4-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8855481..e6800fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,10 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+   struct task_group *tg = _task_group;
+
+   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min, false);
+   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max, false);
+
+   rcu_read_lock();
+   cpu_util_update_eff(_task_group.css);
+   rcu_read_unlock();
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
 int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
 {
+   bool update_root_tg = false;
int old_min, old_max;
int result;
 
@@ -1043,16 +1063,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(_default[UCLAMP_MIN],
  sysctl_sched_uclamp_util_min, false);
+   update_root_tg = true;
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(_default[UCLAMP_MAX],
  sysctl_sched_uclamp_util_max, false);
+   update_root_tg = true;
}
 
+   if (update_root_tg)
+   uclamp_update_root_tg();
+
/*
-* Updating all the RUNNABLE task is expensive, keep it simple and do
-* just a lazy update at each next enqueue time.
+* We update all RUNNABLE tasks only when task groups are in use.
+* Otherwise, keep it simple and do just a lazy update at each next
+* task enqueue time.
 */
+
goto done;
 
 undo:

[tip: sched/core] sched/uclamp: Use TG's clamps to restrict TASK's clamps

2019-09-03 Thread tip-bot2 for Patrick Bellasi

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 3eac870a324728e5d1711840dad70bcd37f3
Gitweb:
https://git.kernel.org/tip/3eac870a324728e5d1711840dad70bcd37f3
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:09 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:39 +02:00

sched/uclamp: Use TG's clamps to restrict TASK's clamps

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-5-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e6800fe..c32ac07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+   struct uclamp_se uc_max;
+
+   /*
+* Tasks in autogroups or root task group will be
+* restricted by system defaults.
+*/
+   if (task_group_is_autogroup(task_group(p)))
+   return uc_req;
+   if (task_group(p) == _task_group)
+   return uc_req;
+
+   uc_max = task_group(p)->uclamp[clamp_id];
+   if (uc_req.value > uc_max.value || !uc_req.user_defined)
+   return uc_max;
+#endif
+
+   return uc_req;
+}
+
 /*
  * The effective clamp bucket index of a task depends on, by increasing
  * priority:
  * - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 {
-   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+   struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
 
/* System default restrictions always apply */

[tip: sched/core] sched/uclamp: Always use 'enum uclamp_id' for clamp_id values

2019-09-03 Thread tip-bot2 for Patrick Bellasi

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 0413d7f33e60751570fd6c179546bde2f7d82dcb
Gitweb:
https://git.kernel.org/tip/0413d7f33e60751570fd6c179546bde2f7d82dcb
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:11 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:40 +02:00

sched/uclamp: Always use 'enum uclamp_id' for clamp_id values

The supported clamp indexes are defined in 'enum clamp_id', however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use 'unsigned int' to represent indices.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c  | 38 +++---
 kernel/sched/sched.h |  2 +-
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55a1c07..3c7b90b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -810,7 +810,7 @@ static inline unsigned int 
uclamp_bucket_base_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
 }
 
-static inline unsigned int uclamp_none(int clamp_id)
+static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
 {
if (clamp_id == UCLAMP_MIN)
return 0;
@@ -826,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
 }
 
 static inline unsigned int
-uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
+uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
  unsigned int clamp_value)
 {
/*
@@ -842,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
return uclamp_none(UCLAMP_MIN);
 }
 
-static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
 unsigned int clamp_value)
 {
/* Reset max-clamp retention only on idle exit */
@@ -853,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, 
unsigned int clamp_id,
 }
 
 static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
-unsigned int clamp_value)
+enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
+  unsigned int clamp_value)
 {
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1;
@@ -874,7 +874,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
 }
 
 static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
 #ifdef CONFIG_UCLAMP_TASK_GROUP
@@ -906,7 +906,7 @@ uclamp_tg_restrict(struct task_struct *p, unsigned int 
clamp_id)
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
-uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
@@ -918,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
return uc_req;
 }
 
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_eff;
 
@@ -942,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, 
unsigned int clamp_id)
  * for each bucket when all its RUNNABLE tasks require the same clamp.
  */
 static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-   unsigned int clamp_id)
+   enum uclamp_id clamp_id)
 {
struct uclamp_rq *uc_rq = >uclamp[clamp_id];
struct uclamp_se *uc_se = >uclamp[clamp_id];
@@ -980,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct 
task_struct *p,
  * enforce the expected state and warn.
  */
 static inline void uclamp_rq_dec_id(struct rq

[tip: sched/core] sched/uclamp: Update CPU's refcount on TG's clamp changes

2019-09-03 Thread tip-bot2 for Patrick Bellasi

The following commit has been merged into the sched/core branch of tip:

Commit-ID: babbe170e053c6ec2343751749995b7b9fd5fd2c
Gitweb:
https://git.kernel.org/tip/babbe170e053c6ec2343751749995b7b9fd5fd2c
Author:Patrick Bellasi 
AuthorDate:Thu, 22 Aug 2019 14:28:10 +01:00
Committer: Ingo Molnar 
CommitterDate: Tue, 03 Sep 2019 09:17:40 +02:00

sched/uclamp: Update CPU's refcount on TG's clamp changes

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Alessio Balsini 
Cc: Dietmar Eggemann 
Cc: Joel Fernandes 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Quentin Perret 
Cc: Rafael J . Wysocki 
Cc: Steve Muckle 
Cc: Suren Baghdasaryan 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: Vincent Guittot 
Cc: Viresh Kumar 
Link: https://lkml.kernel.org/r/20190822132811.31294-6-patrick.bell...@arm.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 55 +++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c32ac07..55a1c07 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,54 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+   struct rq_flags rf;
+   struct rq *rq;
+
+   /*
+* Lock the task and the rq where the task is (or was) queued.
+*
+* We might lock the (previous) rq of a !RUNNABLE task, but that's the
+* price to pay to safely serialize util_{min,max} updates with
+* enqueues, dequeues and migration operations.
+* This is the same locking schema used by __set_cpus_allowed_ptr().
+*/
+   rq = task_rq_lock(p, );
+
+   /*
+* Setting the clamp bucket is serialized by task_rq_lock().
+* If the task is not yet RUNNABLE and its task_struct is not
+* affecting a valid clamp bucket, the next time it's enqueued,
+* it will already see the updated clamp bucket value.
+*/
+   if (!p->uclamp[clamp_id].active) {
+   uclamp_rq_dec_id(rq, p, clamp_id);
+   uclamp_rq_inc_id(rq, p, clamp_id);
+   }
+
+   task_rq_unlock(rq, p, );
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+  unsigned int clamps)
+{
+   struct css_task_iter it;
+   struct task_struct *p;
+   unsigned int clamp_id;
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next())) {
+   for_each_clamp_id(clamp_id) {
+   if ((0x1 << clamp_id) & clamps)
+   uclamp_update_active(p, clamp_id);
+   }
+   }
+   css_task_iter_end();
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
@@ -7160,8 +7208,13 @@ static void cpu_util_update_eff(struct 
cgroup_subsys_state *css)
uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
-   if (!clamps)
+   if (!clamps) {
css = css_rightmost_descendant(css);
+   continue;
+   }
+
+   /* Immediately update descendants RUNNABLE tasks */
+   uclamp_update_active_tasks(css, clamps);
}
 }

Re: [PATCH v14 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

2019-09-02 Thread Patrick Bellasi



On Fri, Aug 30, 2019 at 09:48:34 +, Peter Zijlstra wrote...

> On Thu, Aug 22, 2019 at 02:28:10PM +0100, Patrick Bellasi wrote:
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 04fc161e4dbe..fc2dc86a2abe 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, 
>> struct task_struct *p)
>>  uclamp_rq_dec_id(rq, p, clamp_id);
>>  }
>>  
>> +static inline void
>> +uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
>> +{
>> +struct rq_flags rf;
>> +struct rq *rq;
>> +
>> +/*
>> + * Lock the task and the rq where the task is (or was) queued.
>> + *
>> + * We might lock the (previous) rq of a !RUNNABLE task, but that's the
>> + * price to pay to safely serialize util_{min,max} updates with
>> + * enqueues, dequeues and migration operations.
>> + * This is the same locking schema used by __set_cpus_allowed_ptr().
>> + */
>> +rq = task_rq_lock(p, );
>
> Since modifying cgroup parameters is priv only, this should be OK I
> suppose. Priv can already DoS the system anyway.

Are you referring to the possibility to DoS the scheduler by keep
writing cgroup attributes?

Because, in that case I think cgroup attributes could be written also by
non priv users. It all depends on how they are mounted and permissions
are set. Isn't it?

Anyway, I'm not sure we can fix that here... and in principle we could
have that DoS by setting CPUs affinities, which is user exposed.
Isn't it?

>> +/*
>> + * Setting the clamp bucket is serialized by task_rq_lock().
>> + * If the task is not yet RUNNABLE and its task_struct is not
>> + * affecting a valid clamp bucket, the next time it's enqueued,
>> + * it will already see the updated clamp bucket value.
>> + */
>> +if (!p->uclamp[clamp_id].active)
>> +goto done;
>> +
>> +uclamp_rq_dec_id(rq, p, clamp_id);
>> +uclamp_rq_inc_id(rq, p, clamp_id);
>> +
>> +done:
>
> I'm thinking that:
>
>   if (p->uclamp[clamp_id].active) {
>   uclamp_rq_dec_id(rq, p, clamp_id);
>   uclamp_rq_inc_id(rq, p, clamp_id);
>   }
>
> was too obvious? ;-)

Yep, right... I think there was some more code in prev versions but I
forgot to get rid of that "goto" pattern after some change.

>> +
>> +task_rq_unlock(rq, p, );
>> +}

Cheers,
Patrick

Re: [PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-09-02 Thread Patrick Bellasi

On Fri, Aug 30, 2019 at 09:45:05 +, Peter Zijlstra wrote...

> On Thu, Aug 22, 2019 at 02:28:06PM +0100, Patrick Bellasi wrote:
>> +#define _POW10(exp) ((unsigned int)1e##exp)
>> +#define POW10(exp) _POW10(exp)
>
> What is this magic? You're forcing a float literal into an integer.
> Surely that deserves a comment!

Yes, I'm introducing the two constants:
  UCLAMP_PERCENT_SHIFT,
  UCLAMP_PERCENT_SCALE
similar to what we have for CAPACITY. Moreover, I need both 100*100 (for
the scale) and 100 further down in the code for the: 

percent = div_u64_rem(percent, POW10(UCLAMP_PERCENT_SHIFT), );

used in cpu_uclamp_print().

That's why adding a compile time support to compute a 10^N is useful.

C provides the "1eN" literal, I just convert it to integer and to do
that at compile time I need a two level macros.

What if I add this comment just above the macro definitions:

/*
 * Integer 10^N with a given N exponent by casting to integer the literal "1eN"
 * C expression. Since there is no way to convert a macro argument (N) into a
 * character constant, use two levels of macros.
 */

is this clear enough?

>
>> +struct uclamp_request {
>> +#define UCLAMP_PERCENT_SHIFT2
>> +#define UCLAMP_PERCENT_SCALE(100 * POW10(UCLAMP_PERCENT_SHIFT))
>> +s64 percent;
>> +u64 util;
>> +int ret;
>> +};
>> +
>> +static inline struct uclamp_request
>> +capacity_from_percent(char *buf)
>> +{
>> +struct uclamp_request req = {
>> +.percent = UCLAMP_PERCENT_SCALE,
>> +.util = SCHED_CAPACITY_SCALE,
>> +.ret = 0,
>> +};
>> +
>> +buf = strim(buf);
>> +if (strncmp("max", buf, 4)) {
>
> That is either a bug, and you meant to write: strncmp(buf, "max", 3),
> or it is not, and then you could've written: strcmp(buf, "max")

I don't think it's a bug.

The usage of 4 is intentional, to force a '\0' check while using
strncmp(). Otherwise, strncmp(buf, "max", 3) would accept also strings
starting by "max", which we don't want.

> But as written it doesn't make sense.

The code is safe but I agree that strcmp() does just the same and it
does not generate confusion. That's actually a pretty good example
on how it's not always better to use strncmp() instead of strcmp().

Cheers,
Patrick

[PATCH v14 6/6] sched/core: uclamp: always use enum uclamp_id for clamp_id values

2019-08-22 Thread Patrick Bellasi

The supported clamp indexes are defined in enum clamp_id however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use unsigned int to represent indexes.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
---
 kernel/sched/core.c  | 38 +++---
 kernel/sched/sched.h |  2 +-
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fc2dc86a2abe..269c14ad4473 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -810,7 +810,7 @@ static inline unsigned int 
uclamp_bucket_base_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
 }
 
-static inline unsigned int uclamp_none(int clamp_id)
+static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
 {
if (clamp_id == UCLAMP_MIN)
return 0;
@@ -826,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
 }
 
 static inline unsigned int
-uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
+uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
  unsigned int clamp_value)
 {
/*
@@ -842,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
return uclamp_none(UCLAMP_MIN);
 }
 
-static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
 unsigned int clamp_value)
 {
/* Reset max-clamp retention only on idle exit */
@@ -853,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, 
unsigned int clamp_id,
 }
 
 static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
-unsigned int clamp_value)
+enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
+  unsigned int clamp_value)
 {
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1;
@@ -874,7 +874,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
 }
 
 static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
 #ifdef CONFIG_UCLAMP_TASK_GROUP
@@ -906,7 +906,7 @@ uclamp_tg_restrict(struct task_struct *p, unsigned int 
clamp_id)
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
-uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
@@ -918,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
return uc_req;
 }
 
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_eff;
 
@@ -942,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, 
unsigned int clamp_id)
  * for each bucket when all its RUNNABLE tasks require the same clamp.
  */
 static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-   unsigned int clamp_id)
+   enum uclamp_id clamp_id)
 {
struct uclamp_rq *uc_rq = >uclamp[clamp_id];
struct uclamp_se *uc_se = >uclamp[clamp_id];
@@ -980,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct 
task_struct *p,
  * enforce the expected state and warn.
  */
 static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
-   unsigned int clamp_id)
+   enum uclamp_id clamp_id)
 {
struct uclamp_rq *uc_rq = >uclamp[clamp_id];
struct uclamp_se *uc_se = >uclamp[clamp_id];
@@ -1019,7 +1019,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct 
task_struct *p,
 
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
-   unsigned int clamp_id;
+   enum uclamp_id clamp_id;
 
if (unlikely(!p->sched_class->uclamp_enabled))
return;
@@ -1034,7 +1034,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct 
task_struct *p)
 
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
-   unsigned int clamp_id;
+   enum uclamp_id clamp_id;
 
if (unlikely(!p->

[PATCH v14 3/6] sched/core: uclamp: Propagate system defaults to root group

2019-08-22 Thread Patrick Bellasi

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8dab64247597..3ca054ad3f3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,10 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+   struct task_group *tg = _task_group;
+
+   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min, false);
+   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max, false);
+
+   rcu_read_lock();
+   cpu_util_update_eff(_task_group.css);
+   rcu_read_unlock();
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
 int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
 {
+   bool update_root_tg = false;
int old_min, old_max;
int result;
 
@@ -1043,16 +1063,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(_default[UCLAMP_MIN],
  sysctl_sched_uclamp_util_min, false);
+   update_root_tg = true;
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(_default[UCLAMP_MAX],
  sysctl_sched_uclamp_util_max, false);
+   update_root_tg = true;
}
 
+   if (update_root_tg)
+   uclamp_update_root_tg();
+
/*
-* Updating all the RUNNABLE task is expensive, keep it simple and do
-* just a lazy update at each next enqueue time.
+* We update all RUNNABLE tasks only when task groups are in use.
+* Otherwise, keep it simple and do just a lazy update at each next
+* task enqueue time.
 */
+
goto done;
 
 undo:
-- 
2.22.0

[PATCH v14 4/6] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

2019-08-22 Thread Patrick Bellasi

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3ca054ad3f3e..04fc161e4dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+   struct uclamp_se uc_max;
+
+   /*
+* Tasks in autogroups or root task group will be
+* restricted by system defaults.
+*/
+   if (task_group_is_autogroup(task_group(p)))
+   return uc_req;
+   if (task_group(p) == _task_group)
+   return uc_req;
+
+   uc_max = task_group(p)->uclamp[clamp_id];
+   if (uc_req.value > uc_max.value || !uc_req.user_defined)
+   return uc_max;
+#endif
+
+   return uc_req;
+}
+
 /*
  * The effective clamp bucket index of a task depends on, by increasing
  * priority:
  * - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 {
-   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+   struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
 
/* System default restrictions always apply */
-- 
2.22.0

[PATCH v14 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

2019-08-22 Thread Patrick Bellasi

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 58 -
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 04fc161e4dbe..fc2dc86a2abe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+   struct rq_flags rf;
+   struct rq *rq;
+
+   /*
+* Lock the task and the rq where the task is (or was) queued.
+*
+* We might lock the (previous) rq of a !RUNNABLE task, but that's the
+* price to pay to safely serialize util_{min,max} updates with
+* enqueues, dequeues and migration operations.
+* This is the same locking schema used by __set_cpus_allowed_ptr().
+*/
+   rq = task_rq_lock(p, );
+
+   /*
+* Setting the clamp bucket is serialized by task_rq_lock().
+* If the task is not yet RUNNABLE and its task_struct is not
+* affecting a valid clamp bucket, the next time it's enqueued,
+* it will already see the updated clamp bucket value.
+*/
+   if (!p->uclamp[clamp_id].active)
+   goto done;
+
+   uclamp_rq_dec_id(rq, p, clamp_id);
+   uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+   task_rq_unlock(rq, p, );
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+  unsigned int clamps)
+{
+   struct css_task_iter it;
+   struct task_struct *p;
+   unsigned int clamp_id;
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next())) {
+   for_each_clamp_id(clamp_id) {
+   if ((0x1 << clamp_id) & clamps)
+   uclamp_update_active(p, clamp_id);
+   }
+   }
+   css_task_iter_end();
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
@@ -7160,8 +7211,13 @@ static void cpu_util_update_eff(struct 
cgroup_subsys_state *css)
uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
-   if (!clamps)
+   if (!clamps) {
css = css_rightmost_descendant(css);
+   continue;
+   }
+
+   /* Immediately update descendants RUNNABLE tasks */
+   uclamp_update_active_tasks(css, clamps);
}
 }
 
-- 
2.22.0

[PATCH v14 2/6] sched/core: uclamp: Propagate parent clamps

2019-08-22 Thread Patrick Bellasi

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Signed-off-by: Patrick Bellasi 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c  | 44 
 kernel/sched/sched.h |  2 ++
 2 files changed, 46 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b610e1a4cda..8dab64247597 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1166,6 +1166,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+   root_task_group.uclamp[clamp_id] = uc_max;
 #endif
}
 }
@@ -6824,6 +6825,7 @@ static inline void alloc_uclamp_sched_group(struct 
task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(>uclamp_req[clamp_id],
  uclamp_none(clamp_id), false);
+   tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
 #endif
 }
@@ -7070,6 +7072,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+   struct cgroup_subsys_state *top_css = css;
+   struct uclamp_se *uc_parent = NULL;
+   struct uclamp_se *uc_se = NULL;
+   unsigned int eff[UCLAMP_CNT];
+   unsigned int clamp_id;
+   unsigned int clamps;
+
+   css_for_each_descendant_pre(css, top_css) {
+   uc_parent = css_tg(css)->parent
+   ? css_tg(css)->parent->uclamp : NULL;
+
+   for_each_clamp_id(clamp_id) {
+   /* Assume effective clamps matches requested clamps */
+   eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+   /* Cap effective clamps with parent's effective clamps 
*/
+   if (uc_parent &&
+   eff[clamp_id] > uc_parent[clamp_id].value) {
+   eff[clamp_id] = uc_parent[clamp_id].value;
+   }
+   }
+   /* Ensure protection is always capped by limit */
+   eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+   /* Propagate most restrictive effective clamps */
+   clamps = 0x0;
+   uc_se = css_tg(css)->uclamp;
+   for_each_clamp_id(clamp_id) {
+   if (eff[clamp_id] == uc_se[clamp_id].value)
+   continue;
+   uc_se[clamp_id].value = eff[clamp_id];
+   uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
+   clamps |= (0x1 << clamp_id);
+   }
+   if (!clamps)
+   css = css_rightmost_descendant(css);
+   }
+}
 
 #define _POW10(exp) ((unsigned int)1e##exp)
 #define POW10(exp) _POW10(exp)
@@ -7133,6 +7174,9 @@ static ssize_t cpu_uclamp_write(struct kernfs_open_file 
*of, char *buf,
 */
tg->uclamp_pct[clamp_id] = req.percent;
 
+   /* Update effective clamps to track the most restrictive value */
+   cpu_util_update_eff(of_css(of));
+
rcu_read_unlock();
mutex_unlock(_mutex);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ae1be61fb279..5b343112a47b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -397,6 +397,8 @@ struct task_group {
unsigned intuclamp_pct[UCLAMP_CNT];
/* Clamp values requested for a task group */
struct uclamp_seuclamp_req[UCLAMP_CNT];
+   /* Effective clamp values used for a task group */
+   struct uclamp_seuclamp[UCLAMP_CNT];
 #endif
 
 };
-- 
2.22.0

[PATCH v14 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-08-22 Thread Patrick Bellasi

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run at least at a
 minimum frequency which corresponds to the uclamp.min
 utilization

- uclamp.max: defines the maximum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run up to a
 maximum frequency which corresponds to the uclamp.max
 utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi 
Reviewed-by: Michal Koutny 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v14:
 Message-ID: <20190806161133.ga18...@blackbody.suse.cz>
 - move uclamp_mutex usage here from the following patch
---
 Documentation/admin-guide/cgroup-v2.rst |  34 +
 init/Kconfig|  22 +++
 kernel/sched/core.c | 188 +++-
 kernel/sched/sched.h|   8 +
 4 files changed, 248 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005aa981..5f1c266131b0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit 
models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
@@ -1016,6 +1023,33 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.rst for de

[PATCH v14 0/6] Add utilization clamping support (CGroups API)

2019-08-22 Thread Patrick Bellasi

eries provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==

[1] Energy Aware Scheduling

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (6):
  sched/core: uclamp: Extend CPU's cgroup controller
  sched/core: uclamp: Propagate parent clamps
  sched/core: uclamp: Propagate system defaults to root group
  sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  sched/core: uclamp: Update CPU's refcount on TG's clamp changes
  sched/core: uclamp: always use enum uclamp_id for clamp_id values

 Documentation/admin-guide/cgroup-v2.rst |  34 +++
 init/Kconfig|  22 ++
 kernel/sched/core.c | 375 ++--
 kernel/sched/sched.h|  12 +-
 4 files changed, 421 insertions(+), 22 deletions(-)

-- 
2.22.0

Re: [PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-08-08 Thread Patrick Bellasi



On Tue, Aug 06, 2019 at 17:11:34 +0100, Michal Koutný wrote...

> On Fri, Aug 02, 2019 at 10:08:48AM +0100, Patrick Bellasi 
>  wrote:
>> +static ssize_t cpu_uclamp_write(struct kernfs_open_file *of, char *buf,
>> +size_t nbytes, loff_t off,
>> +enum uclamp_id clamp_id)
>> +{
>> +struct uclamp_request req;
>> +struct task_group *tg;
>> +
>> +req = capacity_from_percent(buf);
>> +if (req.ret)
>> +return req.ret;
>> +
>> +rcu_read_lock();
> This should be the uclamp_mutex.
>
> (The compound results of the series is correct as the lock is introduced
> in "sched/core: uclamp: Propagate parent clamps".
> This is just for the happiness of cherry-pickers/bisectors.)

Right, will move the uclamp_mutex introduction in this patch instead of
in the following one.

>> +static inline void cpu_uclamp_print(struct seq_file *sf,
>> +enum uclamp_id clamp_id)
>> +{
>> [...]
>> +rcu_read_lock();
>> +tg = css_tg(seq_css(sf));
>> +util_clamp = tg->uclamp_req[clamp_id].value;
>> +rcu_read_unlock();
> Why is the rcu_read_lock() needed here? (I'm considering the comment in
> of_css() that should apply here (and it seems that similar uses in other
> seq_file handlers also skip this).)

So, looks like that since we are in the context of a file operation,
all the cgroup's attribute read/write functions are implicitly save.

IOW, we don't need an RCU lock since the TG data structures are granted
to be always available till the end of the read/write operation.

That seems to make sense... I'm wondering if keeping the RCU look is
still a precaution for possible future code/assumption changes or just
an unnecessary overhead?

>> @@ -7369,6 +7506,20 @@ static struct cftype cpu_legacy_files[] = {
>> [...]
>> +.name = "uclamp.min",
>> [...]
>> +.name = "uclamp.max",
> I don't see technical reasons why uclamp couldn't work on legacy
> hierarchy and Tejun acked the series, despite that I'll ask -- should
> the new attributes be exposed in v1 controller hierarchy (taking into
> account the legacy API is sort of frozen and potential maintenance needs
> spanning both hierarchies)?

Not sure to get what you mean here: I'm currently exposing uclamp to
both v1 and v2 hierarchies.

Best,
Patrick

--
#include 

Patrick Bellasi

Re: [PATCH v13 2/6] sched/core: uclamp: Propagate parent clamps

2019-08-08 Thread Patrick Bellasi

On Tue, Aug 06, 2019 at 17:11:53 +0100, Michal Koutný wrote...

> On Fri, Aug 02, 2019 at 10:08:49AM +0100, Patrick Bellasi 
>  wrote:
>> @@ -7095,6 +7149,7 @@ static ssize_t cpu_uclamp_write(struct 
>> kernfs_open_file *of, char *buf,
>>  if (req.ret)
>>  return req.ret;
>>  
>> +mutex_lock(_mutex);
>>  rcu_read_lock();
>>  
>>  tg = css_tg(of_css(of));
>> @@ -7107,7 +7162,11 @@ static ssize_t cpu_uclamp_write(struct 
>> kernfs_open_file *of, char *buf,
>>   */
>>  tg->uclamp_pct[clamp_id] = req.percent;
>>  
>> +/* Update effective clamps to track the most restrictive value */
>> +cpu_util_update_eff(of_css(of));
>> +
>>  rcu_read_unlock();
>> +mutex_unlock(_mutex);
> Following my remarks to "[PATCH v13 1/6] sched/core: uclamp: Extend
> CPU's cgroup", I wonder if the rcu_read_lock() couldn't be moved right
> before cpu_util_update_eff(). And by extension rcu_read_(un)lock could
> be hidden into cpu_util_update_eff() closer to its actual need.

Well, if I've got correctly your comment in the previous message, I
would say that at this stage we don't need RCU looks at all.

Reason being that cpu_util_update_eff() gets called only from
cpu_uclamp_write() which is from an ongoing write operation on a cgroup
attribute and thus granted to be available.

We will eventually need to move the RCU look only down the stack when
uclamp_update_active_tasks() gets called to update the RUNNABLE tasks on
a RQ... or perhaps we don't need them since we already get the
task_rq_lock() for each task we visit.

Is that correct?

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v13 0/6] Add utilization clamping support (CGroups API)

2019-08-06 Thread Patrick Bellasi



On Tue, Aug 06, 2019 at 17:12:06 +0100, Michal Koutný wrote...

> On Fri, Aug 02, 2019 at 10:08:47AM +0100, Patrick Bellasi 
>  wrote:
>> Patrick Bellasi (6):
>>   sched/core: uclamp: Extend CPU's cgroup controller
>>   sched/core: uclamp: Propagate parent clamps
>>   sched/core: uclamp: Propagate system defaults to root group
>>   sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
>>   sched/core: uclamp: Update CPU's refcount on TG's clamp changes
>>   sched/core: uclamp: always use enum uclamp_id for clamp_id values

Hi Michal!

> Thank you Patrick for your patience.

Thanks to you for your reviews.

> I used the time to revisit the series once again and I think the RCU
> locks can be streamlined a bit.

I'll have a look at those, thanks!

> If you find that correct, feel free to add my Reviewed-by to the
> updated series (for 1/6 and legacy, I'm just asking).

Sure, actually sorry for not having already added that tag in the
current version, it will be there in v14 ;)

> Michal

Cheers,
Patrick

--
#include 

Patrick Bellasi

Re: [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases

2019-08-02 Thread Patrick Bellasi

Hi Peter, Vincent,
is there anything different I can do on this?

Cheers,
Patrick

On 28-Jun 15:00, Patrick Bellasi wrote:
> On 28-Jun 14:38, Peter Zijlstra wrote:
> > On Fri, Jun 28, 2019 at 11:08:14AM +0100, Patrick Bellasi wrote:
> > > On 26-Jun 13:40, Vincent Guittot wrote:
> > > > Hi Patrick,
> > > > 
> > > > On Thu, 20 Jun 2019 at 17:06, Patrick Bellasi  
> > > > wrote:
> > > > >
> > > > > The estimated utilization for a task is currently defined based on:
> > > > >  - enqueued: the utilization value at the end of the last activation
> > > > >  - ewma: an exponential moving average which samples are the 
> > > > > enqueued values
> > > > >
> > > > > According to this definition, when a task suddenly change it's 
> > > > > bandwidth
> > > > > requirements from small to big, the EWMA will need to collect multiple
> > > > > samples before converging up to track the new big utilization.
> > > > >
> > > > > Moreover, after the PELT scale invariance update [1], in the above 
> > > > > scenario we
> > > > > can see that the utilization of the task has a significant drop from 
> > > > > the first
> > > > > big activation to the following one. That's implied by the new 
> > > > > "time-scaling"
> > > > 
> > > > Could you give us more details about this? I'm not sure to understand
> > > > what changes between the 1st big activation and the following one ?
> > > 
> > > We are after a solution for the problem Douglas Raillard discussed at
> > > OSPM, specifically the "Task util drop after 1st idle" highlighted in
> > > slide 6 of his presentation:
> > > 
> > >   
> > > http://retis.sssup.it/ospm-summit/Downloads/02_05-Douglas_Raillard-How_can_we_make_schedutil_even_more_effective.pdf
> > > 
> > 
> > So I see the problem, and I don't hate the patch, but I'm still
> > struggling to understand how exactly it related to the time-scaling
> > stuff. Afaict the fundamental problem here is layering two averages. The
> > second (EWMA in our case) will always lag/delay the input of the first
> > (PELT).
> > 
> > The time-scaling thing might make matters worse, because that helps PELT
> > ramp up faster, but that is not the primary issue.
> 
> Sure, we like the new time-scaling PELT which ramps up faster and, as
> long as we have idle time, it's better in predicting what would be the
> utilization as if we was running at max OPP.
> 
> However, the experiment above shows that:
> 
>  - despite the task being a 75% after a certain activation, it takes
>multiple activations for PELT to actually enter that range.
> 
>  - the first activation ends at 665, 10% short wrt the configured
>utilization
> 
>  - while the PELT signal converge toward the 75%, we have some pretty
>consistent drops at wakeup time, especially after the first big
>activation.
> 
> > Or am I missing something?
> 
> I'm not sure the above happens because of a problem in the new
> time-scaling PELT, I actually think it's kind of expected given the
> way we re-scale time contributions depending on the current OPPs.
> 
> It's just that a 375 drops in utilization with just 1.1ms sleep time
> looks to me more related to the time-scaling invariance then just the
> normal/expected PELT decay.
> 
>Could it be an out-of-sync issue between the PELT time scaling code
>and capacity scaling code?
>Perhaps due to some OPP changes/notification going wrong?
> 
> Sorry for not being much more useful on that, maybe Vincent has some
> better ideas.
> 
> The only thing I've kind of convinced myself is that an EWMA on
> util_est does not make a lot of sense for increasing utilization
> tracking.
> 
> Best,
> Patrick
> 
> -- 
> #include 
> 
> Patrick Bellasi

-- 
#include 

Patrick Bellasi

[PATCH v13 3/6] sched/core: uclamp: Propagate system defaults to root group

2019-08-02 Thread Patrick Bellasi

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

When cgroups are in use, force an update of all the RUNNABLE tasks.
Otherwise, keep things simple and do just a lazy update next time each
task will be enqueued.
Do that since we assume a more strict resource control is required when
cgroups are in use. This allows also to keep "effective" clamp values
updated in case we need to expose them to user-space.

Signed-off-by: Patrick Bellasi 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v13:
 Message-ID: <20190725114126.ga4...@blackbody.suse.cz>
 - comment rewording: update all RUNNABLE tasks on system default
   changes only when cgroups are in use.
---
 kernel/sched/core.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index de8886ce0f65..5683c8639b4a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,10 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+   struct task_group *tg = _task_group;
+
+   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min, false);
+   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max, false);
+
+   rcu_read_lock();
+   cpu_util_update_eff(_task_group.css);
+   rcu_read_unlock();
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
 int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
 {
+   bool update_root_tg = false;
int old_min, old_max;
int result;
 
@@ -1043,16 +1063,23 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(_default[UCLAMP_MIN],
  sysctl_sched_uclamp_util_min, false);
+   update_root_tg = true;
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(_default[UCLAMP_MAX],
  sysctl_sched_uclamp_util_max, false);
+   update_root_tg = true;
}
 
+   if (update_root_tg)
+   uclamp_update_root_tg();
+
/*
-* Updating all the RUNNABLE task is expensive, keep it simple and do
-* just a lazy update at each next enqueue time.
+* We update all RUNNABLE tasks only when task groups are in use.
+* Otherwise, keep it simple and do just a lazy update at each next
+* task enqueue time.
 */
+
goto done;
 
 undo:
-- 
2.22.0

[PATCH v13 0/6] Add utilization clamping support (CGroups API)

2019-08-02 Thread Patrick Bellasi

hermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [2] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==

[1] Energy Aware Scheduling

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (6):
  sched/core: uclamp: Extend CPU's cgroup controller
  sched/core: uclamp: Propagate parent clamps
  sched/core: uclamp: Propagate system defaults to root group
  sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  sched/core: uclamp: Update CPU's refcount on TG's clamp changes
  sched/core: uclamp: always use enum uclamp_id for clamp_id values

 Documentation/admin-guide/cgroup-v2.rst |  34 +++
 init/Kconfig|  22 ++
 kernel/sched/core.c | 375 ++--
 kernel/sched/sched.h|  12 +-
 4 files changed, 421 insertions(+), 22 deletions(-)

-- 
2.22.0

[PATCH v13 6/6] sched/core: uclamp: always use enum uclamp_id for clamp_id values

2019-08-02 Thread Patrick Bellasi

The supported clamp indexes are defined in enum clamp_id however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use unsigned int to represent indexes.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
---
 kernel/sched/core.c  | 38 +++---
 kernel/sched/sched.h |  2 +-
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8cc1198e7199..b6241b6ac133 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -810,7 +810,7 @@ static inline unsigned int 
uclamp_bucket_base_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
 }
 
-static inline unsigned int uclamp_none(int clamp_id)
+static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
 {
if (clamp_id == UCLAMP_MIN)
return 0;
@@ -826,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
 }
 
 static inline unsigned int
-uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
+uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
  unsigned int clamp_value)
 {
/*
@@ -842,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
return uclamp_none(UCLAMP_MIN);
 }
 
-static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
 unsigned int clamp_value)
 {
/* Reset max-clamp retention only on idle exit */
@@ -853,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, 
unsigned int clamp_id,
 }
 
 static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
-unsigned int clamp_value)
+enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
+  unsigned int clamp_value)
 {
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1;
@@ -874,7 +874,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
 }
 
 static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
 #ifdef CONFIG_UCLAMP_TASK_GROUP
@@ -906,7 +906,7 @@ uclamp_tg_restrict(struct task_struct *p, unsigned int 
clamp_id)
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
-uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
@@ -918,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
return uc_req;
 }
 
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_eff;
 
@@ -942,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, 
unsigned int clamp_id)
  * for each bucket when all its RUNNABLE tasks require the same clamp.
  */
 static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-   unsigned int clamp_id)
+   enum uclamp_id clamp_id)
 {
struct uclamp_rq *uc_rq = >uclamp[clamp_id];
struct uclamp_se *uc_se = >uclamp[clamp_id];
@@ -980,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct 
task_struct *p,
  * enforce the expected state and warn.
  */
 static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
-   unsigned int clamp_id)
+   enum uclamp_id clamp_id)
 {
struct uclamp_rq *uc_rq = >uclamp[clamp_id];
struct uclamp_se *uc_se = >uclamp[clamp_id];
@@ -1019,7 +1019,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct 
task_struct *p,
 
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
-   unsigned int clamp_id;
+   enum uclamp_id clamp_id;
 
if (unlikely(!p->sched_class->uclamp_enabled))
return;
@@ -1034,7 +1034,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct 
task_struct *p)
 
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
-   unsigned int clamp_id;
+   enum uclamp_id clamp_id;
 
if (unlikely(!p->sched_class->uclamp_en

[PATCH v13 4/6] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

2019-08-02 Thread Patrick Bellasi

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5683c8639b4a..106cf69d70cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+   struct uclamp_se uc_max;
+
+   /*
+* Tasks in autogroups or root task group will be
+* restricted by system defaults.
+*/
+   if (task_group_is_autogroup(task_group(p)))
+   return uc_req;
+   if (task_group(p) == _task_group)
+   return uc_req;
+
+   uc_max = task_group(p)->uclamp[clamp_id];
+   if (uc_req.value > uc_max.value || !uc_req.user_defined)
+   return uc_max;
+#endif
+
+   return uc_req;
+}
+
 /*
  * The effective clamp bucket index of a task depends on, by increasing
  * priority:
  * - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 {
-   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+   struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
 
/* System default restrictions always apply */
-- 
2.22.0

[PATCH v13 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-08-02 Thread Patrick Bellasi

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run at least at a
 minimum frequency which corresponds to the uclamp.min
 utilization

- uclamp.max: defines the maximum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run up to a
 maximum frequency which corresponds to the uclamp.max
 utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing uclamp.min < uclamp.max.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Signed-off-by: Patrick Bellasi 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v13:
 Message-ID: <20190725114104.ga32...@blackbody.suse.cz>
 - move common code from cpu_uclamp_{min,max}_write() into 
cpu_uclamp_write(clamp_id)
---
 Documentation/admin-guide/cgroup-v2.rst |  34 +
 init/Kconfig|  22 
 kernel/sched/core.c | 167 +++-
 kernel/sched/sched.h|   8 ++
 4 files changed, 230 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005aa981..5f1c266131b0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit 
models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
@@ -1016,6 +1023,33 @@ All time durations are in microseconds.
Shows pressure stall information for CPU. See
Documentation/accounting/psi.rst for details.
 
+  cpu.uclamp.min
+A read-write single value file whi

[PATCH v13 2/6] sched/core: uclamp: Propagate parent clamps

2019-08-02 Thread Patrick Bellasi

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c  | 65 ++--
 kernel/sched/sched.h |  2 ++
 2 files changed, 64 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 40cd7567e4d9..de8886ce0f65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool 
update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
 /* Max allowed minimum utilization */
 unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 
@@ -1010,10 +1022,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, 
int write,
loff_t *ppos)
 {
int old_min, old_max;
-   static DEFINE_MUTEX(mutex);
int result;
 
-   mutex_lock();
+   mutex_lock(_mutex);
old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max;
 
@@ -1048,7 +1059,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, 
int write,
sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max;
 done:
-   mutex_unlock();
+   mutex_unlock(_mutex);
 
return result;
 }
@@ -1137,6 +1148,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;
 
+   mutex_init(_mutex);
+
for_each_possible_cpu(cpu) {
memset(_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1153,6 +1166,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+   root_task_group.uclamp[clamp_id] = uc_max;
 #endif
}
 }
@@ -6799,6 +6813,7 @@ static inline void alloc_uclamp_sched_group(struct 
task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(>uclamp_req[clamp_id],
  uclamp_none(clamp_id), false);
+   tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
 #endif
 }
@@ -7045,6 +7060,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+   struct cgroup_subsys_state *top_css = css;
+   struct uclamp_se *uc_parent = NULL;
+   struct uclamp_se *uc_se = NULL;
+   unsigned int eff[UCLAMP_CNT];
+   unsigned int clamp_id;
+   unsigned int clamps;
+
+   css_for_each_descendant_pre(css, top_css) {
+   uc_parent = css_tg(css)->parent
+   ? css_tg(css)->parent->uclamp : NULL;
+
+   for_each_clamp_id(clamp_id) {
+   /* Assume effective clamps matches requested clamps */
+   eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+   /* Cap effective clamps with parent's effective clamps 
*/
+   if (uc_parent &&
+   eff[clamp_id] > uc_parent[clamp_id].value) {
+   eff[clamp_id] = uc_parent[clamp_id].value;
+   }
+

[PATCH v13 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

2019-08-02 Thread Patrick Bellasi

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi 
Acked-by: Tejun Heo 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 58 -
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 106cf69d70cc..8cc1198e7199 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+   struct rq_flags rf;
+   struct rq *rq;
+
+   /*
+* Lock the task and the rq where the task is (or was) queued.
+*
+* We might lock the (previous) rq of a !RUNNABLE task, but that's the
+* price to pay to safely serialize util_{min,max} updates with
+* enqueues, dequeues and migration operations.
+* This is the same locking schema used by __set_cpus_allowed_ptr().
+*/
+   rq = task_rq_lock(p, );
+
+   /*
+* Setting the clamp bucket is serialized by task_rq_lock().
+* If the task is not yet RUNNABLE and its task_struct is not
+* affecting a valid clamp bucket, the next time it's enqueued,
+* it will already see the updated clamp bucket value.
+*/
+   if (!p->uclamp[clamp_id].active)
+   goto done;
+
+   uclamp_rq_dec_id(rq, p, clamp_id);
+   uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+   task_rq_unlock(rq, p, );
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+  unsigned int clamps)
+{
+   struct css_task_iter it;
+   struct task_struct *p;
+   unsigned int clamp_id;
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next())) {
+   for_each_clamp_id(clamp_id) {
+   if ((0x1 << clamp_id) & clamps)
+   uclamp_update_active(p, clamp_id);
+   }
+   }
+   css_task_iter_end();
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
@@ -7148,8 +7199,13 @@ static void cpu_util_update_eff(struct 
cgroup_subsys_state *css)
uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
-   if (!clamps)
+   if (!clamps) {
css = css_rightmost_descendant(css);
+   continue;
+   }
+
+   /* Immediately update descendants RUNNABLE tasks */
+   uclamp_update_active_tasks(css, clamps);
}
 }
 
-- 
2.22.0

Re: [PATCH v12 0/6] Add utilization clamping support (CGroups API)

2019-08-01 Thread Patrick Bellasi

On 29-Jul 13:06, Tejun Heo wrote:
> Hello,

Hi Tejun,

> Looks good to me.  On cgroup side,
> 
> Acked-by: Tejun Heo 

Thanks!

Happy we converged toward something working for you.

I'll add the two small changes suggested by Michal and respin a v13
with your ACK on the series.

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v12 3/6] sched/core: uclamp: Propagate system defaults to root group

2019-08-01 Thread Patrick Bellasi

On 25-Jul 13:41, Michal Koutný wrote:
> On Thu, Jul 18, 2019 at 07:17:45PM +0100, Patrick Bellasi 
>  wrote:
> > The clamp values are not tunable at the level of the root task group.
> > That's for two main reasons:
> > 
> >  - the root group represents "system resources" which are always
> >entirely available from the cgroup standpoint.
> > 
> >  - when tuning/restricting "system resources" makes sense, tuning must
> >be done using a system wide API which should also be available when
> >control groups are not.
> > 
> > When a system wide restriction is available, cgroups should be aware of
> > its value in order to know exactly how much "system resources" are
> > available for the subgroups.
> IIUC, the global default would apply in uclamp_eff_get(), so this
> propagation isn't strictly necessary in order to apply to tasks (that's
> how it works under !CONFIG_UCLAMP_TASK_GROUP).

That's right.

> The reason is that effective value (which isn't exposed currently) in a
> group takes into account this global restriction, right?

Yep, well admittedly in this area things changed in a slightly confusing way.

Up to v10:
 - effective values was exposed to userspace
 - system defaults was enforced only at enqueue time

Now instead:
 - effective values are not exposed anymore (because of Tejun request)
 - system defaults are applied to the root group and propagated down
   the hierarchy to all effective values

Both solutions are functionally correct but, in the first case, the
cgroup's effective values was not really reflecting what a task will
get while, in the current solution, we force update all effective
values while not exposing them anymore.

However, I think this solution is better in keeping information more
consistent and should create less confusion if in the future we decide
to expose effective values to user-space.

Thought?

> > @@ -1043,12 +1063,17 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> > *table, int write,
> > [...]
> > +   if (update_root_tg)
> > +   uclamp_update_root_tg();
> > +
> > /*
> >  * Updating all the RUNNABLE task is expensive, keep it simple and do
> >  * just a lazy update at each next enqueue time.
> Since uclamp_update_root_tg() traverses down to
> uclamp_update_active_tasks() is this comment half true now?

Right, this comment is now wrong. We update all RUNNABLE tasks on
system default changes. However, despite the above command it's
difficult to say how much expensive that operation can be.

It really depends on how many RUNNABLE tasks we have, the number of
CPUs and also how many tasks are not already clamped by a more
restrictive "effective" value. Thus, for the time being, we can
consider speculation the above statement and add in a simple change if
in the future that should be reported as a real issue to justify a
lazy update.

The upside is that with the current implementation we have a more
strict control on task. Even long running tasks can be clamped on
sysadmin demand without waiting for them to sleep.

Does that makes sense?

If it does, I'll drop the above comment in v13.

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v12 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-08-01 Thread Patrick Bellasi

On 25-Jul 13:41, Michal Koutný wrote:
> On Thu, Jul 18, 2019 at 07:17:43PM +0100, Patrick Bellasi 
>  wrote:
> > +static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
> > +   char *buf, size_t nbytes,
> > +   loff_t off)
> > +{
> > [...]
> > +static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
> > +   char *buf, size_t nbytes,
> > +   loff_t off)
> > +{
> > [...]
> These two functions are almost identical yet not trivial. I think it
> wouldn be better to have the code at one place only and distinguish by
> the passed clamp_id.

Good point, since the removal of the boundary checks on values we now
have two identical methods. I'll factor our the common code in a
single function.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

[PATCH v12 1/6] sched/core: uclamp: Extend CPU's cgroup controller

2019-07-18 Thread Patrick Bellasi

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run at least at a
 minimum frequency which corresponds to the uclamp.min
 utilization

- uclamp.max: defines the maximum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run up to a
 maximum frequency which corresponds to the uclamp.max
 utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing uclamp.min < uclamp.max.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v12:
 Message-ID: <20190715133801.yohhd2hywzsv3uyf@e110439-lin>
 - track requested cgroup's percentage to mask conversion rounding to userspace
 - introduce UCLAMP_PERCENT_{SHIFT,SCALE} to avoid hardcoded constants
 - s/uclamp_scale_from_percent()/capacity_from_percent()/
 - move range check from cpu_uclamp_{min,max}_write() to capacity_from_percent()
 Message-ID: <20190718152327.vmnds3kpagh2xz2r@e110439-lin>
 - fix percentage's decimals format string
---
 Documentation/admin-guide/cgroup-v2.rst |  34 +
 init/Kconfig|  22 +++
 kernel/sched/core.c | 175 +++-
 kernel/sched/sched.h|   8 ++
 4 files changed, 238 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index 3b29005aa981..5f1c266131b0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit 
models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware

[PATCH v12 0/6] Add utilization clamping support (CGroups API)

2019-07-18 Thread Patrick Bellasi

pecific property the scheduler uses to know
how much CPU bandwidth a task requires, at least as long as there is idle time.
Thus, the utilization clamp values, defined either per-task or per-task_group,
can represent tasks to the scheduler as being bigger (or smaller) than what
they actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

 - boosting: try to run small/foreground tasks on higher-capacity CPUs to
   complete them faster despite being less energy efficient.

 - capping: try to run big/background tasks on low-capacity CPUs to save power
   and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [2] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==

[1] Energy Aware Scheduling

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (6):
  sched/core: uclamp: Extend CPU's cgroup controller
  sched/core: uclamp: Propagate parent clamps
  sched/core: uclamp: Propagate system defaults to root group
  sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  sched/core: uclamp: Update CPU's refcount on TG's clamp changes
  sched/core: uclamp: always use enum uclamp_id for clamp_id values

 Documentation/admin-guide/cgroup-v2.rst |  34 +++
 init/Kconfig|  22 ++
 kernel/sched/core.c | 382 ++--
 kernel/sched/sched.h|  12 +-
 4 files changed, 430 insertions(+), 20 deletions(-)

-- 
2.22.0

[PATCH v12 4/6] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

2019-07-18 Thread Patrick Bellasi

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, system
defaults are still enforced.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v12:
 Message-ID: <20190716143435.iwwd6fjr3udlqol4@e110439-lin>
 - remove not required and confusing sentence from the above changelog
---
 kernel/sched/core.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e9231b089d5c..426736b2c4d7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+   struct uclamp_se uc_max;
+
+   /*
+* Tasks in autogroups or root task group will be
+* restricted by system defaults.
+*/
+   if (task_group_is_autogroup(task_group(p)))
+   return uc_req;
+   if (task_group(p) == _task_group)
+   return uc_req;
+
+   uc_max = task_group(p)->uclamp[clamp_id];
+   if (uc_req.value > uc_max.value || !uc_req.user_defined)
+   return uc_max;
+#endif
+
+   return uc_req;
+}
+
 /*
  * The effective clamp bucket index of a task depends on, by increasing
  * priority:
  * - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 {
-   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+   struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
 
/* System default restrictions always apply */
-- 
2.22.0

[PATCH v12 3/6] sched/core: uclamp: Propagate system defaults to root group

2019-07-18 Thread Patrick Bellasi

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v12:
 Message-ID: <20190716143417.us3xhksrsaxsl2ok@e110439-lin>
 - add missing RCU read locks across cpu_util_update_eff() call from
   uclamp_update_root_tg()
---
 kernel/sched/core.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 08f5a0c205c6..e9231b089d5c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,10 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+   struct task_group *tg = _task_group;
+
+   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min, false);
+   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max, false);
+
+   rcu_read_lock();
+   cpu_util_update_eff(_task_group.css);
+   rcu_read_unlock();
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
 int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
 {
+   bool update_root_tg = false;
int old_min, old_max;
int result;
 
@@ -1043,12 +1063,17 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(_default[UCLAMP_MIN],
  sysctl_sched_uclamp_util_min, false);
+   update_root_tg = true;
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(_default[UCLAMP_MAX],
  sysctl_sched_uclamp_util_max, false);
+   update_root_tg = true;
}
 
+   if (update_root_tg)
+   uclamp_update_root_tg();
+
/*
 * Updating all the RUNNABLE task is expensive, keep it simple and do
 * just a lazy update at each next enqueue time.
-- 
2.22.0

[PATCH v12 5/6] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

2019-07-18 Thread Patrick Bellasi

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 58 -
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 426736b2c4d7..26ac1cbec0be 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+   struct rq_flags rf;
+   struct rq *rq;
+
+   /*
+* Lock the task and the rq where the task is (or was) queued.
+*
+* We might lock the (previous) rq of a !RUNNABLE task, but that's the
+* price to pay to safely serialize util_{min,max} updates with
+* enqueues, dequeues and migration operations.
+* This is the same locking schema used by __set_cpus_allowed_ptr().
+*/
+   rq = task_rq_lock(p, );
+
+   /*
+* Setting the clamp bucket is serialized by task_rq_lock().
+* If the task is not yet RUNNABLE and its task_struct is not
+* affecting a valid clamp bucket, the next time it's enqueued,
+* it will already see the updated clamp bucket value.
+*/
+   if (!p->uclamp[clamp_id].active)
+   goto done;
+
+   uclamp_rq_dec_id(rq, p, clamp_id);
+   uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+   task_rq_unlock(rq, p, );
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+  unsigned int clamps)
+{
+   struct css_task_iter it;
+   struct task_struct *p;
+   unsigned int clamp_id;
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next())) {
+   for_each_clamp_id(clamp_id) {
+   if ((0x1 << clamp_id) & clamps)
+   uclamp_update_active(p, clamp_id);
+   }
+   }
+   css_task_iter_end();
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
@@ -7091,8 +7142,13 @@ static void cpu_util_update_eff(struct 
cgroup_subsys_state *css)
uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
-   if (!clamps)
+   if (!clamps) {
css = css_rightmost_descendant(css);
+   continue;
+   }
+
+   /* Immediately update descendants RUNNABLE tasks */
+   uclamp_update_active_tasks(css, clamps);
}
 }
 
-- 
2.22.0

[PATCH v12 6/6] sched/core: uclamp: always use enum uclamp_id for clamp_id values

2019-07-18 Thread Patrick Bellasi

The supported clamp indexes are defined in enum clamp_id however, because
of the code logic in some of the first utilization clamping series version,
sometimes we needed to use unsigned int to represent indexes.

This is not more required since the final version of the uclamp_* APIs can
always use the proper enum uclamp_id type.

Fix it with a bulk rename now that we have all the bits merged.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 

---
Changes in v12:
 Message-ID: <20190716140319.hdmgcuevnpwdqobl@e110439-lin>
 - added in this series
---
 kernel/sched/core.c  | 38 +++---
 kernel/sched/sched.h |  2 +-
 2 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 26ac1cbec0be..1e6cf9499d49 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -810,7 +810,7 @@ static inline unsigned int 
uclamp_bucket_base_value(unsigned int clamp_value)
return UCLAMP_BUCKET_DELTA * uclamp_bucket_id(clamp_value);
 }
 
-static inline unsigned int uclamp_none(int clamp_id)
+static inline enum uclamp_id uclamp_none(enum uclamp_id clamp_id)
 {
if (clamp_id == UCLAMP_MIN)
return 0;
@@ -826,7 +826,7 @@ static inline void uclamp_se_set(struct uclamp_se *uc_se,
 }
 
 static inline unsigned int
-uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
+uclamp_idle_value(struct rq *rq, enum uclamp_id clamp_id,
  unsigned int clamp_value)
 {
/*
@@ -842,7 +842,7 @@ uclamp_idle_value(struct rq *rq, unsigned int clamp_id,
return uclamp_none(UCLAMP_MIN);
 }
 
-static inline void uclamp_idle_reset(struct rq *rq, unsigned int clamp_id,
+static inline void uclamp_idle_reset(struct rq *rq, enum uclamp_id clamp_id,
 unsigned int clamp_value)
 {
/* Reset max-clamp retention only on idle exit */
@@ -853,8 +853,8 @@ static inline void uclamp_idle_reset(struct rq *rq, 
unsigned int clamp_id,
 }
 
 static inline
-unsigned int uclamp_rq_max_value(struct rq *rq, unsigned int clamp_id,
-unsigned int clamp_value)
+enum uclamp_id uclamp_rq_max_value(struct rq *rq, enum uclamp_id clamp_id,
+  unsigned int clamp_value)
 {
struct uclamp_bucket *bucket = rq->uclamp[clamp_id].bucket;
int bucket_id = UCLAMP_BUCKETS - 1;
@@ -874,7 +874,7 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
 }
 
 static inline struct uclamp_se
-uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = p->uclamp_req[clamp_id];
 #ifdef CONFIG_UCLAMP_TASK_GROUP
@@ -906,7 +906,7 @@ uclamp_tg_restrict(struct task_struct *p, unsigned int 
clamp_id)
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
-uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
+uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
@@ -918,7 +918,7 @@ uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
return uc_req;
 }
 
-unsigned int uclamp_eff_value(struct task_struct *p, unsigned int clamp_id)
+enum uclamp_id uclamp_eff_value(struct task_struct *p, enum uclamp_id clamp_id)
 {
struct uclamp_se uc_eff;
 
@@ -942,7 +942,7 @@ unsigned int uclamp_eff_value(struct task_struct *p, 
unsigned int clamp_id)
  * for each bucket when all its RUNNABLE tasks require the same clamp.
  */
 static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
-   unsigned int clamp_id)
+   enum uclamp_id clamp_id)
 {
struct uclamp_rq *uc_rq = >uclamp[clamp_id];
struct uclamp_se *uc_se = >uclamp[clamp_id];
@@ -980,7 +980,7 @@ static inline void uclamp_rq_inc_id(struct rq *rq, struct 
task_struct *p,
  * enforce the expected state and warn.
  */
 static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p,
-   unsigned int clamp_id)
+   enum uclamp_id clamp_id)
 {
struct uclamp_rq *uc_rq = >uclamp[clamp_id];
struct uclamp_se *uc_se = >uclamp[clamp_id];
@@ -1019,7 +1019,7 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct 
task_struct *p,
 
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
-   unsigned int clamp_id;
+   enum uclamp_id clamp_id;
 
if (unlikely(!p->sched_class->uclamp_enabled))
return;
@@ -1034,7 +1034,7 @@ static inline void uclamp_rq_inc(struct rq *rq, struct 
task_struct *p)
 
 static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p)
 {
-   unsigned int clamp_id;
+

[PATCH v12 2/6] sched/core: uclamp: Propagate parent clamps

2019-07-18 Thread Patrick Bellasi

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection" (i.e. min utilization) with the corresponding "limit"
(i.e. max utilization).

Do that at effective clamps propagation to ensure all user-space write
never fails while still always tracking the most restrictive values.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v12:
 Message-ID: <20190716140706.vuggfigjlys44lkp@e110439-lin>
 - use a dedicated variable for parent restrictions
 - make more explicit in the documentation that the requested "protection" is
   always capped by the requested "limit"
 Message-ID: <20190716175542.p7vs2muslyuez6lq@e110439-lin>
 - use the newly added uclamp_mutex to serialize the sysfs write callback
---
 kernel/sched/core.c  | 70 ++--
 kernel/sched/sched.h |  2 ++
 2 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fcc32afe53cb..08f5a0c205c6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool 
update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
 /* Max allowed minimum utilization */
 unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 
@@ -1010,10 +1022,9 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, 
int write,
loff_t *ppos)
 {
int old_min, old_max;
-   static DEFINE_MUTEX(mutex);
int result;
 
-   mutex_lock();
+   mutex_lock(_mutex);
old_min = sysctl_sched_uclamp_util_min;
old_max = sysctl_sched_uclamp_util_max;
 
@@ -1048,7 +1059,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table *table, 
int write,
sysctl_sched_uclamp_util_min = old_min;
sysctl_sched_uclamp_util_max = old_max;
 done:
-   mutex_unlock();
+   mutex_unlock(_mutex);
 
return result;
 }
@@ -1137,6 +1148,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;
 
+   mutex_init(_mutex);
+
for_each_possible_cpu(cpu) {
memset(_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1153,6 +1166,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+   root_task_group.uclamp[clamp_id] = uc_max;
 #endif
}
 }
@@ -6740,6 +6754,7 @@ static inline void alloc_uclamp_sched_group(struct 
task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(>uclamp_req[clamp_id],
  uclamp_none(clamp_id), false);
+   tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
 #endif
 }
@@ -6990,6 +7005,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+   struct cgroup_subsys_state *top_css = css;
+   struct uclamp_se *uc_parent = NULL;
+   struct uclamp_se *uc_se = NULL;
+   unsigned int eff[UCLAMP_CNT];
+   unsigned int clamp_id;
+   unsigned int clamps;
+
+   css_for_each_descendant_pre(css, top_css) {
+   uc_parent = css_tg(css)->parent
+   ? css_tg(css)->parent->uclamp : NULL;
+
+   for_each_clamp_id(clamp_id) {
+   /* Assume effective clamps matches requested cla

Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller

2019-07-18 Thread Patrick Bellasi

On 18-Jul 07:52, Tejun Heo wrote:
> Hello, Patrick.
> 
> On Mon, Jul 08, 2019 at 09:43:53AM +0100, Patrick Bellasi wrote:
> > +static inline void cpu_uclamp_print(struct seq_file *sf,
> > +   enum uclamp_id clamp_id)
> > +{
> > +   struct task_group *tg;
> > +   u64 util_clamp;
> > +   u64 percent;
> > +   u32 rem;
> > +
> > +   rcu_read_lock();
> > +   tg = css_tg(seq_css(sf));
> > +   util_clamp = tg->uclamp_req[clamp_id].value;
> > +   rcu_read_unlock();
> > +
> > +   if (util_clamp == SCHED_CAPACITY_SCALE) {
> > +   seq_puts(sf, "max\n");
> > +   return;
> > +   }
> > +
> > +   percent = uclamp_percent_from_scale(util_clamp);
> > +   percent = div_u64_rem(percent, 100, );
> > +   seq_printf(sf, "%llu.%u\n", percent, rem);
> 
> "%llu.%02u" otherwise 20.01 gets printed as 20.1

Yup!... good point! :)

Since we already collected many feedbacks, I've got a v12 ready for posting.
Maybe you better wait for that before going on with the review.

Thanks,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group

2019-07-16 Thread Patrick Bellasi

On 16-Jul 17:36, Michal Koutný wrote:
> On Tue, Jul 16, 2019 at 03:34:17PM +0100, Patrick Bellasi 
>  wrote:
> > > cpu_util_update_eff internally calls css_for_each_descendant_pre() so
> > > this should be protected with rcu_read_lock().
> > 
> > Right, good catch! Will add in v12.
> When I responded to your other patch, it occurred to me that since
> cpu_util_update_eff goes writing down to child csses, it should take
> also uclamp_mutex here to avoid race with direct cgroup attribute
> writers.

Yep, I should drop the "dedicated" mutex we have now in
sysctl_sched_uclamp_handler() and use the uclamp_mutex we already
have.

Thanks, Patrick


-- 
#include 

Patrick Bellasi

Re: [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps

2019-07-16 Thread Patrick Bellasi

On 16-Jul 17:29, Michal Koutný wrote:
> On Tue, Jul 16, 2019 at 03:07:06PM +0100, Patrick Bellasi 
>  wrote:
> > That note comes from the previous review cycle and it's based on a
> > request from Tejun to align uclamp behaviors with the way the
> > delegation model is supposed to work.
> I saw and hopefully understood that reasoning -- uclamp.min has the
> protection semantics and uclamp.max the limit semantics.
> 
> However, what took me some time to comprehend when the effected
> uclamp.min and uclamp.max cross over, i.e. that uclamp.min is then bound
> by uclamp.max (besides parent's uclamp.min). Your commit message
> explains that and I think it's relevant for the kernel docs file
> itself.

Right, I've just added a paragraph to the cpu.uclamp.min documentation.

> > You right, the synchronization is introduced by a later patch:
> > 
> >sched/core: uclamp: Update CPU's refcount on TG's clamp changes
> I saw that lock but didn't realize __setscheduler_uclamp() touches only
> task's struct uclamp_se, none of task_group's/css's (which is under
> uclamp_mutex). That seems correct.

Right, the mutex is used only on the cgroup side. That's because the
CGroup API can affect multiple tasks running on different CPUs, thus
we wanna make sure we don't come up with race conditions. In that
path we can also afford to go a bit slower.

In the fast path instead we rely on the rq-locks to ensure
serialization on RUNNABLE tasks clamp updates.

Coming from the __setscheduler_uclamp() side however we don't sync
RUNNABLE tasks immediately. We delay the update to next enqueue
opportunity.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

2019-07-16 Thread Patrick Bellasi

On 15-Jul 18:42, Michal Koutný wrote:
> On Mon, Jul 08, 2019 at 09:43:56AM +0100, Patrick Bellasi 
>  wrote:
> > This mimics what already happens for a task's CPU affinity mask when the
> > task is also in a cpuset, i.e. cgroup attributes are always used to
> > restrict per-task attributes.
> If I am not mistaken when set_schedaffinity(2) call is made that results
> in an empty cpuset, the call fails with EINVAL [1].
> 
> If I track the code correctly, the values passed to sched_setattr(2) are
> checked against the trivial validity (umin <= umax) and later on, they
> are adjusted to match the effective clamping of the containing
> task_group. Is that correct?
> 
> If the user attempted to sched_setattr [a, b], and the effective uclamp
> was [c, d] such that [a, b] ∩ [c, d] = ∅, the set uclamp will be
> silently moved out of their intended range. Wouldn't it be better to
> return with EINVAL too when the intersection is empty (since the user
> supplied range won't be attained)?

You right for the cpuset case, but I don't think we never end up with
a "empty" set in the case of utilization clamping.

We limit clamps hierarchically in such a way that:

  clamp[clamp_id] = min(task::clamp[clamp_id],
tg::clamp[clamp_id],
system::clamp[clamp_id])

and we ensure, on top of the above that:

  clamp[UCLAMP_MIN] = min(clamp[UCLAMP_MIN], clamp[UCLAMP_MAX])

Since it's all and only about "capping" values, at the very extreme
case you can end up with:

  clamp[UCLAMP_MIN] = clamp[UCLAMP_MAX] = 0

but that's till a valid configuration.

Am I missing something?

Otherwise, I think the changelog sentence you quoted is just
misleading.  I'll remove it from v12 since it does not really clarify
anything more then the rest of the message.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group

2019-07-16 Thread Patrick Bellasi

On 15-Jul 18:42, Michal Koutný wrote:
> On Mon, Jul 08, 2019 at 09:43:55AM +0100, Patrick Bellasi 
>  wrote:
> > +static void uclamp_update_root_tg(void)
> > +{
> > +   struct task_group *tg = _task_group;
> > +
> > +   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
> > + sysctl_sched_uclamp_util_min, false);
> > +   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
> > + sysctl_sched_uclamp_util_max, false);
> > +
> > +   cpu_util_update_eff(_task_group.css);
> > +}
> cpu_util_update_eff internally calls css_for_each_descendant_pre() so
> this should be protected with rcu_read_lock().

Right, good catch! Will add in v12.

Cheers,
Patrick


-- 
#include 

Patrick Bellasi

Re: [PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps

2019-07-16 Thread Patrick Bellasi

Hi Michal,

On 15-Jul 18:42, Michal Koutný wrote:
> On Mon, Jul 08, 2019 at 09:43:54AM +0100, Patrick Bellasi 
>  wrote:
> > Since it's possible for a cpu.uclamp.min value to be bigger than the
> > cpu.uclamp.max value, ensure local consistency by restricting each
> > "protection"
> > (i.e. min utilization) with the corresponding "limit" (i.e. max
> > utilization).
> I think this constraint should be mentioned in the Documentation/

That note comes from the previous review cycle and it's based on a
request from Tejun to align uclamp behaviors with the way the
delegation model is supposed to work.

I guess this part of the documentation:
   
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html?highlight=protections#resource-distribution-models
should already cover the expected uclamp min/max behaviors.

However, I guess "repetita iuvant" in this case. I'll call this out
explicitly in the description of cpu.uclamp.min.

> > +static void cpu_util_update_eff(struct cgroup_subsys_state *css)
> > +{
> > +   struct cgroup_subsys_state *top_css = css;
> > +   struct uclamp_se *uc_se = NULL;
> > +   unsigned int eff[UCLAMP_CNT];
> > +   unsigned int clamp_id;
> > +   unsigned int clamps;
> > +
> > +   css_for_each_descendant_pre(css, top_css) {
> > +   uc_se = css_tg(css)->parent
> > +   ? css_tg(css)->parent->uclamp : NULL;
> > +
> > +   for_each_clamp_id(clamp_id) {
> > +   /* Assume effective clamps matches requested clamps */
> > +   eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
> > +   /* Cap effective clamps with parent's effective clamps 
> > */
> > +   if (uc_se &&
> > +   eff[clamp_id] > uc_se[clamp_id].value) {
> > +   eff[clamp_id] = uc_se[clamp_id].value;
> > +   }
> > +   }
> > +   /* Ensure protection is always capped by limit */
> > +   eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
> > +
> > +   /* Propagate most restrictive effective clamps */
> > +   clamps = 0x0;
> > +   uc_se = css_tg(css)->uclamp;
> (Nitpick only, reassigning child where was parent before decreases
> readibility. IMO)

Did not checked but I think the compiler will figure out it can still
use a single pointer for both assignments.
I'll let's the compiler to its job and add in a dedicated stack var
for the parent pointer.


> > +   for_each_clamp_id(clamp_id) {
> > +   if (eff[clamp_id] == uc_se[clamp_id].value)
> > +   continue;
> > +   uc_se[clamp_id].value = eff[clamp_id];
> > +   uc_se[clamp_id].bucket_id = 
> > uclamp_bucket_id(eff[clamp_id]);
> Shouldn't these writes be synchronized with writes from
> __setscheduler_uclamp()?

You right, the synchronization is introduced by a later patch:

   sched/core: uclamp: Update CPU's refcount on TG's clamp changes

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v11 0/5] Add utilization clamping support (CGroups API)

2019-07-16 Thread Patrick Bellasi

On 15-Jul 18:51, Michal Koutný wrote:
> Hello Patrick.

Hi Michal,

> I took a look at your series and I've posted some notes to your patches.

thanks for your review!

> One applies more to the series overall -- I see there is enum uclamp_id
> defined but at many places (local variables, function args) int or
> unsigned int is used. Besides the inconsistency, I think it'd be nice to
> use the enum at these places.

Right, I think in some of the original versions I had few code paths
where it was not possible to use enum values. That seems no more the case.

Since this change is likely affecting also core bits already merged in
5.3, in v12 I'm going to add a bulk rename patch at the end of the
series, so that we can keep a better tracking of this change.

> (Also, I may suggest CCing ML cgro...@vger.kernel.org where more eyes
> may be available to the cgroup part of your series.)

Good point, I'll add that for the upcoming v12 posting.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller

2019-07-15 Thread Patrick Bellasi

On 08-Jul 12:08, Quentin Perret wrote:
> Hi Patrick,

Hi Quentin!

> On Monday 08 Jul 2019 at 09:43:53 (+0100), Patrick Bellasi wrote:
> > +static inline int uclamp_scale_from_percent(char *buf, u64 *value)
> > +{
> > +   *value = SCHED_CAPACITY_SCALE;
> > +
> > +   buf = strim(buf);
> > +   if (strncmp("max", buf, 4)) {
> > +   s64 percent;
> > +   int ret;
> > +
> > +   ret = cgroup_parse_float(buf, 2, );
> > +   if (ret)
> > +   return ret;
> > +
> > +   percent <<= SCHED_CAPACITY_SHIFT;
> > +   *value = DIV_ROUND_CLOSEST_ULL(percent, 1);
> > +   }
> > +
> > +   return 0;
> > +}
> > +
> > +static inline u64 uclamp_percent_from_scale(u64 value)
> > +{
> > +   return DIV_ROUND_CLOSEST_ULL(value * 1, SCHED_CAPACITY_SCALE);
> > +}
> 
> FWIW, I tried the patches and realized these conversions result in a
> 'funny' behaviour from a user's perspective. Things like this happen:
> 
>$ echo 20 > cpu.uclamp.min
>$ cat cpu.uclamp.min
>20.2
>$ echo 20.2 > cpu.uclamp.min
>$ cat cpu.uclamp.min
>20.21
> 
> Having looked at the code, I get why this is happening, but I'm not sure
> if a random user will. It's not an issue per se, but it's just a bit
> weird.

Yes, that's what we get if we need to use a "two decimal digit
precision percentage" to represent a 1024 range in kernel space.

I don't think the "percent <=> utilization" conversion code can be
made more robust. The only possible alternative I see to get back
exactly what we write in, is to store the actual request in kernel
space, alongside its conversion to the SCHED_CAPACITY_SCALE required by the
actual scheduler code.

Something along these lines (on top of what we have in this series):

---8<---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ddc5fcd4b9cf..82b28cfa5c3f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7148,40 +7148,35 @@ static void cpu_util_update_eff(struct 
cgroup_subsys_state *css)
}
 }

-static inline int uclamp_scale_from_percent(char *buf, u64 *value)
+static inline int uclamp_scale_from_percent(char *buf, s64 *percent, u64 
*scale)
 {
-   *value = SCHED_CAPACITY_SCALE;
+   *scale = SCHED_CAPACITY_SCALE;

buf = strim(buf);
if (strncmp("max", buf, 4)) {
-   s64 percent;
int ret;

-   ret = cgroup_parse_float(buf, 2, );
+   ret = cgroup_parse_float(buf, 2, percent);
if (ret)
return ret;

-   percent <<= SCHED_CAPACITY_SHIFT;
-   *value = DIV_ROUND_CLOSEST_ULL(percent, 1);
+   *scale = *percent << SCHED_CAPACITY_SHIFT;
+   *scale = DIV_ROUND_CLOSEST_ULL(*scale, 1);
}

return 0;
 }

-static inline u64 uclamp_percent_from_scale(u64 value)
-{
-   return DIV_ROUND_CLOSEST_ULL(value * 1, SCHED_CAPACITY_SCALE);
-}
-
 static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
 {
struct task_group *tg;
u64 min_value;
+   s64 percent;
int ret;

-   ret = uclamp_scale_from_percent(buf, _value);
+   ret = uclamp_scale_from_percent(buf, , _value);
if (ret)
return ret;
if (min_value > SCHED_CAPACITY_SCALE)
@@ -7197,6 +7192,9 @@ static ssize_t cpu_uclamp_min_write(struct 
kernfs_open_file *of,
/* Update effective clamps to track the most restrictive value */
cpu_util_update_eff(of_css(of));

+   /* Keep track of the actual requested value */
+   tg->uclamp_pct[UCLAMP_MIN] = percent;
+
rcu_read_unlock();
mutex_unlock(_mutex);

@@ -7209,9 +7207,10 @@ static ssize_t cpu_uclamp_max_write(struct 
kernfs_open_file *of,
 {
struct task_group *tg;
u64 max_value;
+   s64 percent;
int ret;

-   ret = uclamp_scale_from_percent(buf, _value);
+   ret = uclamp_scale_from_percent(buf, , _value);
if (ret)
return ret;
if (max_value > SCHED_CAPACITY_SCALE)
@@ -7227,6 +7226,9 @@ static ssize_t cpu_uclamp_max_write(struct 
kernfs_open_file *of,
/* Update effective clamps to track the most restrictive value */
cpu_util_update_eff(of_css(of));

+   /* Keep track of the actual requested value */
+   tg->uclamp_pct[UCLAMP_MAX] = percent;
+
rcu_read_unlock();
mutex_unlock(_mutex);

@@ -7251,7 +7253,7 @@ static inline void cpu_uclamp_print(struct seq_file *sf,
return;
}

-   percent = uclamp_percent_from_scale(util_clamp);
+   pe

Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware

2019-07-09 Thread Patrick Bellasi

On 08-Jul 14:46, Douglas Raillard wrote:
> Hi Patrick,
> 
> On 7/8/19 12:09 PM, Patrick Bellasi wrote:
> > On 03-Jul 17:36, Douglas Raillard wrote:
> > > On 7/2/19 4:51 PM, Peter Zijlstra wrote:
> > > > On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:

[...]

> > You are also correct in pointing out that in the steady state
> > ramp_boost will not be triggered in that steady state.
> > 
> > IMU, that's for two main reasons:
> >   a) it's very likely that enqueued <= util_avg
> >   b) even in case enqueued should turn out to be _slightly_ bigger then
> >  util_avg, the corresponding (proportional) ramp_boost would be so
> >  tiny to not have any noticeable effect on OPP selection.
> > 
> > Am I correct on point b) above?
> 
> Assuming you meant "util_avg slightly bigger than enqueued" (which is when 
> boosting triggers),
> then yes since ramp_boost effect is proportional to "task_ue.enqueue - 
> task_u". It makes it robust
> against that.

Right :)

> > Could you maybe come up with some experimental numbers related to that
> > case specifically?
> 
> With:
> * an rt-app task ramping up from 5% to 75% util in one big step. The
> whole cycle is 0.6s long (0.3s at 5% followed by 0.3s at 75%). This
> cycle is repeated 20 times and the average of boosting is taken.
> 
> * a hikey 960 (this impact the frequency at which the test runs at
> the beginning of 75% phase, which impacts the number of missed
> activations before the util ramped up).
> 
> * assuming an OPP exists for each util value (i.e. 1024 OPPs, so the
> effect of boost on consumption is not impacted by OPP capacities
> granularity)
> 
> Then the boosting feature would increase the average power
> consumption by 3.1%, out of which 0.12% can be considered "spurious
> boosting" due to the util taking some time to really converge to its
> steady state value.
>
> In practice, the impact of small boosts will be even lower since
> they will less likely trigger the selection of a high OPP due to OPP
> capacity granularity > 1 util unit.

That's ok for the energy side: you estimate a ~3% worst case more
energy on that specific target.

By boosting I expect the negative boost to improve.
Do you have also numbers/stats related to the negative slack?
Can you share a percentage figure for that improvement?

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware

2019-07-08 Thread Patrick Bellasi

On 03-Jul 14:38, Douglas Raillard wrote:
> Hi Peter,
> 
> On 7/2/19 4:44 PM, Peter Zijlstra wrote:
> > On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:
> > > Make schedutil cpufreq governor energy-aware.
> > > 
> > > - patch 1 introduces a function to retrieve a frequency given a base
> > >frequency and an energy cost margin.
> > > - patch 2 links Energy Model perf_domain to sugov_policy.
> > > - patch 3 updates get_next_freq() to make use of the Energy Model.
> > 
> > > 
> > > 1) Selecting the highest possible frequency for a given cost. Some
> > > platforms can have lower frequencies that are less efficient than
> > > higher ones, in which case they should be skipped for most purposes.
> > > They can still be useful to give more freedom to thermal throttling
> > > mechanisms, but not under normal circumstances.
> > > note: the EM framework will warn about such OPPs "hertz/watts ratio
> > > non-monotonically decreasing"
> > 
> > Humm, for some reason I was thinking we explicitly skipped those OPPs
> > and they already weren't used.
> > 
> > This isn't in fact so, and these first few patches make it so?
> 
> That's correct, the cost information about each OPP has been introduced 
> recently in mainline
> by the energy model series. Without that info, the only way to skip them that 
> comes to my
> mind is to set a policy min frequency, since these inefficient OPPs are 
> usually located
> at the lower end.

Perhaps it's also worth to point out that the alternative approach you
point out above is a system wide solution.

While, the ramp_boost thingy you propose, it's a more fine grained
mechanisms which could be extended in the future to have a per-task
side. IOW, it could contribute to have better user-space hints, for
example to ramp_boost more certain tasks and not others.

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RFC PATCH v2 0/5] sched/cpufreq: Make schedutil energy aware

2019-07-08 Thread Patrick Bellasi

On 03-Jul 17:36, Douglas Raillard wrote:
> On 7/2/19 4:51 PM, Peter Zijlstra wrote:
> > On Thu, Jun 27, 2019 at 06:15:58PM +0100, Douglas RAILLARD wrote:

[...]

> > I'm not immediately seeing how it is transient; that is, PELT has a
> > wobble in it's steady state, is that accounted for?
> > 
> 
> The transient-ness of the ramp boost I'm introducing comes from the fact that 
> for a
> periodic task at steady state, task_ue.enqueued <= task_u when the task is 
> executing.
^^^

I find your above "at steady state" a bit confusing.

The condition "task_ue.enqueue <= task_u" is true only for the first
task's big activation after a series of small activations, e.g. a task
switching from 20% to 70%.

That's the transient stat you refer to, isn't it?

> That is because task_ue.enqueued is sampled at dequeue time, precisely at the 
> moment
> at which task_u is reaching its max for that task.

Right, so in the example above we will have enqueued=20% while task_u
is going above to converge towards 70%

> Since we only take into account positive boosts, ramp boost will
> only have an impact in the "increase transients".

Right.

I think Peter was referring to the smallish wobbles we see when the
task already converged to 70%. If that's the case I would say they are
already fully covered also by the current util_est.

You are also correct in pointing out that in the steady state
ramp_boost will not be triggered in that steady state.

IMU, that's for two main reasons:
 a) it's very likely that enqueued <= util_avg
 b) even in case enqueued should turn out to be _slightly_ bigger then
util_avg, the corresponding (proportional) ramp_boost would be so
tiny to not have any noticeable effect on OPP selection.

Am I correct on point b) above?

Could you maybe come up with some experimental numbers related to that
case specifically?

Best,
Patrick

-- 
#include 

Patrick Bellasi

[PATCH v11 4/5] sched/core: uclamp: Use TG's clamps to restrict TASK's clamps

2019-07-08 Thread Patrick Bellasi

When a task specific clamp value is configured via sched_setattr(2), this
value is accounted in the corresponding clamp bucket every time the task is
{en,de}qeued. However, when cgroups are also in use, the task specific
clamp values could be restricted by the task_group (TG) clamp values.

Update uclamp_cpu_inc() to aggregate task and TG clamp values. Every time a
task is enqueued, it's accounted in the clamp bucket tracking the smaller
clamp between the task specific value and its TG effective value. This
allows to:

1. ensure cgroup clamps are always used to restrict task specific requests,
   i.e. boosted not more than its TG effective protection and capped at
   least as its TG effective limit.

2. implement a "nice-like" policy, where tasks are still allowed to request
   less than what enforced by their TG effective limits and protections

This mimics what already happens for a task's CPU affinity mask when the
task is also in a cpuset, i.e. cgroup attributes are always used to
restrict per-task attributes.

Do this by exploiting the concept of "effective" clamp, which is already
used by a TG to track parent enforced restrictions.

Apply task group clamp restrictions only to tasks belonging to a child
group. While, for tasks in the root group or in an autogroup, only system
defaults are enforced.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 276f9c2f6103..2591a70c85cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -873,16 +873,42 @@ unsigned int uclamp_rq_max_value(struct rq *rq, unsigned 
int clamp_id,
return uclamp_idle_value(rq, clamp_id, clamp_value);
 }
 
+static inline struct uclamp_se
+uclamp_tg_restrict(struct task_struct *p, unsigned int clamp_id)
+{
+   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+   struct uclamp_se uc_max;
+
+   /*
+* Tasks in autogroups or root task group will be
+* restricted by system defaults.
+*/
+   if (task_group_is_autogroup(task_group(p)))
+   return uc_req;
+   if (task_group(p) == _task_group)
+   return uc_req;
+
+   uc_max = task_group(p)->uclamp[clamp_id];
+   if (uc_req.value > uc_max.value || !uc_req.user_defined)
+   return uc_max;
+#endif
+
+   return uc_req;
+}
+
 /*
  * The effective clamp bucket index of a task depends on, by increasing
  * priority:
  * - the task specific clamp value, when explicitly requested from userspace
+ * - the task group effective clamp value, for tasks not either in the root
+ *   group or in an autogroup
  * - the system default clamp value, defined by the sysadmin
  */
 static inline struct uclamp_se
 uclamp_eff_get(struct task_struct *p, unsigned int clamp_id)
 {
-   struct uclamp_se uc_req = p->uclamp_req[clamp_id];
+   struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
struct uclamp_se uc_max = uclamp_default[clamp_id];
 
/* System default restrictions always apply */
-- 
2.21.0

[PATCH v11 3/5] sched/core: uclamp: Propagate system defaults to root group

2019-07-08 Thread Patrick Bellasi

The clamp values are not tunable at the level of the root task group.
That's for two main reasons:

 - the root group represents "system resources" which are always
   entirely available from the cgroup standpoint.

 - when tuning/restricting "system resources" makes sense, tuning must
   be done using a system wide API which should also be available when
   control groups are not.

When a system wide restriction is available, cgroups should be aware of
its value in order to know exactly how much "system resources" are
available for the subgroups.

Utilization clamping supports already the concepts of:

 - system defaults: which define the maximum possible clamp values
   usable by tasks.

 - effective clamps: which allows a parent cgroup to constraint (maybe
   temporarily) its descendants without losing the information related
   to the values "requested" from them.

Exploit these two concepts and bind them together in such a way that,
whenever system default are tuned, the new values are propagated to
(possibly) restrict or relax the "effective" value of nested cgroups.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
---
 kernel/sched/core.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec91f4518752..276f9c2f6103 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1017,12 +1017,30 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+#ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css);
+static void uclamp_update_root_tg(void)
+{
+   struct task_group *tg = _task_group;
+
+   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
+ sysctl_sched_uclamp_util_min, false);
+   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
+ sysctl_sched_uclamp_util_max, false);
+
+   cpu_util_update_eff(_task_group.css);
+}
+#else
+static void uclamp_update_root_tg(void) { }
+#endif
+
 int sysctl_sched_uclamp_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
 {
-   int old_min, old_max;
+   bool update_root_tg = false;
static DEFINE_MUTEX(mutex);
+   int old_min, old_max;
int result;
 
mutex_lock();
@@ -1044,12 +1062,17 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
*table, int write,
if (old_min != sysctl_sched_uclamp_util_min) {
uclamp_se_set(_default[UCLAMP_MIN],
  sysctl_sched_uclamp_util_min, false);
+   update_root_tg = true;
}
if (old_max != sysctl_sched_uclamp_util_max) {
uclamp_se_set(_default[UCLAMP_MAX],
  sysctl_sched_uclamp_util_max, false);
+   update_root_tg = true;
}
 
+   if (update_root_tg)
+   uclamp_update_root_tg();
+
/*
 * Updating all the RUNNABLE task is expensive, keep it simple and do
 * just a lazy update at each next enqueue time.
-- 
2.21.0

[PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps

2019-07-08 Thread Patrick Bellasi

In order to properly support hierarchical resources control, the cgroup
delegation model requires that attribute writes from a child group never
fail but still are locally consistent and constrained based on parent's
assigned resources. This requires to properly propagate and aggregate
parent attributes down to its descendants.

Implement this mechanism by adding a new "effective" clamp value for each
task group. The effective clamp value is defined as the smaller value
between the clamp value of a group and the effective clamp value of its
parent. This is the actual clamp value enforced on tasks in a task group.

Since it's possible for a cpu.uclamp.min value to be bigger than the
cpu.uclamp.max value, ensure local consistency by restricting each
"protection"
(i.e. min utilization) with the corresponding "limit" (i.e. max
utilization). Do that at effective clamps propagation to ensure all
user-space write never fails while still always tracking the most
restrictive values.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v11:
 Message-ID: <20190624174607.gq657...@devbig004.ftw2.facebook.com>
 - Removed user-space uclamp.{min.max}.effective API
 - Ensure group limits always clamps group protections
---
 kernel/sched/core.c  | 65 
 kernel/sched/sched.h |  2 ++
 2 files changed, 67 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17ebdaaf7cd9..ec91f4518752 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -773,6 +773,18 @@ static void set_load_weight(struct task_struct *p, bool 
update_load)
 }
 
 #ifdef CONFIG_UCLAMP_TASK
+/*
+ * Serializes updates of utilization clamp values
+ *
+ * The (slow-path) user-space triggers utilization clamp value updates which
+ * can require updates on (fast-path) scheduler's data structures used to
+ * support enqueue/dequeue operations.
+ * While the per-CPU rq lock protects fast-path update operations, user-space
+ * requests are serialized using a mutex to reduce the risk of conflicting
+ * updates or API abuses.
+ */
+static DEFINE_MUTEX(uclamp_mutex);
+
 /* Max allowed minimum utilization */
 unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE;
 
@@ -1137,6 +1149,8 @@ static void __init init_uclamp(void)
unsigned int clamp_id;
int cpu;
 
+   mutex_init(_mutex);
+
for_each_possible_cpu(cpu) {
memset(_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq));
cpu_rq(cpu)->uclamp_flags = 0;
@@ -1153,6 +1167,7 @@ static void __init init_uclamp(void)
uclamp_default[clamp_id] = uc_max;
 #ifdef CONFIG_UCLAMP_TASK_GROUP
root_task_group.uclamp_req[clamp_id] = uc_max;
+   root_task_group.uclamp[clamp_id] = uc_max;
 #endif
}
 }
@@ -6738,6 +6753,7 @@ static inline void alloc_uclamp_sched_group(struct 
task_group *tg,
for_each_clamp_id(clamp_id) {
uclamp_se_set(>uclamp_req[clamp_id],
  uclamp_none(clamp_id), false);
+   tg->uclamp[clamp_id] = parent->uclamp[clamp_id];
}
 #endif
 }
@@ -6988,6 +7004,45 @@ static void cpu_cgroup_attach(struct cgroup_taskset 
*tset)
 }
 
 #ifdef CONFIG_UCLAMP_TASK_GROUP
+static void cpu_util_update_eff(struct cgroup_subsys_state *css)
+{
+   struct cgroup_subsys_state *top_css = css;
+   struct uclamp_se *uc_se = NULL;
+   unsigned int eff[UCLAMP_CNT];
+   unsigned int clamp_id;
+   unsigned int clamps;
+
+   css_for_each_descendant_pre(css, top_css) {
+   uc_se = css_tg(css)->parent
+   ? css_tg(css)->parent->uclamp : NULL;
+
+   for_each_clamp_id(clamp_id) {
+   /* Assume effective clamps matches requested clamps */
+   eff[clamp_id] = css_tg(css)->uclamp_req[clamp_id].value;
+   /* Cap effective clamps with parent's effective clamps 
*/
+   if (uc_se &&
+   eff[clamp_id] > uc_se[clamp_id].value) {
+   eff[clamp_id] = uc_se[clamp_id].value;
+   }
+   }
+   /* Ensure protection is always capped by limit */
+   eff[UCLAMP_MIN] = min(eff[UCLAMP_MIN], eff[UCLAMP_MAX]);
+
+   /* Propagate most restrictive effective clamps */
+   clamps = 0x0;
+   uc_se = css_tg(css)->uclamp;
+   for_each_clamp_id(clamp_id) {
+   if (eff[clamp_id] == uc_se[clamp_id].value)
+   continue;
+   uc_se[clamp_id].value = eff[clamp_id];
+   uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
+   clamps |= (0x1 << clamp_id);
+   }
+   if (!clamps)

[PATCH v11 5/5] sched/core: uclamp: Update CPU's refcount on TG's clamp changes

2019-07-08 Thread Patrick Bellasi

On updates of task group (TG) clamp values, ensure that these new values
are enforced on all RUNNABLE tasks of the task group, i.e. all RUNNABLE
tasks are immediately boosted and/or capped as requested.

Do that each time we update effective clamps from cpu_util_update_eff().
Use the *cgroup_subsys_state (css) to walk the list of tasks in each
affected TG and update their RUNNABLE tasks.
Update each task by using the same mechanism used for cpu affinity masks
updates, i.e. by taking the rq lock.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v11:
 Message-ID: <20190624174607.gq657...@devbig004.ftw2.facebook.com>
 - Ensure group limits always clamps group protection
---
 kernel/sched/core.c | 58 -
 1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2591a70c85cf..ddc5fcd4b9cf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1043,6 +1043,57 @@ static inline void uclamp_rq_dec(struct rq *rq, struct 
task_struct *p)
uclamp_rq_dec_id(rq, p, clamp_id);
 }
 
+static inline void
+uclamp_update_active(struct task_struct *p, unsigned int clamp_id)
+{
+   struct rq_flags rf;
+   struct rq *rq;
+
+   /*
+* Lock the task and the rq where the task is (or was) queued.
+*
+* We might lock the (previous) rq of a !RUNNABLE task, but that's the
+* price to pay to safely serialize util_{min,max} updates with
+* enqueues, dequeues and migration operations.
+* This is the same locking schema used by __set_cpus_allowed_ptr().
+*/
+   rq = task_rq_lock(p, );
+
+   /*
+* Setting the clamp bucket is serialized by task_rq_lock().
+* If the task is not yet RUNNABLE and its task_struct is not
+* affecting a valid clamp bucket, the next time it's enqueued,
+* it will already see the updated clamp bucket value.
+*/
+   if (!p->uclamp[clamp_id].active)
+   goto done;
+
+   uclamp_rq_dec_id(rq, p, clamp_id);
+   uclamp_rq_inc_id(rq, p, clamp_id);
+
+done:
+
+   task_rq_unlock(rq, p, );
+}
+
+static inline void
+uclamp_update_active_tasks(struct cgroup_subsys_state *css,
+  unsigned int clamps)
+{
+   struct css_task_iter it;
+   struct task_struct *p;
+   unsigned int clamp_id;
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next())) {
+   for_each_clamp_id(clamp_id) {
+   if ((0x1 << clamp_id) & clamps)
+   uclamp_update_active(p, clamp_id);
+   }
+   }
+   css_task_iter_end();
+}
+
 #ifdef CONFIG_UCLAMP_TASK_GROUP
 static void cpu_util_update_eff(struct cgroup_subsys_state *css);
 static void uclamp_update_root_tg(void)
@@ -7087,8 +7138,13 @@ static void cpu_util_update_eff(struct 
cgroup_subsys_state *css)
uc_se[clamp_id].bucket_id = 
uclamp_bucket_id(eff[clamp_id]);
clamps |= (0x1 << clamp_id);
}
-   if (!clamps)
+   if (!clamps) {
css = css_rightmost_descendant(css);
+   continue;
+   }
+
+   /* Immediately update descendants RUNNABLE tasks */
+   uclamp_update_active_tasks(css, clamps);
}
 }
 
-- 
2.21.0

[PATCH v11 0/5] Add utilization clamping support (CGroups API)

2019-07-08 Thread Patrick Bellasi

ey actually are.

Utilization clamping thus enables interesting additional optimizations, for
example on asymmetric capacity systems like Arm big.LITTLE and DynamIQ CPUs,
where:

 - boosting: try to run small/foreground tasks on higher-capacity CPUs to
   complete them faster despite being less energy efficient.

 - capping: try to run big/background tasks on low-capacity CPUs to save power
   and thermal headroom for more important tasks

This series does not present this additional usage of utilization clamping but
it's an integral part of the EAS feature set, where [2] is one of its main
components.

Android kernels use SchedTune, a solution similar to utilization clamping, to
bias both 'frequency selection' and 'task placement'. This series provides the
foundation to add similar features to mainline while focusing, for the
time being, just on schedutil integration.


References
==

[1] Energy Aware Scheduling

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/scheduler/sched-energy.txt?h=v5.1

[2] Expressing per-task/per-cgroup performance hints
Linux Plumbers Conference 2018
https://linuxplumbersconf.org/event/2/contributions/128/


Patrick Bellasi (5):
  sched/core: uclamp: Extend CPU's cgroup controller
  sched/core: uclamp: Propagate parent clamps
  sched/core: uclamp: Propagate system defaults to root group
  sched/core: uclamp: Use TG's clamps to restrict TASK's clamps
  sched/core: uclamp: Update CPU's refcount on TG's clamp changes

 Documentation/admin-guide/cgroup-v2.rst |  30 +++
 init/Kconfig|  22 ++
 kernel/sched/core.c | 335 +++-
 kernel/sched/sched.h|   8 +
 4 files changed, 392 insertions(+), 3 deletions(-)

-- 
2.21.0

[PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller

2019-07-08 Thread Patrick Bellasi

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run at least at a
 minimum frequency which corresponds to the uclamp.min
 utilization

- uclamp.max: defines the maximum utilization which should be considered
  i.e. the RUNNABLE tasks of this group will run up to a
 maximum frequency which corresponds to the uclamp.max
 utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Validate local consistency by enforcing uclamp.min < uclamp.max.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Signed-off-by: Patrick Bellasi 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Tejun Heo 

---
Changes in v11:
 Message-ID: <20190624175215.gr657...@devbig004.ftw2.facebook.com>
 - remove checks for cpu_uclamp_{min,max}_write() from root group
 - remove enforcement for "protection" being smaller than "limits"
 - rephrase uclamp extension description to avoid explicit
   mentioning of the bandwidth concept
---
 Documentation/admin-guide/cgroup-v2.rst |  30 +
 init/Kconfig|  22 
 kernel/sched/core.c | 161 +++-
 kernel/sched/sched.h|   6 +
 4 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst 
b/Documentation/admin-guide/cgroup-v2.rst
index a5c845338d6d..1d49426b4c1e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -951,6 +951,13 @@ controller implements weight and absolute bandwidth limit 
models for
 normal scheduling policy and absolute bandwidth allocation model for
 realtime scheduling policy.
 
+In all the above models, cycles distribution is defined only on a temporal
+base and it does not account for the frequency at which tasks are executed.
+The (optional) utilization clamping support allows to hint the schedutil
+cpufreq governor about the minimum desired frequency which should always be
+provided by a CPU, as well as the maximum desired frequency, which should not
+be exceeded by a CPU.
+
 WARNING: cgroup2 doesn't yet support control of realtime processes and
 the cpu controller can only be enabled when all RT processes are in
 the root cgroup.  Be aware that system management software may already
@@ -1016,6 +1023,29 @@ All time durations are in microseconds.
Shows pressure sta

Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

2019-07-02 Thread Patrick Bellasi

On 01-Jul 17:01, Subhra Mazumdar wrote:
> 
> On 7/1/19 6:55 AM, Patrick Bellasi wrote:
> > On 01-Jul 11:02, Peter Zijlstra wrote:
> > > On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> > > > Hi,
> > > > 
> > > > Resending this patchset, will be good to get some feedback. Any 
> > > > suggestions
> > > > that will make it more acceptable are welcome. We have been shipping 
> > > > this
> > > > with Unbreakable Enterprise Kernel in Oracle Linux.
> > > > 
> > > > Current select_idle_sibling first tries to find a fully idle core using
> > > > select_idle_core which can potentially search all cores and if it fails 
> > > > it
> > > > finds any idle cpu using select_idle_cpu. select_idle_cpu can 
> > > > potentially
> > > > search all cpus in the llc domain. This doesn't scale for large llc 
> > > > domains
> > > > and will only get worse with more cores in future.
> > > > 
> > > > This patch solves the scalability problem by:
> > > >   - Setting an upper and lower limit of idle cpu search in 
> > > > select_idle_cpu
> > > > to keep search time low and constant
> > > >   - Adding a new sched feature SIS_CORE to disable select_idle_core
> > > > 
> > > > Additionally it also introduces a new per-cpu variable next_cpu to track
> > > > the limit of search so that every time search starts from where it 
> > > > ended.
> > > > This rotating search window over cpus in LLC domain ensures that idle
> > > > cpus are eventually found in case of high load.
> > > Right, so we had a wee conversation about this patch series at OSPM, and
> > > I don't see any of that reflected here :-(
> > > 
> > > Specifically, given that some people _really_ want the whole L3 mask
> > > scanned to reduce tail latency over raw throughput, while you guys
> > > prefer the other way around, it was proposed to extend the task model.
> > > 
> > > Specifically something like a latency-nice was mentioned (IIRC) where a
> > Right, AFAIR PaulT suggested to add support for the concept of a task
> > being "latency tolerant": meaning we can spend more time to search for
> > a CPU and/or avoid preempting the current task.
> > 
> Wondering if searching and preempting needs will ever be conflicting?

I guess the winning point is that we don't commit behaviors to
userspace, but just abstract concepts which are turned into biases.

I don't see conflicts right now: if you are latency tolerant that
means you can spend more time to try finding a better CPU (e.g. we can
use the energy model to compare multiple CPUs) _and/or_ give the
current task a better chance to complete by delaying its preemption.

> Otherwise sounds like a good direction to me. For the searching aspect, can
> we map latency nice values to the % of cores we search in select_idle_cpu?
> Thus the search cost can be controlled by latency nice value.

I guess that's worth a try, only caveat I see is that it's turning the
bias into something very platform specific. Meaning, the same
latency-nice value on different machines can have very different
results.

Would not be better to try finding a more platform independent mapping?

Maybe something time bounded, e.g. the higher the latency-nice the more
time we can spend looking for CPUs?

> But the issue is if more latency tolerant workloads set to less
> search, we still need some mechanism to achieve good spread of
> threads.

I don't get this example: why more latency tolerant workloads should
require less search?

> Can we keep the sliding window mechanism in that case?

Which one? Sorry did not went through the patches, can you briefly
resume the idea?

> Also will latency nice do anything for select_idle_core and
> select_idle_smt?

I guess principle the same bias can be used at different levels, maybe
with different mappings.

In the mobile world use-case we will likely use it only to switch from
select_idle_sibling to the energy aware slow path. And perhaps to see
if we can bias the wakeup preemption granularity.

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [RESEND PATCH v3 0/7] Improve scheduler scalability for fast path

2019-07-01 Thread Patrick Bellasi

On 01-Jul 11:02, Peter Zijlstra wrote:
> On Wed, Jun 26, 2019 at 06:29:12PM -0700, subhra mazumdar wrote:
> > Hi,
> > 
> > Resending this patchset, will be good to get some feedback. Any suggestions
> > that will make it more acceptable are welcome. We have been shipping this
> > with Unbreakable Enterprise Kernel in Oracle Linux.
> > 
> > Current select_idle_sibling first tries to find a fully idle core using
> > select_idle_core which can potentially search all cores and if it fails it
> > finds any idle cpu using select_idle_cpu. select_idle_cpu can potentially
> > search all cpus in the llc domain. This doesn't scale for large llc domains
> > and will only get worse with more cores in future.
> > 
> > This patch solves the scalability problem by:
> >  - Setting an upper and lower limit of idle cpu search in select_idle_cpu
> >to keep search time low and constant
> >  - Adding a new sched feature SIS_CORE to disable select_idle_core
> > 
> > Additionally it also introduces a new per-cpu variable next_cpu to track
> > the limit of search so that every time search starts from where it ended.
> > This rotating search window over cpus in LLC domain ensures that idle
> > cpus are eventually found in case of high load.
> 
> Right, so we had a wee conversation about this patch series at OSPM, and
> I don't see any of that reflected here :-(
> 
> Specifically, given that some people _really_ want the whole L3 mask
> scanned to reduce tail latency over raw throughput, while you guys
> prefer the other way around, it was proposed to extend the task model.
> 
> Specifically something like a latency-nice was mentioned (IIRC) where a

Right, AFAIR PaulT suggested to add support for the concept of a task
being "latency tolerant": meaning we can spend more time to search for
a CPU and/or avoid preempting the current task.

> task can give a bias but not specify specific behaviour. This is very
> important since we don't want to be ABI tied to specific behaviour.

I like the idea of biasing, especially considering we are still in the
domain of the FAIR scheduler. If something more mandatory should be
required there are other classes which are likely more appropriate.

> Some of the things we could tie to this would be:
> 
>   - select_idle_siblings; -nice would scan more than +nice,

Just to be sure, you are not proposing to use the nice value we
already have, i.e.
  p->{static,normal}_prio
but instead a new similar concept, right?

Otherwise, pros would be we don't touch userspace, but as a cons we
would have side effects, i.e. bandwidth allocation.
While I think we don't want to mix "specific behaviors" with "biases".

>   - wakeup preemption; when the wakee has a relative smaller
> latency-nice value than the current running task, it might preempt
> sooner and the other way around of course.

I think we currently have a single system-wide parameter for that now:

   sched_wakeup_granularity_ns ==> sysctl_sched_min_granularity

which is used in:

   wakeup_gran()for the wakeup path
   check_preempt_tick() for the periodic tick

that's where it should be possible to extend the heuristics with some
biasing based on the latency-nice attribute of a task, right?

>   - pack-vs-spread; +nice would pack more with like tasks (since we
> already spread by default [0] I don't think -nice would affect much
> here).

That will be very useful for the Android case too.
In Android we used to call it "prefer_idle", but that's probably not
the best name, conceptually similar however.

In Android we would use a latency-nice concept to go for either the
fast (select_idle_siblings) or the slow (energy aware) path.

> Hmmm?

Just one more requirement I think it's worth to consider since the
beginning: CGroups support

That would be very welcome interface. Just because is so much more
convenient (and safe) to set these bias on a group of tasks depending
on their role in the system.

Do you have any idea on how we can expose such a "lantency-nice"
property via CGroups? It's very similar to cpu.shares but it does not
represent a resource which can be partitioned.

Best,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH] sched/fair: util_est: fast ramp-up EWMA on utilization increases

2019-07-01 Thread Patrick Bellasi

On 30-Jun 10:43, Vincent Guittot wrote:
> On Fri, 28 Jun 2019 at 16:10, Patrick Bellasi  wrote:
> > On 28-Jun 15:51, Vincent Guittot wrote:
> > > On Fri, 28 Jun 2019 at 14:38, Peter Zijlstra  wrote:
> > > > On Fri, Jun 28, 2019 at 11:08:14AM +0100, Patrick Bellasi wrote:
> > > > > On 26-Jun 13:40, Vincent Guittot wrote:

Hi Vincent,

[...]

> > > AFAICT, it's not related to the time-scaling
> > >
> > > In fact the big 1st activation happens because task runs at low OPP
> > > and hasn't enough time to finish its running phase before the time to
> > > begin the next one happens. This means that the task will run several
> > > computations phase in one go which is no more a 75% task.
> >
> > But in that case, running multiple activations back to back, should we
> > not expect the util_avg to exceed the 75% mark?
> 
> But task starts with a very low value and Pelt needs time to ramp up.

Of course...

[...]

> > > Once cpu reaches a high enough OPP that enable to have sleep phase
> > > between each running phases, the task load tracking comes back to the
> > > normal slope increase (the one that would have happen if task would
> > > have jump from 5% to 75% but already running at max OPP)
> >
> >
> > Indeed, I can see from the plots a change in slope. But there is also
> > that big drop after the first big activation: 375 units in 1.1ms.
> >
> > Is that expected? I guess yes, since we fix the clock_pelt with the
> > lost_idle_time.

... but, I guess Peter was mainly asking about the point above: is
that "big" drop after the first activation related to time-scaling or
not?

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1316 matches

Mail list logo