subject:"Re\: \[PATCH v4 02\/16\] sched\/core\: uclamp\: map TASK's clamp values into CPU's clamp groups"

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-14 Thread Patrick Bellasi

On 13-Sep 21:14, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:52:02PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 19:42, Peter Zijlstra wrote:
> > > On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > > > On 12-Sep 18:12, Peter Zijlstra wrote:
> > > 
> > > > > No idea; but if you want to go all fancy you can replace he whole
> > > > > uclamp_map thing with something like:
> > > > > 
> > > > > struct uclamp_map {
> > > > >   union {
> > > > >   struct {
> > > > >   unsigned long v : 10;
> > > > >   unsigned long c : BITS_PER_LONG - 10;
> > > > >   };
> > > > >   atomic_long_t s;
> > > > >   };
> > > > > };
> > > > 
> > > > That sounds really cool and scary at the same time :)
> > > > 
> > > > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > > > or that we use it to track a percentage value (i.e. [0..100]).
> > > 
> > > Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.
> > 
> > Just remembered a past experience where we had unaligned access traps
> > on some machine because... don't you see any potentially issue on
> > using bitfleds like you suggest above ?
> > 
> > I'm thinking to:
> > 
> >commit 317d359df95d ("sched/core: Force proper alignment of 'struct 
> > util_est'")
> 
> There should not be (and I'm still confused by that particular commit
> you reference). If we access everything through the uclamp_map::s, and
> only use the bitfields to interpret the results, it all 'works'.

Yes, the problem above was different... still I was wondering if there
could be bitfields related alignment issue lurking somewhere.
But, as you say, if we always R/W atomically via uclamp_map::s there
should be none.

> The tricky thing we did earlier was trying to use u64 accesses for 2
> u32 variables. And somehow ia64 didn't get the alignment right.

Right, np... sorry for the noise.

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-14 Thread Patrick Bellasi

On 13-Sep 21:14, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:52:02PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 19:42, Peter Zijlstra wrote:
> > > On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > > > On 12-Sep 18:12, Peter Zijlstra wrote:
> > > 
> > > > > No idea; but if you want to go all fancy you can replace he whole
> > > > > uclamp_map thing with something like:
> > > > > 
> > > > > struct uclamp_map {
> > > > >   union {
> > > > >   struct {
> > > > >   unsigned long v : 10;
> > > > >   unsigned long c : BITS_PER_LONG - 10;
> > > > >   };
> > > > >   atomic_long_t s;
> > > > >   };
> > > > > };
> > > > 
> > > > That sounds really cool and scary at the same time :)
> > > > 
> > > > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > > > or that we use it to track a percentage value (i.e. [0..100]).
> > > 
> > > Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.
> > 
> > Just remembered a past experience where we had unaligned access traps
> > on some machine because... don't you see any potentially issue on
> > using bitfleds like you suggest above ?
> > 
> > I'm thinking to:
> > 
> >commit 317d359df95d ("sched/core: Force proper alignment of 'struct 
> > util_est'")
> 
> There should not be (and I'm still confused by that particular commit
> you reference). If we access everything through the uclamp_map::s, and
> only use the bitfields to interpret the results, it all 'works'.

Yes, the problem above was different... still I was wondering if there
could be bitfields related alignment issue lurking somewhere.
But, as you say, if we always R/W atomically via uclamp_map::s there
should be none.

> The tricky thing we did earlier was trying to use u64 accesses for 2
> u32 variables. And somehow ia64 didn't get the alignment right.

Right, np... sorry for the noise.

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-14 Thread Patrick Bellasi

On 13-Sep 21:20, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:42:09PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 18:24, Peter Zijlstra wrote:
> > > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> 
> > > >  {
> > > > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > > +   int lower_bound, upper_bound;
> > > > +   struct uclamp_se *uc_se;
> > > > +   int result = 0;
> > > 
> > > I think the thing would become much more readable if you set
> > > lower/upper_bound right here.
> 
> > Actually it could also make sense to have them before the mutex ;)
> 
> Indeed.
> 
> +   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> +   ? attr->sched_util_max
> +   : p->uclamp[UCLAMP_MAX].value;
> +
> +   if (upper_bound == UCLAMP_NOT_VALID)
> +   upper_bound = SCHED_CAPACITY_SCALE;
> +   if (attr->sched_util_min > upper_bound) {
> +   result = -EINVAL;
> +   goto done;
> +   }
> +
> +   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> +   if (result == -ENOSPC) {
> +   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> +   goto done;
> +   }
> +   group_id[UCLAMP_MIN] = result;
> +   }
> +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> +   lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> +   ? attr->sched_util_min
> +   : p->uclamp[UCLAMP_MIN].value;
> +
> +   if (lower_bound == UCLAMP_NOT_VALID)
> +   lower_bound = 0;
> +   if (attr->sched_util_max < lower_bound ||
> +   attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> +   result = -EINVAL;
> +   goto done;
> +   }
> 
> That would end up soething like:
> 
>   unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
>   unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
> 
>   if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
>   lower_bound = attr->sched_util_min;
> 
>   if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
>   upper_bound = attr->sched_util_max;
> 
>   if (lower_bound > upper_bound ||
>   upper_bound > SCHED_CAPACITY_MAX)
>   return -EINVAL;
> 
>   mutex_lock(...);

Yes... much cleaner, thanks.

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-14 Thread Patrick Bellasi

On 13-Sep 21:20, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:42:09PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 18:24, Peter Zijlstra wrote:
> > > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> 
> > > >  {
> > > > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > > +   int lower_bound, upper_bound;
> > > > +   struct uclamp_se *uc_se;
> > > > +   int result = 0;
> > > 
> > > I think the thing would become much more readable if you set
> > > lower/upper_bound right here.
> 
> > Actually it could also make sense to have them before the mutex ;)
> 
> Indeed.
> 
> +   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> +   ? attr->sched_util_max
> +   : p->uclamp[UCLAMP_MAX].value;
> +
> +   if (upper_bound == UCLAMP_NOT_VALID)
> +   upper_bound = SCHED_CAPACITY_SCALE;
> +   if (attr->sched_util_min > upper_bound) {
> +   result = -EINVAL;
> +   goto done;
> +   }
> +
> +   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> +   if (result == -ENOSPC) {
> +   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> +   goto done;
> +   }
> +   group_id[UCLAMP_MIN] = result;
> +   }
> +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> +   lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> +   ? attr->sched_util_min
> +   : p->uclamp[UCLAMP_MIN].value;
> +
> +   if (lower_bound == UCLAMP_NOT_VALID)
> +   lower_bound = 0;
> +   if (attr->sched_util_max < lower_bound ||
> +   attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> +   result = -EINVAL;
> +   goto done;
> +   }
> 
> That would end up soething like:
> 
>   unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
>   unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;
> 
>   if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
>   lower_bound = attr->sched_util_min;
> 
>   if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
>   upper_bound = attr->sched_util_max;
> 
>   if (lower_bound > upper_bound ||
>   upper_bound > SCHED_CAPACITY_MAX)
>   return -EINVAL;
> 
>   mutex_lock(...);

Yes... much cleaner, thanks.

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-13 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 06:42:09PM +0100, Patrick Bellasi wrote:
> On 12-Sep 18:24, Peter Zijlstra wrote:
> > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:

> > >  {
> > > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > + int lower_bound, upper_bound;
> > > + struct uclamp_se *uc_se;
> > > + int result = 0;
> > 
> > I think the thing would become much more readable if you set
> > lower/upper_bound right here.

> Actually it could also make sense to have them before the mutex ;)

Indeed.

+   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+   ? attr->sched_util_max
+   : p->uclamp[UCLAMP_MAX].value;
+   
+   if (upper_bound == UCLAMP_NOT_VALID)
+   upper_bound = SCHED_CAPACITY_SCALE;
+   if (attr->sched_util_min > upper_bound) {
+   result = -EINVAL;
+   goto done;
+   }
+ 
+   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
+   if (result == -ENOSPC) {
+   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+   goto done;
+   }
+   group_id[UCLAMP_MIN] = result;
+   }
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+   lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+   ? attr->sched_util_min
+   : p->uclamp[UCLAMP_MIN].value;
+ 
+   if (lower_bound == UCLAMP_NOT_VALID)
+   lower_bound = 0;
+   if (attr->sched_util_max < lower_bound ||
+   attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+   result = -EINVAL;
+   goto done;
+   }

That would end up soething like:

unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;

if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
lower_bound = attr->sched_util_min;

if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
upper_bound = attr->sched_util_max;

if (lower_bound > upper_bound ||
upper_bound > SCHED_CAPACITY_MAX)
return -EINVAL;

mutex_lock(...);

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-13 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 06:42:09PM +0100, Patrick Bellasi wrote:
> On 12-Sep 18:24, Peter Zijlstra wrote:
> > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:

> > >  {
> > > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > + int lower_bound, upper_bound;
> > > + struct uclamp_se *uc_se;
> > > + int result = 0;
> > 
> > I think the thing would become much more readable if you set
> > lower/upper_bound right here.

> Actually it could also make sense to have them before the mutex ;)

Indeed.

+   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
+   ? attr->sched_util_max
+   : p->uclamp[UCLAMP_MAX].value;
+   
+   if (upper_bound == UCLAMP_NOT_VALID)
+   upper_bound = SCHED_CAPACITY_SCALE;
+   if (attr->sched_util_min > upper_bound) {
+   result = -EINVAL;
+   goto done;
+   }
+ 
+   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
+   if (result == -ENOSPC) {
+   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
+   goto done;
+   }
+   group_id[UCLAMP_MIN] = result;
+   }
+   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
+   lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
+   ? attr->sched_util_min
+   : p->uclamp[UCLAMP_MIN].value;
+ 
+   if (lower_bound == UCLAMP_NOT_VALID)
+   lower_bound = 0;
+   if (attr->sched_util_max < lower_bound ||
+   attr->sched_util_max > SCHED_CAPACITY_SCALE) {
+   result = -EINVAL;
+   goto done;
+   }

That would end up soething like:

unsigned int lower_bound = p->uclamp[UCLAMP_MIN].value;
unsigned int upper_bound = p->uclamp[UCLAMP_MAX].value;

if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
lower_bound = attr->sched_util_min;

if (sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
upper_bound = attr->sched_util_max;

if (lower_bound > upper_bound ||
upper_bound > SCHED_CAPACITY_MAX)
return -EINVAL;

mutex_lock(...);

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-13 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 06:52:02PM +0100, Patrick Bellasi wrote:
> On 12-Sep 19:42, Peter Zijlstra wrote:
> > On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > > On 12-Sep 18:12, Peter Zijlstra wrote:
> > 
> > > > No idea; but if you want to go all fancy you can replace he whole
> > > > uclamp_map thing with something like:
> > > > 
> > > > struct uclamp_map {
> > > > union {
> > > > struct {
> > > > unsigned long v : 10;
> > > > unsigned long c : BITS_PER_LONG - 10;
> > > > };
> > > > atomic_long_t s;
> > > > };
> > > > };
> > > 
> > > That sounds really cool and scary at the same time :)
> > > 
> > > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > > or that we use it to track a percentage value (i.e. [0..100]).
> > 
> > Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.
> 
> Just remembered a past experience where we had unaligned access traps
> on some machine because... don't you see any potentially issue on
> using bitfleds like you suggest above ?
> 
> I'm thinking to:
> 
>commit 317d359df95d ("sched/core: Force proper alignment of 'struct 
> util_est'")

There should not be (and I'm still confused by that particular commit
you reference). If we access everything through the uclamp_map::s, and
only use the bitfields to interpret the results, it all 'works'.

The tricky thing we did earlier was trying to use u64 accesses for 2
u32 variables. And somehow ia64 didn't get the alignment right.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-13 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 06:52:02PM +0100, Patrick Bellasi wrote:
> On 12-Sep 19:42, Peter Zijlstra wrote:
> > On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > > On 12-Sep 18:12, Peter Zijlstra wrote:
> > 
> > > > No idea; but if you want to go all fancy you can replace he whole
> > > > uclamp_map thing with something like:
> > > > 
> > > > struct uclamp_map {
> > > > union {
> > > > struct {
> > > > unsigned long v : 10;
> > > > unsigned long c : BITS_PER_LONG - 10;
> > > > };
> > > > atomic_long_t s;
> > > > };
> > > > };
> > > 
> > > That sounds really cool and scary at the same time :)
> > > 
> > > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > > or that we use it to track a percentage value (i.e. [0..100]).
> > 
> > Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.
> 
> Just remembered a past experience where we had unaligned access traps
> on some machine because... don't you see any potentially issue on
> using bitfleds like you suggest above ?
> 
> I'm thinking to:
> 
>commit 317d359df95d ("sched/core: Force proper alignment of 'struct 
> util_est'")

There should not be (and I'm still confused by that particular commit
you reference). If we access everything through the uclamp_map::s, and
only use the bitfields to interpret the results, it all 'works'.

The tricky thing we did earlier was trying to use u64 accesses for 2
u32 variables. And somehow ia64 didn't get the alignment right.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 19:42, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 18:12, Peter Zijlstra wrote:
> 
> > > No idea; but if you want to go all fancy you can replace he whole
> > > uclamp_map thing with something like:
> > > 
> > > struct uclamp_map {
> > >   union {
> > >   struct {
> > >   unsigned long v : 10;
> > >   unsigned long c : BITS_PER_LONG - 10;
> > >   };
> > >   atomic_long_t s;
> > >   };
> > > };
> > 
> > That sounds really cool and scary at the same time :)
> > 
> > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > or that we use it to track a percentage value (i.e. [0..100]).
> 
> Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.

Just remembered a past experience where we had unaligned access traps
on some machine because... don't you see any potentially issue on
using bitfleds like you suggest above ?

I'm thinking to:

   commit 317d359df95d ("sched/core: Force proper alignment of 'struct 
util_est'")

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 19:42, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 18:12, Peter Zijlstra wrote:
> 
> > > No idea; but if you want to go all fancy you can replace he whole
> > > uclamp_map thing with something like:
> > > 
> > > struct uclamp_map {
> > >   union {
> > >   struct {
> > >   unsigned long v : 10;
> > >   unsigned long c : BITS_PER_LONG - 10;
> > >   };
> > >   atomic_long_t s;
> > >   };
> > > };
> > 
> > That sounds really cool and scary at the same time :)
> > 
> > The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> > or that we use it to track a percentage value (i.e. [0..100]).
> 
> Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.

Just remembered a past experience where we had unaligned access traps
on some machine because... don't you see any potentially issue on
using bitfleds like you suggest above ?

I'm thinking to:

   commit 317d359df95d ("sched/core: Force proper alignment of 'struct 
util_est'")

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> On 12-Sep 18:12, Peter Zijlstra wrote:

> > No idea; but if you want to go all fancy you can replace he whole
> > uclamp_map thing with something like:
> > 
> > struct uclamp_map {
> > union {
> > struct {
> > unsigned long v : 10;
> > unsigned long c : BITS_PER_LONG - 10;
> > };
> > atomic_long_t s;
> > };
> > };
> 
> That sounds really cool and scary at the same time :)
> 
> The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> or that we use it to track a percentage value (i.e. [0..100]).

Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.

> One of the last patches introduces percentage values to userspace.
> But, I was considering that in kernel space we should always track
> full scale utilization values.
> 
> The c:(BITS_PER_LONG-10) restricts the range of concurrently active
> SE refcounting the same clamp value. Which, for some 32bit systems is
> only 4 milions among tasks and cgroups... maybe still reasonable...

Yeah, on 32bit having 4M tasks seems 'unlikely'.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 06:35:15PM +0100, Patrick Bellasi wrote:
> On 12-Sep 18:12, Peter Zijlstra wrote:

> > No idea; but if you want to go all fancy you can replace he whole
> > uclamp_map thing with something like:
> > 
> > struct uclamp_map {
> > union {
> > struct {
> > unsigned long v : 10;
> > unsigned long c : BITS_PER_LONG - 10;
> > };
> > atomic_long_t s;
> > };
> > };
> 
> That sounds really cool and scary at the same time :)
> 
> The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
> or that we use it to track a percentage value (i.e. [0..100]).

Or we pick 11 bits, it seems unlikely that capacity be larger than 2k.

> One of the last patches introduces percentage values to userspace.
> But, I was considering that in kernel space we should always track
> full scale utilization values.
> 
> The c:(BITS_PER_LONG-10) restricts the range of concurrently active
> SE refcounting the same clamp value. Which, for some 32bit systems is
> only 4 milions among tasks and cgroups... maybe still reasonable...

Yeah, on 32bit having 4M tasks seems 'unlikely'.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 18:24, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> 
> But large for inline now.

Yes, Suren also already pointed that out... already gone in my v5 ;)

> >  {
> > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > +   int lower_bound, upper_bound;
> > +   struct uclamp_se *uc_se;
> > +   int result = 0;
> 
> I think the thing would become much more readable if you set
> lower/upper_bound right here.

Do you mean the bits I've ---8<---ed below ?

> >
> > +   mutex_lock(_mutex);
> >
> > +   /* Find a valid group_id for each required clamp value */
> > +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {

---8<---
> > +   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > +   ? attr->sched_util_max
> > +   : p->uclamp[UCLAMP_MAX].value;
> > +
> > +   if (upper_bound == UCLAMP_NOT_VALID)
> > +   upper_bound = SCHED_CAPACITY_SCALE;
> > +   if (attr->sched_util_min > upper_bound) {
> > +   result = -EINVAL;
> > +   goto done;
> > +   }
---8<---

Actually it could also make sense to have them before the mutex ;)

> > +
> > +   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> > +   if (result == -ENOSPC) {
> > +   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> 
> AFAICT this is an unpriv part of the syscall; and you can spam the log
> without limits. Not good.

Good point, will better check this.

[...]

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 18:24, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> 
> But large for inline now.

Yes, Suren also already pointed that out... already gone in my v5 ;)

> >  {
> > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > +   int lower_bound, upper_bound;
> > +   struct uclamp_se *uc_se;
> > +   int result = 0;
> 
> I think the thing would become much more readable if you set
> lower/upper_bound right here.

Do you mean the bits I've ---8<---ed below ?

> >
> > +   mutex_lock(_mutex);
> >
> > +   /* Find a valid group_id for each required clamp value */
> > +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {

---8<---
> > +   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > +   ? attr->sched_util_max
> > +   : p->uclamp[UCLAMP_MAX].value;
> > +
> > +   if (upper_bound == UCLAMP_NOT_VALID)
> > +   upper_bound = SCHED_CAPACITY_SCALE;
> > +   if (attr->sched_util_min > upper_bound) {
> > +   result = -EINVAL;
> > +   goto done;
> > +   }
---8<---

Actually it could also make sense to have them before the mutex ;)

> > +
> > +   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> > +   if (result == -ENOSPC) {
> > +   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> 
> AFAICT this is an unpriv part of the syscall; and you can spam the log
> without limits. Not good.

Good point, will better check this.

[...]

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 18:12, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 04:56:19PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 15:49, Peter Zijlstra wrote:
> > > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> 
> > > > +/**
> > > > + * uclamp_map: reference counts a utilization "clamp value"
> > > > + * @value:the utilization "clamp value" required
> > > > + * @se_count: the number of scheduling entities requiring the "clamp 
> > > > value"
> > > > + * @se_lock:  serialize reference count updates by protecting se_count
> > > 
> > > Why do you have a spinlock to serialize a single value? Don't we have
> > > atomics for that?
> > 
> > There are some code paths where it's used to protect clamp groups
> > mapping and initialization, e.g.
> > 
> >   uclamp_group_get()
> >   spin_lock()
> >   // initialize clamp group (if required) and then...
 ^^^
   This is actually a couple of function calls

> >   se_count += 1
> >   spin_unlock()
> > 
> > Almost all these paths are triggered from user-space and protected
> > by a global uclamp_mutex, but fork/exit paths.
> > 
> > To serialize these paths I'm using the spinlock above, does it make
> > sense ? Can we use the global uclamp_mutex on forks/exit too ?
> 
> OK, then your comment is misleading; it serializes both fields.

Yes... that definitively needs an update.

> > One additional observations is that, if in the future we want to add a
> > kernel space API, (e.g. driver asking for a new clamp value), maybe we
> > will need to have a serialized non-sleeping uclamp_group_get() API ?
> 
> No idea; but if you want to go all fancy you can replace he whole
> uclamp_map thing with something like:
> 
> struct uclamp_map {
>   union {
>   struct {
>   unsigned long v : 10;
>   unsigned long c : BITS_PER_LONG - 10;
>   };
>   atomic_long_t s;
>   };
> };

That sounds really cool and scary at the same time :)

The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
or that we use it to track a percentage value (i.e. [0..100]).

One of the last patches introduces percentage values to userspace.
But, I was considering that in kernel space we should always track
full scale utilization values.

The c:(BITS_PER_LONG-10) restricts the range of concurrently active
SE refcounting the same clamp value. Which, for some 32bit systems is
only 4 milions among tasks and cgroups... maybe still reasonable...


> And use uclamp_map::c == 0 as unused (as per normal refcount
> semantics) and atomic_long_cmpxchg() the whole thing using
> uclamp_map::s.

Yes... that could work for the uclamp_map updates, but as I noted
above, I think I have other calls serialized by that lock... will look
better into what you suggest, thanks!


[...]

> > > What's the purpose of that cacheline align statement?
> > 
> > In uclamp_maps, we still need to scan the array when a clamp value is
> > changed from user-space, i.e. the cases reported above. Thus, that
> > alignment is just to ensure that we minimize the number of cache lines
> > used. Does that make sense ?
> > 
> > Maybe that alignment implicitly generated by the compiler ?
> 
> It is not, but if it really is a slow path, we shouldn't care about
> alignment.

Ok, will remove it.

> > > Note that without that apparently superfluous lock, it would be 8*12 =
> > > 96 bytes, which is 1.5 lines and would indeed suggest you default to
> > > GROUP_COUNT=7 by default to fill 2 lines.
> > 
> > Yes, will check better if we can count on just the uclamp_mutex
> 
> Well, if we don't care about performance (slow path) then keeping he
> lock is fine, just the comment and alignment are misleading.

Ok

[...]

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 18:12, Peter Zijlstra wrote:
> On Wed, Sep 12, 2018 at 04:56:19PM +0100, Patrick Bellasi wrote:
> > On 12-Sep 15:49, Peter Zijlstra wrote:
> > > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> 
> > > > +/**
> > > > + * uclamp_map: reference counts a utilization "clamp value"
> > > > + * @value:the utilization "clamp value" required
> > > > + * @se_count: the number of scheduling entities requiring the "clamp 
> > > > value"
> > > > + * @se_lock:  serialize reference count updates by protecting se_count
> > > 
> > > Why do you have a spinlock to serialize a single value? Don't we have
> > > atomics for that?
> > 
> > There are some code paths where it's used to protect clamp groups
> > mapping and initialization, e.g.
> > 
> >   uclamp_group_get()
> >   spin_lock()
> >   // initialize clamp group (if required) and then...
 ^^^
   This is actually a couple of function calls

> >   se_count += 1
> >   spin_unlock()
> > 
> > Almost all these paths are triggered from user-space and protected
> > by a global uclamp_mutex, but fork/exit paths.
> > 
> > To serialize these paths I'm using the spinlock above, does it make
> > sense ? Can we use the global uclamp_mutex on forks/exit too ?
> 
> OK, then your comment is misleading; it serializes both fields.

Yes... that definitively needs an update.

> > One additional observations is that, if in the future we want to add a
> > kernel space API, (e.g. driver asking for a new clamp value), maybe we
> > will need to have a serialized non-sleeping uclamp_group_get() API ?
> 
> No idea; but if you want to go all fancy you can replace he whole
> uclamp_map thing with something like:
> 
> struct uclamp_map {
>   union {
>   struct {
>   unsigned long v : 10;
>   unsigned long c : BITS_PER_LONG - 10;
>   };
>   atomic_long_t s;
>   };
> };

That sounds really cool and scary at the same time :)

The v:10 requires that we never set SCHED_CAPACITY_SCALE>1024
or that we use it to track a percentage value (i.e. [0..100]).

One of the last patches introduces percentage values to userspace.
But, I was considering that in kernel space we should always track
full scale utilization values.

The c:(BITS_PER_LONG-10) restricts the range of concurrently active
SE refcounting the same clamp value. Which, for some 32bit systems is
only 4 milions among tasks and cgroups... maybe still reasonable...


> And use uclamp_map::c == 0 as unused (as per normal refcount
> semantics) and atomic_long_cmpxchg() the whole thing using
> uclamp_map::s.

Yes... that could work for the uclamp_map updates, but as I noted
above, I think I have other calls serialized by that lock... will look
better into what you suggest, thanks!


[...]

> > > What's the purpose of that cacheline align statement?
> > 
> > In uclamp_maps, we still need to scan the array when a clamp value is
> > changed from user-space, i.e. the cases reported above. Thus, that
> > alignment is just to ensure that we minimize the number of cache lines
> > used. Does that make sense ?
> > 
> > Maybe that alignment implicitly generated by the compiler ?
> 
> It is not, but if it really is a slow path, we shouldn't care about
> alignment.

Ok, will remove it.

> > > Note that without that apparently superfluous lock, it would be 8*12 =
> > > 96 bytes, which is 1.5 lines and would indeed suggest you default to
> > > GROUP_COUNT=7 by default to fill 2 lines.
> > 
> > Yes, will check better if we can count on just the uclamp_mutex
> 
> Well, if we don't care about performance (slow path) then keeping he
> lock is fine, just the comment and alignment are misleading.

Ok

[...]

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
>  static inline int __setscheduler_uclamp(struct task_struct *p,
>   const struct sched_attr *attr)

But large for inline now.

>  {
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;

I think the thing would become much more readable if you set
lower/upper_bound right here.
>  
> + mutex_lock(_mutex);
>  
> + /* Find a valid group_id for each required clamp value */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");

AFAICT this is an unpriv part of the syscall; and you can spam the log
without limits. Not good.

> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto done;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uc_se = >uclamp[UCLAMP_MIN];
> + uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
> +  uc_se, attr->sched_util_min);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uc_se = >uclamp[UCLAMP_MAX];
> + uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
> +  uc_se, attr->sched_util_max);
> + }
> +
> +done:
> + mutex_unlock(_mutex);
> +
> + return result;
> +}

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
>  static inline int __setscheduler_uclamp(struct task_struct *p,
>   const struct sched_attr *attr)

But large for inline now.

>  {
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;

I think the thing would become much more readable if you set
lower/upper_bound right here.
>  
> + mutex_lock(_mutex);
>  
> + /* Find a valid group_id for each required clamp value */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");

AFAICT this is an unpriv part of the syscall; and you can spam the log
without limits. Not good.

> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto done;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uc_se = >uclamp[UCLAMP_MIN];
> + uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
> +  uc_se, attr->sched_util_min);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uc_se = >uclamp[UCLAMP_MAX];
> + uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
> +  uc_se, attr->sched_util_max);
> + }
> +
> +done:
> + mutex_unlock(_mutex);
> +
> + return result;
> +}

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 04:56:19PM +0100, Patrick Bellasi wrote:
> On 12-Sep 15:49, Peter Zijlstra wrote:
> > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:

> > > +/**
> > > + * uclamp_map: reference counts a utilization "clamp value"
> > > + * @value:the utilization "clamp value" required
> > > + * @se_count: the number of scheduling entities requiring the "clamp 
> > > value"
> > > + * @se_lock:  serialize reference count updates by protecting se_count
> > 
> > Why do you have a spinlock to serialize a single value? Don't we have
> > atomics for that?
> 
> There are some code paths where it's used to protect clamp groups
> mapping and initialization, e.g.
> 
>   uclamp_group_get()
>   spin_lock()
>   // initialize clamp group (if required) and then...
>   se_count += 1
>   spin_unlock()
> 
> Almost all these paths are triggered from user-space and protected
> by a global uclamp_mutex, but fork/exit paths.
> 
> To serialize these paths I'm using the spinlock above, does it make
> sense ? Can we use the global uclamp_mutex on forks/exit too ?

OK, then your comment is misleading; it serializes both fields.

> One additional observations is that, if in the future we want to add a
> kernel space API, (e.g. driver asking for a new clamp value), maybe we
> will need to have a serialized non-sleeping uclamp_group_get() API ?

No idea; but if you want to go all fancy you can replace he whole
uclamp_map thing with something like:

struct uclamp_map {
union {
struct {
unsigned long v : 10;
unsigned long c : BITS_PER_LONG - 10;
};
atomic_long_t s;
};
};

And use uclamp_map::c == 0 as unused (as per normal refcount
semantics) and atomic_long_cmpxchg() the whole thing using
uclamp_map::s.

> > > + *  uclamp_maps is a matrix of
> > > + *  +--- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> > > + *  ||
> > > + *  |/---+---\
> > > + *  |   ++   ++
> > > + *  |  / UCLAMP_MIN | value  |   | value  |
> > > + *  |  || se_count   |.. | se_count   |
> > > + *  |  |++   ++
> > > + *  +--+++   ++
> > > + * || value  |   | value  |
> > > + * \ UCLAMP_MAX | se_count   |.. | se_count   |
> > > + *  +-^--+   +^---+
> > > + *|   |
> > > + *  uc_map =  +   |
> > > + * _maps[clamp_id][0]  +
> > > + *clamp_value =
> > > + *   uc_map[group_id].value
> > > + */
> > > +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> > > + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> > > + cacheline_aligned_in_smp;
> > > +
> > 
> > I'm still completely confused by all this.
> > 
> > sizeof(uclamp_map) = 12
> > 
> > that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
> > more than 2 (64 byte) cachelines.
> 
> This data structure is *not* used in the hot-path, that's why I did not
> care about fitting it exactly into few cache lines.
> 
> It's used to map a user-space "clamp value" into a kernel-space "clamp
> group" when user-space:
>  - changes a task specific clamp value
>  - changes a cgroup clamp value
>  - a task forks/exits
> 
> I assume we can consider all those as "slow" code paths, is that correct ?

Yep.

> > What's the purpose of that cacheline align statement?
> 
> In uclamp_maps, we still need to scan the array when a clamp value is
> changed from user-space, i.e. the cases reported above. Thus, that
> alignment is just to ensure that we minimize the number of cache lines
> used. Does that make sense ?
> 
> Maybe that alignment implicitly generated by the compiler ?

It is not, but if it really is a slow path, we shouldn't care about
alignment.

> > Note that without that apparently superfluous lock, it would be 8*12 =
> > 96 bytes, which is 1.5 lines and would indeed suggest you default to
> > GROUP_COUNT=7 by default to fill 2 lines.
> 
> Yes, will check better if we can count on just the uclamp_mutex

Well, if we don't care about performance (slow path) then keeping he
lock is fine, just the comment and alignment are misleading.

> > Why are the min and max things torn up like that?  I'm fairly sure I
> > asked some of that last time; but the above comments only try to explain
> > what, not why.
> 
> We use that organization to speedup scanning for clamp values of the
> same clamp_id. That's more

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Wed, Sep 12, 2018 at 04:56:19PM +0100, Patrick Bellasi wrote:
> On 12-Sep 15:49, Peter Zijlstra wrote:
> > On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:

> > > +/**
> > > + * uclamp_map: reference counts a utilization "clamp value"
> > > + * @value:the utilization "clamp value" required
> > > + * @se_count: the number of scheduling entities requiring the "clamp 
> > > value"
> > > + * @se_lock:  serialize reference count updates by protecting se_count
> > 
> > Why do you have a spinlock to serialize a single value? Don't we have
> > atomics for that?
> 
> There are some code paths where it's used to protect clamp groups
> mapping and initialization, e.g.
> 
>   uclamp_group_get()
>   spin_lock()
>   // initialize clamp group (if required) and then...
>   se_count += 1
>   spin_unlock()
> 
> Almost all these paths are triggered from user-space and protected
> by a global uclamp_mutex, but fork/exit paths.
> 
> To serialize these paths I'm using the spinlock above, does it make
> sense ? Can we use the global uclamp_mutex on forks/exit too ?

OK, then your comment is misleading; it serializes both fields.

> One additional observations is that, if in the future we want to add a
> kernel space API, (e.g. driver asking for a new clamp value), maybe we
> will need to have a serialized non-sleeping uclamp_group_get() API ?

No idea; but if you want to go all fancy you can replace he whole
uclamp_map thing with something like:

struct uclamp_map {
union {
struct {
unsigned long v : 10;
unsigned long c : BITS_PER_LONG - 10;
};
atomic_long_t s;
};
};

And use uclamp_map::c == 0 as unused (as per normal refcount
semantics) and atomic_long_cmpxchg() the whole thing using
uclamp_map::s.

> > > + *  uclamp_maps is a matrix of
> > > + *  +--- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> > > + *  ||
> > > + *  |/---+---\
> > > + *  |   ++   ++
> > > + *  |  / UCLAMP_MIN | value  |   | value  |
> > > + *  |  || se_count   |.. | se_count   |
> > > + *  |  |++   ++
> > > + *  +--+++   ++
> > > + * || value  |   | value  |
> > > + * \ UCLAMP_MAX | se_count   |.. | se_count   |
> > > + *  +-^--+   +^---+
> > > + *|   |
> > > + *  uc_map =  +   |
> > > + * _maps[clamp_id][0]  +
> > > + *clamp_value =
> > > + *   uc_map[group_id].value
> > > + */
> > > +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> > > + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> > > + cacheline_aligned_in_smp;
> > > +
> > 
> > I'm still completely confused by all this.
> > 
> > sizeof(uclamp_map) = 12
> > 
> > that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
> > more than 2 (64 byte) cachelines.
> 
> This data structure is *not* used in the hot-path, that's why I did not
> care about fitting it exactly into few cache lines.
> 
> It's used to map a user-space "clamp value" into a kernel-space "clamp
> group" when user-space:
>  - changes a task specific clamp value
>  - changes a cgroup clamp value
>  - a task forks/exits
> 
> I assume we can consider all those as "slow" code paths, is that correct ?

Yep.

> > What's the purpose of that cacheline align statement?
> 
> In uclamp_maps, we still need to scan the array when a clamp value is
> changed from user-space, i.e. the cases reported above. Thus, that
> alignment is just to ensure that we minimize the number of cache lines
> used. Does that make sense ?
> 
> Maybe that alignment implicitly generated by the compiler ?

It is not, but if it really is a slow path, we shouldn't care about
alignment.

> > Note that without that apparently superfluous lock, it would be 8*12 =
> > 96 bytes, which is 1.5 lines and would indeed suggest you default to
> > GROUP_COUNT=7 by default to fill 2 lines.
> 
> Yes, will check better if we can count on just the uclamp_mutex

Well, if we don't care about performance (slow path) then keeping he
lock is fine, just the comment and alignment are misleading.

> > Why are the min and max things torn up like that?  I'm fairly sure I
> > asked some of that last time; but the above comments only try to explain
> > what, not why.
> 
> We use that organization to speedup scanning for clamp values of the
> same clamp_id. That's more

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 15:49, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> > +/**
> > + * Utilization's clamp group
> > + *
> > + * A utilization clamp group maps a "clamp value" (value), i.e.
> > + * util_{min,max}, to a "clamp group index" (group_id).
> > + */
> > +struct uclamp_se {
> > +   unsigned int value;
> > +   unsigned int group_id;
> > +};
> 
> > +/**
> > + * uclamp_map: reference counts a utilization "clamp value"
> > + * @value:the utilization "clamp value" required
> > + * @se_count: the number of scheduling entities requiring the "clamp value"
> > + * @se_lock:  serialize reference count updates by protecting se_count
> 
> Why do you have a spinlock to serialize a single value? Don't we have
> atomics for that?

There are some code paths where it's used to protect clamp groups
mapping and initialization, e.g.

  uclamp_group_get()
  spin_lock()
  // initialize clamp group (if required) and then...
  se_count += 1
  spin_unlock()

Almost all these paths are triggered from user-space and protected
by a global uclamp_mutex, but fork/exit paths.

To serialize these paths I'm using the spinlock above, does it make
sense ? Can we use the global uclamp_mutex on forks/exit too ?

One additional observations is that, if in the future we want to add a
kernel space API, (e.g. driver asking for a new clamp value), maybe we
will need to have a serialized non-sleeping uclamp_group_get() API ?

> > + */
> > +struct uclamp_map {
> > +   int value;
> > +   int se_count;
> > +   raw_spinlock_t se_lock;
> > +};
> > +
> > +/**
> > + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> > + *
> > + * Since only a limited number of different "clamp values" are supported, 
> > we
> > + * need to map each different clamp value into a "clamp group" (group_id) 
> > to
> > + * be used by the per-CPU accounting in the fast-path, when tasks are
> > + * enqueued and dequeued.
> > + * We also support different kind of utilization clamping, min and max
> > + * utilization for example, each representing what we call a "clamp index"
> > + * (clamp_id).
> > + *
> > + * A matrix is thus required to map "clamp values" to "clamp groups"
> > + * (group_id), for each "clamp index" (clamp_id), where:
> > + * - rows are indexed by clamp_id and they collect the clamp groups for a
> > + *   given clamp index
> > + * - columns are indexed by group_id and they collect the clamp values 
> > which
> > + *   maps to that clamp group
> > + *
> > + * Thus, the column index of a given (clamp_id, value) pair represents the
> > + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> > + *
> > + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> > + * clamped tasks.  Thus we allocate one more slot than the value of
> > + * CONFIG_UCLAMP_GROUPS_COUNT.
> > + *
> > + * Here is the map layout and, right below, how entries are accessed by the
> > + * following code.
> > + *
> > + *  uclamp_maps is a matrix of
> > + *  +--- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> > + *  ||
> > + *  |/---+---\
> > + *  |   ++   ++
> > + *  |  / UCLAMP_MIN | value  |   | value  |
> > + *  |  || se_count   |.. | se_count   |
> > + *  |  |++   ++
> > + *  +--+++   ++
> > + * || value  |   | value  |
> > + * \ UCLAMP_MAX | se_count   |.. | se_count   |
> > + *  +-^--+   +^---+
> > + *|   |
> > + *  uc_map =  +   |
> > + * _maps[clamp_id][0]  +
> > + *clamp_value =
> > + *   uc_map[group_id].value
> > + */
> > +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> > +   [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> > +   cacheline_aligned_in_smp;
> > +
> 
> I'm still completely confused by all this.
> 
> sizeof(uclamp_map) = 12
> 
> that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
> more than 2 (64 byte) cachelines.

This data structure is *not* used in the hot-path, that's why I did not
care about fitting it exactly into few cache lines.

It's used to map a user-space "clamp value" into a kernel-space "clamp
group" when user-space:
 - changes a task specific clamp value
 - changes a cgroup clamp value
 - a task forks/exits

I assume we can consider all those as "slow" code paths, is that correct ?

At enqueue/dequeue time we use instead struct uclamp_cpu, introduced

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

On 12-Sep 15:49, Peter Zijlstra wrote:
> On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> > +/**
> > + * Utilization's clamp group
> > + *
> > + * A utilization clamp group maps a "clamp value" (value), i.e.
> > + * util_{min,max}, to a "clamp group index" (group_id).
> > + */
> > +struct uclamp_se {
> > +   unsigned int value;
> > +   unsigned int group_id;
> > +};
> 
> > +/**
> > + * uclamp_map: reference counts a utilization "clamp value"
> > + * @value:the utilization "clamp value" required
> > + * @se_count: the number of scheduling entities requiring the "clamp value"
> > + * @se_lock:  serialize reference count updates by protecting se_count
> 
> Why do you have a spinlock to serialize a single value? Don't we have
> atomics for that?

There are some code paths where it's used to protect clamp groups
mapping and initialization, e.g.

  uclamp_group_get()
  spin_lock()
  // initialize clamp group (if required) and then...
  se_count += 1
  spin_unlock()

Almost all these paths are triggered from user-space and protected
by a global uclamp_mutex, but fork/exit paths.

To serialize these paths I'm using the spinlock above, does it make
sense ? Can we use the global uclamp_mutex on forks/exit too ?

One additional observations is that, if in the future we want to add a
kernel space API, (e.g. driver asking for a new clamp value), maybe we
will need to have a serialized non-sleeping uclamp_group_get() API ?

> > + */
> > +struct uclamp_map {
> > +   int value;
> > +   int se_count;
> > +   raw_spinlock_t se_lock;
> > +};
> > +
> > +/**
> > + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> > + *
> > + * Since only a limited number of different "clamp values" are supported, 
> > we
> > + * need to map each different clamp value into a "clamp group" (group_id) 
> > to
> > + * be used by the per-CPU accounting in the fast-path, when tasks are
> > + * enqueued and dequeued.
> > + * We also support different kind of utilization clamping, min and max
> > + * utilization for example, each representing what we call a "clamp index"
> > + * (clamp_id).
> > + *
> > + * A matrix is thus required to map "clamp values" to "clamp groups"
> > + * (group_id), for each "clamp index" (clamp_id), where:
> > + * - rows are indexed by clamp_id and they collect the clamp groups for a
> > + *   given clamp index
> > + * - columns are indexed by group_id and they collect the clamp values 
> > which
> > + *   maps to that clamp group
> > + *
> > + * Thus, the column index of a given (clamp_id, value) pair represents the
> > + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> > + *
> > + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> > + * clamped tasks.  Thus we allocate one more slot than the value of
> > + * CONFIG_UCLAMP_GROUPS_COUNT.
> > + *
> > + * Here is the map layout and, right below, how entries are accessed by the
> > + * following code.
> > + *
> > + *  uclamp_maps is a matrix of
> > + *  +--- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> > + *  ||
> > + *  |/---+---\
> > + *  |   ++   ++
> > + *  |  / UCLAMP_MIN | value  |   | value  |
> > + *  |  || se_count   |.. | se_count   |
> > + *  |  |++   ++
> > + *  +--+++   ++
> > + * || value  |   | value  |
> > + * \ UCLAMP_MAX | se_count   |.. | se_count   |
> > + *  +-^--+   +^---+
> > + *|   |
> > + *  uc_map =  +   |
> > + * _maps[clamp_id][0]  +
> > + *clamp_value =
> > + *   uc_map[group_id].value
> > + */
> > +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> > +   [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> > +   cacheline_aligned_in_smp;
> > +
> 
> I'm still completely confused by all this.
> 
> sizeof(uclamp_map) = 12
> 
> that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
> more than 2 (64 byte) cachelines.

This data structure is *not* used in the hot-path, that's why I did not
care about fitting it exactly into few cache lines.

It's used to map a user-space "clamp value" into a kernel-space "clamp
group" when user-space:
 - changes a task specific clamp value
 - changes a cgroup clamp value
 - a task forks/exits

I assume we can consider all those as "slow" code paths, is that correct ?

At enqueue/dequeue time we use instead struct uclamp_cpu, introduced

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> +/**
> + * Utilization's clamp group
> + *
> + * A utilization clamp group maps a "clamp value" (value), i.e.
> + * util_{min,max}, to a "clamp group index" (group_id).
> + */
> +struct uclamp_se {
> + unsigned int value;
> + unsigned int group_id;
> +};

> +/**
> + * uclamp_map: reference counts a utilization "clamp value"
> + * @value:the utilization "clamp value" required
> + * @se_count: the number of scheduling entities requiring the "clamp value"
> + * @se_lock:  serialize reference count updates by protecting se_count

Why do you have a spinlock to serialize a single value? Don't we have
atomics for that?

> + */
> +struct uclamp_map {
> + int value;
> + int se_count;
> + raw_spinlock_t se_lock;
> +};
> +
> +/**
> + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> + *
> + * Since only a limited number of different "clamp values" are supported, we
> + * need to map each different clamp value into a "clamp group" (group_id) to
> + * be used by the per-CPU accounting in the fast-path, when tasks are
> + * enqueued and dequeued.
> + * We also support different kind of utilization clamping, min and max
> + * utilization for example, each representing what we call a "clamp index"
> + * (clamp_id).
> + *
> + * A matrix is thus required to map "clamp values" to "clamp groups"
> + * (group_id), for each "clamp index" (clamp_id), where:
> + * - rows are indexed by clamp_id and they collect the clamp groups for a
> + *   given clamp index
> + * - columns are indexed by group_id and they collect the clamp values which
> + *   maps to that clamp group
> + *
> + * Thus, the column index of a given (clamp_id, value) pair represents the
> + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> + *
> + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> + * clamped tasks.  Thus we allocate one more slot than the value of
> + * CONFIG_UCLAMP_GROUPS_COUNT.
> + *
> + * Here is the map layout and, right below, how entries are accessed by the
> + * following code.
> + *
> + *  uclamp_maps is a matrix of
> + *  +--- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> + *  ||
> + *  |/---+---\
> + *  |   ++   ++
> + *  |  / UCLAMP_MIN | value  |   | value  |
> + *  |  || se_count   |.. | se_count   |
> + *  |  |++   ++
> + *  +--+++   ++
> + * || value  |   | value  |
> + * \ UCLAMP_MAX | se_count   |.. | se_count   |
> + *  +-^--+   +^---+
> + *|   |
> + *  uc_map =  +   |
> + * _maps[clamp_id][0]  +
> + *clamp_value =
> + *   uc_map[group_id].value
> + */
> +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> + cacheline_aligned_in_smp;
> +

I'm still completely confused by all this.

sizeof(uclamp_map) = 12

that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
more than 2 (64 byte) cachelines. What's the purpose of that cacheline
align statement?

Note that without that apparently superfluous lock, it would be 8*12 =
96 bytes, which is 1.5 lines and would indeed suggest you default to
GROUP_COUNT=7 by default to fill 2 lines.

Why are the min and max things torn up like that?  I'm fairly sure I
asked some of that last time; but the above comments only try to explain
what, not why.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Peter Zijlstra

On Tue, Aug 28, 2018 at 02:53:10PM +0100, Patrick Bellasi wrote:
> +/**
> + * Utilization's clamp group
> + *
> + * A utilization clamp group maps a "clamp value" (value), i.e.
> + * util_{min,max}, to a "clamp group index" (group_id).
> + */
> +struct uclamp_se {
> + unsigned int value;
> + unsigned int group_id;
> +};

> +/**
> + * uclamp_map: reference counts a utilization "clamp value"
> + * @value:the utilization "clamp value" required
> + * @se_count: the number of scheduling entities requiring the "clamp value"
> + * @se_lock:  serialize reference count updates by protecting se_count

Why do you have a spinlock to serialize a single value? Don't we have
atomics for that?

> + */
> +struct uclamp_map {
> + int value;
> + int se_count;
> + raw_spinlock_t se_lock;
> +};
> +
> +/**
> + * uclamp_maps: maps each SEs "clamp value" into a CPUs "clamp group"
> + *
> + * Since only a limited number of different "clamp values" are supported, we
> + * need to map each different clamp value into a "clamp group" (group_id) to
> + * be used by the per-CPU accounting in the fast-path, when tasks are
> + * enqueued and dequeued.
> + * We also support different kind of utilization clamping, min and max
> + * utilization for example, each representing what we call a "clamp index"
> + * (clamp_id).
> + *
> + * A matrix is thus required to map "clamp values" to "clamp groups"
> + * (group_id), for each "clamp index" (clamp_id), where:
> + * - rows are indexed by clamp_id and they collect the clamp groups for a
> + *   given clamp index
> + * - columns are indexed by group_id and they collect the clamp values which
> + *   maps to that clamp group
> + *
> + * Thus, the column index of a given (clamp_id, value) pair represents the
> + * clamp group (group_id) used by the fast-path's per-CPU accounting.
> + *
> + * NOTE: first clamp group (group_id=0) is reserved for tracking of non
> + * clamped tasks.  Thus we allocate one more slot than the value of
> + * CONFIG_UCLAMP_GROUPS_COUNT.
> + *
> + * Here is the map layout and, right below, how entries are accessed by the
> + * following code.
> + *
> + *  uclamp_maps is a matrix of
> + *  +--- UCLAMP_CNT by CONFIG_UCLAMP_GROUPS_COUNT+1 entries
> + *  ||
> + *  |/---+---\
> + *  |   ++   ++
> + *  |  / UCLAMP_MIN | value  |   | value  |
> + *  |  || se_count   |.. | se_count   |
> + *  |  |++   ++
> + *  +--+++   ++
> + * || value  |   | value  |
> + * \ UCLAMP_MAX | se_count   |.. | se_count   |
> + *  +-^--+   +^---+
> + *|   |
> + *  uc_map =  +   |
> + * _maps[clamp_id][0]  +
> + *clamp_value =
> + *   uc_map[group_id].value
> + */
> +static struct uclamp_map uclamp_maps[UCLAMP_CNT]
> + [CONFIG_UCLAMP_GROUPS_COUNT + 1]
> + cacheline_aligned_in_smp;
> +

I'm still completely confused by all this.

sizeof(uclamp_map) = 12

that array is 2*6=12 of those, so the whole thing is 144 bytes. which is
more than 2 (64 byte) cachelines. What's the purpose of that cacheline
align statement?

Note that without that apparently superfluous lock, it would be 8*12 =
96 bytes, which is 1.5 lines and would indeed suggest you default to
GROUP_COUNT=7 by default to fill 2 lines.

Why are the min and max things torn up like that?  I'm fairly sure I
asked some of that last time; but the above comments only try to explain
what, not why.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

Hi Suren,

On 08-Sep 16:47, Suren Baghdasaryan wrote:

[...]

> > + * A clamp group is not free if there is at least one SE which is sing a 
> > clamp
> 
> typo in the sentence

Right, s/is sing/is using/

+1

[...]

> > +static int
> > +uclamp_group_find(int clamp_id, unsigned int clamp_value)
> > +{
> > +   struct uclamp_map *uc_map = _maps[clamp_id][0];
> > +   int free_group_id = UCLAMP_NOT_VALID;
> > +   unsigned int group_id = 0;
> > +
> > +   for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> > +   /* Keep track of first free clamp group */
> > +   if (uclamp_group_available(clamp_id, group_id)) {
> > +   if (free_group_id == UCLAMP_NOT_VALID)
> > +   free_group_id = group_id;
> > +   continue;
> > +   }
> 
> Not a big improvement but reordering the two conditions in this loop
> would avoid finding and recording free_group_id if the very first
> group is the one we are looking for.

Right, indeed with:

   uclamp_group_put()
   uclamp_group_reset()
   uclamp_group_init()

we always ensure that:

   uc_map[group_id].value == UCLAMP_NOT_VALID

for free groups. Thus, it should be safe to swap this two checks and
we likely save few instructions for a likely common case of non
clamped tasks.

+1

I'll also get the chance to remove the two simple comments. ;)

> > +   /* Return index of first group with same clamp value */
> > +   if (uc_map[group_id].value == clamp_value)
> > +   return group_id;
> > +   }
> > +
> > +   if (likely(free_group_id != UCLAMP_NOT_VALID))
> > +   return free_group_id;
> > +
> > +   return -ENOSPC;
> > +}

[...]

> > +static inline void uclamp_group_put(int clamp_id, int group_id)

> Is the size and the number of invocations of this function small
> enough for inlining? Same goes for uclamp_group_get() and especially
> for __setscheduler_uclamp().

Right... yes, we could let the scheduler do its job and remove inline
from these functions... at least for those not in the critical path.

+1

[...]

> > +   if (likely(uc_map[group_id].se_count))
> > +   uc_map[group_id].se_count -= 1;
> > +#ifdef SCHED_DEBUG
> > +   else {
> 
> nit: no need for braces
> 
> > +   WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
> > +clamp_id, group_id);

Since the above statement is multi-line, we actually need it for code
code-style requirements.

> > +   }
> > +#endif

[...]

> > +static void uclamp_fork(struct task_struct *p, bool reset)
> > +{
> > +   int clamp_id;
> > +
> > +   if (unlikely(!p->sched_class->uclamp_enabled))
> > +   return;
> > +
> > +   for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > +   int next_group_id = p->uclamp[clamp_id].group_id;
> > +   struct uclamp_se *uc_se = >uclamp[clamp_id];
> 
> Might be easier to read if after the above assignment you use
> uc_se->xxx instead of p->uclamp[clamp_id].xxx in the code below.

Yes, that's actually the intent of the above assigmenet... but I've
forgot a couple of usages! +1

> > +
> > +   if (unlikely(reset)) {
> > +   next_group_id = 0;
> > +   p->uclamp[clamp_id].value = uclamp_none(clamp_id);
> > +   }
> > +
> > +   p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
> > +   uclamp_group_get(clamp_id, next_group_id, uc_se,
> > +p->uclamp[clamp_id].value);
> > +   }
> > +}

[...]

> Thanks,
> Suren.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-12 Thread Patrick Bellasi

Hi Suren,

On 08-Sep 16:47, Suren Baghdasaryan wrote:

[...]

> > + * A clamp group is not free if there is at least one SE which is sing a 
> > clamp
> 
> typo in the sentence

Right, s/is sing/is using/

+1

[...]

> > +static int
> > +uclamp_group_find(int clamp_id, unsigned int clamp_value)
> > +{
> > +   struct uclamp_map *uc_map = _maps[clamp_id][0];
> > +   int free_group_id = UCLAMP_NOT_VALID;
> > +   unsigned int group_id = 0;
> > +
> > +   for ( ; group_id <= CONFIG_UCLAMP_GROUPS_COUNT; ++group_id) {
> > +   /* Keep track of first free clamp group */
> > +   if (uclamp_group_available(clamp_id, group_id)) {
> > +   if (free_group_id == UCLAMP_NOT_VALID)
> > +   free_group_id = group_id;
> > +   continue;
> > +   }
> 
> Not a big improvement but reordering the two conditions in this loop
> would avoid finding and recording free_group_id if the very first
> group is the one we are looking for.

Right, indeed with:

   uclamp_group_put()
   uclamp_group_reset()
   uclamp_group_init()

we always ensure that:

   uc_map[group_id].value == UCLAMP_NOT_VALID

for free groups. Thus, it should be safe to swap this two checks and
we likely save few instructions for a likely common case of non
clamped tasks.

+1

I'll also get the chance to remove the two simple comments. ;)

> > +   /* Return index of first group with same clamp value */
> > +   if (uc_map[group_id].value == clamp_value)
> > +   return group_id;
> > +   }
> > +
> > +   if (likely(free_group_id != UCLAMP_NOT_VALID))
> > +   return free_group_id;
> > +
> > +   return -ENOSPC;
> > +}

[...]

> > +static inline void uclamp_group_put(int clamp_id, int group_id)

> Is the size and the number of invocations of this function small
> enough for inlining? Same goes for uclamp_group_get() and especially
> for __setscheduler_uclamp().

Right... yes, we could let the scheduler do its job and remove inline
from these functions... at least for those not in the critical path.

+1

[...]

> > +   if (likely(uc_map[group_id].se_count))
> > +   uc_map[group_id].se_count -= 1;
> > +#ifdef SCHED_DEBUG
> > +   else {
> 
> nit: no need for braces
> 
> > +   WARN(1, "invalid SE clamp group [%d:%d] refcount\n",
> > +clamp_id, group_id);

Since the above statement is multi-line, we actually need it for code
code-style requirements.

> > +   }
> > +#endif

[...]

> > +static void uclamp_fork(struct task_struct *p, bool reset)
> > +{
> > +   int clamp_id;
> > +
> > +   if (unlikely(!p->sched_class->uclamp_enabled))
> > +   return;
> > +
> > +   for (clamp_id = 0; clamp_id < UCLAMP_CNT; ++clamp_id) {
> > +   int next_group_id = p->uclamp[clamp_id].group_id;
> > +   struct uclamp_se *uc_se = >uclamp[clamp_id];
> 
> Might be easier to read if after the above assignment you use
> uc_se->xxx instead of p->uclamp[clamp_id].xxx in the code below.

Yes, that's actually the intent of the above assigmenet... but I've
forgot a couple of usages! +1

> > +
> > +   if (unlikely(reset)) {
> > +   next_group_id = 0;
> > +   p->uclamp[clamp_id].value = uclamp_none(clamp_id);
> > +   }
> > +
> > +   p->uclamp[clamp_id].group_id = UCLAMP_NOT_VALID;
> > +   uclamp_group_get(clamp_id, next_group_id, uc_se,
> > +p->uclamp[clamp_id].value);
> > +   }
> > +}

[...]

> Thanks,
> Suren.

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-08 Thread Suren Baghdasaryan

Hi Patrick!

On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
 wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
>
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference counters.
>
>  :
>(user-space changes)  :  (kernel space / scheduler)
>  :
>  SLOW PATH   : FAST PATH
>  :
> task_struct::uclamp::value   : sched/core::enqueue/dequeue
>  : cpufreq_schedutil
>  :
>   ++++ +---+
>   |  TASK  || CLAMP GROUP| |CPU CLAMPS |
>   ++++ +---+
>   |||   clamp_{min,max}  | |  clamp_{min,max}  |
>   | util_{min,max} ||  se_count  | |tasks count|
>   ++++ +---+
>  :
>+-->  :  +--->
> group_id = map(clamp_value)  :  ref_count(group_id)
>  :
>  :
>
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
>
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
>sense to boost/limit to different frequencies,
>e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
>the per-CPU clamp values in the fast path.
>
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Paul Turner 
> Cc: Suren Baghdasaryan 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: Quentin Perret 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
>
> ---
> Changes in v4:
>  Message-ID: <20180814112509.gb2...@codeaurora.org>
>  - add uclamp_exit_task() to release clamp refcount from do_exit()
>  Message-ID: <20180816133249.GA2964@e110439-lin>
>  - keep the WARN but butify a bit that code
>  Message-ID: <20180413082648.gp4...@hirez.programming.kicks-ass.net>
>  - move uclamp_enabled at the top of sched_class to keep it on the same
>cache line of other main wakeup time callbacks
>  Others:
>  - init uclamp for the init_task and refcount its clamp groups
>  - add uclamp specific fork time code into uclamp_fork
>  - add support for SCHED_FLAG_RESET_ON_FORK
>default clamps are now set for init_task and inherited/reset at
>fork time (when then flag is set for the parent)
>  - enable uclamp only for FAIR tasks, RT class will be enabled only
>by a following patch which also integrate the class to schedutil
>  - define uclamp_maps cacheline_aligned_in_smp
>  - in uclamp_group_get() ensure to include uclamp_group_available() and
>uclamp_group_init() into the atomic section defined by:
>   uc_map[next_group_id].se_lock
>  - do not use mutex_lock(_mutex) in uclamp_exit_task
>which is also not needed since refcounting is already guarded by
>the uc_map[group_id].se_lock spinlock
>  - rebased on v4.19-rc1
>
> Changes in v3:
>  Message-ID: 
> 
>  - rename UCLAMP_NONE into UCLAMP_NOT_VALID
>  - remove not necessary checks in uclamp_group_find()
>  - add WARN on unlikely un-referenced decrement in uclamp_group_put()
>  - make __setscheduler_uclamp() able to set just one clamp value
>  - make __setscheduler_uclamp() failing if both clamps are required but
>there is no clamp groups available for one of them
>  - remove uclamp_group_find() from uclamp_group_get() which now takes a
>group_id as a parameter
>  Others:
>  - rebased on tip/sched/core
> Changes in v2:
>  - rabased on v4.18-rc4
>  - set UCLAMP_GROUPS_COUNT=2 by default
>which allows to fit all

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-08 Thread Suren Baghdasaryan

Hi Patrick!

On Tue, Aug 28, 2018 at 6:53 AM, Patrick Bellasi
 wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
>
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference counters.
>
>  :
>(user-space changes)  :  (kernel space / scheduler)
>  :
>  SLOW PATH   : FAST PATH
>  :
> task_struct::uclamp::value   : sched/core::enqueue/dequeue
>  : cpufreq_schedutil
>  :
>   ++++ +---+
>   |  TASK  || CLAMP GROUP| |CPU CLAMPS |
>   ++++ +---+
>   |||   clamp_{min,max}  | |  clamp_{min,max}  |
>   | util_{min,max} ||  se_count  | |tasks count|
>   ++++ +---+
>  :
>+-->  :  +--->
> group_id = map(clamp_value)  :  ref_count(group_id)
>  :
>  :
>
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
>
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
>sense to boost/limit to different frequencies,
>e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
>the per-CPU clamp values in the fast path.
>
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
>
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Paul Turner 
> Cc: Suren Baghdasaryan 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: Quentin Perret 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
>
> ---
> Changes in v4:
>  Message-ID: <20180814112509.gb2...@codeaurora.org>
>  - add uclamp_exit_task() to release clamp refcount from do_exit()
>  Message-ID: <20180816133249.GA2964@e110439-lin>
>  - keep the WARN but butify a bit that code
>  Message-ID: <20180413082648.gp4...@hirez.programming.kicks-ass.net>
>  - move uclamp_enabled at the top of sched_class to keep it on the same
>cache line of other main wakeup time callbacks
>  Others:
>  - init uclamp for the init_task and refcount its clamp groups
>  - add uclamp specific fork time code into uclamp_fork
>  - add support for SCHED_FLAG_RESET_ON_FORK
>default clamps are now set for init_task and inherited/reset at
>fork time (when then flag is set for the parent)
>  - enable uclamp only for FAIR tasks, RT class will be enabled only
>by a following patch which also integrate the class to schedutil
>  - define uclamp_maps cacheline_aligned_in_smp
>  - in uclamp_group_get() ensure to include uclamp_group_available() and
>uclamp_group_init() into the atomic section defined by:
>   uc_map[next_group_id].se_lock
>  - do not use mutex_lock(_mutex) in uclamp_exit_task
>which is also not needed since refcounting is already guarded by
>the uc_map[group_id].se_lock spinlock
>  - rebased on v4.19-rc1
>
> Changes in v3:
>  Message-ID: 
> 
>  - rename UCLAMP_NONE into UCLAMP_NOT_VALID
>  - remove not necessary checks in uclamp_group_find()
>  - add WARN on unlikely un-referenced decrement in uclamp_group_put()
>  - make __setscheduler_uclamp() able to set just one clamp value
>  - make __setscheduler_uclamp() failing if both clamps are required but
>there is no clamp groups available for one of them
>  - remove uclamp_group_find() from uclamp_group_get() which now takes a
>group_id as a parameter
>  Others:
>  - rebased on tip/sched/core
> Changes in v2:
>  - rabased on v4.18-rc4
>  - set UCLAMP_GROUPS_COUNT=2 by default
>which allows to fit all

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Juri Lelli

On 06/09/18 14:48, Patrick Bellasi wrote:
> Hi Juri!
> 
> On 05-Sep 12:45, Juri Lelli wrote:
> > Hi,
> > 
> > On 28/08/18 14:53, Patrick Bellasi wrote:
> > 
> > [...]
> > 
> > >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > >   const struct sched_attr *attr)
> > >  {
> > > - if (attr->sched_util_min > attr->sched_util_max)
> > > - return -EINVAL;
> > > - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > > - return -EINVAL;
> > > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > + int lower_bound, upper_bound;
> > > + struct uclamp_se *uc_se;
> > > + int result = 0;
> > >  
> > > - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > > - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > > + mutex_lock(_mutex);
> > 
> > This is going to get called from an rcu_read_lock() section, which is a
> > no-go for using mutexes:
> > 
> >  sys_sched_setattr ->
> >rcu_read_lock()
> >...
> >sched_setattr() ->
> >  __sched_setscheduler() ->
> >...
> >__setscheduler_uclamp() ->
> >  ...
> >  mutex_lock()
> 
> Rightm, great catch, thanks!
> 
> > Guess you could fix the issue by getting the task struct after find_
> > process_by_pid() in sys_sched_attr() and then calling sched_setattr()
> > after rcu_read_lock() (putting the task struct at the end). Peter
> > actually suggested this mod to solve a different issue.
> 
> I guess you mean something like this ?
> 
> ---8<---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5792,10 +5792,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct 
> sched_attr __user *, uattr,
> rcu_read_lock();
> retval = -ESRCH;
> p = find_process_by_pid(pid);
> -   if (p != NULL)
> -   retval = sched_setattr(p, );
> +   if (likely(p))
> +   get_task_struct(p);
> rcu_read_unlock();
>  
> +   if (likely(p)) {
> +   retval = sched_setattr(p, );
> +   put_task_struct(p);
> +   }
> +
> return retval;
>  }
> ---8<---

This should do the job yes.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Juri Lelli

On 06/09/18 14:48, Patrick Bellasi wrote:
> Hi Juri!
> 
> On 05-Sep 12:45, Juri Lelli wrote:
> > Hi,
> > 
> > On 28/08/18 14:53, Patrick Bellasi wrote:
> > 
> > [...]
> > 
> > >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > >   const struct sched_attr *attr)
> > >  {
> > > - if (attr->sched_util_min > attr->sched_util_max)
> > > - return -EINVAL;
> > > - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > > - return -EINVAL;
> > > + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > > + int lower_bound, upper_bound;
> > > + struct uclamp_se *uc_se;
> > > + int result = 0;
> > >  
> > > - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > > - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > > + mutex_lock(_mutex);
> > 
> > This is going to get called from an rcu_read_lock() section, which is a
> > no-go for using mutexes:
> > 
> >  sys_sched_setattr ->
> >rcu_read_lock()
> >...
> >sched_setattr() ->
> >  __sched_setscheduler() ->
> >...
> >__setscheduler_uclamp() ->
> >  ...
> >  mutex_lock()
> 
> Rightm, great catch, thanks!
> 
> > Guess you could fix the issue by getting the task struct after find_
> > process_by_pid() in sys_sched_attr() and then calling sched_setattr()
> > after rcu_read_lock() (putting the task struct at the end). Peter
> > actually suggested this mod to solve a different issue.
> 
> I guess you mean something like this ?
> 
> ---8<---
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5792,10 +5792,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct 
> sched_attr __user *, uattr,
> rcu_read_lock();
> retval = -ESRCH;
> p = find_process_by_pid(pid);
> -   if (p != NULL)
> -   retval = sched_setattr(p, );
> +   if (likely(p))
> +   get_task_struct(p);
> rcu_read_unlock();
>  
> +   if (likely(p)) {
> +   retval = sched_setattr(p, );
> +   put_task_struct(p);
> +   }
> +
> return retval;
>  }
> ---8<---

This should do the job yes.

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Patrick Bellasi

On 06-Sep 10:17, Juri Lelli wrote:
> On 28/08/18 14:53, Patrick Bellasi wrote:
> 
> [...]
> 
> >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> >  {
> > -   if (attr->sched_util_min > attr->sched_util_max)
> > -   return -EINVAL;
> > -   if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > -   return -EINVAL;
> > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > +   int lower_bound, upper_bound;
> > +   struct uclamp_se *uc_se;
> > +   int result = 0;
> >  
> > -   p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > -   p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > +   mutex_lock(_mutex);
> >  
> > -   return 0;
> > +   /* Find a valid group_id for each required clamp value */
> > +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > +   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > +   ? attr->sched_util_max
> > +   : p->uclamp[UCLAMP_MAX].value;
> > +
> > +   if (upper_bound == UCLAMP_NOT_VALID)
> > +   upper_bound = SCHED_CAPACITY_SCALE;
> > +   if (attr->sched_util_min > upper_bound) {
> > +   result = -EINVAL;
> > +   goto done;
> > +   }
> > +
> > +   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> > +   if (result == -ENOSPC) {
> > +   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> > +   goto done;
> > +   }
> > +   group_id[UCLAMP_MIN] = result;
> > +   }
> > +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > +   lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> > +   ? attr->sched_util_min
> > +   : p->uclamp[UCLAMP_MIN].value;
> > +
> > +   if (lower_bound == UCLAMP_NOT_VALID)
> > +   lower_bound = 0;
> > +   if (attr->sched_util_max < lower_bound ||
> > +   attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> > +   result = -EINVAL;
> > +   goto done;
> > +   }
> > +
> > +   result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> > +   if (result == -ENOSPC) {
> > +   pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> > +   goto done;
> > +   }
> > +   group_id[UCLAMP_MAX] = result;
> > +   }
> 
> Don't you have to reset result to 0 here (it seems what follows cannot
> fail anymore)? Otherwise this function will return latest
> uclamp_group_find return value, which will be interpreted as error if
> not 0.

Yes, wired.. it uses to work on my tests just because I return from
the caller on !0 and:
 - that's just good enough to set by uclamps
 - my current rt-app integration does not check the sycall return value
 - my tests never request the clamp_group mapped on group_id=0

:(

Will fix both this and the rt-app integration and the tests !

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Patrick Bellasi

On 06-Sep 10:17, Juri Lelli wrote:
> On 28/08/18 14:53, Patrick Bellasi wrote:
> 
> [...]
> 
> >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> >  {
> > -   if (attr->sched_util_min > attr->sched_util_max)
> > -   return -EINVAL;
> > -   if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > -   return -EINVAL;
> > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > +   int lower_bound, upper_bound;
> > +   struct uclamp_se *uc_se;
> > +   int result = 0;
> >  
> > -   p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > -   p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > +   mutex_lock(_mutex);
> >  
> > -   return 0;
> > +   /* Find a valid group_id for each required clamp value */
> > +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > +   upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> > +   ? attr->sched_util_max
> > +   : p->uclamp[UCLAMP_MAX].value;
> > +
> > +   if (upper_bound == UCLAMP_NOT_VALID)
> > +   upper_bound = SCHED_CAPACITY_SCALE;
> > +   if (attr->sched_util_min > upper_bound) {
> > +   result = -EINVAL;
> > +   goto done;
> > +   }
> > +
> > +   result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> > +   if (result == -ENOSPC) {
> > +   pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> > +   goto done;
> > +   }
> > +   group_id[UCLAMP_MIN] = result;
> > +   }
> > +   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > +   lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> > +   ? attr->sched_util_min
> > +   : p->uclamp[UCLAMP_MIN].value;
> > +
> > +   if (lower_bound == UCLAMP_NOT_VALID)
> > +   lower_bound = 0;
> > +   if (attr->sched_util_max < lower_bound ||
> > +   attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> > +   result = -EINVAL;
> > +   goto done;
> > +   }
> > +
> > +   result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> > +   if (result == -ENOSPC) {
> > +   pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> > +   goto done;
> > +   }
> > +   group_id[UCLAMP_MAX] = result;
> > +   }
> 
> Don't you have to reset result to 0 here (it seems what follows cannot
> fail anymore)? Otherwise this function will return latest
> uclamp_group_find return value, which will be interpreted as error if
> not 0.

Yes, wired.. it uses to work on my tests just because I return from
the caller on !0 and:
 - that's just good enough to set by uclamps
 - my current rt-app integration does not check the sycall return value
 - my tests never request the clamp_group mapped on group_id=0

:(

Will fix both this and the rt-app integration and the tests !

Cheers Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Patrick Bellasi

Hi Juri!

On 05-Sep 12:45, Juri Lelli wrote:
> Hi,
> 
> On 28/08/18 14:53, Patrick Bellasi wrote:
> 
> [...]
> 
> >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> >  {
> > -   if (attr->sched_util_min > attr->sched_util_max)
> > -   return -EINVAL;
> > -   if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > -   return -EINVAL;
> > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > +   int lower_bound, upper_bound;
> > +   struct uclamp_se *uc_se;
> > +   int result = 0;
> >  
> > -   p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > -   p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > +   mutex_lock(_mutex);
> 
> This is going to get called from an rcu_read_lock() section, which is a
> no-go for using mutexes:
> 
>  sys_sched_setattr ->
>rcu_read_lock()
>...
>sched_setattr() ->
>  __sched_setscheduler() ->
>...
>__setscheduler_uclamp() ->
>  ...
>mutex_lock()

Rightm, great catch, thanks!

> Guess you could fix the issue by getting the task struct after find_
> process_by_pid() in sys_sched_attr() and then calling sched_setattr()
> after rcu_read_lock() (putting the task struct at the end). Peter
> actually suggested this mod to solve a different issue.

I guess you mean something like this ?

---8<---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5792,10 +5792,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct 
sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
-   if (p != NULL)
-   retval = sched_setattr(p, );
+   if (likely(p))
+   get_task_struct(p);
rcu_read_unlock();
 
+   if (likely(p)) {
+   retval = sched_setattr(p, );
+   put_task_struct(p);
+   }
+
return retval;
 }
---8<---

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Patrick Bellasi

Hi Juri!

On 05-Sep 12:45, Juri Lelli wrote:
> Hi,
> 
> On 28/08/18 14:53, Patrick Bellasi wrote:
> 
> [...]
> 
> >  static inline int __setscheduler_uclamp(struct task_struct *p,
> > const struct sched_attr *attr)
> >  {
> > -   if (attr->sched_util_min > attr->sched_util_max)
> > -   return -EINVAL;
> > -   if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> > -   return -EINVAL;
> > +   int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> > +   int lower_bound, upper_bound;
> > +   struct uclamp_se *uc_se;
> > +   int result = 0;
> >  
> > -   p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> > -   p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> > +   mutex_lock(_mutex);
> 
> This is going to get called from an rcu_read_lock() section, which is a
> no-go for using mutexes:
> 
>  sys_sched_setattr ->
>rcu_read_lock()
>...
>sched_setattr() ->
>  __sched_setscheduler() ->
>...
>__setscheduler_uclamp() ->
>  ...
>mutex_lock()

Rightm, great catch, thanks!

> Guess you could fix the issue by getting the task struct after find_
> process_by_pid() in sys_sched_attr() and then calling sched_setattr()
> after rcu_read_lock() (putting the task struct at the end). Peter
> actually suggested this mod to solve a different issue.

I guess you mean something like this ?

---8<---
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5792,10 +5792,15 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct 
sched_attr __user *, uattr,
rcu_read_lock();
retval = -ESRCH;
p = find_process_by_pid(pid);
-   if (p != NULL)
-   retval = sched_setattr(p, );
+   if (likely(p))
+   get_task_struct(p);
rcu_read_unlock();
 
+   if (likely(p)) {
+   retval = sched_setattr(p, );
+   put_task_struct(p);
+   }
+
return retval;
 }
---8<---

Cheers,
Patrick

-- 
#include 

Patrick Bellasi

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Juri Lelli

On 28/08/18 14:53, Patrick Bellasi wrote:

[...]

>  static inline int __setscheduler_uclamp(struct task_struct *p,
>   const struct sched_attr *attr)
>  {
> - if (attr->sched_util_min > attr->sched_util_max)
> - return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> - return -EINVAL;
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;
>  
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(_mutex);
>  
> - return 0;
> + /* Find a valid group_id for each required clamp value */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto done;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }

Don't you have to reset result to 0 here (it seems what follows cannot
fail anymore)? Otherwise this function will return latest
uclamp_group_find return value, which will be interpreted as error if
not 0.

> +
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uc_se = >uclamp[UCLAMP_MIN];
> + uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
> +  uc_se, attr->sched_util_min);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uc_se = >uclamp[UCLAMP_MAX];
> + uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
> +  uc_se, attr->sched_util_max);
> + }
> +
> +done:
> + mutex_unlock(_mutex);
> +
> + return result;
> +}

Best,

- Juri

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-06 Thread Juri Lelli

On 28/08/18 14:53, Patrick Bellasi wrote:

[...]

>  static inline int __setscheduler_uclamp(struct task_struct *p,
>   const struct sched_attr *attr)
>  {
> - if (attr->sched_util_min > attr->sched_util_max)
> - return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> - return -EINVAL;
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;
>  
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(_mutex);
>  
> - return 0;
> + /* Find a valid group_id for each required clamp value */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + upper_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX)
> + ? attr->sched_util_max
> + : p->uclamp[UCLAMP_MAX].value;
> +
> + if (upper_bound == UCLAMP_NOT_VALID)
> + upper_bound = SCHED_CAPACITY_SCALE;
> + if (attr->sched_util_min > upper_bound) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MIN, attr->sched_util_min);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MIN");
> + goto done;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + lower_bound = (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN)
> + ? attr->sched_util_min
> + : p->uclamp[UCLAMP_MIN].value;
> +
> + if (lower_bound == UCLAMP_NOT_VALID)
> + lower_bound = 0;
> + if (attr->sched_util_max < lower_bound ||
> + attr->sched_util_max > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto done;
> + }
> +
> + result = uclamp_group_find(UCLAMP_MAX, attr->sched_util_max);
> + if (result == -ENOSPC) {
> + pr_err(UCLAMP_ENOSPC_FMT, "MAX");
> + goto done;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }

Don't you have to reset result to 0 here (it seems what follows cannot
fail anymore)? Otherwise this function will return latest
uclamp_group_find return value, which will be interpreted as error if
not 0.

> +
> + /* Update each required clamp group */
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> + uc_se = >uclamp[UCLAMP_MIN];
> + uclamp_group_get(UCLAMP_MIN, group_id[UCLAMP_MIN],
> +  uc_se, attr->sched_util_min);
> + }
> + if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> + uc_se = >uclamp[UCLAMP_MAX];
> + uclamp_group_get(UCLAMP_MAX, group_id[UCLAMP_MAX],
> +  uc_se, attr->sched_util_max);
> + }
> +
> +done:
> + mutex_unlock(_mutex);
> +
> + return result;
> +}

Best,

- Juri

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-05 Thread Juri Lelli

Hi,

On 28/08/18 14:53, Patrick Bellasi wrote:

[...]

>  static inline int __setscheduler_uclamp(struct task_struct *p,
>   const struct sched_attr *attr)
>  {
> - if (attr->sched_util_min > attr->sched_util_max)
> - return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> - return -EINVAL;
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;
>  
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(_mutex);

This is going to get called from an rcu_read_lock() section, which is a
no-go for using mutexes:

 sys_sched_setattr ->
   rcu_read_lock()
   ...
   sched_setattr() ->
 __sched_setscheduler() ->
   ...
   __setscheduler_uclamp() ->
 ...
 mutex_lock()

Guess you could fix the issue by getting the task struct after find_
process_by_pid() in sys_sched_attr() and then calling sched_setattr()
after rcu_read_lock() (putting the task struct at the end). Peter
actually suggested this mod to solve a different issue.

Best,

- Juri

Re: [PATCH v4 02/16] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-09-05 Thread Juri Lelli

Hi,

On 28/08/18 14:53, Patrick Bellasi wrote:

[...]

>  static inline int __setscheduler_uclamp(struct task_struct *p,
>   const struct sched_attr *attr)
>  {
> - if (attr->sched_util_min > attr->sched_util_max)
> - return -EINVAL;
> - if (attr->sched_util_max > SCHED_CAPACITY_SCALE)
> - return -EINVAL;
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + int lower_bound, upper_bound;
> + struct uclamp_se *uc_se;
> + int result = 0;
>  
> - p->uclamp[UCLAMP_MIN] = attr->sched_util_min;
> - p->uclamp[UCLAMP_MAX] = attr->sched_util_max;
> + mutex_lock(_mutex);

This is going to get called from an rcu_read_lock() section, which is a
no-go for using mutexes:

 sys_sched_setattr ->
   rcu_read_lock()
   ...
   sched_setattr() ->
 __sched_setscheduler() ->
   ...
   __setscheduler_uclamp() ->
 ...
 mutex_lock()

Guess you could fix the issue by getting the task struct after find_
process_by_pid() in sys_sched_attr() and then calling sched_setattr()
after rcu_read_lock() (putting the task struct at the end). Peter
actually suggested this mod to solve a different issue.

Best,

- Juri

38 matches

Mail list logo