Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-06 Thread Valentin Schneider
On 06/09/2019 18:10, Parth Shah wrote:
> Right, CPU capacity can solve the problem of indicating the thermal throttle 
> to the scheduler.
> AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal 
> headroom of the CPU.
> This is a nice mitigation but,
> 1. Sometimes a single task is responsible for the Thermal heatup of the core, 
> reducing the
>CPU capacity of all the CPUs in the core is not optimal when just moving 
> such single
>task to other core can allow us to remain within thermal headroom. This is 
> important
>for the servers especially where there are upto 8 threads.> 2. Given the 
> implementation in the patches and its integration with EAS, it seems difficult
>to adapt to servers, where CPU capacity itself is in doubt.
>https://lkml.org/lkml/2019/5/15/1402
> 

I'd nuance this to *SMT* capacity (which isn't just servers). The thing is
that it's difficult to come up with a sensible scheme to describe the base
capacity of a single logical CPU. But yeah, valid point.

>>
>> For active balance, we actually already have a condition that moves a task
>> to a less capacity-pressured CPU (although it is somewhat specific). So if
>> thermal pressure follows that task (e.g. it's doing tons of vector/float),
>> it will be rotated around.
> 
> Agree. But this should break in certain conditions like when we have multiple 
> tasks
> in a core with almost equal utilization among which one is just doing vector 
> operations.
> LB can pick and move any task with equal probability if the capacity is 
> reduced here.
> 

Right, if/when we get things like per-unit signals (wasn't there something
about tracking AVX a few months back?) then we'll be able to make
more informed decisions, for now we'll need some handholding (read: task
classification).

>>
>> However there should be a point made on latency vs throughput. If you
>> care about latency you probably do not want to active balance your task. If
> 
> Can you please elaborate on why not to consider active balance for latency 
> sensitive tasks?
> Because, sometimes finding a thermally cool core is beneficial when Turbo 
> frequency
> range is around 20% above rated ones.
> 

This goes back to my reply to Patrick further up the thread.

Right now active balance can happen just because we've been imbalanced for
some time and repeatedly failed to migrate anything. After 3 (IIRC) successive
failed attempts, we'll active balance the running task of the remote rq we
decided was busiest.

If that happens to be a latency sensitive task, that's not great - active
balancing means stopping that task's execution, so we're going to add some
latency to this latency-sensitive task. My proposal was to further ratelimit
active balance (e.g. require more failed attempts) when the task that would be
preempted is latency-sensitive.

My point is: if that task is doing fine where it is, why preempt it? That's
just introducing latency IMO (keeping in mind that those balance attempts
could happen despite not having any thermal pressure).

If you care about performance (e.g. a minimum level of throughput), to me
that is a separate (though perhaps not entirely distinct) property.

>> you care about throughput, it should be specified in some way (util-clamp
>> says hello!).
>>
> 
> yes I do care for latency and throughput both. :-)

Don't we all!

> but I'm wondering how uclamp can solve the problem for throughput.
> If I make the thermally hot tasks to appear bigger than other tasks then 
> reducing
> CPU capacity can allow such tasks to move around the chip.
> But this will require the utilization value to be relatively large compared 
> to the other
> tasks in the core. Or other task's uclamp.max can be lowered to make such 
> task rotate.
> If I got it right, then this will be a difficult UCLAMP usecase from user 
> perspective, right?
> I feel like I'm missing something here.
> 

Hmm perhaps I was jumping the gun here. What I was getting to is if you have
something like misfit that migrates tasks to CPUs of higher capacity than the
one they are on, you could use uclamp to flag them.

You could translate your throughput requirement as a uclamp.min of e.g. 80%,
and if the CPU capacity goes below that (or close within a margin) then you'd
try to migrate the task to a CPU of higher capacity (i.e. not or less 
thermally pressured).

This doesn't have to involve your less throughput-sensitive tasks, since you
would only tag and take action for your throughput-sensitive tasks.


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-06 Thread Parth Shah



On 9/6/19 7:43 PM, Valentin Schneider wrote:
> On 06/09/2019 13:45, Parth Shah wrote:> 
>> I guess there is some usecase in case of thermal throttling.
>> If a task is heating up the core then in ideal scenarios POWER systems 
>> throttle
>> down to rated frequency.
>> In such case, if the task is latency sensitive (min latency nice), we can 
>> move the
>> task around the chip to heat up the chip uniformly allowing me to gain more 
>> performance
>> with sustained higher frequency.
>> With this, we will require the help from active load balancer and 
>> latency-nice
>> classification on per task and/or group basis.
>>
>> Hopefully, this might be useful for other arch as well, right?
>>
> 
> Most of the functionality is already there, we're only really missing thermal
> pressure awareness. There was [1] but it seems to have died.
> 
> 
> At least with CFS load balancing, if thermal throttling is correctly
> reflected as a CPU capacity reduction you will tend to move things away from
> that CPU, since load is balanced over capacities.
> 

Right, CPU capacity can solve the problem of indicating the thermal throttle to 
the scheduler.
AFAIU, the patchset from Thara changes CPU capacity to reflect Thermal headroom 
of the CPU.
This is a nice mitigation but,
1. Sometimes a single task is responsible for the Thermal heatup of the core, 
reducing the
   CPU capacity of all the CPUs in the core is not optimal when just moving 
such single
   task to other core can allow us to remain within thermal headroom. This is 
important
   for the servers especially where there are upto 8 threads.
2. Given the implementation in the patches and its integration with EAS, it 
seems difficult
   to adapt to servers, where CPU capacity itself is in doubt.
   https://lkml.org/lkml/2019/5/15/1402

> 
> For active balance, we actually already have a condition that moves a task
> to a less capacity-pressured CPU (although it is somewhat specific). So if
> thermal pressure follows that task (e.g. it's doing tons of vector/float),
> it will be rotated around.

Agree. But this should break in certain conditions like when we have multiple 
tasks
in a core with almost equal utilization among which one is just doing vector 
operations.
LB can pick and move any task with equal probability if the capacity is reduced 
here.

> 
> However there should be a point made on latency vs throughput. If you
> care about latency you probably do not want to active balance your task. If

Can you please elaborate on why not to consider active balance for latency 
sensitive tasks?
Because, sometimes finding a thermally cool core is beneficial when Turbo 
frequency
range is around 20% above rated ones.

> you care about throughput, it should be specified in some way (util-clamp
> says hello!).
> 

yes I do care for latency and throughput both. :-)
but I'm wondering how uclamp can solve the problem for throughput.
If I make the thermally hot tasks to appear bigger than other tasks then 
reducing
CPU capacity can allow such tasks to move around the chip.
But this will require the utilization value to be relatively large compared to 
the other
tasks in the core. Or other task's uclamp.max can be lowered to make such task 
rotate.
If I got it right, then this will be a difficult UCLAMP usecase from user 
perspective, right?
I feel like I'm missing something here.

> It sort of feels like you'd want an extension of misfit migration (salesman
> hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
> util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
> enabled for systems with asymmetric capacities, but could be enabled globally
> for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
> on SMP systems).> On top of that, if we make misfit consider e.g. uclamp.min 
> (I don't think
> that's already the case), then you have your throughput knob to have *some* 
> designated tasks move away from (thermal & else) pressure. 
> 
> 
> [1]: 
> https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopin...@linaro.org/
> 

Thanks,
Parth



Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-06 Thread Vincent Guittot
On Fri, 6 Sep 2019 at 16:13, Valentin Schneider
 wrote:
>
> On 06/09/2019 13:45, Parth Shah wrote:>
> > I guess there is some usecase in case of thermal throttling.
> > If a task is heating up the core then in ideal scenarios POWER systems 
> > throttle
> > down to rated frequency.
> > In such case, if the task is latency sensitive (min latency nice), we can 
> > move the
> > task around the chip to heat up the chip uniformly allowing me to gain more 
> > performance
> > with sustained higher frequency.
> > With this, we will require the help from active load balancer and 
> > latency-nice
> > classification on per task and/or group basis.
> >
> > Hopefully, this might be useful for other arch as well, right?
> >
>
> Most of the functionality is already there, we're only really missing thermal
> pressure awareness. There was [1] but it seems to have died.

Thara still works on it but she has been sidetracked on other
activities and  It takes more time than expected to run all tests with
different averaging window and process the results
>
>
> At least with CFS load balancing, if thermal throttling is correctly
> reflected as a CPU capacity reduction you will tend to move things away from
> that CPU, since load is balanced over capacities.
>
>
> For active balance, we actually already have a condition that moves a task
> to a less capacity-pressured CPU (although it is somewhat specific). So if
> thermal pressure follows that task (e.g. it's doing tons of vector/float),
> it will be rotated around.
>
> However there should be a point made on latency vs throughput. If you
> care about latency you probably do not want to active balance your task. If
> you care about throughput, it should be specified in some way (util-clamp
> says hello!).
>
> It sort of feels like you'd want an extension of misfit migration (salesman
> hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
> util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
> enabled for systems with asymmetric capacities, but could be enabled globally
> for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
> on SMP systems).
>
> On top of that, if we make misfit consider e.g. uclamp.min (I don't think
> that's already the case), then you have your throughput knob to have *some*
> designated tasks move away from (thermal & else) pressure.
>
>
> [1]: 
> https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopin...@linaro.org/


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-06 Thread Valentin Schneider
On 06/09/2019 13:45, Parth Shah wrote:> 
> I guess there is some usecase in case of thermal throttling.
> If a task is heating up the core then in ideal scenarios POWER systems 
> throttle
> down to rated frequency.
> In such case, if the task is latency sensitive (min latency nice), we can 
> move the
> task around the chip to heat up the chip uniformly allowing me to gain more 
> performance
> with sustained higher frequency.
> With this, we will require the help from active load balancer and latency-nice
> classification on per task and/or group basis.
> 
> Hopefully, this might be useful for other arch as well, right?
> 

Most of the functionality is already there, we're only really missing thermal
pressure awareness. There was [1] but it seems to have died.


At least with CFS load balancing, if thermal throttling is correctly
reflected as a CPU capacity reduction you will tend to move things away from
that CPU, since load is balanced over capacities.


For active balance, we actually already have a condition that moves a task
to a less capacity-pressured CPU (although it is somewhat specific). So if
thermal pressure follows that task (e.g. it's doing tons of vector/float),
it will be rotated around.

However there should be a point made on latency vs throughput. If you
care about latency you probably do not want to active balance your task. If
you care about throughput, it should be specified in some way (util-clamp
says hello!).

It sort of feels like you'd want an extension of misfit migration (salesman
hat goes on from here) - misfit moves tasks that are CPU bound (IOW their
util is >= 80% of the CPU capacity) to CPUs of higher capacity. It's only
enabled for systems with asymmetric capacities, but could be enabled globally
for "dynamically-created asymmetric capacities" (IOW RT/IRQ/thermal pressure
on SMP systems).

On top of that, if we make misfit consider e.g. uclamp.min (I don't think
that's already the case), then you have your throughput knob to have *some* 
designated tasks move away from (thermal & else) pressure. 


[1]: 
https://lore.kernel.org/lkml/1555443521-579-1-git-send-email-thara.gopin...@linaro.org/


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-06 Thread Parth Shah



On 9/5/19 6:37 PM, Patrick Bellasi wrote:
> 
> On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...
> 
>> On 05/09/2019 12:18, Patrick Bellasi wrote:
 There's a few things wrong there; I really feel that if we call it nice,
 it should be like nice. Otherwise we should call it latency-bias and not
 have the association with nice to confuse people.

 Secondly; the default should be in the middle of the range. Naturally
 this would be a signed range like nice [-(x+1),x] for some x. but if you
 want [0,1024], then the default really should be 512, but personally I
 like 0 better as a default, in which case we need negative numbers.

 This is important because we want to be able to bias towards less
 importance to (tail) latency as well as more importantance to (tail)
 latency.

 Specifically, Oracle wants to sacrifice (some) latency for throughput.
 Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>>>
>>> Right, we have this dualism to deal with and current mainline behaviour
>>> is somehow in the middle.
>>>
>>> BTW, the FB requirement is the same we have in Android.
>>> We want some CFS tasks to have very small latency and a low chance
>>> to be preempted by the wake-up of less-important "background" tasks.
>>>
>>> I'm not totally against the usage of a signed range, but I'm thinking
>>> that since we are introducing a new (non POSIX) concept we can get the
>>> chance to make it more human friendly.
>>>
>>> Give the two extremes above, would not be much simpler and intuitive to
>>> have 0 implementing the FB/Android (no latency) case and 1024 the
>>> (max latency) Oracle case?
>>>
>>
>> For something like latency-, I don't see the point of having
>> such a wide range. The nice range is probably more than enough - and before
>> even bothering about the range, we should probably agree on what the range
>> should represent.
>>
>> If it's niceness, I read it as: positive latency-nice value means we're
>> nice to latency, means we reduce it. So the further up you go, the more you
>> restrict your wakeup scan. I think it's quite easy to map that into the
>> code: current behaviour at 0, with a decreasing scan mask size as we go
>> towards +19. I don't think anyone needs 512 steps to tune this.
>>
>> I don't know what logic we'd follow for negative values though. Maybe
>> latency-nice -20 means always going through the slowpath, but what of the
>> intermediate values?
> 
> Yep, I think so fare we are all converging towards the idea to use the
> a signed range. Regarding the range itself, yes: 1024 looks very
> oversized, but +-20 is still something which leave room for a bit of
> flexibility and it also better matches the idea that we don't want to
> "enumerate behaviours" but just expose a knob. To map certain "bias" we
> could benefit from a slightly larger range.
> 
>> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
> 
> For the wakeup path there is also the TurboSched proposal by Parth:
> 
>Message-ID: <20190725070857.6639-1-pa...@linux.ibm.com> 
>https://lore.kernel.org/lkml/20190725070857.6639-1-pa...@linux.ibm.com/
> 
> we should keep in mind.
> 
>> applied elsewhere (e.g. load-balance, something like task_hot() and its
>> use of sysctl_sched_migration_cost).
> 
> For LB can you come up with some better description of what usages you
> see could benefit from a "per task" or "per task-group" latency niceness?
> 

I guess there is some usecase in case of thermal throttling.
If a task is heating up the core then in ideal scenarios POWER systems throttle
down to rated frequency.
In such case, if the task is latency sensitive (min latency nice), we can move 
the
task around the chip to heat up the chip uniformly allowing me to gain more 
performance
with sustained higher frequency.
With this, we will require the help from active load balancer and latency-nice
classification on per task and/or group basis.

Hopefully, this might be useful for other arch as well, right?

> Best,
> Patrick
> 

Thanks,
Parth



Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-06 Thread Parth Shah



On 9/5/19 3:15 PM, Patrick Bellasi wrote:
> 
> On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...
> 
>> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>>
>> *sigh*, no. We start with a normal per task attribute, and then later,
>> if it is needed and makes sense, we add it to cgroups.
> 
> FWIW, to add on top of what Peter says, we used this same approach for
> uclamp and it proved to be a very effective way to come up with a good
> design. General principles have been:
> 
>  - a system wide API [1] (under /proc/sys/kernel/sched_*) defines
>default values for all tasks affected by that feature.
>This interface has to define also upper bounds for task specific
>values. Thus, in the case of latency-nice, it should be set by
>default to the MIN value, since that's the current mainline
>behaviour: all tasks are latency sensitive.
> 
>  - a per-task API [2] (via the sched_setattr() syscall) can be used to
>relax the system wide setting thus implementing a "nice" policy.
> 
>  - a per-taskgroup API [3] (via cpu controller's attributes) can be used
>to relax the system-wide settings and restrict the per-task API.
> 
> The above features are worth to be added in that exact order.
> 
>> Also, your Changelog fails on pretty much every point. It doesn't
>> explain why, it doesn't describe anything and so on.
> 
> On the description side, I guess it's worth to mention somewhere to
> which scheduling classes this feature can be useful for. It's worth to
> mention that it can apply only to:
> 
>  - CFS tasks: for example, at wakeup time a task with an high
>latency-nice should avoid to preempt a low latency-nice task.
>Maybe by mapping the latency nice value into proper vruntime
>normalization value?
> 

If I got this correct, does this also mean that a task's latency-nice
will be mapped to prio/nice.
i.e, task with min-latency-nice will have highest priority?

>  - RT tasks: for example, at wakeup time a task with an high
>latency-nice value could avoid to preempt a CFS task.
> 

So, will this make CFS task to precede RT task?
and cause priority inversion?

> I'm sure there will be discussion about some of these features, that's
> why it's important in the proposal presentation to keep a well defined
> distinction among the "mechanisms and API" and how we use the new
> concept to "bias" some scheduler policies.
> 
>> From just reading the above, I would expect it to have the range
>> [-20,19] just like normal nice. Apparently this is not so.
> 
> Regarding the range for the latency-nice values, I guess we have two
> options:
> 
>   - [-20..19], which makes it similar to priorities
>   downside: we quite likely end up with a kernel space representation
>   which does not match the user-space one, e.g. look at
>   task_struct::prio.
> 
>   - [0..1024], which makes it more similar to a "percentage"
> 
> Being latency-nice a new concept, we are not constrained by POSIX and
> IMHO the [0..1024] scale is a better fit.
> 
> That will translate into:
> 
>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>   policies are disabled and we wakeup up as fast as possible
> 
>   latency-nice=1024 : maximum niceness, where for example we can imaging
>   to turn switch a CFS task to be SCHED_IDLE?
> 
> Best,
> Patrick
> 
> [1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
> [2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support 
> utilization clamping")
> [3] 5 patches in today's tip/sched/core up to:
> commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp 
> changes")
> 



Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-06 Thread Parth Shah



On 9/5/19 3:41 PM, Patrick Bellasi wrote:
> 
> On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...
> 
>> On 9/4/19 11:02 PM, Tim Chen wrote:
>>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
 Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
 "latency-nice" which is shared by all the threads in that Cgroup.
>>>
>>>
>>> Subhra,
>>>
>>> Thanks for posting the patchset.  Having a latency nice hint
>>> is useful beyond idle load balancing.  I can think of other
>>> application scenarios, like scheduling batch machine learning AVX 512
>>> processes with latency sensitive processes.  AVX512 limits the frequency
>>> of the CPU and it is best to avoid latency sensitive task on the
>>> same core with AVX512.  So latency nice hint allows the scheduler
>>> to have a criteria to determine the latency sensitivity of a task
>>> and arrange latency sensitive tasks away from AVX512 tasks.
>>>
>>
>>
>> Hi Tim and Subhra,
>>
>> This patchset seems to be interesting for my TurboSched patches as well
>> where I try to pack jitter tasks on fewer cores to get higher Turbo 
>> Frequencies.
>> Well, the problem I face is that we sometime end up putting multiple jitter 
>> tasks on a core
>> running some latency sensitive application which may see performance 
>> degradation.
>> So my plan was to classify such tasks to be latency sensitive thereby 
>> hinting the load
>> balancer to not put tasks on such cores.
>>
>> TurboSched: https://lkml.org/lkml/2019/7/25/296
>>
>>> You configure the latency hint on a cgroup basis.
>>> But I think not all tasks in a cgroup necessarily have the same
>>> latency sensitivity.
>>>
>>> For example, I can see that cgroup can be applied on a per user basis,
>>> and the user could run different tasks that have different latency 
>>> sensitivity.
>>> We may also need a way to configure latency sensitivity on a per task basis 
>>> instead on
>>> a per cgroup basis.
>>>
>>
>> AFAIU, the problem defined above intersects with my patches as well where 
>> the interface
>> is required to classify the jitter tasks. I have already tried few methods 
>> like
>> syscall and cgroup to classify such tasks and maybe something like that can 
>> be adopted
>> with these patchset as well.
> 
> Agree, these two patchest are definitively overlapping in terms of
> mechanisms and APIs to expose to userspace. You to guys seems to target
> different goals but the general approach should be:
> 
>  - expose a single and abstract concept to user-space
>latency-nice or latency-tolerant as PaulT proposed at OSPM
> 

I agree. Both the patchset tries to classify a tasks for some purpose for 
better latency.
TurboSched requires the classification of whether the task is jitter and should 
not be given
enough resources/frequency. This is a boolean value.
Whereas, latency-nice is a range. So does that mean that a max-latency-nice 
task is a jitter?

I was thinking of not doing jitter packing on a core occupying
min-latency-nice (i.e, latency sensitive) task (until there are other busier 
cores).

Given this, we can expose a single per-task attribute to the user by a syscall, 
right?

>  - map this concept in kernel-space to different kind of bias, both at
>wakeup time and load-balance time, and use both for RT and CFS tasks.
> 
> That's my understanding at least ;)
> 
> I guess we will have interesting discussions at the upcoming LPC to
> figure out a solution fitting all needs.

Definitely.

> 
>> Thanks,
>> Parth
> 
> Best,
> Patrick
> 



Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Valentin Schneider
On 05/09/2019 14:07, Patrick Bellasi wrote:
> Yep, I think so fare we are all converging towards the idea to use the
> a signed range. Regarding the range itself, yes: 1024 looks very
> oversized, but +-20 is still something which leave room for a bit of
> flexibility and it also better matches the idea that we don't want to
> "enumerate behaviours" but just expose a knob. To map certain "bias" we
> could benefit from a slightly larger range.
> 
>> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
> 
> For the wakeup path there is also the TurboSched proposal by Parth:
> 
>Message-ID: <20190725070857.6639-1-pa...@linux.ibm.com> 
>https://lore.kernel.org/lkml/20190725070857.6639-1-pa...@linux.ibm.com/
> 
> we should keep in mind.
> 
>> applied elsewhere (e.g. load-balance, something like task_hot() and its
>> use of sysctl_sched_migration_cost).
> 
> For LB can you come up with some better description of what usages you
> see could benefit from a "per task" or "per task-group" latency niceness?
> 

task_hot() "ratelimits" migrations of tasks that were running up until
very recently (hence "cache hot"), but the knob is system wide. It might
make sense to exploit a per-task attribute to tune this (e.g. go wild if
not latency sensitive, otherwise stay away for longer).

We could perhaps even apply it to active load balance to similarly stay
away from latency sensitive tasks. Right now this is gated by a
sched_domain-wide attribute (nr_balance_failed). We could tweak this to
requiring more (less) failed attempts before interrupting latency (in)
sensitive tasks.

I'm sure we can come up with even more creative ways to pour even more
heuristics in there ;)

> Best,
> Patrick
> 


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Qais Yousef
On 09/05/19 13:48, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> > Right, although I think behaviours could still be exported but via a
> > different and configurable interface, using thresholds.
> 
> I would try _really_ hard to avoid pinning down behaviour. The more you
> do that, the less you can change.

While I agree with that but I find there's a contradiction between not
'pinning down behavior' and 'easy and clear way to bias latency sensitive from
end user's perspective'.

Maybe we should protect this with a kconfig + experimental tag until trial
and error show the best way forward?

--
Qais Yousef


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:46:37 +0100, Valentin Schneider wrote...

> On 05/09/2019 12:18, Patrick Bellasi wrote:
>>> There's a few things wrong there; I really feel that if we call it nice,
>>> it should be like nice. Otherwise we should call it latency-bias and not
>>> have the association with nice to confuse people.
>>>
>>> Secondly; the default should be in the middle of the range. Naturally
>>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>>> want [0,1024], then the default really should be 512, but personally I
>>> like 0 better as a default, in which case we need negative numbers.
>>>
>>> This is important because we want to be able to bias towards less
>>> importance to (tail) latency as well as more importantance to (tail)
>>> latency.
>>>
>>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>> 
>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>> 
>
> For something like latency-, I don't see the point of having
> such a wide range. The nice range is probably more than enough - and before
> even bothering about the range, we should probably agree on what the range
> should represent.
>
> If it's niceness, I read it as: positive latency-nice value means we're
> nice to latency, means we reduce it. So the further up you go, the more you
> restrict your wakeup scan. I think it's quite easy to map that into the
> code: current behaviour at 0, with a decreasing scan mask size as we go
> towards +19. I don't think anyone needs 512 steps to tune this.
>
> I don't know what logic we'd follow for negative values though. Maybe
> latency-nice -20 means always going through the slowpath, but what of the
> intermediate values?

Yep, I think so fare we are all converging towards the idea to use the
a signed range. Regarding the range itself, yes: 1024 looks very
oversized, but +-20 is still something which leave room for a bit of
flexibility and it also better matches the idea that we don't want to
"enumerate behaviours" but just expose a knob. To map certain "bias" we
could benefit from a slightly larger range.

> AFAICT this RFC only looks at wakeups, but I guess latency-nice can be

For the wakeup path there is also the TurboSched proposal by Parth:

   Message-ID: <20190725070857.6639-1-pa...@linux.ibm.com> 
   https://lore.kernel.org/lkml/20190725070857.6639-1-pa...@linux.ibm.com/

we should keep in mind.

> applied elsewhere (e.g. load-balance, something like task_hot() and its
> use of sysctl_sched_migration_cost).

For LB can you come up with some better description of what usages you
see could benefit from a "per task" or "per task-group" latency niceness?

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 12:40:01PM +0100, Patrick Bellasi wrote:
> Right, although I think behaviours could still be exported but via a
> different and configurable interface, using thresholds.

I would try _really_ hard to avoid pinning down behaviour. The more you
do that, the less you can change.


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Valentin Schneider
On 05/09/2019 12:18, Patrick Bellasi wrote:
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>>
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>>
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>>
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> 
> Right, we have this dualism to deal with and current mainline behaviour
> is somehow in the middle.
> 
> BTW, the FB requirement is the same we have in Android.
> We want some CFS tasks to have very small latency and a low chance
> to be preempted by the wake-up of less-important "background" tasks.
> 
> I'm not totally against the usage of a signed range, but I'm thinking
> that since we are introducing a new (non POSIX) concept we can get the
> chance to make it more human friendly.
> 
> Give the two extremes above, would not be much simpler and intuitive to
> have 0 implementing the FB/Android (no latency) case and 1024 the
> (max latency) Oracle case?
> 

For something like latency-, I don't see the point of having
such a wide range. The nice range is probably more than enough - and before
even bothering about the range, we should probably agree on what the range
should represent.

If it's niceness, I read it as: positive latency-nice value means we're
nice to latency, means we reduce it. So the further up you go, the more you
restrict your wakeup scan. I think it's quite easy to map that into the
code: current behaviour at 0, with a decreasing scan mask size as we go
towards +19. I don't think anyone needs 512 steps to tune this.

I don't know what logic we'd follow for negative values though. Maybe
latency-nice -20 means always going through the slowpath, but what of the
intermediate values?

AFAICT this RFC only looks at wakeups, but I guess latency-nice can be
applied elsewhere (e.g. load-balance, something like task_hot() and its
use of sysctl_sched_migration_cost).

> Moreover, we will never match completely the nice semantic, give that
> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
> usage change for each step?
> 
> For latency-nice instead we will likely base our biasing strategies on
> some predefined (maybe system-wide configurable) const thresholds.
> 
> Could changing the name to "latency-tolerance" break the tie by marking
> its difference wrt prior/nice levels? AFAIR, that was also the original
> proposal [1] by PaulT during the OSPM discussion.
> 
> Best,
> Patrick
> 
> [1] https://youtu.be/oz43thSFqmk?t=1302
> 


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Qais Yousef
On 09/05/19 13:30, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> > On 09/05/19 12:46, Peter Zijlstra wrote:
> 
> > > This is important because we want to be able to bias towards less
> > > importance to (tail) latency as well as more importantance to (tail)
> > > latency.
> > > 
> > > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> > 
> > Another use case I'm considering is using latency-nice to prefer an idle 
> > CPU if
> > latency-nice is set otherwise go for the most energy efficient CPU.
> > 
> > Ie: sacrifice (some) energy for latency.
> > 
> > The way I see interpreting latency-nice here as a binary switch. But
> > maybe we can use the range to select what (some) energy to sacrifice
> > mean here. Hmmm.
> 
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).

I meant I want to use it as a binary.

> 
> In your case, I'm thinking you mean >0, we want to lower the latency.

Yes. As long as there's an easy way to say: does this task care about latency
or not I'm good.

> 
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
> 
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Agreed. I just wanted to say that the way this range is going to be
interpreted will differ from path to path and we need to consider that in the
final mapping. Especially from the final user's perspective of what setting
this value ultimately means to them.

--
Qais Yousef


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 12:30:52PM +0100, Patrick Bellasi wrote:

> I see this concept possibly evolving into something more then just a
> binary switch. Not yet convinced if it make sense and/or it's possible
> but, in principle, I was thinking about these possible usages for CFS
> tasks:
> 
>  - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
>depending on crossing certain pre-configured threshold of latency
>niceness.

A big part of BATCH is wakeup preemption (batch doesn't preempt itself),
and wakeup preemption is a task-task propery and can thus be completely
relative.

>  - dynamically bias the vruntime updates we do in place_entity()
>depending on the actual latency niceness of a task.

That is dangerous; theory says we should keep track of the 0-lag point
and place it back where we found it. BFQ does this correctly IIRC, but
for CFS I've never done that because 'expensive'.

But yes, we could (carefully) fumble a bit there.

>  - bias the decisions we take in check_preempt_tick() still depending
>on a relative comparison of the current and wakeup task latency
>niceness values.

Ack.

Placing relative and absolute behaviour on the same scale is going to be
'fun' :-)


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:40:30 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:
>
>> Right, we have this dualism to deal with and current mainline behaviour
>> is somehow in the middle.
>> 
>> BTW, the FB requirement is the same we have in Android.
>> We want some CFS tasks to have very small latency and a low chance
>> to be preempted by the wake-up of less-important "background" tasks.
>> 
>> I'm not totally against the usage of a signed range, but I'm thinking
>> that since we are introducing a new (non POSIX) concept we can get the
>> chance to make it more human friendly.
>
> I'm arguing that signed _is_ more human friendly ;-)

... but you are not human. :)

>> Give the two extremes above, would not be much simpler and intuitive to
>> have 0 implementing the FB/Android (no latency) case and 1024 the
>> (max latency) Oracle case?
>
> See, I find the signed thing more natural, negative is a bias away from
> latency sensitive, positive is a bias towards latency sensitive.
>
> Also; 0 is a good default value ;-)

Yes, that's appealing indeed.

>> Moreover, we will never match completely the nice semantic, give that
>> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
>> usage change for each step?
>
> Only because we were nice when implementing it. Posix leaves it
> unspecified and we could change it at any time. The only real semantics
> is a relative 'weight' (opengroup uses the term 'favourable').

Good to know, I was considering it a POXIS requirement.

>> Could changing the name to "latency-tolerance" break the tie by marking
>> its difference wrt prior/nice levels? AFAIR, that was also the original
>> proposal [1] by PaulT during the OSPM discussion.
>
> latency torrerance could still be a signed entity, positive would
> signify we're more tolerant of latency (ie. less sensitive) while
> negative would be less tolerant (ie. more sensitive).

Right.

>> For latency-nice instead we will likely base our biasing strategies on
>> some predefined (maybe system-wide configurable) const thresholds.
>
> I'm not quite sure; yes, for some of these things, like the idle search
> on wakeup, certainly. But say for wakeup-preemption, we could definitely
> make it a task relative attribute.

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 12:18:55PM +0100, Patrick Bellasi wrote:

> Right, we have this dualism to deal with and current mainline behaviour
> is somehow in the middle.
> 
> BTW, the FB requirement is the same we have in Android.
> We want some CFS tasks to have very small latency and a low chance
> to be preempted by the wake-up of less-important "background" tasks.
> 
> I'm not totally against the usage of a signed range, but I'm thinking
> that since we are introducing a new (non POSIX) concept we can get the
> chance to make it more human friendly.

I'm arguing that signed _is_ more human friendly ;-)

> Give the two extremes above, would not be much simpler and intuitive to
> have 0 implementing the FB/Android (no latency) case and 1024 the
> (max latency) Oracle case?

See, I find the signed thing more natural, negative is a bias away from
latency sensitive, positive is a bias towards latency sensitive.

Also; 0 is a good default value ;-)

> Moreover, we will never match completely the nice semantic, give that
> a 1 nice unit has a proper math meaning, isn't something like 10% CPU
> usage change for each step?

Only because we were nice when implementing it. Posix leaves it
unspecified and we could change it at any time. The only real semantics
is a relative 'weight' (opengroup uses the term 'favourable').

> Could changing the name to "latency-tolerance" break the tie by marking
> its difference wrt prior/nice levels? AFAIR, that was also the original
> proposal [1] by PaulT during the OSPM discussion.

latency torrerance could still be a signed entity, positive would
signify we're more tolerant of latency (ie. less sensitive) while
negative would be less tolerant (ie. more sensitive).

> For latency-nice instead we will likely base our biasing strategies on
> some predefined (maybe system-wide configurable) const thresholds.

I'm not quite sure; yes, for some of these things, like the idle search
on wakeup, certainly. But say for wakeup-preemption, we could definitely
make it a task relative attribute.


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:30:02 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
>> On 09/05/19 12:46, Peter Zijlstra wrote:
>
>> > This is important because we want to be able to bias towards less
>> > importance to (tail) latency as well as more importantance to (tail)
>> > latency.
>> > 
>> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>> 
>> Another use case I'm considering is using latency-nice to prefer an idle CPU 
>> if
>> latency-nice is set otherwise go for the most energy efficient CPU.
>> 
>> Ie: sacrifice (some) energy for latency.
>> 
>> The way I see interpreting latency-nice here as a binary switch. But
>> maybe we can use the range to select what (some) energy to sacrifice
>> mean here. Hmmm.
>
> It cannot be binary, per definition is must be ternary, that is, <0, ==0
> and >0 (or middle value if you're of that persuasion).
>
> In your case, I'm thinking you mean >0, we want to lower the latency.
>
> Anyway; there were a number of things mentioned at OSPM that we could
> tie into this thing and finding sensible mappings is going to be a bit
> of trial and error I suppose.
>
> But as patrick said; we're very much exporting a BIAS knob, not a set of
> behaviours.

Right, although I think behaviours could still be exported but via a
different and configurable interface, using thresholds.

Either at compile time or via procfs maybe we can expose and properly
document what happen in the scheduler if/when a task has a "latency
niceness" crossing a given threshold.

For example, by setting something like:

   /proc/sys/kernel/sched_cfs_latency_idle = 1000

we state that the task is going to be scheduled according to the
SCHED_IDLE policy.

  ( ( (tomatoes target here) ) )

Not sure also if we wanna commit to user-space APIs how we internally
map/translate a "latency niceness" value into a scheduler behaviour
bias. Maybe better not at least at the very beginning.

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 12:13:47 +0100, Qais Yousef wrote...

> On 09/05/19 12:46, Peter Zijlstra wrote:
>> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>> 
>> > > From just reading the above, I would expect it to have the range
>> > > [-20,19] just like normal nice. Apparently this is not so.
>> > 
>> > Regarding the range for the latency-nice values, I guess we have two
>> > options:
>> > 
>> >   - [-20..19], which makes it similar to priorities
>> >   downside: we quite likely end up with a kernel space representation
>> >   which does not match the user-space one, e.g. look at
>> >   task_struct::prio.
>> > 
>> >   - [0..1024], which makes it more similar to a "percentage"
>> > 
>> > Being latency-nice a new concept, we are not constrained by POSIX and
>> > IMHO the [0..1024] scale is a better fit.
>> > 
>> > That will translate into:
>> > 
>> >   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>> >   policies are disabled and we wakeup up as fast as possible
>> > 
>> >   latency-nice=1024 : maximum niceness, where for example we can imaging
>> >   to turn switch a CFS task to be SCHED_IDLE?
>> 
>> There's a few things wrong there; I really feel that if we call it nice,
>> it should be like nice. Otherwise we should call it latency-bias and not
>> have the association with nice to confuse people.
>> 
>> Secondly; the default should be in the middle of the range. Naturally
>> this would be a signed range like nice [-(x+1),x] for some x. but if you
>> want [0,1024], then the default really should be 512, but personally I
>> like 0 better as a default, in which case we need negative numbers.
>> 
>> This is important because we want to be able to bias towards less
>> importance to (tail) latency as well as more importantance to (tail)
>> latency.
>> 
>> Specifically, Oracle wants to sacrifice (some) latency for throughput.
>> Facebook OTOH seems to want to sacrifice (some) throughput for latency.
>
> Another use case I'm considering is using latency-nice to prefer an idle CPU 
> if
> latency-nice is set otherwise go for the most energy efficient CPU.
>
> Ie: sacrifice (some) energy for latency.
>
> The way I see interpreting latency-nice here as a binary switch. But maybe we
> can use the range to select what (some) energy to sacrifice mean here. Hmmm.

I see this concept possibly evolving into something more then just a
binary switch. Not yet convinced if it make sense and/or it's possible
but, in principle, I was thinking about these possible usages for CFS
tasks:

 - dynamically tune the policy of a task among SCHED_{OTHER,BATCH,IDLE}
   depending on crossing certain pre-configured threshold of latency
   niceness.

 - dynamically bias the vruntime updates we do in place_entity()
   depending on the actual latency niceness of a task.

 - bias the decisions we take in check_preempt_tick() still depending
   on a relative comparison of the current and wakeup task latency
   niceness values.

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 12:13:47PM +0100, Qais Yousef wrote:
> On 09/05/19 12:46, Peter Zijlstra wrote:

> > This is important because we want to be able to bias towards less
> > importance to (tail) latency as well as more importantance to (tail)
> > latency.
> > 
> > Specifically, Oracle wants to sacrifice (some) latency for throughput.
> > Facebook OTOH seems to want to sacrifice (some) throughput for latency.
> 
> Another use case I'm considering is using latency-nice to prefer an idle CPU 
> if
> latency-nice is set otherwise go for the most energy efficient CPU.
> 
> Ie: sacrifice (some) energy for latency.
> 
> The way I see interpreting latency-nice here as a binary switch. But
> maybe we can use the range to select what (some) energy to sacrifice
> mean here. Hmmm.

It cannot be binary, per definition is must be ternary, that is, <0, ==0
and >0 (or middle value if you're of that persuasion).

In your case, I'm thinking you mean >0, we want to lower the latency.

Anyway; there were a number of things mentioned at OSPM that we could
tie into this thing and finding sensible mappings is going to be a bit
of trial and error I suppose.

But as patrick said; we're very much exporting a BIAS knob, not a set of
behaviours.


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 11:46:16 +0100, Peter Zijlstra wrote...

> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
>
>> > From just reading the above, I would expect it to have the range
>> > [-20,19] just like normal nice. Apparently this is not so.
>> 
>> Regarding the range for the latency-nice values, I guess we have two
>> options:
>> 
>>   - [-20..19], which makes it similar to priorities
>>   downside: we quite likely end up with a kernel space representation
>>   which does not match the user-space one, e.g. look at
>>   task_struct::prio.
>> 
>>   - [0..1024], which makes it more similar to a "percentage"
>> 
>> Being latency-nice a new concept, we are not constrained by POSIX and
>> IMHO the [0..1024] scale is a better fit.
>> 
>> That will translate into:
>> 
>>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>>   policies are disabled and we wakeup up as fast as possible
>> 
>>   latency-nice=1024 : maximum niceness, where for example we can imaging
>>   to turn switch a CFS task to be SCHED_IDLE?
>
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
>
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
>
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
>
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Right, we have this dualism to deal with and current mainline behaviour
is somehow in the middle.

BTW, the FB requirement is the same we have in Android.
We want some CFS tasks to have very small latency and a low chance
to be preempted by the wake-up of less-important "background" tasks.

I'm not totally against the usage of a signed range, but I'm thinking
that since we are introducing a new (non POSIX) concept we can get the
chance to make it more human friendly.

Give the two extremes above, would not be much simpler and intuitive to
have 0 implementing the FB/Android (no latency) case and 1024 the
(max latency) Oracle case?

Moreover, we will never match completely the nice semantic, give that
a 1 nice unit has a proper math meaning, isn't something like 10% CPU
usage change for each step?

For latency-nice instead we will likely base our biasing strategies on
some predefined (maybe system-wide configurable) const thresholds.

Could changing the name to "latency-tolerance" break the tie by marking
its difference wrt prior/nice levels? AFAIR, that was also the original
proposal [1] by PaulT during the OSPM discussion.

Best,
Patrick

[1] https://youtu.be/oz43thSFqmk?t=1302

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Qais Yousef
On 09/05/19 12:46, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:
> 
> > > From just reading the above, I would expect it to have the range
> > > [-20,19] just like normal nice. Apparently this is not so.
> > 
> > Regarding the range for the latency-nice values, I guess we have two
> > options:
> > 
> >   - [-20..19], which makes it similar to priorities
> >   downside: we quite likely end up with a kernel space representation
> >   which does not match the user-space one, e.g. look at
> >   task_struct::prio.
> > 
> >   - [0..1024], which makes it more similar to a "percentage"
> > 
> > Being latency-nice a new concept, we are not constrained by POSIX and
> > IMHO the [0..1024] scale is a better fit.
> > 
> > That will translate into:
> > 
> >   latency-nice=0 : default (current mainline) behaviour, all "biasing"
> >   policies are disabled and we wakeup up as fast as possible
> > 
> >   latency-nice=1024 : maximum niceness, where for example we can imaging
> >   to turn switch a CFS task to be SCHED_IDLE?
> 
> There's a few things wrong there; I really feel that if we call it nice,
> it should be like nice. Otherwise we should call it latency-bias and not
> have the association with nice to confuse people.
> 
> Secondly; the default should be in the middle of the range. Naturally
> this would be a signed range like nice [-(x+1),x] for some x. but if you
> want [0,1024], then the default really should be 512, but personally I
> like 0 better as a default, in which case we need negative numbers.
> 
> This is important because we want to be able to bias towards less
> importance to (tail) latency as well as more importantance to (tail)
> latency.
> 
> Specifically, Oracle wants to sacrifice (some) latency for throughput.
> Facebook OTOH seems to want to sacrifice (some) throughput for latency.

Another use case I'm considering is using latency-nice to prefer an idle CPU if
latency-nice is set otherwise go for the most energy efficient CPU.

Ie: sacrifice (some) energy for latency.

The way I see interpreting latency-nice here as a binary switch. But maybe we
can use the range to select what (some) energy to sacrifice mean here. Hmmm.

--
Qais Yousef


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 11:05:18AM +0100, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b52ed1a..365c928 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq 
> > *this_rq) { }
> >  #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
> >  
> >  /*
> > + * Latency-nice default value
> > + */
> > +#defineLATENCY_NICE_DEFAULT5
> > +#defineLATENCY_NICE_MIN1
> > +#defineLATENCY_NICE_MAX100
> 
> Values 1 and 5 looks kind of arbitrary.

Yes, and like I just wrote, completely and utterly wrong.


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 10:45:27AM +0100, Patrick Bellasi wrote:

> > From just reading the above, I would expect it to have the range
> > [-20,19] just like normal nice. Apparently this is not so.
> 
> Regarding the range for the latency-nice values, I guess we have two
> options:
> 
>   - [-20..19], which makes it similar to priorities
>   downside: we quite likely end up with a kernel space representation
>   which does not match the user-space one, e.g. look at
>   task_struct::prio.
> 
>   - [0..1024], which makes it more similar to a "percentage"
> 
> Being latency-nice a new concept, we are not constrained by POSIX and
> IMHO the [0..1024] scale is a better fit.
> 
> That will translate into:
> 
>   latency-nice=0 : default (current mainline) behaviour, all "biasing"
>   policies are disabled and we wakeup up as fast as possible
> 
>   latency-nice=1024 : maximum niceness, where for example we can imaging
>   to turn switch a CFS task to be SCHED_IDLE?

There's a few things wrong there; I really feel that if we call it nice,
it should be like nice. Otherwise we should call it latency-bias and not
have the association with nice to confuse people.

Secondly; the default should be in the middle of the range. Naturally
this would be a signed range like nice [-(x+1),x] for some x. but if you
want [0,1024], then the default really should be 512, but personally I
like 0 better as a default, in which case we need negative numbers.

This is important because we want to be able to bias towards less
importance to (tail) latency as well as more importantance to (tail)
latency.

Specifically, Oracle wants to sacrifice (some) latency for throughput.
Facebook OTOH seems to want to sacrifice (some) throughput for latency.


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 07:15:34 +0100, Parth Shah wrote...

> On 9/4/19 11:02 PM, Tim Chen wrote:
>> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>>> "latency-nice" which is shared by all the threads in that Cgroup.
>> 
>> 
>> Subhra,
>> 
>> Thanks for posting the patchset.  Having a latency nice hint
>> is useful beyond idle load balancing.  I can think of other
>> application scenarios, like scheduling batch machine learning AVX 512
>> processes with latency sensitive processes.  AVX512 limits the frequency
>> of the CPU and it is best to avoid latency sensitive task on the
>> same core with AVX512.  So latency nice hint allows the scheduler
>> to have a criteria to determine the latency sensitivity of a task
>> and arrange latency sensitive tasks away from AVX512 tasks.
>> 
>
>
> Hi Tim and Subhra,
>
> This patchset seems to be interesting for my TurboSched patches as well
> where I try to pack jitter tasks on fewer cores to get higher Turbo 
> Frequencies.
> Well, the problem I face is that we sometime end up putting multiple jitter 
> tasks on a core
> running some latency sensitive application which may see performance 
> degradation.
> So my plan was to classify such tasks to be latency sensitive thereby hinting 
> the load
> balancer to not put tasks on such cores.
>
> TurboSched: https://lkml.org/lkml/2019/7/25/296
>
>> You configure the latency hint on a cgroup basis.
>> But I think not all tasks in a cgroup necessarily have the same
>> latency sensitivity.
>> 
>> For example, I can see that cgroup can be applied on a per user basis,
>> and the user could run different tasks that have different latency 
>> sensitivity.
>> We may also need a way to configure latency sensitivity on a per task basis 
>> instead on
>> a per cgroup basis.
>> 
>
> AFAIU, the problem defined above intersects with my patches as well where the 
> interface
> is required to classify the jitter tasks. I have already tried few methods 
> like
> syscall and cgroup to classify such tasks and maybe something like that can 
> be adopted
> with these patchset as well.

Agree, these two patchest are definitively overlapping in terms of
mechanisms and APIs to expose to userspace. You to guys seems to target
different goals but the general approach should be:

 - expose a single and abstract concept to user-space
   latency-nice or latency-tolerant as PaulT proposed at OSPM

 - map this concept in kernel-space to different kind of bias, both at
   wakeup time and load-balance time, and use both for RT and CFS tasks.

That's my understanding at least ;)

I guess we will have interesting discussions at the upcoming LPC to
figure out a solution fitting all needs.

> Thanks,
> Parth

Best,
Patrick

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


We already commented on adding the cgroup API after the per-task API.

However, for the cgroup bits will be super important to have

 [ +tejun ]

in CC since here we are at discussing the idea to add a new cpu
controller's attribute.

There are opinions about which kind of attributes can be added to
cgroups and I'm sure a "latency-nice" attribute will generate an
interesting discussion. :)

LPC is coming up, perhaps we can get the chance to have a chat with
Tejun about the manoeuvring space in this area.

On Fri, Aug 30, 2019 at 18:49:36 +0100, subhra mazumdar wrote...

> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.
>
> Signed-off-by: subhra mazumdar 
> ---
>  include/linux/sched.h |  1 +
>  kernel/sched/core.c   | 40 
>  kernel/sched/fair.c   |  1 +
>  kernel/sched/sched.h  |  8 
>  4 files changed, 50 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1183741..b4a79c3 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -631,6 +631,7 @@ struct task_struct {
>   int static_prio;
>   int normal_prio;
>   unsigned intrt_priority;
> + u64 latency_nice;

I guess we can save some bit here... or, if we are very brave, maybe we
can explore the possibility to pack all prios into a single u64?

 ( ( (tomatoes target here) ) )

>   const struct sched_class*sched_class;
>   struct sched_entity se;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427..47969bc 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5976,6 +5976,7 @@ void __init sched_init(void)
>   init_dl_rq(>dl);
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   root_task_group.shares = ROOT_TASK_GROUP_LOAD;
> + root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
>   INIT_LIST_HEAD(>leaf_cfs_rq_list);
>   rq->tmp_alone_branch = >leaf_cfs_rq_list;
>   /*
> @@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, 
> int type)
>*/
>   tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
> struct task_group, css);
> + tsk->latency_nice = tg->latency_nice;
>   tg = autogroup_task_group(tsk, tg);
>   tsk->sched_task_group = tg;
>  
> @@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct 
> cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
> +static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
> +  struct cftype *cft)
> +{
> + struct task_group *tg = css_tg(css);
> +
> + return tg->latency_nice;
> +}
> +
> +static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
> +   struct cftype *cft, u64 latency_nice)
> +{
> + struct task_group *tg = css_tg(css);
> + struct css_task_iter it;
> + struct task_struct *p;
> +
> + if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
> + return -ERANGE;
> +
> + tg->latency_nice = latency_nice;
> +
> + css_task_iter_start(css, 0, );
> + while ((p = css_task_iter_next()))
> + p->latency_nice = latency_nice;

Once (and if) the cgroup API is added we can avoid this (potentially
massive) "update on write" in favour of an "on demand composition at
wakeup-time".

We don't care about updating the latency-nice of NON RUNNABLE tasks,
do we?

AFAIK, we need that value only (or mostly) at wakeup time. Thus, when a
task wakeup up we can easily compose (and eventually cache) it's
current latency-nice value by considering, in priority order:

  - the system wide upper-bound
  - the task group restriction
  - the task specific relaxation

Something similar to what we already do for uclamp composition with this
patch currently in tip/sched/core:

   commit 3eac870a3247 ("sched/uclamp: Use TG's clamps to restrict TASK's 
clamps")


> + css_task_iter_end();
> +
> + return 0;
> +}
> +
>  static struct cftype cpu_legacy_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   {
> @@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
>   .write_u64 = cpu_rt_period_write_uint,
>   },
>  #endif
> + {
> + .name = "latency-nice",
> + .read_u64 = cpu_latency_nice_read_u64,
> + .write_u64 = cpu_latency_nice_write_u64,
> + },
>   { } /* Terminate */
>  };
>  
> @@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
>   .write = cpu_max_write,
>   },
>  #endif
> + {
> + .name = "latency-nice",
> + .read_u64 = cpu_latency_nice_read_u64,
> + .write_u64 = cpu_latency_nice_write_u64,
> + },
>   { 

Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Patrick Bellasi


On Thu, Sep 05, 2019 at 09:31:27 +0100, Peter Zijlstra wrote...

> On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
>
> *sigh*, no. We start with a normal per task attribute, and then later,
> if it is needed and makes sense, we add it to cgroups.

FWIW, to add on top of what Peter says, we used this same approach for
uclamp and it proved to be a very effective way to come up with a good
design. General principles have been:

 - a system wide API [1] (under /proc/sys/kernel/sched_*) defines
   default values for all tasks affected by that feature.
   This interface has to define also upper bounds for task specific
   values. Thus, in the case of latency-nice, it should be set by
   default to the MIN value, since that's the current mainline
   behaviour: all tasks are latency sensitive.

 - a per-task API [2] (via the sched_setattr() syscall) can be used to
   relax the system wide setting thus implementing a "nice" policy.

 - a per-taskgroup API [3] (via cpu controller's attributes) can be used
   to relax the system-wide settings and restrict the per-task API.

The above features are worth to be added in that exact order.

> Also, your Changelog fails on pretty much every point. It doesn't
> explain why, it doesn't describe anything and so on.

On the description side, I guess it's worth to mention somewhere to
which scheduling classes this feature can be useful for. It's worth to
mention that it can apply only to:

 - CFS tasks: for example, at wakeup time a task with an high
   latency-nice should avoid to preempt a low latency-nice task.
   Maybe by mapping the latency nice value into proper vruntime
   normalization value?

 - RT tasks: for example, at wakeup time a task with an high
   latency-nice value could avoid to preempt a CFS task.

I'm sure there will be discussion about some of these features, that's
why it's important in the proposal presentation to keep a well defined
distinction among the "mechanisms and API" and how we use the new
concept to "bias" some scheduler policies.

> From just reading the above, I would expect it to have the range
> [-20,19] just like normal nice. Apparently this is not so.

Regarding the range for the latency-nice values, I guess we have two
options:

  - [-20..19], which makes it similar to priorities
  downside: we quite likely end up with a kernel space representation
  which does not match the user-space one, e.g. look at
  task_struct::prio.

  - [0..1024], which makes it more similar to a "percentage"

Being latency-nice a new concept, we are not constrained by POSIX and
IMHO the [0..1024] scale is a better fit.

That will translate into:

  latency-nice=0 : default (current mainline) behaviour, all "biasing"
  policies are disabled and we wakeup up as fast as possible

  latency-nice=1024 : maximum niceness, where for example we can imaging
  to turn switch a CFS task to be SCHED_IDLE?

Best,
Patrick

[1] commit e8f14172c6b1 ("sched/uclamp: Add system default clamps")
[2] commit a509a7cd7974 ("sched/uclamp: Extend sched_setattr() to support 
utilization clamping")
[3] 5 patches in today's tip/sched/core up to:
commit babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp 
changes")

-- 
#include 

Patrick Bellasi


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Peter Zijlstra
On Fri, Aug 30, 2019 at 10:49:36AM -0700, subhra mazumdar wrote:
> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.

*sigh*, no. We start with a normal per task attribute, and then later,
if it is needed and makes sense, we add it to cgroups.

Also, your Changelog fails on pretty much every point. It doesn't
explain why, it doesn't describe anything and so on.

>From just reading the above, I would expect it to have the range
[-20,19] just like normal nice. Apparently this is not so.


Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-05 Thread Parth Shah



On 9/4/19 11:02 PM, Tim Chen wrote:
> On 8/30/19 10:49 AM, subhra mazumdar wrote:
>> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
>> "latency-nice" which is shared by all the threads in that Cgroup.
> 
> 
> Subhra,
> 
> Thanks for posting the patchset.  Having a latency nice hint
> is useful beyond idle load balancing.  I can think of other
> application scenarios, like scheduling batch machine learning AVX 512
> processes with latency sensitive processes.  AVX512 limits the frequency
> of the CPU and it is best to avoid latency sensitive task on the
> same core with AVX512.  So latency nice hint allows the scheduler
> to have a criteria to determine the latency sensitivity of a task
> and arrange latency sensitive tasks away from AVX512 tasks.
> 


Hi Tim and Subhra,

This patchset seems to be interesting for my TurboSched patches as well
where I try to pack jitter tasks on fewer cores to get higher Turbo Frequencies.
Well, the problem I face is that we sometime end up putting multiple jitter 
tasks on a core
running some latency sensitive application which may see performance 
degradation.
So my plan was to classify such tasks to be latency sensitive thereby hinting 
the load
balancer to not put tasks on such cores.

TurboSched: https://lkml.org/lkml/2019/7/25/296

> You configure the latency hint on a cgroup basis.
> But I think not all tasks in a cgroup necessarily have the same
> latency sensitivity.
> 
> For example, I can see that cgroup can be applied on a per user basis,
> and the user could run different tasks that have different latency 
> sensitivity.
> We may also need a way to configure latency sensitivity on a per task basis 
> instead on
> a per cgroup basis.
> 

AFAIU, the problem defined above intersects with my patches as well where the 
interface
is required to classify the jitter tasks. I have already tried few methods like
syscall and cgroup to classify such tasks and maybe something like that can be 
adopted
with these patchset as well.


Thanks,
Parth



Re: [RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-09-04 Thread Tim Chen
On 8/30/19 10:49 AM, subhra mazumdar wrote:
> Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
> "latency-nice" which is shared by all the threads in that Cgroup.


Subhra,

Thanks for posting the patchset.  Having a latency nice hint
is useful beyond idle load balancing.  I can think of other
application scenarios, like scheduling batch machine learning AVX 512
processes with latency sensitive processes.  AVX512 limits the frequency
of the CPU and it is best to avoid latency sensitive task on the
same core with AVX512.  So latency nice hint allows the scheduler
to have a criteria to determine the latency sensitivity of a task
and arrange latency sensitive tasks away from AVX512 tasks.

You configure the latency hint on a cgroup basis.
But I think not all tasks in a cgroup necessarily have the same
latency sensitivity.

For example, I can see that cgroup can be applied on a per user basis,
and the user could run different tasks that have different latency sensitivity.
We may also need a way to configure latency sensitivity on a per task basis 
instead on
a per cgroup basis.

Tim


> @@ -631,6 +631,7 @@ struct task_struct {
>   int static_prio;
>   int normal_prio;
>   unsigned intrt_priority;
> + u64 latency_nice;

Does it need to be 64 bit?  Max latency nice is only 100.

>  
>   const struct sched_class*sched_class;
>   struct sched_entity se;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 874c427..47969bc 100644

> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b52ed1a..365c928 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq 
> *this_rq) { }
>  #define NICE_0_LOAD  (1L << NICE_0_LOAD_SHIFT)
>  
>  /*
> + * Latency-nice default value
> + */

Will be useful to add comments to let reader know 
that higher latency nice number means a task is more 
latency tolerant.

Is there a reason for setting the default to be a low
value of 5?

Seems like we will default to only to search the
same core for idle cpu on a smaller system, 
as we only search 5% of the cpu span of the target sched domain.

> +#define  LATENCY_NICE_DEFAULT5
> +#define  LATENCY_NICE_MIN1
> +#define  LATENCY_NICE_MAX100
> +


[RFC PATCH 1/9] sched,cgroup: Add interface for latency-nice

2019-08-30 Thread subhra mazumdar
Add Cgroup interface for latency-nice. Each CPU Cgroup adds a new file
"latency-nice" which is shared by all the threads in that Cgroup.

Signed-off-by: subhra mazumdar 
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 40 
 kernel/sched/fair.c   |  1 +
 kernel/sched/sched.h  |  8 
 4 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1183741..b4a79c3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -631,6 +631,7 @@ struct task_struct {
int static_prio;
int normal_prio;
unsigned intrt_priority;
+   u64 latency_nice;
 
const struct sched_class*sched_class;
struct sched_entity se;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427..47969bc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5976,6 +5976,7 @@ void __init sched_init(void)
init_dl_rq(>dl);
 #ifdef CONFIG_FAIR_GROUP_SCHED
root_task_group.shares = ROOT_TASK_GROUP_LOAD;
+   root_task_group.latency_nice = LATENCY_NICE_DEFAULT;
INIT_LIST_HEAD(>leaf_cfs_rq_list);
rq->tmp_alone_branch = >leaf_cfs_rq_list;
/*
@@ -6345,6 +6346,7 @@ static void sched_change_group(struct task_struct *tsk, 
int type)
 */
tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
  struct task_group, css);
+   tsk->latency_nice = tg->latency_nice;
tg = autogroup_task_group(tsk, tg);
tsk->sched_task_group = tg;
 
@@ -6812,6 +6814,34 @@ static u64 cpu_rt_period_read_uint(struct 
cgroup_subsys_state *css,
 }
 #endif /* CONFIG_RT_GROUP_SCHED */
 
+static u64 cpu_latency_nice_read_u64(struct cgroup_subsys_state *css,
+struct cftype *cft)
+{
+   struct task_group *tg = css_tg(css);
+
+   return tg->latency_nice;
+}
+
+static int cpu_latency_nice_write_u64(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 latency_nice)
+{
+   struct task_group *tg = css_tg(css);
+   struct css_task_iter it;
+   struct task_struct *p;
+
+   if (latency_nice < LATENCY_NICE_MIN || latency_nice > LATENCY_NICE_MAX)
+   return -ERANGE;
+
+   tg->latency_nice = latency_nice;
+
+   css_task_iter_start(css, 0, );
+   while ((p = css_task_iter_next()))
+   p->latency_nice = latency_nice;
+   css_task_iter_end();
+
+   return 0;
+}
+
 static struct cftype cpu_legacy_files[] = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -6848,6 +6878,11 @@ static struct cftype cpu_legacy_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
 #endif
+   {
+   .name = "latency-nice",
+   .read_u64 = cpu_latency_nice_read_u64,
+   .write_u64 = cpu_latency_nice_write_u64,
+   },
{ } /* Terminate */
 };
 
@@ -7015,6 +7050,11 @@ static struct cftype cpu_files[] = {
.write = cpu_max_write,
},
 #endif
+   {
+   .name = "latency-nice",
+   .read_u64 = cpu_latency_nice_read_u64,
+   .write_u64 = cpu_latency_nice_write_u64,
+   },
{ } /* terminate */
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f..b08d00c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10479,6 +10479,7 @@ int alloc_fair_sched_group(struct task_group *tg, 
struct task_group *parent)
goto err;
 
tg->shares = NICE_0_LOAD;
+   tg->latency_nice = LATENCY_NICE_DEFAULT;
 
init_cfs_bandwidth(tg_cfs_bandwidth(tg));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b52ed1a..365c928 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -143,6 +143,13 @@ static inline void cpu_load_update_active(struct rq 
*this_rq) { }
 #define NICE_0_LOAD(1L << NICE_0_LOAD_SHIFT)
 
 /*
+ * Latency-nice default value
+ */
+#defineLATENCY_NICE_DEFAULT5
+#defineLATENCY_NICE_MIN1
+#defineLATENCY_NICE_MAX100
+
+/*
  * Single value that decides SCHED_DEADLINE internal math precision.
  * 10 -> just above 1us
  * 9  -> just above 0.5us
@@ -362,6 +369,7 @@ struct cfs_bandwidth {
 /* Task group related information */
 struct task_group {
struct cgroup_subsys_state css;
+   u64 latency_nice;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each CPU */
-- 
2.9.3