subject:"Re\: \[PATCH\] sched\/rt\: RT_RUNTIME

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-14 Thread Peter Zijlstra

On Sun, Nov 13, 2016 at 12:53:28PM -0600, Christoph Lameter wrote:
> On Fri, 11 Nov 2016, Peter Zijlstra wrote:
> 
> > On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> > > On Thu, 10 Nov 2016, Daniel Vacek wrote:
> > >
> > > > I believe Daniel's patches are the best thing we can do in current
> > > > situation as the behavior now seems rather buggy and does not provide 
> > > > above
> > > > mentioned expectations set when rt throttling was merged with default
> > > > budget of 95% of CPU time. Nor if you configure so that it does (by
> > > > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly 
> > > > when
> > > > there still can be rt task running not really starving anyone. At least
> > > > till a proper rework of rt scheduling with DL Server is implemented.
> > >
> > > This looks like a fix for a bug and the company I work for is suffering
> > > as a result. Could we please merge that ASAP?
> > >
> >
> > What bug? And no, the patch has as many technical issues as it has
> > conceptual ones.
> 
> There is a deadlock, Peter!!!

Describe please? Also, have you tried disabling RT_RUNTIME_SHARE ?

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-14 Thread Peter Zijlstra

On Sun, Nov 13, 2016 at 12:53:28PM -0600, Christoph Lameter wrote:
> On Fri, 11 Nov 2016, Peter Zijlstra wrote:
> 
> > On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> > > On Thu, 10 Nov 2016, Daniel Vacek wrote:
> > >
> > > > I believe Daniel's patches are the best thing we can do in current
> > > > situation as the behavior now seems rather buggy and does not provide 
> > > > above
> > > > mentioned expectations set when rt throttling was merged with default
> > > > budget of 95% of CPU time. Nor if you configure so that it does (by
> > > > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly 
> > > > when
> > > > there still can be rt task running not really starving anyone. At least
> > > > till a proper rework of rt scheduling with DL Server is implemented.
> > >
> > > This looks like a fix for a bug and the company I work for is suffering
> > > as a result. Could we please merge that ASAP?
> > >
> >
> > What bug? And no, the patch has as many technical issues as it has
> > conceptual ones.
> 
> There is a deadlock, Peter!!!

Describe please? Also, have you tried disabling RT_RUNTIME_SHARE ?

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-13 Thread Christoph Lameter

On Fri, 11 Nov 2016, Peter Zijlstra wrote:

> On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> > On Thu, 10 Nov 2016, Daniel Vacek wrote:
> >
> > > I believe Daniel's patches are the best thing we can do in current
> > > situation as the behavior now seems rather buggy and does not provide 
> > > above
> > > mentioned expectations set when rt throttling was merged with default
> > > budget of 95% of CPU time. Nor if you configure so that it does (by
> > > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly 
> > > when
> > > there still can be rt task running not really starving anyone. At least
> > > till a proper rework of rt scheduling with DL Server is implemented.
> >
> > This looks like a fix for a bug and the company I work for is suffering
> > as a result. Could we please merge that ASAP?
> >
>
> What bug? And no, the patch has as many technical issues as it has
> conceptual ones.

There is a deadlock, Peter!!!

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-13 Thread Christoph Lameter

On Fri, 11 Nov 2016, Peter Zijlstra wrote:

> On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> > On Thu, 10 Nov 2016, Daniel Vacek wrote:
> >
> > > I believe Daniel's patches are the best thing we can do in current
> > > situation as the behavior now seems rather buggy and does not provide 
> > > above
> > > mentioned expectations set when rt throttling was merged with default
> > > budget of 95% of CPU time. Nor if you configure so that it does (by
> > > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly 
> > > when
> > > there still can be rt task running not really starving anyone. At least
> > > till a proper rework of rt scheduling with DL Server is implemented.
> >
> > This looks like a fix for a bug and the company I work for is suffering
> > as a result. Could we please merge that ASAP?
> >
>
> What bug? And no, the patch has as many technical issues as it has
> conceptual ones.

There is a deadlock, Peter!!!

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-11 Thread Peter Zijlstra

On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> On Thu, 10 Nov 2016, Daniel Vacek wrote:
> 
> > I believe Daniel's patches are the best thing we can do in current
> > situation as the behavior now seems rather buggy and does not provide above
> > mentioned expectations set when rt throttling was merged with default
> > budget of 95% of CPU time. Nor if you configure so that it does (by
> > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> > there still can be rt task running not really starving anyone. At least
> > till a proper rework of rt scheduling with DL Server is implemented.
> 
> This looks like a fix for a bug and the company I work for is suffering
> as a result. Could we please merge that ASAP?
> 

What bug? And no, the patch has as many technical issues as it has
conceptual ones.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-11 Thread Peter Zijlstra

On Fri, Nov 11, 2016 at 12:46:37PM -0600, Christoph Lameter wrote:
> On Thu, 10 Nov 2016, Daniel Vacek wrote:
> 
> > I believe Daniel's patches are the best thing we can do in current
> > situation as the behavior now seems rather buggy and does not provide above
> > mentioned expectations set when rt throttling was merged with default
> > budget of 95% of CPU time. Nor if you configure so that it does (by
> > disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> > there still can be rt task running not really starving anyone. At least
> > till a proper rework of rt scheduling with DL Server is implemented.
> 
> This looks like a fix for a bug and the company I work for is suffering
> as a result. Could we please merge that ASAP?
> 

What bug? And no, the patch has as many technical issues as it has
conceptual ones.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-11 Thread Christoph Lameter

On Thu, 10 Nov 2016, Daniel Vacek wrote:

> I believe Daniel's patches are the best thing we can do in current
> situation as the behavior now seems rather buggy and does not provide above
> mentioned expectations set when rt throttling was merged with default
> budget of 95% of CPU time. Nor if you configure so that it does (by
> disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> there still can be rt task running not really starving anyone. At least
> till a proper rework of rt scheduling with DL Server is implemented.

This looks like a fix for a bug and the company I work for is suffering
as a result. Could we please merge that ASAP?

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-11 Thread Christoph Lameter

On Thu, 10 Nov 2016, Daniel Vacek wrote:

> I believe Daniel's patches are the best thing we can do in current
> situation as the behavior now seems rather buggy and does not provide above
> mentioned expectations set when rt throttling was merged with default
> budget of 95% of CPU time. Nor if you configure so that it does (by
> disabling RT_RUNTIME_SHARE), it also forces CPUs to go idle needlessly when
> there still can be rt task running not really starving anyone. At least
> till a proper rework of rt scheduling with DL Server is implemented.

This looks like a fix for a bug and the company I work for is suffering
as a result. Could we please merge that ASAP?

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-09 Thread Daniel Bristot de Oliveira

On 11/08/2016 08:50 PM, Peter Zijlstra wrote:
>> The problem is that using RT_RUNTIME_SHARE a CPU will almost always
>> > borrow enough runtime to make a CPU intensive rt task to run forever...
>> > well not forever, but until the system crash because a kworker starved
>> > in this CPU. Kworkers are sched fair by design and users do not always
>> > have a way to avoid them in an isolated CPU, for example.
>> > 
>> > The user then can disable RT_RUNTIME_SHARE, but then the user will have
>> > the CPU going idle for (period - runtime) at each period... throwing CPU
>> > time in the trash.
> So why is this a problem? You really should not be running that much
> FIFO tasks to begin with.

I agree that a spinning real-time task is not a good practice, but there
are people using it and they have their own reasons/metrics/evaluations
motivating them.

> So I'm willing to take out (or at least default disable
> RT_RUNTIME_SHARE). But other than this, this never really worked to
> begin with. So it cannot be a regression. And we've lived this long with
> the 'problem'.

I agree! It would work perfectly in the absence of tasks pinned to a
processor, but that is not the case.

Trying to attend the users that want as much CPU time for -rt tasks as
possible, the proposed patch seems to be a better solution when compared
to RT_RUNTIME_SHARE, and it is way simpler! Even though it is not as
perfect as a DL Server would be in the future, it seems to be useful
until there...

> We really should be doing the right thing here, not make a bigger mess.

I see, agree and I am anxious to have it! :-). Tommaso and I discussed
about DL servers implementing such rt throttling. The more complicated
point so far (as Rostedt pointed on another e-mail) will be to have DL
servers with arbitrary affinity, or serving task with arbitrary
affinity. For example, one DL server pinned to each CPU providing
bandwidth for fair tasks to run for (rt_period - rt_runtime)us at each
rt_period... it will take sometime until someone propose it.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-09 Thread Daniel Bristot de Oliveira

On 11/08/2016 08:50 PM, Peter Zijlstra wrote:
>> The problem is that using RT_RUNTIME_SHARE a CPU will almost always
>> > borrow enough runtime to make a CPU intensive rt task to run forever...
>> > well not forever, but until the system crash because a kworker starved
>> > in this CPU. Kworkers are sched fair by design and users do not always
>> > have a way to avoid them in an isolated CPU, for example.
>> > 
>> > The user then can disable RT_RUNTIME_SHARE, but then the user will have
>> > the CPU going idle for (period - runtime) at each period... throwing CPU
>> > time in the trash.
> So why is this a problem? You really should not be running that much
> FIFO tasks to begin with.

I agree that a spinning real-time task is not a good practice, but there
are people using it and they have their own reasons/metrics/evaluations
motivating them.

> So I'm willing to take out (or at least default disable
> RT_RUNTIME_SHARE). But other than this, this never really worked to
> begin with. So it cannot be a regression. And we've lived this long with
> the 'problem'.

I agree! It would work perfectly in the absence of tasks pinned to a
processor, but that is not the case.

Trying to attend the users that want as much CPU time for -rt tasks as
possible, the proposed patch seems to be a better solution when compared
to RT_RUNTIME_SHARE, and it is way simpler! Even though it is not as
perfect as a DL Server would be in the future, it seems to be useful
until there...

> We really should be doing the right thing here, not make a bigger mess.

I see, agree and I am anxious to have it! :-). Tommaso and I discussed
about DL servers implementing such rt throttling. The more complicated
point so far (as Rostedt pointed on another e-mail) will be to have DL
servers with arbitrary affinity, or serving task with arbitrary
affinity. For example, one DL server pinned to each CPU providing
bandwidth for fair tasks to run for (rt_period - rt_runtime)us at each
rt_period... it will take sometime until someone propose it.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Christoph Lameter

On Mon, 7 Nov 2016, Steven Rostedt wrote:

> On Mon, 7 Nov 2016 21:06:50 +0100
> Daniel Bristot de Oliveira  wrote:
>
> > The throttling allowed the kworker to run, but once the kworker went to
> > sleep, the RT tasks started to work again. In the previous behavior,
> > the system would either go idle, or the kworker would starve because
> > the runtime become infinity for RR tasks.
>
> I'm confused? Are you saying that RR tasks don't get throttled in the
> current code? That sounds like a bug to me.

Good. Thats what I wanted to hear after all these justifications that the
system is just behaving as designed.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Christoph Lameter

On Mon, 7 Nov 2016, Steven Rostedt wrote:

> On Mon, 7 Nov 2016 21:06:50 +0100
> Daniel Bristot de Oliveira  wrote:
>
> > The throttling allowed the kworker to run, but once the kworker went to
> > sleep, the RT tasks started to work again. In the previous behavior,
> > the system would either go idle, or the kworker would starve because
> > the runtime become infinity for RR tasks.
>
> I'm confused? Are you saying that RR tasks don't get throttled in the
> current code? That sounds like a bug to me.

Good. Thats what I wanted to hear after all these justifications that the
system is just behaving as designed.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra

On Tue, Nov 08, 2016 at 08:29:49PM +0100, Daniel Bristot de Oliveira wrote:
> 
> 
> On 11/08/2016 07:05 PM, Peter Zijlstra wrote:
> >> > 
> >> > I know what we want to do, but there's some momentous problems that
> >> > need to be solved first.
> > Like what?
> 
> The problem is that using RT_RUNTIME_SHARE a CPU will almost always
> borrow enough runtime to make a CPU intensive rt task to run forever...
> well not forever, but until the system crash because a kworker starved
> in this CPU. Kworkers are sched fair by design and users do not always
> have a way to avoid them in an isolated CPU, for example.
> 
> The user then can disable RT_RUNTIME_SHARE, but then the user will have
> the CPU going idle for (period - runtime) at each period... throwing CPU
> time in the trash.

So why is this a problem? You really should not be running that much
FIFO tasks to begin with.

So I'm willing to take out (or at least default disable
RT_RUNTIME_SHARE). But other than this, this never really worked to
begin with. So it cannot be a regression. And we've lived this long with
the 'problem'.

And that means this is a 'feature' and that means I say no.

We really should be doing the right thing here, not make a bigger mess.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra

On Tue, Nov 08, 2016 at 08:29:49PM +0100, Daniel Bristot de Oliveira wrote:
> 
> 
> On 11/08/2016 07:05 PM, Peter Zijlstra wrote:
> >> > 
> >> > I know what we want to do, but there's some momentous problems that
> >> > need to be solved first.
> > Like what?
> 
> The problem is that using RT_RUNTIME_SHARE a CPU will almost always
> borrow enough runtime to make a CPU intensive rt task to run forever...
> well not forever, but until the system crash because a kworker starved
> in this CPU. Kworkers are sched fair by design and users do not always
> have a way to avoid them in an isolated CPU, for example.
> 
> The user then can disable RT_RUNTIME_SHARE, but then the user will have
> the CPU going idle for (period - runtime) at each period... throwing CPU
> time in the trash.

So why is this a problem? You really should not be running that much
FIFO tasks to begin with.

So I'm willing to take out (or at least default disable
RT_RUNTIME_SHARE). But other than this, this never really worked to
begin with. So it cannot be a regression. And we've lived this long with
the 'problem'.

And that means this is a 'feature' and that means I say no.

We really should be doing the right thing here, not make a bigger mess.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Daniel Bristot de Oliveira

On 11/08/2016 07:05 PM, Peter Zijlstra wrote:
>> > 
>> > I know what we want to do, but there's some momentous problems that
>> > need to be solved first.
> Like what?

The problem is that using RT_RUNTIME_SHARE a CPU will almost always
borrow enough runtime to make a CPU intensive rt task to run forever...
well not forever, but until the system crash because a kworker starved
in this CPU. Kworkers are sched fair by design and users do not always
have a way to avoid them in an isolated CPU, for example.

The user then can disable RT_RUNTIME_SHARE, but then the user will have
the CPU going idle for (period - runtime) at each period... throwing CPU
time in the trash.

>> > Until then, we may be forced to continue with
>> > hacks.
> Well, the more ill specified hacks we put in, the harder if will be to
> replace because people will end up depending on it.

The proposed patch seems to be the expected behavior for users/rt
throttling - they want a safeguard for fair tasks while allowing -rt
tasks to run as much as possible.

I see (and completely agree) that a DL server for fair/rt task would be
the best way to deal with this problem, but it will take some time until
such solution :-(. We even discussed this at Retis today, but yeah, it
will take sometime even in the best case.

(thinking aloud... a DL Server would react like the proposed patch, in
the sense that it would not be activated without tasks to run and would
return the CPU for other tasks if the tasks inside the server finish
their job before the end of the DL server runtime...)

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Daniel Bristot de Oliveira

On 11/08/2016 07:05 PM, Peter Zijlstra wrote:
>> > 
>> > I know what we want to do, but there's some momentous problems that
>> > need to be solved first.
> Like what?

The problem is that using RT_RUNTIME_SHARE a CPU will almost always
borrow enough runtime to make a CPU intensive rt task to run forever...
well not forever, but until the system crash because a kworker starved
in this CPU. Kworkers are sched fair by design and users do not always
have a way to avoid them in an isolated CPU, for example.

The user then can disable RT_RUNTIME_SHARE, but then the user will have
the CPU going idle for (period - runtime) at each period... throwing CPU
time in the trash.

>> > Until then, we may be forced to continue with
>> > hacks.
> Well, the more ill specified hacks we put in, the harder if will be to
> replace because people will end up depending on it.

The proposed patch seems to be the expected behavior for users/rt
throttling - they want a safeguard for fair tasks while allowing -rt
tasks to run as much as possible.

I see (and completely agree) that a DL server for fair/rt task would be
the best way to deal with this problem, but it will take some time until
such solution :-(. We even discussed this at Retis today, but yeah, it
will take sometime even in the best case.

(thinking aloud... a DL Server would react like the proposed patch, in
the sense that it would not be activated without tasks to run and would
return the CPU for other tasks if the tasks inside the server finish
their job before the end of the DL server runtime...)

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra

On Tue, Nov 08, 2016 at 12:17:10PM -0500, Steven Rostedt wrote:
> On Tue, 8 Nov 2016 17:51:33 +0100
> Peter Zijlstra  wrote:
> 
> > You really should already know this.
> 
> I know what we want to do, but there's some momentous problems that
> need to be solved first.

Like what?

> Until then, we may be forced to continue with
> hacks.

Well, the more ill specified hacks we put in, the harder if will be to
replace because people will end up depending on it.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra

On Tue, Nov 08, 2016 at 12:17:10PM -0500, Steven Rostedt wrote:
> On Tue, 8 Nov 2016 17:51:33 +0100
> Peter Zijlstra  wrote:
> 
> > You really should already know this.
> 
> I know what we want to do, but there's some momentous problems that
> need to be solved first.

Like what?

> Until then, we may be forced to continue with
> hacks.

Well, the more ill specified hacks we put in, the harder if will be to
replace because people will end up depending on it.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Steven Rostedt

On Tue, 8 Nov 2016 17:51:33 +0100
Peter Zijlstra  wrote:

> You really should already know this.

I know what we want to do, but there's some momentous problems that
need to be solved first. Until then, we may be forced to continue with
hacks.

> 
> As stands the current rt cgroup code (and all this throttling code) is a
> giant mess (as in, its not actually correct from a RT pov). We should
> not make it worse by adding random hacks to it.
> 
> The right way to to about doing this is by replacing it with something
> better; like the proposed DL server for FIFO tasks -- which is entirely
> non-trivial as well, see existing discussion on that.

Right. The biggest issue that I see is how to assign affinities to
FIFO tasks and use a DL server to keep them from starving other tasks?

> 
> I'm not entirely sure what this patch was supposed to fix, but it could
> be running CFS tasks with higher priority than RT for a window, instead

I'm a bit confused with the above sentence. Do you mean that this patch
causes CFS tasks to run for a period with a higher priority than RT?
Well, currently we have the both CFS tasks and the "idle" task run
higher than RT, but this patch changes that to be just CFS tasks.

> of throttling RT tasks. This seems fairly ill specified, but something
> like that could easily done with an explicit or slack time DL server for
> CFS tasks.

If we can have a DL scheduler that can handle arbitrary affinities,
then all could be solved with that.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Steven Rostedt

On Tue, 8 Nov 2016 17:51:33 +0100
Peter Zijlstra  wrote:

> You really should already know this.

I know what we want to do, but there's some momentous problems that
need to be solved first. Until then, we may be forced to continue with
hacks.

> 
> As stands the current rt cgroup code (and all this throttling code) is a
> giant mess (as in, its not actually correct from a RT pov). We should
> not make it worse by adding random hacks to it.
> 
> The right way to to about doing this is by replacing it with something
> better; like the proposed DL server for FIFO tasks -- which is entirely
> non-trivial as well, see existing discussion on that.

Right. The biggest issue that I see is how to assign affinities to
FIFO tasks and use a DL server to keep them from starving other tasks?

> 
> I'm not entirely sure what this patch was supposed to fix, but it could
> be running CFS tasks with higher priority than RT for a window, instead

I'm a bit confused with the above sentence. Do you mean that this patch
causes CFS tasks to run for a period with a higher priority than RT?
Well, currently we have the both CFS tasks and the "idle" task run
higher than RT, but this patch changes that to be just CFS tasks.

> of throttling RT tasks. This seems fairly ill specified, but something
> like that could easily done with an explicit or slack time DL server for
> CFS tasks.

If we can have a DL scheduler that can handle arbitrary affinities,
then all could be solved with that.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra

On Tue, Nov 08, 2016 at 09:07:40AM -0500, Steven Rostedt wrote:
> On Tue, 8 Nov 2016 12:59:58 +0100
> Peter Zijlstra  wrote:
> 
> > No, none of this stands a chance of being accepted.
> > 
> > This is making bad code worse.
> 
> Peter,
> 
> Instead of a flat out rejection, can you please provide some
> constructive criticism to let those that are working on this know what
> would be accepted? And what their next steps should be.
> 
> There's obviously a problem with the current code, what steps do you
> recommend to fix it?

You really should already know this.

As stands the current rt cgroup code (and all this throttling code) is a
giant mess (as in, its not actually correct from a RT pov). We should
not make it worse by adding random hacks to it.

The right way to to about doing this is by replacing it with something
better; like the proposed DL server for FIFO tasks -- which is entirely
non-trivial as well, see existing discussion on that.

I'm not entirely sure what this patch was supposed to fix, but it could
be running CFS tasks with higher priority than RT for a window, instead
of throttling RT tasks. This seems fairly ill specified, but something
like that could easily done with an explicit or slack time DL server for
CFS tasks.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra

On Tue, Nov 08, 2016 at 09:07:40AM -0500, Steven Rostedt wrote:
> On Tue, 8 Nov 2016 12:59:58 +0100
> Peter Zijlstra  wrote:
> 
> > No, none of this stands a chance of being accepted.
> > 
> > This is making bad code worse.
> 
> Peter,
> 
> Instead of a flat out rejection, can you please provide some
> constructive criticism to let those that are working on this know what
> would be accepted? And what their next steps should be.
> 
> There's obviously a problem with the current code, what steps do you
> recommend to fix it?

You really should already know this.

As stands the current rt cgroup code (and all this throttling code) is a
giant mess (as in, its not actually correct from a RT pov). We should
not make it worse by adding random hacks to it.

The right way to to about doing this is by replacing it with something
better; like the proposed DL server for FIFO tasks -- which is entirely
non-trivial as well, see existing discussion on that.

I'm not entirely sure what this patch was supposed to fix, but it could
be running CFS tasks with higher priority than RT for a window, instead
of throttling RT tasks. This seems fairly ill specified, but something
like that could easily done with an explicit or slack time DL server for
CFS tasks.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Steven Rostedt

On Tue, 8 Nov 2016 12:59:58 +0100
Peter Zijlstra  wrote:

> No, none of this stands a chance of being accepted.
> 
> This is making bad code worse.

Peter,

Instead of a flat out rejection, can you please provide some
constructive criticism to let those that are working on this know what
would be accepted? And what their next steps should be.

There's obviously a problem with the current code, what steps do you
recommend to fix it?

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Steven Rostedt

On Tue, 8 Nov 2016 12:59:58 +0100
Peter Zijlstra  wrote:

> No, none of this stands a chance of being accepted.
> 
> This is making bad code worse.

Peter,

Instead of a flat out rejection, can you please provide some
constructive criticism to let those that are working on this know what
would be accepted? And what their next steps should be.

There's obviously a problem with the current code, what steps do you
recommend to fix it?

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra



No, none of this stands a chance of being accepted.

This is making bad code worse.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Peter Zijlstra



No, none of this stands a chance of being accepted.

This is making bad code worse.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Juri Lelli

Hi Daniel,

On 07/11/16 14:51, Daniel Bristot de Oliveira wrote:
> On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:

[...]

> > -) only issue might be that, if a non-RT task wakes up after the
> > unthrottle, it will have to wait, but worst-case it will have a chance
> > in the next throttling window
> 
> In the current default behavior (RT_RUNTIME_SHARING), in a domain with
> more than two CPUs, the worst case easily become "infinity," because a
> CPU can borrow runtime from another CPU. There is no guarantee for
> minimum latency for non-rt tasks. Anyway, if the user wants to provide
> such guarantee, they just need not enable this feature, while disabling
> RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))
> 

I could only skim through the patch, so please forgive me if I'm talking
gibberish, but I think what Tommaso is saying is that with your current
approach if an unlucky OTHER task wakes up just after you unthrottled an
rt_rq (by replenishment), it will have to wait until the next throttling
event. I agree that this is still better than current status, and that
you can still configure the system to avoid this from happening. What
I'm wondering though is if we could modify your implementation and only
do the replenishment when the replenishment timer actually fires, but
let RT tasks continue to run, while their rt_rq is throttled, if no
OTHER task is present, or wakes up. I guess this will complicate things,
and maybe doesn't buy us much, just an idea. :)

Otherwise, the patch looks good and useful to me.

Best,

- Juri

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-08 Thread Juri Lelli

Hi Daniel,

On 07/11/16 14:51, Daniel Bristot de Oliveira wrote:
> On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:

[...]

> > -) only issue might be that, if a non-RT task wakes up after the
> > unthrottle, it will have to wait, but worst-case it will have a chance
> > in the next throttling window
> 
> In the current default behavior (RT_RUNTIME_SHARING), in a domain with
> more than two CPUs, the worst case easily become "infinity," because a
> CPU can borrow runtime from another CPU. There is no guarantee for
> minimum latency for non-rt tasks. Anyway, if the user wants to provide
> such guarantee, they just need not enable this feature, while disabling
> RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))
> 

I could only skim through the patch, so please forgive me if I'm talking
gibberish, but I think what Tommaso is saying is that with your current
approach if an unlucky OTHER task wakes up just after you unthrottled an
rt_rq (by replenishment), it will have to wait until the next throttling
event. I agree that this is still better than current status, and that
you can still configure the system to avoid this from happening. What
I'm wondering though is if we could modify your implementation and only
do the replenishment when the replenishment timer actually fires, but
let RT tasks continue to run, while their rt_rq is throttled, if no
OTHER task is present, or wakes up. I guess this will complicate things,
and maybe doesn't buy us much, just an idea. :)

Otherwise, the patch looks good and useful to me.

Best,

- Juri

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread luca abeni

Hi all,

since GRUB reclaiming has been mentioned, I am going to add some
comments on it :)

On Mon, 7 Nov 2016 14:51:37 +0100
Daniel Bristot de Oliveira  wrote:

[...]
> The sum of allocated runtime for all DL tasks will not to be greater
> than RT throttling enforcement runtime. The DL scheduler admission
> control already avoids this by limiting the amount of CPU time all DL
> tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
> "global" throttling on before hand - in the admission control.
> 
> GRUB might implement something <> for the DEADLINE scheduler.
> With GRUB, a deadline tasks will have more runtime than previously
> set/granted. But I am quite sure it will still be bounded by the
> sum of the already allocated DL runtime, that will continue being
> smaller than "to_ratio(global_rt_period(), global_rt_runtime())".

Well, it's not exactly like this... In the original GRUB algorithm[1]
(that was uni-processor only), the tasks were able to reclaim 100% of
the CPU time (in other words: with the original GRUB algorithm,
SCHED_DEADLINE tasks can starve all of the non-deadline tasks).

But in the patchset I submitted I modified the algorithm to reclaim
only a specified fraction of the CPU time[2] (so that some CPU time is
left for non-deadline tasks). See patch 5/6 in my latest submission
(v3). I set the percentage of reclaimable CPU time equal to
"to_ratio(global_rt_period(), global_rt_runtime())" (so, deadline tasks
can be able to consume up to this fraction), but this can be changed if
needed.

Finally, notice that if we are interested in hard schedulability (hard
respect of all the deadlines - this is not something that
SCHED_DEADLINE currently does) then the reclaiming algorithm must be
modified and can reclaim a smaller amount of CPU time (see [3,4] for
details)


Luca

[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused
bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000.
Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October).
Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the
Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany. 
[3] Abeni, L., Lipari, G., Parri, A., & Sun, Y. (2016, April).
Multicore CPU reclaiming: parallel or sequential?. In Proceedings of
the 31st Annual ACM Symposium on Applied Computing (pp. 1877-1884). ACM.
[4] https://arxiv.org/abs/1512.01984



> 
> Am I missing something?
> 
> > -) only issue might be that, if a non-RT task wakes up after the
> > unthrottle, it will have to wait, but worst-case it will have a
> > chance in the next throttling window  
> 
> In the current default behavior (RT_RUNTIME_SHARING), in a domain with
> more than two CPUs, the worst case easily become "infinity," because a
> CPU can borrow runtime from another CPU. There is no guarantee for
> minimum latency for non-rt tasks. Anyway, if the user wants to provide
> such guarantee, they just need not enable this feature, while
> disabling RT_RUNTIME_SHARING (or run the non-rt task as a deadline
> task ;-))
> 
> > -) an alternative to unthrottling might be temporary class
> > downgrade to sched_other, but that might be much more complex,
> > instead this Daniel's one looks quite simple  
> 
> Yeah, decrease the priority of the task would be something way more
> complicated and prone to errors. RT tasks would need to reduce its
> priority to a level higher than the IDLE task, but lower than
> SCHED_IDLE...
> 
> > -) when considering also DEADLINE tasks, it might be good to think
> > about how we'd like the throttling of DEADLINE and RT tasks to
> > inter-relate, e.g.:  
> 
> Currently, DL tasks are limited (in the bw control) to the global RT
> throttling limit...
> 
> I think that this might be an extension to GRUB... that is extending
> the current behavior... so... things for the future - and IMHO it is
> another topic - way more challenging.
> 
> Comments are welcome :-)
> 
> -- Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-rt-users" in the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread luca abeni

Hi all,

since GRUB reclaiming has been mentioned, I am going to add some
comments on it :)

On Mon, 7 Nov 2016 14:51:37 +0100
Daniel Bristot de Oliveira  wrote:

[...]
> The sum of allocated runtime for all DL tasks will not to be greater
> than RT throttling enforcement runtime. The DL scheduler admission
> control already avoids this by limiting the amount of CPU time all DL
> tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
> "global" throttling on before hand - in the admission control.
> 
> GRUB might implement something <> for the DEADLINE scheduler.
> With GRUB, a deadline tasks will have more runtime than previously
> set/granted. But I am quite sure it will still be bounded by the
> sum of the already allocated DL runtime, that will continue being
> smaller than "to_ratio(global_rt_period(), global_rt_runtime())".

Well, it's not exactly like this... In the original GRUB algorithm[1]
(that was uni-processor only), the tasks were able to reclaim 100% of
the CPU time (in other words: with the original GRUB algorithm,
SCHED_DEADLINE tasks can starve all of the non-deadline tasks).

But in the patchset I submitted I modified the algorithm to reclaim
only a specified fraction of the CPU time[2] (so that some CPU time is
left for non-deadline tasks). See patch 5/6 in my latest submission
(v3). I set the percentage of reclaimable CPU time equal to
"to_ratio(global_rt_period(), global_rt_runtime())" (so, deadline tasks
can be able to consume up to this fraction), but this can be changed if
needed.

Finally, notice that if we are interested in hard schedulability (hard
respect of all the deadlines - this is not something that
SCHED_DEADLINE currently does) then the reclaiming algorithm must be
modified and can reclaim a smaller amount of CPU time (see [3,4] for
details)


Luca

[1] Lipari, G., & Baruah, S. (2000). Greedy reclamation of unused
bandwidth in constant-bandwidth servers. In Real-Time Systems, 2000.
Euromicro RTS 2000. 12th Euromicro Conference on (pp. 193-200). IEEE.
[2] Abeni, L., Lelli, J., Scordino, C., & Palopoli, L. (2014, October).
Greedy CPU reclaiming for SCHED DEADLINE. In Proceedings of the
Real-Time Linux Workshop (RTLWS), Dusseldorf, Germany. 
[3] Abeni, L., Lipari, G., Parri, A., & Sun, Y. (2016, April).
Multicore CPU reclaiming: parallel or sequential?. In Proceedings of
the 31st Annual ACM Symposium on Applied Computing (pp. 1877-1884). ACM.
[4] https://arxiv.org/abs/1512.01984



> 
> Am I missing something?
> 
> > -) only issue might be that, if a non-RT task wakes up after the
> > unthrottle, it will have to wait, but worst-case it will have a
> > chance in the next throttling window  
> 
> In the current default behavior (RT_RUNTIME_SHARING), in a domain with
> more than two CPUs, the worst case easily become "infinity," because a
> CPU can borrow runtime from another CPU. There is no guarantee for
> minimum latency for non-rt tasks. Anyway, if the user wants to provide
> such guarantee, they just need not enable this feature, while
> disabling RT_RUNTIME_SHARING (or run the non-rt task as a deadline
> task ;-))
> 
> > -) an alternative to unthrottling might be temporary class
> > downgrade to sched_other, but that might be much more complex,
> > instead this Daniel's one looks quite simple  
> 
> Yeah, decrease the priority of the task would be something way more
> complicated and prone to errors. RT tasks would need to reduce its
> priority to a level higher than the IDLE task, but lower than
> SCHED_IDLE...
> 
> > -) when considering also DEADLINE tasks, it might be good to think
> > about how we'd like the throttling of DEADLINE and RT tasks to
> > inter-relate, e.g.:  
> 
> Currently, DL tasks are limited (in the bw control) to the global RT
> throttling limit...
> 
> I think that this might be an extension to GRUB... that is extending
> the current behavior... so... things for the future - and IMHO it is
> another topic - way more challenging.
> 
> Comments are welcome :-)
> 
> -- Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-rt-users" in the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 21:33:02 +0100
Daniel Bristot de Oliveira  wrote:

> On 11/07/2016 09:16 PM, Steven Rostedt wrote:
> > I'm confused? Are you saying that RR tasks don't get throttled in the
> > current code? That sounds like a bug to me.  
> 
> If the RT_RUNTIME_SHARING is enabled, the CPU in which the RR tasks are
> running (and pinned) will borrow RT runtime from another CPU, allowing
> the RR tasks to run forever. For example:
> 
> [root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
>   .rt_runtime: 950.00
>   .rt_runtime: 950.00
>   .rt_runtime: 950.00
>   .rt_runtime: 950.00
> [root@kiron debug]# echo RT_RUNTIME_SHARE > sched_features
> [root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
> [1] 23908
> [root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
> [2] 23915
> [root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
>   .rt_runtime: 900.00
>   .rt_runtime: 950.00
>   .rt_runtime: 1000.00
>   .rt_runtime: 950.00
> 
> You see? the rt_runtime of the CPU 2 was borrowed time from CPU 0.
> 
> It is not a BUG but a feature (no jokes haha). With RT_RUNTIME_SHARE,
> the rt_runtime is such a global runtime. It works fine for tasks that
> can migrate... but that is not the case for per-cpu kworkers.

This still looks like a bug, or not the expected result. Perhaps we
shouldn't share when tasks are pinned. It doesn't make sense. It's like
pinning two deadline tasks to the same CPU and giving them 100% of that
CPU and saying that it's really just 1/nr_cpus of usage, which would
have the same effect.

OK, it appears this is specific to RT_RUNTIME_SHARE which is what
causes this strange behavior, and even more rational to make this a
default option and perhaps even turn RT_RUNTIME_SHARE off by default.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 21:33:02 +0100
Daniel Bristot de Oliveira  wrote:

> On 11/07/2016 09:16 PM, Steven Rostedt wrote:
> > I'm confused? Are you saying that RR tasks don't get throttled in the
> > current code? That sounds like a bug to me.  
> 
> If the RT_RUNTIME_SHARING is enabled, the CPU in which the RR tasks are
> running (and pinned) will borrow RT runtime from another CPU, allowing
> the RR tasks to run forever. For example:
> 
> [root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
>   .rt_runtime: 950.00
>   .rt_runtime: 950.00
>   .rt_runtime: 950.00
>   .rt_runtime: 950.00
> [root@kiron debug]# echo RT_RUNTIME_SHARE > sched_features
> [root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
> [1] 23908
> [root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
> [2] 23915
> [root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
>   .rt_runtime: 900.00
>   .rt_runtime: 950.00
>   .rt_runtime: 1000.00
>   .rt_runtime: 950.00
> 
> You see? the rt_runtime of the CPU 2 was borrowed time from CPU 0.
> 
> It is not a BUG but a feature (no jokes haha). With RT_RUNTIME_SHARE,
> the rt_runtime is such a global runtime. It works fine for tasks that
> can migrate... but that is not the case for per-cpu kworkers.

This still looks like a bug, or not the expected result. Perhaps we
shouldn't share when tasks are pinned. It doesn't make sense. It's like
pinning two deadline tasks to the same CPU and giving them 100% of that
CPU and saying that it's really just 1/nr_cpus of usage, which would
have the same effect.

OK, it appears this is specific to RT_RUNTIME_SHARE which is what
causes this strange behavior, and even more rational to make this a
default option and perhaps even turn RT_RUNTIME_SHARE off by default.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

On 11/07/2016 09:16 PM, Steven Rostedt wrote:
> I'm confused? Are you saying that RR tasks don't get throttled in the
> current code? That sounds like a bug to me.

If the RT_RUNTIME_SHARING is enabled, the CPU in which the RR tasks are
running (and pinned) will borrow RT runtime from another CPU, allowing
the RR tasks to run forever. For example:

[root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
  .rt_runtime: 950.00
  .rt_runtime: 950.00
  .rt_runtime: 950.00
  .rt_runtime: 950.00
[root@kiron debug]# echo RT_RUNTIME_SHARE > sched_features
[root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
[1] 23908
[root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
[2] 23915
[root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
  .rt_runtime: 900.00
  .rt_runtime: 950.00
  .rt_runtime: 1000.00
  .rt_runtime: 950.00

You see? the rt_runtime of the CPU 2 was borrowed time from CPU 0.

It is not a BUG but a feature (no jokes haha). With RT_RUNTIME_SHARE,
the rt_runtime is such a global runtime. It works fine for tasks that
can migrate... but that is not the case for per-cpu kworkers.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

On 11/07/2016 09:16 PM, Steven Rostedt wrote:
> I'm confused? Are you saying that RR tasks don't get throttled in the
> current code? That sounds like a bug to me.

If the RT_RUNTIME_SHARING is enabled, the CPU in which the RR tasks are
running (and pinned) will borrow RT runtime from another CPU, allowing
the RR tasks to run forever. For example:

[root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
  .rt_runtime: 950.00
  .rt_runtime: 950.00
  .rt_runtime: 950.00
  .rt_runtime: 950.00
[root@kiron debug]# echo RT_RUNTIME_SHARE > sched_features
[root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
[1] 23908
[root@kiron debug]# taskset -c 2 chrt -r 5 /home/bristot/f &
[2] 23915
[root@kiron debug]# cat /proc/sched_debug | grep rt_runtime
  .rt_runtime: 900.00
  .rt_runtime: 950.00
  .rt_runtime: 1000.00
  .rt_runtime: 950.00

You see? the rt_runtime of the CPU 2 was borrowed time from CPU 0.

It is not a BUG but a feature (no jokes haha). With RT_RUNTIME_SHARE,
the rt_runtime is such a global runtime. It works fine for tasks that
can migrate... but that is not the case for per-cpu kworkers.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 21:06:50 +0100
Daniel Bristot de Oliveira  wrote:

> The throttling allowed the kworker to run, but once the kworker went to
> sleep, the RT tasks started to work again. In the previous behavior,
> the system would either go idle, or the kworker would starve because 
> the runtime become infinity for RR tasks.

I'm confused? Are you saying that RR tasks don't get throttled in the
current code? That sounds like a bug to me.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 21:06:50 +0100
Daniel Bristot de Oliveira  wrote:

> The throttling allowed the kworker to run, but once the kworker went to
> sleep, the RT tasks started to work again. In the previous behavior,
> the system would either go idle, or the kworker would starve because 
> the runtime become infinity for RR tasks.

I'm confused? Are you saying that RR tasks don't get throttled in the
current code? That sounds like a bug to me.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira



On 11/07/2016 09:00 PM, Steven Rostedt wrote:
> On Mon, 7 Nov 2016 13:54:12 -0600 (CST)
> Christoph Lameter  wrote:
> 
>> On Mon, 7 Nov 2016, Steven Rostedt wrote:
>>
>>> On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
>>> Christoph Lameter  wrote:
>>>  
 SCHED_RR tasks alternately running on on cpu can cause endless deferral of
 kworker threads. With the global effect of the OS processing reserved
 it may be the case that the processor we are executing never gets any
 time. And if that kworker threads role is releasing a mutex (like the
 cgroup_lock) then deadlocks can result.  
>>>
>>> I believe SCHED_RR tasks will still throttle if they use up too much of
>>> the CPU. But I still don't see how this patch helps your situation.  
>>
>> The kworker thread will be able to make progress? Or am I not reading this
>> correctly?
>>
> 
> If kworker is SCHED_OTHER, then it will be able to make progress if the
> RT tasks are throttled.
> 
> What Daniel's patch does, is to turn off throttling of the RT tasks if
> there's no other task on the run queue.


Here in the example of two spinning RR tasks (f-22466 & f-22473) and an
other task (o-22506):  
   f-22466 [002] d... 79045.641364: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=o next_pid=22506 
next_prio=120
   o-22506 [002] d... 79045.690379: sched_switch: prev_comm=o 
prev_pid=22506 prev_prio=120 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79045.725359: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79045.825356: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79045.925350: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.025346: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.125346: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.225337: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.325333: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.425328: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.525324: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.625319: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.641320: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=o next_pid=22506 
next_prio=120
   o-22506 [002] d... 79046.690335: sched_switch: prev_comm=o 
prev_pid=22506 prev_prio=120 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94

The throttling is per-rq, so even if the RR tasks keep switching
between each other, the throttling will take place if there is
any sched other task.

On Christoph's case, the other task will be the kworker, like bellow:

   f-22466 [002] d... 79294.430542: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79294.483539: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=kworker/2:1 
next_pid=22198 next_prio=120
 kworker/2:1-22198 [002] d... 79294.483544: sched_switch: 
prev_comm=kworker/2:1 prev_pid=22198 prev_prio=120 prev_state=S ==> next_comm=f 
next_pid=22473 next_prio=94
   f-22473 [002] d... 79294.530537: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79294.630541: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94

The throttling allowed the kworker to run, but once the kworker went to
sleep, the RT tasks started to work again. In the previous behavior,
the system would either go idle, or the kworker would starve because 
the runtime become infinity for RR tasks.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira



On 11/07/2016 09:00 PM, Steven Rostedt wrote:
> On Mon, 7 Nov 2016 13:54:12 -0600 (CST)
> Christoph Lameter  wrote:
> 
>> On Mon, 7 Nov 2016, Steven Rostedt wrote:
>>
>>> On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
>>> Christoph Lameter  wrote:
>>>  
 SCHED_RR tasks alternately running on on cpu can cause endless deferral of
 kworker threads. With the global effect of the OS processing reserved
 it may be the case that the processor we are executing never gets any
 time. And if that kworker threads role is releasing a mutex (like the
 cgroup_lock) then deadlocks can result.  
>>>
>>> I believe SCHED_RR tasks will still throttle if they use up too much of
>>> the CPU. But I still don't see how this patch helps your situation.  
>>
>> The kworker thread will be able to make progress? Or am I not reading this
>> correctly?
>>
> 
> If kworker is SCHED_OTHER, then it will be able to make progress if the
> RT tasks are throttled.
> 
> What Daniel's patch does, is to turn off throttling of the RT tasks if
> there's no other task on the run queue.


Here in the example of two spinning RR tasks (f-22466 & f-22473) and an
other task (o-22506):  
   f-22466 [002] d... 79045.641364: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=o next_pid=22506 
next_prio=120
   o-22506 [002] d... 79045.690379: sched_switch: prev_comm=o 
prev_pid=22506 prev_prio=120 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79045.725359: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79045.825356: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79045.925350: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.025346: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.125346: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.225337: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.325333: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.425328: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.525324: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79046.625319: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79046.641320: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=o next_pid=22506 
next_prio=120
   o-22506 [002] d... 79046.690335: sched_switch: prev_comm=o 
prev_pid=22506 prev_prio=120 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94

The throttling is per-rq, so even if the RR tasks keep switching
between each other, the throttling will take place if there is
any sched other task.

On Christoph's case, the other task will be the kworker, like bellow:

   f-22466 [002] d... 79294.430542: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94
   f-22473 [002] d... 79294.483539: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=kworker/2:1 
next_pid=22198 next_prio=120
 kworker/2:1-22198 [002] d... 79294.483544: sched_switch: 
prev_comm=kworker/2:1 prev_pid=22198 prev_prio=120 prev_state=S ==> next_comm=f 
next_pid=22473 next_prio=94
   f-22473 [002] d... 79294.530537: sched_switch: prev_comm=f 
prev_pid=22473 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22466 
next_prio=94
   f-22466 [002] d... 79294.630541: sched_switch: prev_comm=f 
prev_pid=22466 prev_prio=94 prev_state=R ==> next_comm=f next_pid=22473 
next_prio=94

The throttling allowed the kworker to run, but once the kworker went to
sleep, the RT tasks started to work again. In the previous behavior,
the system would either go idle, or the kworker would starve because 
the runtime become infinity for RR tasks.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
Christoph Lameter  wrote:

> SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> kworker threads. With the global effect of the OS processing reserved
> it may be the case that the processor we are executing never gets any
> time. And if that kworker threads role is releasing a mutex (like the
> cgroup_lock) then deadlocks can result.

I believe SCHED_RR tasks will still throttle if they use up too much of
the CPU. But I still don't see how this patch helps your situation.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
Christoph Lameter  wrote:

> SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> kworker threads. With the global effect of the OS processing reserved
> it may be the case that the processor we are executing never gets any
> time. And if that kworker threads role is releasing a mutex (like the
> cgroup_lock) then deadlocks can result.

I believe SCHED_RR tasks will still throttle if they use up too much of
the CPU. But I still don't see how this patch helps your situation.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 13:54:12 -0600 (CST)
Christoph Lameter  wrote:

> On Mon, 7 Nov 2016, Steven Rostedt wrote:
> 
> > On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
> > Christoph Lameter  wrote:
> >  
> > > SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> > > kworker threads. With the global effect of the OS processing reserved
> > > it may be the case that the processor we are executing never gets any
> > > time. And if that kworker threads role is releasing a mutex (like the
> > > cgroup_lock) then deadlocks can result.  
> >
> > I believe SCHED_RR tasks will still throttle if they use up too much of
> > the CPU. But I still don't see how this patch helps your situation.  
> 
> The kworker thread will be able to make progress? Or am I not reading this
> correctly?
> 

If kworker is SCHED_OTHER, then it will be able to make progress if the
RT tasks are throttled.

What Daniel's patch does, is to turn off throttling of the RT tasks if
there's no other task on the run queue.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 13:54:12 -0600 (CST)
Christoph Lameter  wrote:

> On Mon, 7 Nov 2016, Steven Rostedt wrote:
> 
> > On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
> > Christoph Lameter  wrote:
> >  
> > > SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> > > kworker threads. With the global effect of the OS processing reserved
> > > it may be the case that the processor we are executing never gets any
> > > time. And if that kworker threads role is releasing a mutex (like the
> > > cgroup_lock) then deadlocks can result.  
> >
> > I believe SCHED_RR tasks will still throttle if they use up too much of
> > the CPU. But I still don't see how this patch helps your situation.  
> 
> The kworker thread will be able to make progress? Or am I not reading this
> correctly?
> 

If kworker is SCHED_OTHER, then it will be able to make progress if the
RT tasks are throttled.

What Daniel's patch does, is to turn off throttling of the RT tasks if
there's no other task on the run queue.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Christoph Lameter

On Mon, 7 Nov 2016, Steven Rostedt wrote:

> On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
> Christoph Lameter  wrote:
>
> > SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> > kworker threads. With the global effect of the OS processing reserved
> > it may be the case that the processor we are executing never gets any
> > time. And if that kworker threads role is releasing a mutex (like the
> > cgroup_lock) then deadlocks can result.
>
> I believe SCHED_RR tasks will still throttle if they use up too much of
> the CPU. But I still don't see how this patch helps your situation.

The kworker thread will be able to make progress? Or am I not reading this
correctly?

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Christoph Lameter

On Mon, 7 Nov 2016, Steven Rostedt wrote:

> On Mon, 7 Nov 2016 13:30:15 -0600 (CST)
> Christoph Lameter  wrote:
>
> > SCHED_RR tasks alternately running on on cpu can cause endless deferral of
> > kworker threads. With the global effect of the OS processing reserved
> > it may be the case that the processor we are executing never gets any
> > time. And if that kworker threads role is releasing a mutex (like the
> > cgroup_lock) then deadlocks can result.
>
> I believe SCHED_RR tasks will still throttle if they use up too much of
> the CPU. But I still don't see how this patch helps your situation.

The kworker thread will be able to make progress? Or am I not reading this
correctly?

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Christoph Lameter

On Mon, 7 Nov 2016, Steven Rostedt wrote:

> On Mon, 7 Nov 2016 10:55:38 -0600 (CST)
> Christoph Lameter  wrote:
>
> > On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:
> >
> > > With these two options set, the user will guarantee some runtime
> > > for non-rt-tasks on all CPUs, while keeping real-time tasks running
> > > as much as possible.
> >
> > Excellent this would improve the situation with deadlocks as a result of
> > cgroup_locks not being released due to lack of workqueue processing.
>
> ?? What deadlocks do you see? I mean, can you show the situation that
> throttling RT tasks will cause deadlock?
>
> Sorry, but I'm just not seeing it.

SCHED_RR tasks alternately running on on cpu can cause endless deferral of
kworker threads. With the global effect of the OS processing reserved
it may be the case that the processor we are executing never gets any
time. And if that kworker threads role is releasing a mutex (like the
cgroup_lock) then deadlocks can result.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Christoph Lameter

On Mon, 7 Nov 2016, Steven Rostedt wrote:

> On Mon, 7 Nov 2016 10:55:38 -0600 (CST)
> Christoph Lameter  wrote:
>
> > On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:
> >
> > > With these two options set, the user will guarantee some runtime
> > > for non-rt-tasks on all CPUs, while keeping real-time tasks running
> > > as much as possible.
> >
> > Excellent this would improve the situation with deadlocks as a result of
> > cgroup_locks not being released due to lack of workqueue processing.
>
> ?? What deadlocks do you see? I mean, can you show the situation that
> throttling RT tasks will cause deadlock?
>
> Sorry, but I'm just not seeing it.

SCHED_RR tasks alternately running on on cpu can cause endless deferral of
kworker threads. With the global effect of the OS processing reserved
it may be the case that the processor we are executing never gets any
time. And if that kworker threads role is releasing a mutex (like the
cgroup_lock) then deadlocks can result.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 19:49:03 +0100
Daniel Bristot de Oliveira  wrote:

> On 11/07/2016 07:32 PM, Steven Rostedt wrote:
> >> Excellent this would improve the situation with deadlocks as a result of  
> >> > cgroup_locks not being released due to lack of workqueue processing.  
> > ?? What deadlocks do you see? I mean, can you show the situation that
> > throttling RT tasks will cause deadlock?
> > 
> > Sorry, but I'm just not seeing it.  
> 
> It is not a deadlock in the theoretical sense of the word, but it is
> more a side effect of the starvation - that looks like a deadlock.
> 
> There is a case where the removal of a cgroup dir calls
> lru_add_drain_all(), that might schedule a kworker in the CPU that is
> running the spinning-rt task. The kworker will starve - because they are
> SCHED_OTHER by design, the lru_add_drain_all() will wait forever while
> holding the cgroup lock and this will cause a lot of problems on other
> tasks.

I understand the issue with not throttling an RT task, but this patch
is about not not throttling! That is, what scenario is there that will
cause a "deadlock" or deadlock like to happen when we *do* throttle,
where not throttling will work better, as this patch would have?

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 19:49:03 +0100
Daniel Bristot de Oliveira  wrote:

> On 11/07/2016 07:32 PM, Steven Rostedt wrote:
> >> Excellent this would improve the situation with deadlocks as a result of  
> >> > cgroup_locks not being released due to lack of workqueue processing.  
> > ?? What deadlocks do you see? I mean, can you show the situation that
> > throttling RT tasks will cause deadlock?
> > 
> > Sorry, but I'm just not seeing it.  
> 
> It is not a deadlock in the theoretical sense of the word, but it is
> more a side effect of the starvation - that looks like a deadlock.
> 
> There is a case where the removal of a cgroup dir calls
> lru_add_drain_all(), that might schedule a kworker in the CPU that is
> running the spinning-rt task. The kworker will starve - because they are
> SCHED_OTHER by design, the lru_add_drain_all() will wait forever while
> holding the cgroup lock and this will cause a lot of problems on other
> tasks.

I understand the issue with not throttling an RT task, but this patch
is about not not throttling! That is, what scenario is there that will
cause a "deadlock" or deadlock like to happen when we *do* throttle,
where not throttling will work better, as this patch would have?

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

On 11/07/2016 07:32 PM, Steven Rostedt wrote:
>> Excellent this would improve the situation with deadlocks as a result of
>> > cgroup_locks not being released due to lack of workqueue processing.
> ?? What deadlocks do you see? I mean, can you show the situation that
> throttling RT tasks will cause deadlock?
> 
> Sorry, but I'm just not seeing it.

It is not a deadlock in the theoretical sense of the word, but it is
more a side effect of the starvation - that looks like a deadlock.

There is a case where the removal of a cgroup dir calls
lru_add_drain_all(), that might schedule a kworker in the CPU that is
running the spinning-rt task. The kworker will starve - because they are
SCHED_OTHER by design, the lru_add_drain_all() will wait forever while
holding the cgroup lock and this will cause a lot of problems on other
tasks.

This problem was fixed on -rt using a -rt specific lock mechanism, but
the problem still exist in the non-rt kernel. Btw, this is just an
example of side effects of the starvation of non-rt-tasks. We can have
more examples...

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

On 11/07/2016 07:32 PM, Steven Rostedt wrote:
>> Excellent this would improve the situation with deadlocks as a result of
>> > cgroup_locks not being released due to lack of workqueue processing.
> ?? What deadlocks do you see? I mean, can you show the situation that
> throttling RT tasks will cause deadlock?
> 
> Sorry, but I'm just not seeing it.

It is not a deadlock in the theoretical sense of the word, but it is
more a side effect of the starvation - that looks like a deadlock.

There is a case where the removal of a cgroup dir calls
lru_add_drain_all(), that might schedule a kworker in the CPU that is
running the spinning-rt task. The kworker will starve - because they are
SCHED_OTHER by design, the lru_add_drain_all() will wait forever while
holding the cgroup lock and this will cause a lot of problems on other
tasks.

This problem was fixed on -rt using a -rt specific lock mechanism, but
the problem still exist in the non-rt kernel. Btw, this is just an
example of side effects of the starvation of non-rt-tasks. We can have
more examples...

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Clark Williams

On Mon, 7 Nov 2016 13:30:46 -0500
Steven Rostedt  wrote:

> On Mon, 7 Nov 2016 12:22:21 -0600
> Clark Williams  wrote:
> 
> > I'm still reviewing the patch, but I have to wonder why bother with making 
> > it a scheduler feature?
> > 
> > The SCHED_FIFO definition allows a fifo thread to starve others
> > because a fifo task will run until it yields. Throttling was added as
> > a safety valve to allow starved SCHED_OTHER tasks to get some cpu
> > time.  Adding this unconditionally gets us a safety valve for
> > throttling a badly written fifo task, but allows the fifo task to
> > continue to consume cpu cycles if it's not starving anyone. 
> > 
> > Or am I missing something that's blazingly obvious?  
> 
> Or I say make it the default. If people want the old behavior, they can
> modify SCHED_FEATURES to do so.
> 

Ok, I can see wanting the previous behavior. 

Clark


pgp1Ch6TqUqc8.pgp
Description: OpenPGP digital signature

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Clark Williams

On Mon, 7 Nov 2016 13:30:46 -0500
Steven Rostedt  wrote:

> On Mon, 7 Nov 2016 12:22:21 -0600
> Clark Williams  wrote:
> 
> > I'm still reviewing the patch, but I have to wonder why bother with making 
> > it a scheduler feature?
> > 
> > The SCHED_FIFO definition allows a fifo thread to starve others
> > because a fifo task will run until it yields. Throttling was added as
> > a safety valve to allow starved SCHED_OTHER tasks to get some cpu
> > time.  Adding this unconditionally gets us a safety valve for
> > throttling a badly written fifo task, but allows the fifo task to
> > continue to consume cpu cycles if it's not starving anyone. 
> > 
> > Or am I missing something that's blazingly obvious?  
> 
> Or I say make it the default. If people want the old behavior, they can
> modify SCHED_FEATURES to do so.
> 

Ok, I can see wanting the previous behavior. 

Clark


pgp1Ch6TqUqc8.pgp
Description: OpenPGP digital signature

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

On 11/07/2016 07:30 PM, Steven Rostedt wrote:
>> I'm still reviewing the patch, but I have to wonder why bother with making 
>> it a scheduler feature?
>> > 
>> > The SCHED_FIFO definition allows a fifo thread to starve others
>> > because a fifo task will run until it yields. Throttling was added as
>> > a safety valve to allow starved SCHED_OTHER tasks to get some cpu
>> > time.  Adding this unconditionally gets us a safety valve for
>> > throttling a badly written fifo task, but allows the fifo task to
>> > continue to consume cpu cycles if it's not starving anyone. 
>> > 
>> > Or am I missing something that's blazingly obvious?
> Or I say make it the default. If people want the old behavior, they can
> modify SCHED_FEATURES to do so.

I added it as a feature to keep the current behavior by default.
Currently, we have two throttling modes:

With RT_RUNTIME_SHARING (default):
  before throttle, try to borrow some runtime from other CPU.

Without RT_RUNTIME_SHARING:
  throttle the RT task, even if there is nothing else to do.

The problem of the first is that an CPU easily borrow enough runtime to
make the spin-rt-task to run forever, allowing the starvation of the
non-rt-tasks, hence invalidating the mechanism.

The problem of the second is that (with the default values) the CPU will
be idle 5% of the time.

IMHO, the balanced behavior is using GREED option + without
RT_RUNTIME_SHARING: the non-rt tasks will be able to run, while avoiding
CPU going idle.

We can turn it by default setting default options.

Moreover, AFAICS, these sched options are static keys, so they are very
very low overhead conditions.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

On 11/07/2016 07:30 PM, Steven Rostedt wrote:
>> I'm still reviewing the patch, but I have to wonder why bother with making 
>> it a scheduler feature?
>> > 
>> > The SCHED_FIFO definition allows a fifo thread to starve others
>> > because a fifo task will run until it yields. Throttling was added as
>> > a safety valve to allow starved SCHED_OTHER tasks to get some cpu
>> > time.  Adding this unconditionally gets us a safety valve for
>> > throttling a badly written fifo task, but allows the fifo task to
>> > continue to consume cpu cycles if it's not starving anyone. 
>> > 
>> > Or am I missing something that's blazingly obvious?
> Or I say make it the default. If people want the old behavior, they can
> modify SCHED_FEATURES to do so.

I added it as a feature to keep the current behavior by default.
Currently, we have two throttling modes:

With RT_RUNTIME_SHARING (default):
  before throttle, try to borrow some runtime from other CPU.

Without RT_RUNTIME_SHARING:
  throttle the RT task, even if there is nothing else to do.

The problem of the first is that an CPU easily borrow enough runtime to
make the spin-rt-task to run forever, allowing the starvation of the
non-rt-tasks, hence invalidating the mechanism.

The problem of the second is that (with the default values) the CPU will
be idle 5% of the time.

IMHO, the balanced behavior is using GREED option + without
RT_RUNTIME_SHARING: the non-rt tasks will be able to run, while avoiding
CPU going idle.

We can turn it by default setting default options.

Moreover, AFAICS, these sched options are static keys, so they are very
very low overhead conditions.

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 10:55:38 -0600 (CST)
Christoph Lameter  wrote:

> On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:
> 
> > With these two options set, the user will guarantee some runtime
> > for non-rt-tasks on all CPUs, while keeping real-time tasks running
> > as much as possible.  
> 
> Excellent this would improve the situation with deadlocks as a result of
> cgroup_locks not being released due to lack of workqueue processing.

?? What deadlocks do you see? I mean, can you show the situation that
throttling RT tasks will cause deadlock?

Sorry, but I'm just not seeing it.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 10:55:38 -0600 (CST)
Christoph Lameter  wrote:

> On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:
> 
> > With these two options set, the user will guarantee some runtime
> > for non-rt-tasks on all CPUs, while keeping real-time tasks running
> > as much as possible.  
> 
> Excellent this would improve the situation with deadlocks as a result of
> cgroup_locks not being released due to lack of workqueue processing.

?? What deadlocks do you see? I mean, can you show the situation that
throttling RT tasks will cause deadlock?

Sorry, but I'm just not seeing it.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 12:22:21 -0600
Clark Williams  wrote:

> I'm still reviewing the patch, but I have to wonder why bother with making it 
> a scheduler feature?
> 
> The SCHED_FIFO definition allows a fifo thread to starve others
> because a fifo task will run until it yields. Throttling was added as
> a safety valve to allow starved SCHED_OTHER tasks to get some cpu
> time.  Adding this unconditionally gets us a safety valve for
> throttling a badly written fifo task, but allows the fifo task to
> continue to consume cpu cycles if it's not starving anyone. 
> 
> Or am I missing something that's blazingly obvious?

Or I say make it the default. If people want the old behavior, they can
modify SCHED_FEATURES to do so.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Steven Rostedt

On Mon, 7 Nov 2016 12:22:21 -0600
Clark Williams  wrote:

> I'm still reviewing the patch, but I have to wonder why bother with making it 
> a scheduler feature?
> 
> The SCHED_FIFO definition allows a fifo thread to starve others
> because a fifo task will run until it yields. Throttling was added as
> a safety valve to allow starved SCHED_OTHER tasks to get some cpu
> time.  Adding this unconditionally gets us a safety valve for
> throttling a badly written fifo task, but allows the fifo task to
> continue to consume cpu cycles if it's not starving anyone. 
> 
> Or am I missing something that's blazingly obvious?

Or I say make it the default. If people want the old behavior, they can
modify SCHED_FEATURES to do so.

-- Steve

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Luca Abeni

On Mon, 7 Nov 2016 19:03:08 +0100
Tommaso Cucinotta  wrote:

> On 07/11/2016 14:51, Daniel Bristot de Oliveira wrote:
> > Hi Tommaso,
> 
> Hi,
> 
> I'm cc-ing Luca for GRUB et al., pls find a few further notes below...
Thanks Tommaso! I've seen the email on the linux-rt-users mailing list,
and I'll reply there about GRUB...



Thanks,
Luca


> 
> > On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
> >> as anticipated live to Daniel:
> >> -) +1 for the general concept, we'd need something similar also for
> >> SCHED_DEADLINE
> >
> > Resumed: the sum of the runtime of deadline tasks will not be
> > greater than the "to_ratio(global_rt_period(),
> > global_rt_runtime())" - see init_dl_bw(). Therefore, DL rq will not
> > be throttle by the RT throttling mechanism.
> >
> > Extended: RT tasks' throttling aims to bound, for all CPUS of a
> > domain - when RT_RUNTIME_SHARING sharing is enabled; or per-rq -
> > when RT_RUNTIME_SHARING is disabled; the amount of time that RT
> > tasks can run continuously, in such way to provide some CPU time
> > for non-real-time tasks to run. RT tasks need this global/local
> > throttling mechanism to avoid the starvation of non-rt tasks
> > because RT tasks do not have a limited runtime - RT task (or
> > taskset) can run for an infinity runtime.
> >
> > DL tasks' throttling has another meaning. DL tasks' throttling aims
> > to avoid *a* DL task for running for more than *its own*
> > pre-allocated runtime.
> 
> sure, and having an option to let it run for longer, if there's
> nothing else running in the system, is still interesting for pretty
> much similar reasons to those being discussed in this thread ...
> 
> > The sum of allocated runtime for all DL tasks will not to be greater
> > than RT throttling enforcement runtime. The DL scheduler admission
> > control already avoids this by limiting the amount of CPU time all
> > DL tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind
> > the "global" throttling on before hand - in the admission control.
> >
> > GRUB might implement something <> for the DEADLINE
> > scheduler. With GRUB, a deadline tasks will have more runtime than
> > previously set/granted.
> 
> yes, the main difference being: GRUB will let a DL task run for longer
> than its own runtime, but still let it starve anything below (RT as
> well as OTHER tasks); perhaps Luca (cc) has some further comment on
> this...
> 
> Thanks,
> 
>   T.
> 
> > But I am quite sure it will still be bounded by the sum
> > of the already allocated DL runtime, that will continue being
> > smaller than "to_ratio(global_rt_period(), global_rt_runtime())".
> >
> > Am I missing something?
> >
> >> -) only issue might be that, if a non-RT task wakes up after the
> >> unthrottle, it will have to wait, but worst-case it will have a
> >> chance in the next throttling window
> >
> > In the current default behavior (RT_RUNTIME_SHARING), in a domain
> > with more than two CPUs, the worst case easily become "infinity,"
> > because a CPU can borrow runtime from another CPU. There is no
> > guarantee for minimum latency for non-rt tasks. Anyway, if the user
> > wants to provide such guarantee, they just need not enable this
> > feature, while disabling RT_RUNTIME_SHARING (or run the non-rt task
> > as a deadline task ;-))
> >
> >> -) an alternative to unthrottling might be temporary class
> >> downgrade to sched_other, but that might be much more complex,
> >> instead this Daniel's one looks quite simple
> >
> > Yeah, decrease the priority of the task would be something way more
> > complicated and prone to errors. RT tasks would need to reduce its
> > priority to a level higher than the IDLE task, but lower than
> > SCHED_IDLE...
> >
> >> -) when considering also DEADLINE tasks, it might be good to think
> >> about how we'd like the throttling of DEADLINE and RT tasks to
> >> inter-relate, e.g.:
> >
> > Currently, DL tasks are limited (in the bw control) to the global RT
> > throttling limit...
> >
> > I think that this might be an extension to GRUB... that is
> > extending the current behavior... so... things for the future - and
> > IMHO it is another topic - way more challenging.
> >
> > Comments are welcome :-)
> >
> > -- Daniel
> >
> 
>

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Luca Abeni

On Mon, 7 Nov 2016 19:03:08 +0100
Tommaso Cucinotta  wrote:

> On 07/11/2016 14:51, Daniel Bristot de Oliveira wrote:
> > Hi Tommaso,
> 
> Hi,
> 
> I'm cc-ing Luca for GRUB et al., pls find a few further notes below...
Thanks Tommaso! I've seen the email on the linux-rt-users mailing list,
and I'll reply there about GRUB...



Thanks,
Luca


> 
> > On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
> >> as anticipated live to Daniel:
> >> -) +1 for the general concept, we'd need something similar also for
> >> SCHED_DEADLINE
> >
> > Resumed: the sum of the runtime of deadline tasks will not be
> > greater than the "to_ratio(global_rt_period(),
> > global_rt_runtime())" - see init_dl_bw(). Therefore, DL rq will not
> > be throttle by the RT throttling mechanism.
> >
> > Extended: RT tasks' throttling aims to bound, for all CPUS of a
> > domain - when RT_RUNTIME_SHARING sharing is enabled; or per-rq -
> > when RT_RUNTIME_SHARING is disabled; the amount of time that RT
> > tasks can run continuously, in such way to provide some CPU time
> > for non-real-time tasks to run. RT tasks need this global/local
> > throttling mechanism to avoid the starvation of non-rt tasks
> > because RT tasks do not have a limited runtime - RT task (or
> > taskset) can run for an infinity runtime.
> >
> > DL tasks' throttling has another meaning. DL tasks' throttling aims
> > to avoid *a* DL task for running for more than *its own*
> > pre-allocated runtime.
> 
> sure, and having an option to let it run for longer, if there's
> nothing else running in the system, is still interesting for pretty
> much similar reasons to those being discussed in this thread ...
> 
> > The sum of allocated runtime for all DL tasks will not to be greater
> > than RT throttling enforcement runtime. The DL scheduler admission
> > control already avoids this by limiting the amount of CPU time all
> > DL tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind
> > the "global" throttling on before hand - in the admission control.
> >
> > GRUB might implement something <> for the DEADLINE
> > scheduler. With GRUB, a deadline tasks will have more runtime than
> > previously set/granted.
> 
> yes, the main difference being: GRUB will let a DL task run for longer
> than its own runtime, but still let it starve anything below (RT as
> well as OTHER tasks); perhaps Luca (cc) has some further comment on
> this...
> 
> Thanks,
> 
>   T.
> 
> > But I am quite sure it will still be bounded by the sum
> > of the already allocated DL runtime, that will continue being
> > smaller than "to_ratio(global_rt_period(), global_rt_runtime())".
> >
> > Am I missing something?
> >
> >> -) only issue might be that, if a non-RT task wakes up after the
> >> unthrottle, it will have to wait, but worst-case it will have a
> >> chance in the next throttling window
> >
> > In the current default behavior (RT_RUNTIME_SHARING), in a domain
> > with more than two CPUs, the worst case easily become "infinity,"
> > because a CPU can borrow runtime from another CPU. There is no
> > guarantee for minimum latency for non-rt tasks. Anyway, if the user
> > wants to provide such guarantee, they just need not enable this
> > feature, while disabling RT_RUNTIME_SHARING (or run the non-rt task
> > as a deadline task ;-))
> >
> >> -) an alternative to unthrottling might be temporary class
> >> downgrade to sched_other, but that might be much more complex,
> >> instead this Daniel's one looks quite simple
> >
> > Yeah, decrease the priority of the task would be something way more
> > complicated and prone to errors. RT tasks would need to reduce its
> > priority to a level higher than the IDLE task, but lower than
> > SCHED_IDLE...
> >
> >> -) when considering also DEADLINE tasks, it might be good to think
> >> about how we'd like the throttling of DEADLINE and RT tasks to
> >> inter-relate, e.g.:
> >
> > Currently, DL tasks are limited (in the bw control) to the global RT
> > throttling limit...
> >
> > I think that this might be an extension to GRUB... that is
> > extending the current behavior... so... things for the future - and
> > IMHO it is another topic - way more challenging.
> >
> > Comments are welcome :-)
> >
> > -- Daniel
> >
> 
>

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Clark Williams

On Mon,  7 Nov 2016 09:17:55 +0100
Daniel Bristot de Oliveira  wrote:

> The rt throttling mechanism prevents the starvation of non-real-time
> tasks by CPU intensive real-time tasks. In terms of percentage,
> the default behavior allows real-time tasks to run up to 95% of a
> given period, leaving the other 5% of the period for non-real-time
> tasks. In the absence of non-rt tasks, the system goes idle for 5%
> of the period.
> 
> Although this behavior works fine for the purpose of avoiding
> bad real-time tasks that can hang the system, some greed users
> want to allow the real-time task to continue running in the absence
> of non-real-time tasks starving. In other words, they do not want to
> see the system going idle.
> 
> This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
> users (TM). When enabled, this feature will check if non-rt tasks are
> starving before throttling the real-time task. If the real-time task
> becomes throttled, it will be unthrottled as soon as the system goes
> idle, or when the next period starts, whichever comes first.
> 
> This feature is enabled with the following command:
>   # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
> 

I'm still reviewing the patch, but I have to wonder why bother with making it a 
scheduler feature?

The SCHED_FIFO definition allows a fifo thread to starve others because a fifo 
task will run until it yields. Throttling was added as a safety valve to allow 
starved SCHED_OTHER tasks to get some cpu time.  Adding this unconditionally 
gets us a safety valve for throttling a badly written fifo task, but allows the 
fifo task to continue to consume cpu cycles if it's not starving anyone. 

Or am I missing something that's blazingly obvious?

Clark


pgpjY4N8nxMfS.pgp
Description: OpenPGP digital signature

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Clark Williams

On Mon,  7 Nov 2016 09:17:55 +0100
Daniel Bristot de Oliveira  wrote:

> The rt throttling mechanism prevents the starvation of non-real-time
> tasks by CPU intensive real-time tasks. In terms of percentage,
> the default behavior allows real-time tasks to run up to 95% of a
> given period, leaving the other 5% of the period for non-real-time
> tasks. In the absence of non-rt tasks, the system goes idle for 5%
> of the period.
> 
> Although this behavior works fine for the purpose of avoiding
> bad real-time tasks that can hang the system, some greed users
> want to allow the real-time task to continue running in the absence
> of non-real-time tasks starving. In other words, they do not want to
> see the system going idle.
> 
> This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
> users (TM). When enabled, this feature will check if non-rt tasks are
> starving before throttling the real-time task. If the real-time task
> becomes throttled, it will be unthrottled as soon as the system goes
> idle, or when the next period starts, whichever comes first.
> 
> This feature is enabled with the following command:
>   # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features
> 

I'm still reviewing the patch, but I have to wonder why bother with making it a 
scheduler feature?

The SCHED_FIFO definition allows a fifo thread to starve others because a fifo 
task will run until it yields. Throttling was added as a safety valve to allow 
starved SCHED_OTHER tasks to get some cpu time.  Adding this unconditionally 
gets us a safety valve for throttling a badly written fifo task, but allows the 
fifo task to continue to consume cpu cycles if it's not starving anyone. 

Or am I missing something that's blazingly obvious?

Clark


pgpjY4N8nxMfS.pgp
Description: OpenPGP digital signature

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Tommaso Cucinotta


On 07/11/2016 14:51, Daniel Bristot de Oliveira wrote:

Hi Tommaso,


Hi,

I'm cc-ing Luca for GRUB et al., pls find a few further notes below...


On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:

as anticipated live to Daniel:
-) +1 for the general concept, we'd need something similar also for
SCHED_DEADLINE


Resumed: the sum of the runtime of deadline tasks will not be greater
than the "to_ratio(global_rt_period(), global_rt_runtime())" - see
init_dl_bw(). Therefore, DL rq will not be throttle by the RT throttling
mechanism.

Extended: RT tasks' throttling aims to bound, for all CPUS of a domain -
when RT_RUNTIME_SHARING sharing is enabled; or per-rq - when
RT_RUNTIME_SHARING is disabled; the amount of time that RT tasks can run
continuously, in such way to provide some CPU time for non-real-time
tasks to run. RT tasks need this global/local throttling mechanism to
avoid the starvation of non-rt tasks because RT tasks do not have a
limited runtime - RT task (or taskset) can run for an infinity runtime.

DL tasks' throttling has another meaning. DL tasks' throttling aims to
avoid *a* DL task for running for more than *its own* pre-allocated runtime.


sure, and having an option to let it run for longer, if there's nothing else
running in the system, is still interesting for pretty much similar reasons
to those being discussed in this thread ...


The sum of allocated runtime for all DL tasks will not to be greater
than RT throttling enforcement runtime. The DL scheduler admission
control already avoids this by limiting the amount of CPU time all DL
tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
"global" throttling on before hand - in the admission control.

GRUB might implement something <> for the DEADLINE scheduler.
With GRUB, a deadline tasks will have more runtime than previously
set/granted.


yes, the main difference being: GRUB will let a DL task run for longer
than its own runtime, but still let it starve anything below (RT as well
as OTHER tasks); perhaps Luca (cc) has some further comment on this...

Thanks,

T.


But I am quite sure it will still be bounded by the sum
of the already allocated DL runtime, that will continue being smaller
than "to_ratio(global_rt_period(), global_rt_runtime())".

Am I missing something?


-) only issue might be that, if a non-RT task wakes up after the
unthrottle, it will have to wait, but worst-case it will have a chance
in the next throttling window


In the current default behavior (RT_RUNTIME_SHARING), in a domain with
more than two CPUs, the worst case easily become "infinity," because a
CPU can borrow runtime from another CPU. There is no guarantee for
minimum latency for non-rt tasks. Anyway, if the user wants to provide
such guarantee, they just need not enable this feature, while disabling
RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))


-) an alternative to unthrottling might be temporary class downgrade to
sched_other, but that might be much more complex, instead this Daniel's
one looks quite simple


Yeah, decrease the priority of the task would be something way more
complicated and prone to errors. RT tasks would need to reduce its
priority to a level higher than the IDLE task, but lower than SCHED_IDLE...


-) when considering also DEADLINE tasks, it might be good to think about
how we'd like the throttling of DEADLINE and RT tasks to inter-relate,
e.g.:


Currently, DL tasks are limited (in the bw control) to the global RT
throttling limit...

I think that this might be an extension to GRUB... that is extending the
current behavior... so... things for the future - and IMHO it is another
topic - way more challenging.

Comments are welcome :-)

-- Daniel




--
Tommaso Cucinotta, Computer Engineering PhD
Associate Professor at the Real-Time Systems Laboratory (ReTiS)
Scuola Superiore Sant'Anna, Pisa, Italy
http://retis.sssup.it/people/tommaso

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Tommaso Cucinotta


On 07/11/2016 14:51, Daniel Bristot de Oliveira wrote:

Hi Tommaso,


Hi,

I'm cc-ing Luca for GRUB et al., pls find a few further notes below...


On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:

as anticipated live to Daniel:
-) +1 for the general concept, we'd need something similar also for
SCHED_DEADLINE


Resumed: the sum of the runtime of deadline tasks will not be greater
than the "to_ratio(global_rt_period(), global_rt_runtime())" - see
init_dl_bw(). Therefore, DL rq will not be throttle by the RT throttling
mechanism.

Extended: RT tasks' throttling aims to bound, for all CPUS of a domain -
when RT_RUNTIME_SHARING sharing is enabled; or per-rq - when
RT_RUNTIME_SHARING is disabled; the amount of time that RT tasks can run
continuously, in such way to provide some CPU time for non-real-time
tasks to run. RT tasks need this global/local throttling mechanism to
avoid the starvation of non-rt tasks because RT tasks do not have a
limited runtime - RT task (or taskset) can run for an infinity runtime.

DL tasks' throttling has another meaning. DL tasks' throttling aims to
avoid *a* DL task for running for more than *its own* pre-allocated runtime.


sure, and having an option to let it run for longer, if there's nothing else
running in the system, is still interesting for pretty much similar reasons
to those being discussed in this thread ...


The sum of allocated runtime for all DL tasks will not to be greater
than RT throttling enforcement runtime. The DL scheduler admission
control already avoids this by limiting the amount of CPU time all DL
tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
"global" throttling on before hand - in the admission control.

GRUB might implement something <> for the DEADLINE scheduler.
With GRUB, a deadline tasks will have more runtime than previously
set/granted.


yes, the main difference being: GRUB will let a DL task run for longer
than its own runtime, but still let it starve anything below (RT as well
as OTHER tasks); perhaps Luca (cc) has some further comment on this...

Thanks,

T.


But I am quite sure it will still be bounded by the sum
of the already allocated DL runtime, that will continue being smaller
than "to_ratio(global_rt_period(), global_rt_runtime())".

Am I missing something?


-) only issue might be that, if a non-RT task wakes up after the
unthrottle, it will have to wait, but worst-case it will have a chance
in the next throttling window


In the current default behavior (RT_RUNTIME_SHARING), in a domain with
more than two CPUs, the worst case easily become "infinity," because a
CPU can borrow runtime from another CPU. There is no guarantee for
minimum latency for non-rt tasks. Anyway, if the user wants to provide
such guarantee, they just need not enable this feature, while disabling
RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))


-) an alternative to unthrottling might be temporary class downgrade to
sched_other, but that might be much more complex, instead this Daniel's
one looks quite simple


Yeah, decrease the priority of the task would be something way more
complicated and prone to errors. RT tasks would need to reduce its
priority to a level higher than the IDLE task, but lower than SCHED_IDLE...


-) when considering also DEADLINE tasks, it might be good to think about
how we'd like the throttling of DEADLINE and RT tasks to inter-relate,
e.g.:


Currently, DL tasks are limited (in the bw control) to the global RT
throttling limit...

I think that this might be an extension to GRUB... that is extending the
current behavior... so... things for the future - and IMHO it is another
topic - way more challenging.

Comments are welcome :-)

-- Daniel




--
Tommaso Cucinotta, Computer Engineering PhD
Associate Professor at the Real-Time Systems Laboratory (ReTiS)
Scuola Superiore Sant'Anna, Pisa, Italy
http://retis.sssup.it/people/tommaso

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Christoph Lameter

On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:

> With these two options set, the user will guarantee some runtime
> for non-rt-tasks on all CPUs, while keeping real-time tasks running
> as much as possible.

Excellent this would improve the situation with deadlocks as a result of
cgroup_locks not being released due to lack of workqueue processing.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Christoph Lameter

On Mon, 7 Nov 2016, Daniel Bristot de Oliveira wrote:

> With these two options set, the user will guarantee some runtime
> for non-rt-tasks on all CPUs, while keeping real-time tasks running
> as much as possible.

Excellent this would improve the situation with deadlocks as a result of
cgroup_locks not being released due to lack of workqueue processing.

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

Hi Tommaso,

On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
> as anticipated live to Daniel:
> -) +1 for the general concept, we'd need something similar also for
> SCHED_DEADLINE

Resumed: the sum of the runtime of deadline tasks will not be greater
than the "to_ratio(global_rt_period(), global_rt_runtime())" - see
init_dl_bw(). Therefore, DL rq will not be throttle by the RT throttling
mechanism.

Extended: RT tasks' throttling aims to bound, for all CPUS of a domain -
when RT_RUNTIME_SHARING sharing is enabled; or per-rq - when
RT_RUNTIME_SHARING is disabled; the amount of time that RT tasks can run
continuously, in such way to provide some CPU time for non-real-time
tasks to run. RT tasks need this global/local throttling mechanism to
avoid the starvation of non-rt tasks because RT tasks do not have a
limited runtime - RT task (or taskset) can run for an infinity runtime.

DL tasks' throttling has another meaning. DL tasks' throttling aims to
avoid *a* DL task for running for more than *its own* pre-allocated runtime.

The sum of allocated runtime for all DL tasks will not to be greater
than RT throttling enforcement runtime. The DL scheduler admission
control already avoids this by limiting the amount of CPU time all DL
tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
"global" throttling on before hand - in the admission control.

GRUB might implement something <> for the DEADLINE scheduler.
With GRUB, a deadline tasks will have more runtime than previously
set/granted. But I am quite sure it will still be bounded by the sum
of the already allocated DL runtime, that will continue being smaller
than "to_ratio(global_rt_period(), global_rt_runtime())".

Am I missing something?

> -) only issue might be that, if a non-RT task wakes up after the
> unthrottle, it will have to wait, but worst-case it will have a chance
> in the next throttling window

In the current default behavior (RT_RUNTIME_SHARING), in a domain with
more than two CPUs, the worst case easily become "infinity," because a
CPU can borrow runtime from another CPU. There is no guarantee for
minimum latency for non-rt tasks. Anyway, if the user wants to provide
such guarantee, they just need not enable this feature, while disabling
RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))

> -) an alternative to unthrottling might be temporary class downgrade to
> sched_other, but that might be much more complex, instead this Daniel's
> one looks quite simple

Yeah, decrease the priority of the task would be something way more
complicated and prone to errors. RT tasks would need to reduce its
priority to a level higher than the IDLE task, but lower than SCHED_IDLE...

> -) when considering also DEADLINE tasks, it might be good to think about
> how we'd like the throttling of DEADLINE and RT tasks to inter-relate,
> e.g.:

Currently, DL tasks are limited (in the bw control) to the global RT
throttling limit...

I think that this might be an extension to GRUB... that is extending the
current behavior... so... things for the future - and IMHO it is another
topic - way more challenging.

Comments are welcome :-)

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Daniel Bristot de Oliveira

Hi Tommaso,

On 11/07/2016 11:31 AM, Tommaso Cucinotta wrote:
> as anticipated live to Daniel:
> -) +1 for the general concept, we'd need something similar also for
> SCHED_DEADLINE

Resumed: the sum of the runtime of deadline tasks will not be greater
than the "to_ratio(global_rt_period(), global_rt_runtime())" - see
init_dl_bw(). Therefore, DL rq will not be throttle by the RT throttling
mechanism.

Extended: RT tasks' throttling aims to bound, for all CPUS of a domain -
when RT_RUNTIME_SHARING sharing is enabled; or per-rq - when
RT_RUNTIME_SHARING is disabled; the amount of time that RT tasks can run
continuously, in such way to provide some CPU time for non-real-time
tasks to run. RT tasks need this global/local throttling mechanism to
avoid the starvation of non-rt tasks because RT tasks do not have a
limited runtime - RT task (or taskset) can run for an infinity runtime.

DL tasks' throttling has another meaning. DL tasks' throttling aims to
avoid *a* DL task for running for more than *its own* pre-allocated runtime.

The sum of allocated runtime for all DL tasks will not to be greater
than RT throttling enforcement runtime. The DL scheduler admission
control already avoids this by limiting the amount of CPU time all DL
tasks can consume (see init_dl_bw()). So, DL tasks are avoid ind the
"global" throttling on before hand - in the admission control.

GRUB might implement something <> for the DEADLINE scheduler.
With GRUB, a deadline tasks will have more runtime than previously
set/granted. But I am quite sure it will still be bounded by the sum
of the already allocated DL runtime, that will continue being smaller
than "to_ratio(global_rt_period(), global_rt_runtime())".

Am I missing something?

> -) only issue might be that, if a non-RT task wakes up after the
> unthrottle, it will have to wait, but worst-case it will have a chance
> in the next throttling window

In the current default behavior (RT_RUNTIME_SHARING), in a domain with
more than two CPUs, the worst case easily become "infinity," because a
CPU can borrow runtime from another CPU. There is no guarantee for
minimum latency for non-rt tasks. Anyway, if the user wants to provide
such guarantee, they just need not enable this feature, while disabling
RT_RUNTIME_SHARING (or run the non-rt task as a deadline task ;-))

> -) an alternative to unthrottling might be temporary class downgrade to
> sched_other, but that might be much more complex, instead this Daniel's
> one looks quite simple

Yeah, decrease the priority of the task would be something way more
complicated and prone to errors. RT tasks would need to reduce its
priority to a level higher than the IDLE task, but lower than SCHED_IDLE...

> -) when considering also DEADLINE tasks, it might be good to think about
> how we'd like the throttling of DEADLINE and RT tasks to inter-relate,
> e.g.:

Currently, DL tasks are limited (in the bw control) to the global RT
throttling limit...

I think that this might be an extension to GRUB... that is extending the
current behavior... so... things for the future - and IMHO it is another
topic - way more challenging.

Comments are welcome :-)

-- Daniel

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Tommaso Cucinotta


as anticipated live to Daniel:
-) +1 for the general concept, we'd need something similar also for 
SCHED_DEADLINE
-) only issue might be that, if a non-RT task wakes up after the unthrottle, it 
will have to wait, but worst-case it will have a chance in the next throttling 
window
-) an alternative to unthrottling might be temporary class downgrade to 
sched_other, but that might be much more complex, instead this Daniel's one 
looks quite simple
-) when considering also DEADLINE tasks, it might be good to think about how 
we'd like the throttling of DEADLINE and RT tasks to inter-relate, e.g.:
  a) DEADLINE unthrottles if there's no RT nor OTHER tasks? what if there's an 
unthrottled RT?
  b) DEADLINE throttles by downgrading to OTHER?
  c) DEADLINE throttles by downgrading to RT (RR/FIFO and what prio?)

My2c, thanks!

T.

On 07/11/2016 09:17, Daniel Bristot de Oliveira wrote:

The rt throttling mechanism prevents the starvation of non-real-time
tasks by CPU intensive real-time tasks. In terms of percentage,
the default behavior allows real-time tasks to run up to 95% of a
given period, leaving the other 5% of the period for non-real-time
tasks. In the absence of non-rt tasks, the system goes idle for 5%
of the period.

Although this behavior works fine for the purpose of avoiding
bad real-time tasks that can hang the system, some greed users
want to allow the real-time task to continue running in the absence
of non-real-time tasks starving. In other words, they do not want to
see the system going idle.

This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
users (TM). When enabled, this feature will check if non-rt tasks are
starving before throttling the real-time task. If the real-time task
becomes throttled, it will be unthrottled as soon as the system goes
idle, or when the next period starts, whichever comes first.

This feature is enabled with the following command:
   # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features

The user might also want to disable NO_RT_RUNTIME_SHARE logic,
to keep all CPUs with the same rt_runtime.
   # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features

With these two options set, the user will guarantee some runtime
for non-rt-tasks on all CPUs, while keeping real-time tasks running
as much as possible.

The feature is disabled by default, keeping the current behavior.

Signed-off-by: Daniel Bristot de Oliveira 
Reviewed-by: Steven Rostedt 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Christoph Lameter 
Cc: linux-rt-users 
Cc: LKML 

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027..c4c62ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3275,7 +3275,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct pin_cookie cookie
if (unlikely(!p))
p = idle_sched_class.pick_next_task(rq, prev, cookie);
  
-		return p;

+   if (likely(p != RETRY_TASK))
+   return p;
}
  
  again:

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 69631fa..3bd7a6d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,6 +66,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
  
  SCHED_FEAT(FORCE_SD_OVERLAP, false)

  SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(RT_RUNTIME_GREED, false)
  SCHED_FEAT(LB_MIN, false)
  SCHED_FEAT(ATTACH_AGE_LOAD, true)
  
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c

index 5405d3f..0f23e06 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -26,6 +26,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
  static struct task_struct *
  pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct 
pin_cookie cookie)
  {
+   if (sched_feat(RT_RUNTIME_GREED))
+   if (try_to_unthrottle_rt_rq(>rt))
+   return RETRY_TASK;
+
put_prev_task(rq, prev);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2516b8d..a6961a5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -631,6 +631,22 @@ static inline struct rt_bandwidth 
*sched_rt_bandwidth(struct rt_rq *rt_rq)
  
  #endif /* CONFIG_RT_GROUP_SCHED */
  
+static inline void unthrottle_rt_rq(struct rt_rq *rt_rq)

+{
+   rt_rq->rt_time = 0;
+   rt_rq->rt_throttled = 0;
+   sched_rt_rq_enqueue(rt_rq);
+}
+
+int try_to_unthrottle_rt_rq(struct rt_rq *rt_rq)
+{
+   if (rt_rq_throttled(rt_rq)) {
+   unthrottle_rt_rq(rt_rq);
+   return 1;
+   }
+   return 0;
+}
+
  bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
  {
struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
@@ -920,6 +936,18

Re: [PATCH] sched/rt: RT_RUNTIME_GREED sched feature

2016-11-07 Thread Tommaso Cucinotta


as anticipated live to Daniel:
-) +1 for the general concept, we'd need something similar also for 
SCHED_DEADLINE
-) only issue might be that, if a non-RT task wakes up after the unthrottle, it 
will have to wait, but worst-case it will have a chance in the next throttling 
window
-) an alternative to unthrottling might be temporary class downgrade to 
sched_other, but that might be much more complex, instead this Daniel's one 
looks quite simple
-) when considering also DEADLINE tasks, it might be good to think about how 
we'd like the throttling of DEADLINE and RT tasks to inter-relate, e.g.:
  a) DEADLINE unthrottles if there's no RT nor OTHER tasks? what if there's an 
unthrottled RT?
  b) DEADLINE throttles by downgrading to OTHER?
  c) DEADLINE throttles by downgrading to RT (RR/FIFO and what prio?)

My2c, thanks!

T.

On 07/11/2016 09:17, Daniel Bristot de Oliveira wrote:

The rt throttling mechanism prevents the starvation of non-real-time
tasks by CPU intensive real-time tasks. In terms of percentage,
the default behavior allows real-time tasks to run up to 95% of a
given period, leaving the other 5% of the period for non-real-time
tasks. In the absence of non-rt tasks, the system goes idle for 5%
of the period.

Although this behavior works fine for the purpose of avoiding
bad real-time tasks that can hang the system, some greed users
want to allow the real-time task to continue running in the absence
of non-real-time tasks starving. In other words, they do not want to
see the system going idle.

This patch implements the RT_RUNTIME_GREED scheduler feature for greedy
users (TM). When enabled, this feature will check if non-rt tasks are
starving before throttling the real-time task. If the real-time task
becomes throttled, it will be unthrottled as soon as the system goes
idle, or when the next period starts, whichever comes first.

This feature is enabled with the following command:
   # echo RT_RUNTIME_GREED > /sys/kernel/debug/sched_features

The user might also want to disable NO_RT_RUNTIME_SHARE logic,
to keep all CPUs with the same rt_runtime.
   # echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features

With these two options set, the user will guarantee some runtime
for non-rt-tasks on all CPUs, while keeping real-time tasks running
as much as possible.

The feature is disabled by default, keeping the current behavior.

Signed-off-by: Daniel Bristot de Oliveira 
Reviewed-by: Steven Rostedt 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Christoph Lameter 
Cc: linux-rt-users 
Cc: LKML 

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 42d4027..c4c62ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3275,7 +3275,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
struct pin_cookie cookie
if (unlikely(!p))
p = idle_sched_class.pick_next_task(rq, prev, cookie);
  
-		return p;

+   if (likely(p != RETRY_TASK))
+   return p;
}
  
  again:

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 69631fa..3bd7a6d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,6 +66,7 @@ SCHED_FEAT(RT_PUSH_IPI, true)
  
  SCHED_FEAT(FORCE_SD_OVERLAP, false)

  SCHED_FEAT(RT_RUNTIME_SHARE, true)
+SCHED_FEAT(RT_RUNTIME_GREED, false)
  SCHED_FEAT(LB_MIN, false)
  SCHED_FEAT(ATTACH_AGE_LOAD, true)
  
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c

index 5405d3f..0f23e06 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -26,6 +26,10 @@ static void check_preempt_curr_idle(struct rq *rq, struct 
task_struct *p, int fl
  static struct task_struct *
  pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct 
pin_cookie cookie)
  {
+   if (sched_feat(RT_RUNTIME_GREED))
+   if (try_to_unthrottle_rt_rq(>rt))
+   return RETRY_TASK;
+
put_prev_task(rq, prev);
update_idle_core(rq);
schedstat_inc(rq->sched_goidle);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2516b8d..a6961a5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -631,6 +631,22 @@ static inline struct rt_bandwidth 
*sched_rt_bandwidth(struct rt_rq *rt_rq)
  
  #endif /* CONFIG_RT_GROUP_SCHED */
  
+static inline void unthrottle_rt_rq(struct rt_rq *rt_rq)

+{
+   rt_rq->rt_time = 0;
+   rt_rq->rt_throttled = 0;
+   sched_rt_rq_enqueue(rt_rq);
+}
+
+int try_to_unthrottle_rt_rq(struct rt_rq *rt_rq)
+{
+   if (rt_rq_throttled(rt_rq)) {
+   unthrottle_rt_rq(rt_rq);
+   return 1;
+   }
+   return 0;
+}
+
  bool sched_rt_bandwidth_account(struct rt_rq *rt_rq)
  {
struct rt_bandwidth *rt_b = sched_rt_bandwidth(rt_rq);
@@ -920,6 +936,18 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 * but accrue some time due to boosting.
 */
if

70 matches

Mail list logo