subject:"Re\: \[RFC PATCH v1 00\/11\] Create fast idle path for short idle periods"

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Paul E. McKenney

On Thu, Jul 20, 2017 at 04:19:40PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote:
> > > 
> > > static void cpuidle_idle_call()
> > > {
> > >   rcu_idle_enter()
> > >   ..
> > >   rcu_idle_exit()
> > > }
> > > 
> > > I want
> > > 
> > > static void cpuidle_idle_call()
> > > {
> > >   if (tick stopped)
> > >   rcu_idle_enter()
> > >   ..
> > >   if (tick stopped)
> > >   rcu_idle_exit()
> > > }
> > >
> > > Or checking tick stop can be put into rcu_idle_enter/exit
> > 
> > The answer is the traditional "it depends".
> > 
> > If the above change was all that you did, that would be a bug in the
> > case where the predicted short idle time turned out to in reality be an
> > indefinite idle time. 
> 
> Can't be, you didn't disable the tick after all, so you're guaranteed to
> get interrupted by the tick and try again.

I will reserve judgment on that until I see the patch.  But to your point,
I would indeed hope that it works that way.  ;-)

Thanx, Paul

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Paul E. McKenney

On Thu, Jul 20, 2017 at 04:19:40PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote:
> > > 
> > > static void cpuidle_idle_call()
> > > {
> > >   rcu_idle_enter()
> > >   ..
> > >   rcu_idle_exit()
> > > }
> > > 
> > > I want
> > > 
> > > static void cpuidle_idle_call()
> > > {
> > >   if (tick stopped)
> > >   rcu_idle_enter()
> > >   ..
> > >   if (tick stopped)
> > >   rcu_idle_exit()
> > > }
> > >
> > > Or checking tick stop can be put into rcu_idle_enter/exit
> > 
> > The answer is the traditional "it depends".
> > 
> > If the above change was all that you did, that would be a bug in the
> > case where the predicted short idle time turned out to in reality be an
> > indefinite idle time. 
> 
> Can't be, you didn't disable the tick after all, so you're guaranteed to
> get interrupted by the tick and try again.

I will reserve judgment on that until I see the patch.  But to your point,
I would indeed hope that it works that way.  ;-)

Thanx, Paul

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Peter Zijlstra

On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote:
> > 
> > static void cpuidle_idle_call()
> > {
> > rcu_idle_enter()
> > ..
> > rcu_idle_exit()
> > }
> > 
> > I want
> > 
> > static void cpuidle_idle_call()
> > {
> > if (tick stopped)
> > rcu_idle_enter()
> > ..
> > if (tick stopped)
> > rcu_idle_exit()
> > }
> >
> > Or checking tick stop can be put into rcu_idle_enter/exit
> 
> The answer is the traditional "it depends".
> 
> If the above change was all that you did, that would be a bug in the
> case where the predicted short idle time turned out to in reality be an
> indefinite idle time. 

Can't be, you didn't disable the tick after all, so you're guaranteed to
get interrupted by the tick and try again.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Peter Zijlstra

On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote:
> > 
> > static void cpuidle_idle_call()
> > {
> > rcu_idle_enter()
> > ..
> > rcu_idle_exit()
> > }
> > 
> > I want
> > 
> > static void cpuidle_idle_call()
> > {
> > if (tick stopped)
> > rcu_idle_enter()
> > ..
> > if (tick stopped)
> > rcu_idle_exit()
> > }
> >
> > Or checking tick stop can be put into rcu_idle_enter/exit
> 
> The answer is the traditional "it depends".
> 
> If the above change was all that you did, that would be a bug in the
> case where the predicted short idle time turned out to in reality be an
> indefinite idle time. 

Can't be, you didn't disable the tick after all, so you're guaranteed to
get interrupted by the tick and try again.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Arjan van de Ven


On 7/20/2017 1:11 AM, Thomas Gleixner wrote:

On Thu, 20 Jul 2017, Li, Aubrey wrote:

Don't get me wrong, even if a fast path is acceptable, we still need to
figure out if the coming idle is short and when to switch. I'm just worried
about if irq timings is not an ideal statistics, we have to skip it too.


There is no ideal solution ever.

Lets sit back and look at that from the big picture first before dismissing
a particular item upfront.

The current NOHZ implementation does:

predict = nohz_predict(timers, rcu, arch, irqwork);

if ((predict - now) > X)
stop_tick()

The C-State machinery does something like:

predict = cstate_predict(next_timer, scheduler);

cstate = cstate_select(predict);

That disconnect is part of the problem. What we really want is:

predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings);


two separate predictors is clearly a recipe for badness.

(likewise, C and P states try to estimate "performance sensitivity" and 
sometimes estimate in opposite directions)



to be honest, performance sensitivity estimation is probably 10x more critical 
for C state
selection than idle duration; a lot of modern hardware will do the energy 
efficiency stuff
in a microcontroller when it coordinates between multiple cores in the system 
on C and P states.

(both x86 and ARM have such microcontroller nowadays, at least for the higher 
performance
designs)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Arjan van de Ven


On 7/20/2017 1:11 AM, Thomas Gleixner wrote:

On Thu, 20 Jul 2017, Li, Aubrey wrote:

Don't get me wrong, even if a fast path is acceptable, we still need to
figure out if the coming idle is short and when to switch. I'm just worried
about if irq timings is not an ideal statistics, we have to skip it too.


There is no ideal solution ever.

Lets sit back and look at that from the big picture first before dismissing
a particular item upfront.

The current NOHZ implementation does:

predict = nohz_predict(timers, rcu, arch, irqwork);

if ((predict - now) > X)
stop_tick()

The C-State machinery does something like:

predict = cstate_predict(next_timer, scheduler);

cstate = cstate_select(predict);

That disconnect is part of the problem. What we really want is:

predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings);


two separate predictors is clearly a recipe for badness.

(likewise, C and P states try to estimate "performance sensitivity" and 
sometimes estimate in opposite directions)



to be honest, performance sensitivity estimation is probably 10x more critical 
for C state
selection than idle duration; a lot of modern hardware will do the energy 
efficiency stuff
in a microcontroller when it coordinates between multiple cores in the system 
on C and P states.

(both x86 and ARM have such microcontroller nowadays, at least for the higher 
performance
designs)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Arjan van de Ven


On 7/20/2017 5:50 AM, Paul E. McKenney wrote:

To make this work reasonably, you would also need some way to check for
the case where the prediction idle time is short but the real idle time
is very long.


so the case where you predict very short but is actually "indefinite", the real
solution likely is that we set a timer some time in the future
(say 100msec, or some other value that is long but not indefinite)
where we wake up the system and make a new prediction,
since clearly we were insanely wrong in the prediction and should try
again.

that or we turn the prediction from a single value into a range of
(expected, upper bound)

where upper bound is likely the next timer or other going-to-happen events.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Arjan van de Ven


On 7/20/2017 5:50 AM, Paul E. McKenney wrote:

To make this work reasonably, you would also need some way to check for
the case where the prediction idle time is short but the real idle time
is very long.


so the case where you predict very short but is actually "indefinite", the real
solution likely is that we set a timer some time in the future
(say 100msec, or some other value that is long but not indefinite)
where we wake up the system and make a new prediction,
since clearly we were insanely wrong in the prediction and should try
again.

that or we turn the prediction from a single value into a range of
(expected, upper bound)

where upper bound is likely the next timer or other going-to-happen events.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Paul E. McKenney

On Thu, Jul 20, 2017 at 09:40:49AM +0800, Li, Aubrey wrote:
> On 2017/7/19 22:48, Paul E. McKenney wrote:
> > On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote:
> >> On 2017/7/18 23:20, Paul E. McKenney wrote:
> >>
>  2) for rcu idle enter/exit, I measured the details which Paul provided, 
>  and
>  the result matches with what I have measured before, nothing notable 
>  found.
>  But it still makes more sense if we can make rcu idle enter/exit hooked 
>  with
>  tick off. (it's possible other workloads behave differently)
> >>>
> >>> Again, assuming that RCU is informed of CPUs in the kernel, regardless
> >>> of whether or not the tick is on that that point in time.
> >>>
> >> Yeah, I see, no problem for a normal idle.
> >>
> >> But for a short idle, we want to return to the task ASAP. Even though RCU 
> >> cost
> >> is not notable, it would still be better for me if we can save some cycles 
> >> in
> >> idle entry and idle exit.
> >>
> >> Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> >> scenario?
> >> My understanding is, if tick is not stopped, then we don't need inform RCU 
> >> in
> >> idle path, it can be informed in irq exit.
> > 
> > Indeed, the problem arises when the tick is stopped.
> 
> My question is, does problem arise when the tick is *not* stopped (skipping 
> nohz idle)?
> 
> instead of 
> 
> static void cpuidle_idle_call()
> {
>   rcu_idle_enter()
>   ..
>   rcu_idle_exit()
> }
> 
> I want
> 
> static void cpuidle_idle_call()
> {
>   if (tick stopped)
>   rcu_idle_enter()
>   ..
>   if (tick stopped)
>   rcu_idle_exit()
> }
> 
> Or checking tick stop can be put into rcu_idle_enter/exit

The answer is the traditional "it depends".

If the above change was all that you did, that would be a bug in the
case where the predicted short idle time turned out to in reality be an
indefinite idle time.  RCU would indefinitely believe that the CPU was
non-idle, and would wait for it to report a quiescent state, which it
would not, even given the scheduling-clock interrupt (it is idle, so it
knows RCU already knows that it is idle, you see).  RCU would eventually
send it a resched IPI, which would probably get things going again,
but the best you could hope for was extra resched IPIs and excessively
long grace periods.

To make this work reasonably, you would also need some way to check for
the case where the prediction idle time is short but the real idle time
is very long.

Thanx, Paul

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Paul E. McKenney

On Thu, Jul 20, 2017 at 09:40:49AM +0800, Li, Aubrey wrote:
> On 2017/7/19 22:48, Paul E. McKenney wrote:
> > On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote:
> >> On 2017/7/18 23:20, Paul E. McKenney wrote:
> >>
>  2) for rcu idle enter/exit, I measured the details which Paul provided, 
>  and
>  the result matches with what I have measured before, nothing notable 
>  found.
>  But it still makes more sense if we can make rcu idle enter/exit hooked 
>  with
>  tick off. (it's possible other workloads behave differently)
> >>>
> >>> Again, assuming that RCU is informed of CPUs in the kernel, regardless
> >>> of whether or not the tick is on that that point in time.
> >>>
> >> Yeah, I see, no problem for a normal idle.
> >>
> >> But for a short idle, we want to return to the task ASAP. Even though RCU 
> >> cost
> >> is not notable, it would still be better for me if we can save some cycles 
> >> in
> >> idle entry and idle exit.
> >>
> >> Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> >> scenario?
> >> My understanding is, if tick is not stopped, then we don't need inform RCU 
> >> in
> >> idle path, it can be informed in irq exit.
> > 
> > Indeed, the problem arises when the tick is stopped.
> 
> My question is, does problem arise when the tick is *not* stopped (skipping 
> nohz idle)?
> 
> instead of 
> 
> static void cpuidle_idle_call()
> {
>   rcu_idle_enter()
>   ..
>   rcu_idle_exit()
> }
> 
> I want
> 
> static void cpuidle_idle_call()
> {
>   if (tick stopped)
>   rcu_idle_enter()
>   ..
>   if (tick stopped)
>   rcu_idle_exit()
> }
> 
> Or checking tick stop can be put into rcu_idle_enter/exit

The answer is the traditional "it depends".

If the above change was all that you did, that would be a bug in the
case where the predicted short idle time turned out to in reality be an
indefinite idle time.  RCU would indefinitely believe that the CPU was
non-idle, and would wait for it to report a quiescent state, which it
would not, even given the scheduling-clock interrupt (it is idle, so it
knows RCU already knows that it is idle, you see).  RCU would eventually
send it a resched IPI, which would probably get things going again,
but the best you could hope for was extra resched IPIs and excessively
long grace periods.

To make this work reasonably, you would also need some way to check for
the case where the prediction idle time is short but the real idle time
is very long.

Thanx, Paul

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Thomas Gleixner

On Thu, 20 Jul 2017, Li, Aubrey wrote:
> Don't get me wrong, even if a fast path is acceptable, we still need to
> figure out if the coming idle is short and when to switch. I'm just worried
> about if irq timings is not an ideal statistics, we have to skip it too.

There is no ideal solution ever.

Lets sit back and look at that from the big picture first before dismissing
a particular item upfront.

The current NOHZ implementation does:

predict = nohz_predict(timers, rcu, arch, irqwork);

if ((predict - now) > X)
stop_tick()

The C-State machinery does something like:

predict = cstate_predict(next_timer, scheduler);

cstate = cstate_select(predict);

That disconnect is part of the problem. What we really want is:

predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings);

and use that prediction for both the NOHZ and the C-State decision
function. That's the first thing which needs to be addressed.

Once that is done, you can look into the prediction functions and optimize
that or tweak the bits and pieces there and decide which predictors work
best for a particular workload.

As long as you just look into a particular predictor function and do not
address the underlying conceptual issues first, the outcome is very much
predictable: It's going to be useless crap.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-20 Thread Thomas Gleixner

On Thu, 20 Jul 2017, Li, Aubrey wrote:
> Don't get me wrong, even if a fast path is acceptable, we still need to
> figure out if the coming idle is short and when to switch. I'm just worried
> about if irq timings is not an ideal statistics, we have to skip it too.

There is no ideal solution ever.

Lets sit back and look at that from the big picture first before dismissing
a particular item upfront.

The current NOHZ implementation does:

predict = nohz_predict(timers, rcu, arch, irqwork);

if ((predict - now) > X)
stop_tick()

The C-State machinery does something like:

predict = cstate_predict(next_timer, scheduler);

cstate = cstate_select(predict);

That disconnect is part of the problem. What we really want is:

predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings);

and use that prediction for both the NOHZ and the C-State decision
function. That's the first thing which needs to be addressed.

Once that is done, you can look into the prediction functions and optimize
that or tweak the bits and pieces there and decide which predictors work
best for a particular workload.

As long as you just look into a particular predictor function and do not
address the underlying conceptual issues first, the outcome is very much
predictable: It's going to be useless crap.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Li, Aubrey

On 2017/7/19 15:55, Thomas Gleixner wrote:
> On Wed, 19 Jul 2017, Li, Aubrey wrote:
>> On 2017/7/18 15:19, Thomas Gleixner wrote:
>>> You can very well avoid it by taking the irq timings or whatever other
>>> information into account for the NOHZ decision.
>>>
>> If I read the source correctly, irq timing statistics computation happens at
>> idle time. Sadly, this is what Andi Kleen worried about. People keep putting
>> more and more slow stuff in idle path, not realizing it could be a critical 
>> path.
> 
> Can you please stop this right here? We all realize that there is an issue,
> we are not as stupid as you might think.
> 
> But we disagree that the proper solution is to add an ad hoc hack which
> shortcuts stuff and creates a maintenance problem in the long run.
> 
> Just dismissing a proposal based on 'oh this adds more code to the idle
> path' and then yelling 'you all do not get it' is not going to solve this
> in any way.
> 
> You simply refuse to look at the big picture and you are just not willing
> to analyze the involved bits and pieces and find a solution which allows to
> serve both the NOHZ power saving and the interrupt driven workloads best.
> 
> All you are doing is providing data about the status quo, and then
> deferring from that, that you need an extra magic code path. That
> information is just a inventory of the current behaviour and the current
> behaviour is caused by the current implementation. There is nothing set in
> stone with that implementation and it can be changed.
> 
> But you are just in refusal mode and instead of sitting down and doing the
> hard work, you accuse other people that they are not realizing the problem
> and insist on your sloppy hackery. That's simply not the way it works.

Don't get me wrong, even if a fast path is acceptable, we still need to
figure out if the coming idle is short and when to switch. I'm just worried
about if irq timings is not an ideal statistics, we have to skip it too.

Let me a little time to digest the details of irq timings and obtain some
data to see if it's suitable for my case.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Li, Aubrey

On 2017/7/19 15:55, Thomas Gleixner wrote:
> On Wed, 19 Jul 2017, Li, Aubrey wrote:
>> On 2017/7/18 15:19, Thomas Gleixner wrote:
>>> You can very well avoid it by taking the irq timings or whatever other
>>> information into account for the NOHZ decision.
>>>
>> If I read the source correctly, irq timing statistics computation happens at
>> idle time. Sadly, this is what Andi Kleen worried about. People keep putting
>> more and more slow stuff in idle path, not realizing it could be a critical 
>> path.
> 
> Can you please stop this right here? We all realize that there is an issue,
> we are not as stupid as you might think.
> 
> But we disagree that the proper solution is to add an ad hoc hack which
> shortcuts stuff and creates a maintenance problem in the long run.
> 
> Just dismissing a proposal based on 'oh this adds more code to the idle
> path' and then yelling 'you all do not get it' is not going to solve this
> in any way.
> 
> You simply refuse to look at the big picture and you are just not willing
> to analyze the involved bits and pieces and find a solution which allows to
> serve both the NOHZ power saving and the interrupt driven workloads best.
> 
> All you are doing is providing data about the status quo, and then
> deferring from that, that you need an extra magic code path. That
> information is just a inventory of the current behaviour and the current
> behaviour is caused by the current implementation. There is nothing set in
> stone with that implementation and it can be changed.
> 
> But you are just in refusal mode and instead of sitting down and doing the
> hard work, you accuse other people that they are not realizing the problem
> and insist on your sloppy hackery. That's simply not the way it works.

Don't get me wrong, even if a fast path is acceptable, we still need to
figure out if the coming idle is short and when to switch. I'm just worried
about if irq timings is not an ideal statistics, we have to skip it too.

Let me a little time to digest the details of irq timings and obtain some
data to see if it's suitable for my case.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Li, Aubrey

On 2017/7/19 22:48, Paul E. McKenney wrote:
> On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote:
>> On 2017/7/18 23:20, Paul E. McKenney wrote:
>>
 2) for rcu idle enter/exit, I measured the details which Paul provided, and
 the result matches with what I have measured before, nothing notable found.
 But it still makes more sense if we can make rcu idle enter/exit hooked 
 with
 tick off. (it's possible other workloads behave differently)
>>>
>>> Again, assuming that RCU is informed of CPUs in the kernel, regardless
>>> of whether or not the tick is on that that point in time.
>>>
>> Yeah, I see, no problem for a normal idle.
>>
>> But for a short idle, we want to return to the task ASAP. Even though RCU 
>> cost
>> is not notable, it would still be better for me if we can save some cycles in
>> idle entry and idle exit.
>>
>> Do we have any problem if we skip RCU idle enter/exit under a fast idle 
>> scenario?
>> My understanding is, if tick is not stopped, then we don't need inform RCU in
>> idle path, it can be informed in irq exit.
> 
> Indeed, the problem arises when the tick is stopped.

My question is, does problem arise when the tick is *not* stopped (skipping 
nohz idle)?

instead of 

static void cpuidle_idle_call()
{
rcu_idle_enter()
..
rcu_idle_exit()
}

I want

static void cpuidle_idle_call()
{
if (tick stopped)
rcu_idle_enter()
..
if (tick stopped)
rcu_idle_exit()
}

Or checking tick stop can be put into rcu_idle_enter/exit

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Li, Aubrey

On 2017/7/19 22:48, Paul E. McKenney wrote:
> On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote:
>> On 2017/7/18 23:20, Paul E. McKenney wrote:
>>
 2) for rcu idle enter/exit, I measured the details which Paul provided, and
 the result matches with what I have measured before, nothing notable found.
 But it still makes more sense if we can make rcu idle enter/exit hooked 
 with
 tick off. (it's possible other workloads behave differently)
>>>
>>> Again, assuming that RCU is informed of CPUs in the kernel, regardless
>>> of whether or not the tick is on that that point in time.
>>>
>> Yeah, I see, no problem for a normal idle.
>>
>> But for a short idle, we want to return to the task ASAP. Even though RCU 
>> cost
>> is not notable, it would still be better for me if we can save some cycles in
>> idle entry and idle exit.
>>
>> Do we have any problem if we skip RCU idle enter/exit under a fast idle 
>> scenario?
>> My understanding is, if tick is not stopped, then we don't need inform RCU in
>> idle path, it can be informed in irq exit.
> 
> Indeed, the problem arises when the tick is stopped.

My question is, does problem arise when the tick is *not* stopped (skipping 
nohz idle)?

instead of 

static void cpuidle_idle_call()
{
rcu_idle_enter()
..
rcu_idle_exit()
}

I want

static void cpuidle_idle_call()
{
if (tick stopped)
rcu_idle_enter()
..
if (tick stopped)
rcu_idle_exit()
}

Or checking tick stop can be put into rcu_idle_enter/exit

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Paul E. McKenney

On Wed, Jul 19, 2017 at 10:03:22AM -0500, Christopher Lameter wrote:
> On Wed, 19 Jul 2017, Paul E. McKenney wrote:
> 
> > > Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> > > scenario?
> > > My understanding is, if tick is not stopped, then we don't need inform 
> > > RCU in
> > > idle path, it can be informed in irq exit.
> >
> > Indeed, the problem arises when the tick is stopped.
> 
> Well is there a boundary when you would want the notification calls? I
> would think that even an idle period of a couple of seconds does not
> necessarily require a callback to rcu. Had some brokenness here where RCU
> calls did not occur for hours or so. At some point the system ran out of
> memory but thats far off.

Yeah, I should spell this out more completely, shouldn't I?  And get
it into the documentation if it isn't already there...  Where it is
currently at best hinted at.  :-/

1.  If a CPU is either idle or executing in usermode, and RCU believes
it is non-idle, the scheduling-clock tick had better be running.
Otherwise, you will get RCU CPU stall warnings.  Or at best,
very long (11-second) grace periods, with a pointless IPI waking
the CPU each time.

2.  If a CPU is in a portion of the kernel that executes RCU read-side
critical sections, and RCU believes this CPU to be idle, you can get
random memory corruption.  DON'T DO THIS!!!

This is one reason to test with lockdep, which will complain
about this sort of thing.

3.  If a CPU is in a portion of the kernel that is absolutely
positively no-joking guaranteed to never execute any RCU read-side
critical sections, and RCU believes this CPU to to be idle,
no problem.  This sort of thing is used by some architectures
for light-weight exception handlers, which can then avoid the
overhead of rcu_irq_enter() and rcu_irq_exit().  Some go further
and avoid the entireties of irq_enter() and irq_exit().

Just make very sure you are running some of your tests with
CONFIG_PROVE_RCU=y.

4.  If a CPU is executing in the kernel with the scheduling-clock
interrupt disabled and RCU believes this CPU to be non-idle,
and if the CPU goes idle (from an RCU perspective) every few
jiffies, no problem.  It is usually OK for there to be the
occasional gap between idle periods of up to a second or so.

If the gap grows too long, you get RCU CPU stall warnings.

5.  If a CPU is either idle or executing in usermode, and RCU believes
it to be idle, of course no problem.

6.  If a CPU is executing in the kernel, the kernel code
path is passing through quiescent states at a reasonable
frequency (preferably about once per few jiffies, but the
occasional excursion to a second or so is usually OK) and the
scheduling-clock interrupt is enabled, of course no problem.

If the gap between a successive pair of quiescent states grows
too long, you get RCU CPU stall warnings.

For purposes of comparison, the default time until RCU CPU stall warning
in mainline is 21 seconds.  A number of distros set this to 60 seconds.
Back in the 90s, the analogous timeout for DYNIX/ptx was 1.5 seconds.  :-/

Hence "a second or so" instead of your "a couple of seconds".  ;-)

Please see below for the corresponding patch to RCU's requirements.
Thoughts?

Thanx, Paul

commit 693c9bfa43f92570dd362d8834440b418bbb994a
Author: Paul E. McKenney 
Date:   Wed Jul 19 09:52:58 2017 -0700

doc: Set down RCU's scheduling-clock-interrupt needs

This commit documents the situations in which RCU needs the
scheduling-clock interrupt to be enabled, along with the consequences
of failing to meet RCU's needs in this area.

Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index 95b30fa25d56..7980bee5607f 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows:
Scheduler and RCU.
Tracing and RCU.
Energy Efficiency.
+   
+   Scheduling-Clock Interrupts and RCU.
Memory Efficiency.

Performance, Scalability, Response Time, and Reliability.
@@ -2532,6 +2534,113 @@ I learned of many of these requirements via angry phone 
calls:
 Flaming me on the Linux-kernel mailing list was apparently not
 sufficient to fully vent their ire at RCU's energy-efficiency bugs!

+
+Scheduling-Clock Interrupts and RCU
+
+
+The kernel transitions between in-kernel non-idle execution,

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Paul E. McKenney

On Wed, Jul 19, 2017 at 10:03:22AM -0500, Christopher Lameter wrote:
> On Wed, 19 Jul 2017, Paul E. McKenney wrote:
> 
> > > Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> > > scenario?
> > > My understanding is, if tick is not stopped, then we don't need inform 
> > > RCU in
> > > idle path, it can be informed in irq exit.
> >
> > Indeed, the problem arises when the tick is stopped.
> 
> Well is there a boundary when you would want the notification calls? I
> would think that even an idle period of a couple of seconds does not
> necessarily require a callback to rcu. Had some brokenness here where RCU
> calls did not occur for hours or so. At some point the system ran out of
> memory but thats far off.

Yeah, I should spell this out more completely, shouldn't I?  And get
it into the documentation if it isn't already there...  Where it is
currently at best hinted at.  :-/

1.  If a CPU is either idle or executing in usermode, and RCU believes
it is non-idle, the scheduling-clock tick had better be running.
Otherwise, you will get RCU CPU stall warnings.  Or at best,
very long (11-second) grace periods, with a pointless IPI waking
the CPU each time.

2.  If a CPU is in a portion of the kernel that executes RCU read-side
critical sections, and RCU believes this CPU to be idle, you can get
random memory corruption.  DON'T DO THIS!!!

This is one reason to test with lockdep, which will complain
about this sort of thing.

3.  If a CPU is in a portion of the kernel that is absolutely
positively no-joking guaranteed to never execute any RCU read-side
critical sections, and RCU believes this CPU to to be idle,
no problem.  This sort of thing is used by some architectures
for light-weight exception handlers, which can then avoid the
overhead of rcu_irq_enter() and rcu_irq_exit().  Some go further
and avoid the entireties of irq_enter() and irq_exit().

Just make very sure you are running some of your tests with
CONFIG_PROVE_RCU=y.

4.  If a CPU is executing in the kernel with the scheduling-clock
interrupt disabled and RCU believes this CPU to be non-idle,
and if the CPU goes idle (from an RCU perspective) every few
jiffies, no problem.  It is usually OK for there to be the
occasional gap between idle periods of up to a second or so.

If the gap grows too long, you get RCU CPU stall warnings.

5.  If a CPU is either idle or executing in usermode, and RCU believes
it to be idle, of course no problem.

6.  If a CPU is executing in the kernel, the kernel code
path is passing through quiescent states at a reasonable
frequency (preferably about once per few jiffies, but the
occasional excursion to a second or so is usually OK) and the
scheduling-clock interrupt is enabled, of course no problem.

If the gap between a successive pair of quiescent states grows
too long, you get RCU CPU stall warnings.

For purposes of comparison, the default time until RCU CPU stall warning
in mainline is 21 seconds.  A number of distros set this to 60 seconds.
Back in the 90s, the analogous timeout for DYNIX/ptx was 1.5 seconds.  :-/

Hence "a second or so" instead of your "a couple of seconds".  ;-)

Please see below for the corresponding patch to RCU's requirements.
Thoughts?

Thanx, Paul

commit 693c9bfa43f92570dd362d8834440b418bbb994a
Author: Paul E. McKenney 
Date:   Wed Jul 19 09:52:58 2017 -0700

doc: Set down RCU's scheduling-clock-interrupt needs

This commit documents the situations in which RCU needs the
scheduling-clock interrupt to be enabled, along with the consequences
of failing to meet RCU's needs in this area.

Signed-off-by: Paul E. McKenney 

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index 95b30fa25d56..7980bee5607f 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows:
Scheduler and RCU.
Tracing and RCU.
Energy Efficiency.
+   
+   Scheduling-Clock Interrupts and RCU.
Memory Efficiency.

Performance, Scalability, Response Time, and Reliability.
@@ -2532,6 +2534,113 @@ I learned of many of these requirements via angry phone 
calls:
 Flaming me on the Linux-kernel mailing list was apparently not
 sufficient to fully vent their ire at RCU's energy-efficiency bugs!

+
+Scheduling-Clock Interrupts and RCU
+
+
+The kernel transitions between in-kernel non-idle execution, userspace
+execution, and the idle loop.
+Depending on

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Christopher Lameter

On Wed, 19 Jul 2017, Paul E. McKenney wrote:

> > Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> > scenario?
> > My understanding is, if tick is not stopped, then we don't need inform RCU 
> > in
> > idle path, it can be informed in irq exit.
>
> Indeed, the problem arises when the tick is stopped.

Well is there a boundary when you would want the notification calls? I
would think that even an idle period of a couple of seconds does not
necessarily require a callback to rcu. Had some brokenness here where RCU
calls did not occur for hours or so. At some point the system ran out of
memory but thats far off.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Christopher Lameter

On Wed, 19 Jul 2017, Paul E. McKenney wrote:

> > Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> > scenario?
> > My understanding is, if tick is not stopped, then we don't need inform RCU 
> > in
> > idle path, it can be informed in irq exit.
>
> Indeed, the problem arises when the tick is stopped.

Well is there a boundary when you would want the notification calls? I
would think that even an idle period of a couple of seconds does not
necessarily require a callback to rcu. Had some brokenness here where RCU
calls did not occur for hours or so. At some point the system ran out of
memory but thats far off.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Paul E. McKenney

On Wed, Jul 19, 2017 at 03:43:07PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > But I think we can at the very least do this; it only gets called from
> > > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled 
> > > > > by
> > > > > that point.
> > > > > 
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 51d4c3acf32d..dccf2dc8155a 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user)
> > > > >   */
> > > > >  void rcu_idle_enter(void)
> > > > >  {
> > > > > - unsigned long flags;
> > > > > -
> > > > > - local_irq_save(flags);
> > > > 
> > > > With this addition, I am all for it:
> > > > 
> > > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> > > > enabled!!!");
> > > > 
> > > > If you are OK with this addition, may I please have your Signed-off-by?
> > > 
> > > Sure,
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > 
> > Very good, I have queued the patch below.  I left out the removal of
> > the export as I need to work out why the export was there.  If it turns
> > out not to be needed, I will remove the related ones as well.
> > 
> > Fair enough?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 95f3e587ce6388028a51f0c852800fca944e7032
> > Author: Peter Zijlstra (Intel) 
> > Date:   Wed Jul 12 07:59:54 2017 -0700
> > 
> > rcu: Make rcu_idle_enter() rely on callers disabling irqs
> > 
> > All callers to rcu_idle_enter() have irqs disabled, so there is no
> > point in rcu_idle_enter disabling them again.  This commit therefore
> > replaces the irq disabling with a RCU_LOCKDEP_WARN().
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6de6b1c5ee53..c78b076653ce 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user)
> >   */
> >  void rcu_idle_enter(void)
> >  {
> > -   unsigned long flags;
> > -
> > -   local_irq_save(flags);
> > +   RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> > enabled!!!");
> 
> Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset 
> because there
> are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we 
> need it but we were too
> shy to do it for performance reason.

If you create it, let me know, I could switch to it.

Thanx, Paul

> > rcu_eqs_enter(false);
> > -   local_irq_restore(flags);
> >  }
> >  EXPORT_SYMBOL_GPL(rcu_idle_enter);
> >  
> > 
>

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Paul E. McKenney

On Wed, Jul 19, 2017 at 03:43:07PM +0200, Frederic Weisbecker wrote:
> On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote:
> > > 
> > > > > But I think we can at the very least do this; it only gets called from
> > > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled 
> > > > > by
> > > > > that point.
> > > > > 
> > > > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 51d4c3acf32d..dccf2dc8155a 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user)
> > > > >   */
> > > > >  void rcu_idle_enter(void)
> > > > >  {
> > > > > - unsigned long flags;
> > > > > -
> > > > > - local_irq_save(flags);
> > > > 
> > > > With this addition, I am all for it:
> > > > 
> > > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> > > > enabled!!!");
> > > > 
> > > > If you are OK with this addition, may I please have your Signed-off-by?
> > > 
> > > Sure,
> > > 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > 
> > Very good, I have queued the patch below.  I left out the removal of
> > the export as I need to work out why the export was there.  If it turns
> > out not to be needed, I will remove the related ones as well.
> > 
> > Fair enough?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > commit 95f3e587ce6388028a51f0c852800fca944e7032
> > Author: Peter Zijlstra (Intel) 
> > Date:   Wed Jul 12 07:59:54 2017 -0700
> > 
> > rcu: Make rcu_idle_enter() rely on callers disabling irqs
> > 
> > All callers to rcu_idle_enter() have irqs disabled, so there is no
> > point in rcu_idle_enter disabling them again.  This commit therefore
> > replaces the irq disabling with a RCU_LOCKDEP_WARN().
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 6de6b1c5ee53..c78b076653ce 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user)
> >   */
> >  void rcu_idle_enter(void)
> >  {
> > -   unsigned long flags;
> > -
> > -   local_irq_save(flags);
> > +   RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> > enabled!!!");
> 
> Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset 
> because there
> are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we 
> need it but we were too
> shy to do it for performance reason.

If you create it, let me know, I could switch to it.

Thanx, Paul

> > rcu_eqs_enter(false);
> > -   local_irq_restore(flags);
> >  }
> >  EXPORT_SYMBOL_GPL(rcu_idle_enter);
> >  
> > 
>

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Paul E. McKenney

On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote:
> On 2017/7/18 23:20, Paul E. McKenney wrote:
> 
> >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and
> >> the result matches with what I have measured before, nothing notable found.
> >> But it still makes more sense if we can make rcu idle enter/exit hooked 
> >> with
> >> tick off. (it's possible other workloads behave differently)
> > 
> > Again, assuming that RCU is informed of CPUs in the kernel, regardless
> > of whether or not the tick is on that that point in time.
> > 
> Yeah, I see, no problem for a normal idle.
> 
> But for a short idle, we want to return to the task ASAP. Even though RCU cost
> is not notable, it would still be better for me if we can save some cycles in
> idle entry and idle exit.
> 
> Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> scenario?
> My understanding is, if tick is not stopped, then we don't need inform RCU in
> idle path, it can be informed in irq exit.

Indeed, the problem arises when the tick is stopped.

Thanx, Paul

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Paul E. McKenney

On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote:
> On 2017/7/18 23:20, Paul E. McKenney wrote:
> 
> >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and
> >> the result matches with what I have measured before, nothing notable found.
> >> But it still makes more sense if we can make rcu idle enter/exit hooked 
> >> with
> >> tick off. (it's possible other workloads behave differently)
> > 
> > Again, assuming that RCU is informed of CPUs in the kernel, regardless
> > of whether or not the tick is on that that point in time.
> > 
> Yeah, I see, no problem for a normal idle.
> 
> But for a short idle, we want to return to the task ASAP. Even though RCU cost
> is not notable, it would still be better for me if we can save some cycles in
> idle entry and idle exit.
> 
> Do we have any problem if we skip RCU idle enter/exit under a fast idle 
> scenario?
> My understanding is, if tick is not stopped, then we don't need inform RCU in
> idle path, it can be informed in irq exit.

Indeed, the problem arises when the tick is stopped.

Thanx, Paul

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote:
> On 2017/7/18 3:23, Peter Zijlstra wrote:
> > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
> >>> And as said; Daniel has been working on a better predictor -- now he's
> >>> probably not used it on the network workload you're looking at, so that
> >>> might be something to consider.
> >>
> >> Deriving a better idle predictor is a bit orthogonal to fast idle.
> > 
> > No. If you want a different C state selected we need to fix the current
> > C state selector. We're not going to tinker.
> > 
> > And the predictor is probably the most fundamental part of the whole C
> > state selection logic.
> > 
> > Now I think the problem is that the current predictor goes for an
> > average idle duration. This means that we, on average, get it wrong 50%
> > of the time. For performance that's bad.
> > 
> > If you want to improve the worst case, we need to consider a cumulative
> > distribution function, and staying with the Gaussian assumption already
> > present, that would mean using:
> > 
> >  1  x - mu
> > CDF(x) = - [ 1 + erf(-) ]
> >  2   sigma sqrt(2)
> > 
> > Where, per the normal convention mu is the average and sigma^2 the
> > variance. See also:
> > 
> >   https://en.wikipedia.org/wiki/Normal_distribution
> > 
> > We then solve CDF(x) = n% to find the x for which we get it wrong n% of
> > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).
> > 
> > This conceptually gets us better exit latency for the cases where we got
> > it wrong before, and practically pushes down the estimate which gets us
> > C1 longer.
> > 
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> Maybe you are talking about applying some machine learning algorithm online
> to fit a multivariate normal distribution, :)

See here for an implementation of what I spoke about above:

  
https://lkml.kernel.org/r/20170719133940.uytsixvfgpmo3...@hirez.programming.kicks-ass.net

Very much statistics 101.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote:
> On 2017/7/18 3:23, Peter Zijlstra wrote:
> > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
> >>> And as said; Daniel has been working on a better predictor -- now he's
> >>> probably not used it on the network workload you're looking at, so that
> >>> might be something to consider.
> >>
> >> Deriving a better idle predictor is a bit orthogonal to fast idle.
> > 
> > No. If you want a different C state selected we need to fix the current
> > C state selector. We're not going to tinker.
> > 
> > And the predictor is probably the most fundamental part of the whole C
> > state selection logic.
> > 
> > Now I think the problem is that the current predictor goes for an
> > average idle duration. This means that we, on average, get it wrong 50%
> > of the time. For performance that's bad.
> > 
> > If you want to improve the worst case, we need to consider a cumulative
> > distribution function, and staying with the Gaussian assumption already
> > present, that would mean using:
> > 
> >  1  x - mu
> > CDF(x) = - [ 1 + erf(-) ]
> >  2   sigma sqrt(2)
> > 
> > Where, per the normal convention mu is the average and sigma^2 the
> > variance. See also:
> > 
> >   https://en.wikipedia.org/wiki/Normal_distribution
> > 
> > We then solve CDF(x) = n% to find the x for which we get it wrong n% of
> > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).
> > 
> > This conceptually gets us better exit latency for the cases where we got
> > it wrong before, and practically pushes down the estimate which gets us
> > C1 longer.
> > 
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> Maybe you are talking about applying some machine learning algorithm online
> to fit a multivariate normal distribution, :)

See here for an implementation of what I spoke about above:

  
https://lkml.kernel.org/r/20170719133940.uytsixvfgpmo3...@hirez.programming.kicks-ass.net

Very much statistics 101.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Frederic Weisbecker

On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote:
> > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote:
> > 
> > > > But I think we can at the very least do this; it only gets called from
> > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled by
> > > > that point.
> > > > 
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 51d4c3acf32d..dccf2dc8155a 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user)
> > > >   */
> > > >  void rcu_idle_enter(void)
> > > >  {
> > > > -   unsigned long flags;
> > > > -
> > > > -   local_irq_save(flags);
> > > 
> > > With this addition, I am all for it:
> > > 
> > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> > > enabled!!!");
> > > 
> > > If you are OK with this addition, may I please have your Signed-off-by?
> > 
> > Sure,
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> 
> Very good, I have queued the patch below.  I left out the removal of
> the export as I need to work out why the export was there.  If it turns
> out not to be needed, I will remove the related ones as well.
> 
> Fair enough?
> 
>   Thanx, Paul
> 
> 
> 
> commit 95f3e587ce6388028a51f0c852800fca944e7032
> Author: Peter Zijlstra (Intel) 
> Date:   Wed Jul 12 07:59:54 2017 -0700
> 
> rcu: Make rcu_idle_enter() rely on callers disabling irqs
> 
> All callers to rcu_idle_enter() have irqs disabled, so there is no
> point in rcu_idle_enter disabling them again.  This commit therefore
> replaces the irq disabling with a RCU_LOCKDEP_WARN().
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 6de6b1c5ee53..c78b076653ce 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user)
>   */
>  void rcu_idle_enter(void)
>  {
> - unsigned long flags;
> -
> - local_irq_save(flags);
> + RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> enabled!!!");

Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset 
because there
are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we need 
it but we were too
shy to do it for performance reason.

>   rcu_eqs_enter(false);
> - local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_enter);
>  
>

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Frederic Weisbecker

On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote:
> > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote:
> > 
> > > > But I think we can at the very least do this; it only gets called from
> > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled by
> > > > that point.
> > > > 
> > > > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 51d4c3acf32d..dccf2dc8155a 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user)
> > > >   */
> > > >  void rcu_idle_enter(void)
> > > >  {
> > > > -   unsigned long flags;
> > > > -
> > > > -   local_irq_save(flags);
> > > 
> > > With this addition, I am all for it:
> > > 
> > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> > > enabled!!!");
> > > 
> > > If you are OK with this addition, may I please have your Signed-off-by?
> > 
> > Sure,
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> 
> Very good, I have queued the patch below.  I left out the removal of
> the export as I need to work out why the export was there.  If it turns
> out not to be needed, I will remove the related ones as well.
> 
> Fair enough?
> 
>   Thanx, Paul
> 
> 
> 
> commit 95f3e587ce6388028a51f0c852800fca944e7032
> Author: Peter Zijlstra (Intel) 
> Date:   Wed Jul 12 07:59:54 2017 -0700
> 
> rcu: Make rcu_idle_enter() rely on callers disabling irqs
> 
> All callers to rcu_idle_enter() have irqs disabled, so there is no
> point in rcu_idle_enter disabling them again.  This commit therefore
> replaces the irq disabling with a RCU_LOCKDEP_WARN().
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 6de6b1c5ee53..c78b076653ce 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user)
>   */
>  void rcu_idle_enter(void)
>  {
> - unsigned long flags;
> -
> - local_irq_save(flags);
> + RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs 
> enabled!!!");

Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset 
because there
are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we need 
it but we were too
shy to do it for performance reason.

>   rcu_eqs_enter(false);
> - local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(rcu_idle_enter);
>  
>

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Thomas Gleixner

On Wed, 19 Jul 2017, Li, Aubrey wrote:
> On 2017/7/18 15:19, Thomas Gleixner wrote:
> > You can very well avoid it by taking the irq timings or whatever other
> > information into account for the NOHZ decision.
> > 
> If I read the source correctly, irq timing statistics computation happens at
> idle time. Sadly, this is what Andi Kleen worried about. People keep putting
> more and more slow stuff in idle path, not realizing it could be a critical 
> path.

Can you please stop this right here? We all realize that there is an issue,
we are not as stupid as you might think.

But we disagree that the proper solution is to add an ad hoc hack which
shortcuts stuff and creates a maintenance problem in the long run.

Just dismissing a proposal based on 'oh this adds more code to the idle
path' and then yelling 'you all do not get it' is not going to solve this
in any way.

You simply refuse to look at the big picture and you are just not willing
to analyze the involved bits and pieces and find a solution which allows to
serve both the NOHZ power saving and the interrupt driven workloads best.

All you are doing is providing data about the status quo, and then
deferring from that, that you need an extra magic code path. That
information is just a inventory of the current behaviour and the current
behaviour is caused by the current implementation. There is nothing set in
stone with that implementation and it can be changed.

But you are just in refusal mode and instead of sitting down and doing the
hard work, you accuse other people that they are not realizing the problem
and insist on your sloppy hackery. That's simply not the way it works.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Thomas Gleixner

On Wed, 19 Jul 2017, Li, Aubrey wrote:
> On 2017/7/18 15:19, Thomas Gleixner wrote:
> > You can very well avoid it by taking the irq timings or whatever other
> > information into account for the NOHZ decision.
> > 
> If I read the source correctly, irq timing statistics computation happens at
> idle time. Sadly, this is what Andi Kleen worried about. People keep putting
> more and more slow stuff in idle path, not realizing it could be a critical 
> path.

Can you please stop this right here? We all realize that there is an issue,
we are not as stupid as you might think.

But we disagree that the proper solution is to add an ad hoc hack which
shortcuts stuff and creates a maintenance problem in the long run.

Just dismissing a proposal based on 'oh this adds more code to the idle
path' and then yelling 'you all do not get it' is not going to solve this
in any way.

You simply refuse to look at the big picture and you are just not willing
to analyze the involved bits and pieces and find a solution which allows to
serve both the NOHZ power saving and the interrupt driven workloads best.

All you are doing is providing data about the status quo, and then
deferring from that, that you need an extra magic code path. That
information is just a inventory of the current behaviour and the current
behaviour is caused by the current implementation. There is nothing set in
stone with that implementation and it can be changed.

But you are just in refusal mode and instead of sitting down and doing the
hard work, you accuse other people that they are not realizing the problem
and insist on your sloppy hackery. That's simply not the way it works.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Li, Aubrey

On 2017/7/18 15:19, Thomas Gleixner wrote:
> On Mon, 17 Jul 2017, Andi Kleen wrote:
>> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
>>> On Mon, 17 Jul 2017, Andi Kleen wrote:
>>>
> We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> how/if
> it's better than menu governor.

 I still would like to see how the fast path without the C1 heuristic works.

 Fast pathing is a different concept from a better predictor. IMHO we need
 both, but the first is likely lower hanging fruit.
>>>
>>> Hacking something on the side is always the lower hanging fruit as it
>>> avoids solving the hard problems. As Peter said already, that's not going
>>> to happen unless there is a real technical reason why the general path
>>> cannot be fixed. So far there is no proof for that.
>>
>> You didn't look at Aubrey's data?
>>
>> There are some unavoidable slow operations in the current path -- e.g.
>> reprograming the timer for NOHZ. But we don't need that for really 
>> short idle periods, because as you pointed out they never get woken
>> up by the tick.
>>
>> Similar for other things like RCU.
>>
>> I don't see how you can avoid that other than without a fast path mechanism.
> 
> You can very well avoid it by taking the irq timings or whatever other
> information into account for the NOHZ decision.
> 
If I read the source correctly, irq timing statistics computation happens at
idle time. Sadly, this is what Andi Kleen worried about. People keep putting
more and more slow stuff in idle path, not realizing it could be a critical 
path.

/*
 * The computation happens at idle time. When the CPU is not idle, the
 * interrupts' timestamps are stored in the circular buffer, when the
 * CPU goes idle and this routine is called, all the buffer's values
 * are injected in the statistical model continuing to extend the
 * statistics from the previous busy-idle cycle.
 */

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-19 Thread Li, Aubrey

On 2017/7/18 15:19, Thomas Gleixner wrote:
> On Mon, 17 Jul 2017, Andi Kleen wrote:
>> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
>>> On Mon, 17 Jul 2017, Andi Kleen wrote:
>>>
> We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> how/if
> it's better than menu governor.

 I still would like to see how the fast path without the C1 heuristic works.

 Fast pathing is a different concept from a better predictor. IMHO we need
 both, but the first is likely lower hanging fruit.
>>>
>>> Hacking something on the side is always the lower hanging fruit as it
>>> avoids solving the hard problems. As Peter said already, that's not going
>>> to happen unless there is a real technical reason why the general path
>>> cannot be fixed. So far there is no proof for that.
>>
>> You didn't look at Aubrey's data?
>>
>> There are some unavoidable slow operations in the current path -- e.g.
>> reprograming the timer for NOHZ. But we don't need that for really 
>> short idle periods, because as you pointed out they never get woken
>> up by the tick.
>>
>> Similar for other things like RCU.
>>
>> I don't see how you can avoid that other than without a fast path mechanism.
> 
> You can very well avoid it by taking the irq timings or whatever other
> information into account for the NOHZ decision.
> 
If I read the source correctly, irq timing statistics computation happens at
idle time. Sadly, this is what Andi Kleen worried about. People keep putting
more and more slow stuff in idle path, not realizing it could be a critical 
path.

/*
 * The computation happens at idle time. When the CPU is not idle, the
 * interrupts' timestamps are stored in the circular buffer, when the
 * CPU goes idle and this routine is called, all the buffer's values
 * are injected in the statistical model continuing to extend the
 * statistics from the previous busy-idle cycle.
 */

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Li, Aubrey

On 2017/7/18 23:20, Paul E. McKenney wrote:

>> 2) for rcu idle enter/exit, I measured the details which Paul provided, and
>> the result matches with what I have measured before, nothing notable found.
>> But it still makes more sense if we can make rcu idle enter/exit hooked with
>> tick off. (it's possible other workloads behave differently)
> 
> Again, assuming that RCU is informed of CPUs in the kernel, regardless
> of whether or not the tick is on that that point in time.
> 
Yeah, I see, no problem for a normal idle.

But for a short idle, we want to return to the task ASAP. Even though RCU cost
is not notable, it would still be better for me if we can save some cycles in
idle entry and idle exit.

Do we have any problem if we skip RCU idle enter/exit under a fast idle 
scenario?
My understanding is, if tick is not stopped, then we don't need inform RCU in
idle path, it can be informed in irq exit.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Li, Aubrey

On 2017/7/18 23:20, Paul E. McKenney wrote:

>> 2) for rcu idle enter/exit, I measured the details which Paul provided, and
>> the result matches with what I have measured before, nothing notable found.
>> But it still makes more sense if we can make rcu idle enter/exit hooked with
>> tick off. (it's possible other workloads behave differently)
> 
> Again, assuming that RCU is informed of CPUs in the kernel, regardless
> of whether or not the tick is on that that point in time.
> 
Yeah, I see, no problem for a normal idle.

But for a short idle, we want to return to the task ASAP. Even though RCU cost
is not notable, it would still be better for me if we can save some cycles in
idle entry and idle exit.

Do we have any problem if we skip RCU idle enter/exit under a fast idle 
scenario?
My understanding is, if tick is not stopped, then we don't need inform RCU in
idle path, it can be informed in irq exit.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Mon, Jul 17, 2017 at 12:48:38PM -0700, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> 
> fwiw some time ago I made a chart for predicted vs actual so you can sort
> of judge the distribution of things visually
> 
> http://git.fenrus.org/tmp/linux2.png

That shows we get it wrong a lot of times (about 50%, as per the
average) and moving the line has benefit. Since for performance you
really don't want to pick the deeper idle state, so you want to bias
your pick towards a shallower state.

Using the CDF approach you can quantify by how much you want it moved.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Mon, Jul 17, 2017 at 12:48:38PM -0700, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> 
> fwiw some time ago I made a chart for predicted vs actual so you can sort
> of judge the distribution of things visually
> 
> http://git.fenrus.org/tmp/linux2.png

That shows we get it wrong a lot of times (about 50%, as per the
average) and moving the line has benefit. Since for performance you
really don't want to pick the deeper idle state, so you want to bias
your pick towards a shallower state.

Using the CDF approach you can quantify by how much you want it moved.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 09:37:57AM -0700, Arjan van de Ven wrote:

> that's just a matter of fixing the C1 and later thresholds to line up right.
> shrug that's the most trivial thing to do, it's a number in a table.

Well, they represent a physical measure, namely the break-even-time. If
you go muck with them you don't have anything left.  This is tinkering
of the worst possible kind.

Fix the estimator if you want behavioural changes.

> some distros do those tunings anyway when they don't like the upstream tunings

*shudder*

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 09:37:57AM -0700, Arjan van de Ven wrote:

> that's just a matter of fixing the C1 and later thresholds to line up right.
> shrug that's the most trivial thing to do, it's a number in a table.

Well, they represent a physical measure, namely the break-even-time. If
you go muck with them you don't have anything left.  This is tinkering
of the worst possible kind.

Fix the estimator if you want behavioural changes.

> some distros do those tunings anyway when they don't like the upstream tunings

*shudder*

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote:

> 3) for tick nohz idle, we want to skip if the coming idle is short. If we can
> skip the tick nohz idle, we then skip all the items depending on it. But, 
> there
> are two hard points:
> 
> 3.1) how to compute the period of the coming idle. My current proposal is to
> use two factors in the current idle menu governor. There are two possible
> options from Peter and Thomas, the one is to use scheduler idle estimate, 
> which
> is task activity based, the other is to use the statistics generated from irq
> timings work.
> 
> 3.2) how to determine if the idle is short or long. My current proposal is to
> use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
> didn't get the details of an auto-adjust mechanism yet

So the problem is that the cost of NOHZ enter/exit are for a large part
determined by (micro) architecture costs of programming timer hardware.

A single static threshold will never be the right value across all the
various machines we run Linux on.

So my suggestion was simply timing the cost of doing those functions
ever time we do them and keeping a running average of their cost. Then
use that measured cost as a basis for selecting when to skip them. For
example if the estimated idle period (by whatever estimate we end up
using) is less than 4 times the cost of doing NOHZ, don't do NOHZ.

Note how both tick_nohz_idle_{enter,exit}() already take a timestamp at
entry; all you need to do is take another one at exit and subtract.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote:

> 3) for tick nohz idle, we want to skip if the coming idle is short. If we can
> skip the tick nohz idle, we then skip all the items depending on it. But, 
> there
> are two hard points:
> 
> 3.1) how to compute the period of the coming idle. My current proposal is to
> use two factors in the current idle menu governor. There are two possible
> options from Peter and Thomas, the one is to use scheduler idle estimate, 
> which
> is task activity based, the other is to use the statistics generated from irq
> timings work.
> 
> 3.2) how to determine if the idle is short or long. My current proposal is to
> use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
> didn't get the details of an auto-adjust mechanism yet

So the problem is that the cost of NOHZ enter/exit are for a large part
determined by (micro) architecture costs of programming timer hardware.

A single static threshold will never be the right value across all the
various machines we run Linux on.

So my suggestion was simply timing the cost of doing those functions
ever time we do them and keeping a running average of their cost. Then
use that measured cost as a basis for selecting when to skip them. For
example if the estimated idle period (by whatever estimate we end up
using) is less than 4 times the cost of doing NOHZ, don't do NOHZ.

Note how both tick_nohz_idle_{enter,exit}() already take a timestamp at
entry; all you need to do is take another one at exit and subtract.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Arjan van de Ven


On 7/18/2017 9:36 AM, Peter Zijlstra wrote:

On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote:


the most obvious way to do this (for me, maybe I'm naive) is to add another
C state, lets call it "C1-lite" with its own thresholds and power levels etc,
and just let that be picked naturally based on the heuristics.
(if we want to improve the heuristics, that's fine and always welcome but that
is completely orthogonal in my mind)


C1-lite would then have a threshold < C1, whereas I understood the
desire to be for the fast-idle crud to have a larger threshold than C1
currently has.

That is, from what I understood, they want C1 selected *longer*.


that's just a matter of fixing the C1 and later thresholds to line up right.

shrug that's the most trivial thing to do, it's a number in a table.

some distros do those tunings anyway when they don't like the upstream tunings

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Arjan van de Ven


On 7/18/2017 9:36 AM, Peter Zijlstra wrote:

On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote:


the most obvious way to do this (for me, maybe I'm naive) is to add another
C state, lets call it "C1-lite" with its own thresholds and power levels etc,
and just let that be picked naturally based on the heuristics.
(if we want to improve the heuristics, that's fine and always welcome but that
is completely orthogonal in my mind)


C1-lite would then have a threshold < C1, whereas I understood the
desire to be for the fast-idle crud to have a larger threshold than C1
currently has.

That is, from what I understood, they want C1 selected *longer*.


that's just a matter of fixing the C1 and later thresholds to line up right.

shrug that's the most trivial thing to do, it's a number in a table.

some distros do those tunings anyway when they don't like the upstream tunings

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote:
> 
> the most obvious way to do this (for me, maybe I'm naive) is to add another
> C state, lets call it "C1-lite" with its own thresholds and power levels etc,
> and just let that be picked naturally based on the heuristics.
> (if we want to improve the heuristics, that's fine and always welcome but that
> is completely orthogonal in my mind)

C1-lite would then have a threshold < C1, whereas I understood the
desire to be for the fast-idle crud to have a larger threshold than C1
currently has.

That is, from what I understood, they want C1 selected *longer*.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote:
> 
> the most obvious way to do this (for me, maybe I'm naive) is to add another
> C state, lets call it "C1-lite" with its own thresholds and power levels etc,
> and just let that be picked naturally based on the heuristics.
> (if we want to improve the heuristics, that's fine and always welcome but that
> is completely orthogonal in my mind)

C1-lite would then have a threshold < C1, whereas I understood the
desire to be for the fast-idle crud to have a larger threshold than C1
currently has.

That is, from what I understood, they want C1 selected *longer*.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Arjan van de Ven


On 7/18/2017 8:20 AM, Paul E. McKenney wrote:

3.2) how to determine if the idle is short or long. My current proposal is to
use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
didn't get the details of an auto-adjust mechanism yet



the most obvious way to do this (for me, maybe I'm naive) is to add another
C state, lets call it "C1-lite" with its own thresholds and power levels etc,
and just let that be picked naturally based on the heuristics.
(if we want to improve the heuristics, that's fine and always welcome but that
is completely orthogonal in my mind)

this C1-lite would then skip some of the idle steps like the nohz logic. How we
plumb that ... might end up being a flag or whatever, we'll figure that out 
easily.

as long as "real C1" has a break even time that is appropriate compared to 
C1-lite,
we'll only pick C1-lite for very very short idles like is desired...
but we don't end up creating a parallel infra for picking states, that part 
just does
not make sense to me tbh I have yet to see any reason why C1-lite couldn't 
be just
another C-state for everything except the actual place where we do the "go 
idle" last
bit of logic.

(Also note that for extreme short idles, today we just spinloop (C0), so by 
this argument
we should also do a C0-lite.. or make this C0 always the lite variant)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Arjan van de Ven


On 7/18/2017 8:20 AM, Paul E. McKenney wrote:

3.2) how to determine if the idle is short or long. My current proposal is to
use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
didn't get the details of an auto-adjust mechanism yet



the most obvious way to do this (for me, maybe I'm naive) is to add another
C state, lets call it "C1-lite" with its own thresholds and power levels etc,
and just let that be picked naturally based on the heuristics.
(if we want to improve the heuristics, that's fine and always welcome but that
is completely orthogonal in my mind)

this C1-lite would then skip some of the idle steps like the nohz logic. How we
plumb that ... might end up being a flag or whatever, we'll figure that out 
easily.

as long as "real C1" has a break even time that is appropriate compared to 
C1-lite,
we'll only pick C1-lite for very very short idles like is desired...
but we don't end up creating a parallel infra for picking states, that part 
just does
not make sense to me tbh I have yet to see any reason why C1-lite couldn't 
be just
another C-state for everything except the actual place where we do the "go 
idle" last
bit of logic.

(Also note that for extreme short idles, today we just spinloop (C0), so by 
this argument
we should also do a C0-lite.. or make this C0 always the lite variant)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Paul E. McKenney

On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote:
> On 2017/7/18 14:43, Thomas Gleixner wrote:
> > On Mon, 17 Jul 2017, Andi Kleen wrote:
> > 
> >>> We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> >>> how/if
> >>> it's better than menu governor.
> >>
> >> I still would like to see how the fast path without the C1 heuristic works.
> >>
> >> Fast pathing is a different concept from a better predictor. IMHO we need
> >> both, but the first is likely lower hanging fruit.
> > 
> > Hacking something on the side is always the lower hanging fruit as it
> > avoids solving the hard problems. As Peter said already, that's not going
> > to happen unless there is a real technical reason why the general path
> > cannot be fixed. So far there is no proof for that.
> > 
> Let me try to make a summary, please correct me if I was wrong.
> 
> 1) for quiet_vmstat, we are agreed to move to another place where tick is
> really stopped.
> 
> 2) for rcu idle enter/exit, I measured the details which Paul provided, and
> the result matches with what I have measured before, nothing notable found.
> But it still makes more sense if we can make rcu idle enter/exit hooked with
> tick off. (it's possible other workloads behave differently)

Again, assuming that RCU is informed of CPUs in the kernel, regardless
of whether or not the tick is on that that point in time.

Thanx, Paul

> 3) for tick nohz idle, we want to skip if the coming idle is short. If we can
> skip the tick nohz idle, we then skip all the items depending on it. But, 
> there
> are two hard points:
> 
> 3.1) how to compute the period of the coming idle. My current proposal is to
> use two factors in the current idle menu governor. There are two possible
> options from Peter and Thomas, the one is to use scheduler idle estimate, 
> which
> is task activity based, the other is to use the statistics generated from irq
> timings work.
> 
> 3.2) how to determine if the idle is short or long. My current proposal is to
> use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
> didn't get the details of an auto-adjust mechanism yet
> 
> 4) for idle loop, my proposal introduces a simple one to use default idle
> routine directly, while Peter and Thomas suggest we fix c-state selection
> in the existing idle path.
> 
> Thanks,
> -Aubrey
>

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Paul E. McKenney

On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote:
> On 2017/7/18 14:43, Thomas Gleixner wrote:
> > On Mon, 17 Jul 2017, Andi Kleen wrote:
> > 
> >>> We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> >>> how/if
> >>> it's better than menu governor.
> >>
> >> I still would like to see how the fast path without the C1 heuristic works.
> >>
> >> Fast pathing is a different concept from a better predictor. IMHO we need
> >> both, but the first is likely lower hanging fruit.
> > 
> > Hacking something on the side is always the lower hanging fruit as it
> > avoids solving the hard problems. As Peter said already, that's not going
> > to happen unless there is a real technical reason why the general path
> > cannot be fixed. So far there is no proof for that.
> > 
> Let me try to make a summary, please correct me if I was wrong.
> 
> 1) for quiet_vmstat, we are agreed to move to another place where tick is
> really stopped.
> 
> 2) for rcu idle enter/exit, I measured the details which Paul provided, and
> the result matches with what I have measured before, nothing notable found.
> But it still makes more sense if we can make rcu idle enter/exit hooked with
> tick off. (it's possible other workloads behave differently)

Again, assuming that RCU is informed of CPUs in the kernel, regardless
of whether or not the tick is on that that point in time.

Thanx, Paul

> 3) for tick nohz idle, we want to skip if the coming idle is short. If we can
> skip the tick nohz idle, we then skip all the items depending on it. But, 
> there
> are two hard points:
> 
> 3.1) how to compute the period of the coming idle. My current proposal is to
> use two factors in the current idle menu governor. There are two possible
> options from Peter and Thomas, the one is to use scheduler idle estimate, 
> which
> is task activity based, the other is to use the statistics generated from irq
> timings work.
> 
> 3.2) how to determine if the idle is short or long. My current proposal is to
> use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
> didn't get the details of an auto-adjust mechanism yet
> 
> 4) for idle loop, my proposal introduces a simple one to use default idle
> routine directly, while Peter and Thomas suggest we fix c-state selection
> in the existing idle path.
> 
> Thanks,
> -Aubrey
>

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote:
> On 2017/7/18 3:23, Peter Zijlstra wrote:
> > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
> >>> And as said; Daniel has been working on a better predictor -- now he's
> >>> probably not used it on the network workload you're looking at, so that
> >>> might be something to consider.
> >>
> >> Deriving a better idle predictor is a bit orthogonal to fast idle.
> > 
> > No. If you want a different C state selected we need to fix the current
> > C state selector. We're not going to tinker.
> > 
> > And the predictor is probably the most fundamental part of the whole C
> > state selection logic.
> > 
> > Now I think the problem is that the current predictor goes for an
> > average idle duration. This means that we, on average, get it wrong 50%
> > of the time. For performance that's bad.
> > 
> > If you want to improve the worst case, we need to consider a cumulative
> > distribution function, and staying with the Gaussian assumption already
> > present, that would mean using:
> > 
> >  1  x - mu
> > CDF(x) = - [ 1 + erf(-) ]
> >  2   sigma sqrt(2)
> > 
> > Where, per the normal convention mu is the average and sigma^2 the
> > variance. See also:
> > 
> >   https://en.wikipedia.org/wiki/Normal_distribution
> > 
> > We then solve CDF(x) = n% to find the x for which we get it wrong n% of
> > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).
> > 
> > This conceptually gets us better exit latency for the cases where we got
> > it wrong before, and practically pushes down the estimate which gets us
> > C1 longer.
> > 
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> Maybe you are talking about applying some machine learning algorithm online
> to fit a multivariate normal distribution, :)

Nah, nothing that fancy..

Something that _could_ work and deals with arbitrary distributions is
buckets divided on the actual C-state selection boundaries and a
(cyclic) array of the N most recent idle times.

Something like so:

struct est_data {
u8 array[64]
u8 *dist;
u8 idx;
}

DEFINE_PER_CPU(struct est_data, est_data);

void est_init(void)
{
int size = drv->state_count;
int cpu;

for_each_possible_cpu(cpu) {
per_cpu(est_data, cpu).dist = kzalloc(size);
// handle errors
}
}

u8 est_duration_2_state(u64 duration)
{
for (i=0; istate_count; i++) {
if (duration/1024 < drv->state[i].target_residency)
return i;
}

return i-1;
}

void est_contemplate(u64 duration)
{
struct est_data *ed = this_cpu_ptr(_data);
int state = est_duration_2_state(duration);
int idx = (ed->idx++ % ARRAY_SIZE(ed->array);

ed->dist[ed->array[idx]]--;
ed->array[idx] = state;
ed->dist[ed->array[idx]]++;
}

int est_state(int pct)
{
struct est_data *ed = this_cpu_ptr(_data);
int limit = pct * ARRAY_SIZE(ed->array) / 100; /* XXX move div out of 
line */
int cnt, last = 0;

/* CDF */
for (i=0; istate_count; i++) {
cnt += ed->dist[i];
if (cnt > limit)
break;
last = i;
}

return last;
}


> Well, back to the problem, when the scheduler picks up idle thread, it does
> not look at the history, nor make the prediction. So it's possible it has
> to switch back a task ASAP when it's going into idle(very common under some
> workloads).
> 
> That is, (idle_entry + idle_exit) > idle. If the system has multiple
> hardware idle states, then:
> 
> (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep
> 
> So we eventually want the idle path lighter than what we currently have.
> A complex predictor may have high accuracy, but the cost could be high as 
> well.

I never suggested anything complex. The current menu thing uses an
average, all I said is if instead of the average you use something less,
say 'avg - 2*stdev' (it already computes the stdev) you get something,
which assuming Gaussian, is less than ~5 wrong on exit latency.

The above, also simple thing, uses a generic distribution function,
which works because it uses the exact boundaries we're limited to
anyway.

Of course, the above needs to be augmented with IRQ bits etc..

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Peter Zijlstra

On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote:
> On 2017/7/18 3:23, Peter Zijlstra wrote:
> > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
> >>> And as said; Daniel has been working on a better predictor -- now he's
> >>> probably not used it on the network workload you're looking at, so that
> >>> might be something to consider.
> >>
> >> Deriving a better idle predictor is a bit orthogonal to fast idle.
> > 
> > No. If you want a different C state selected we need to fix the current
> > C state selector. We're not going to tinker.
> > 
> > And the predictor is probably the most fundamental part of the whole C
> > state selection logic.
> > 
> > Now I think the problem is that the current predictor goes for an
> > average idle duration. This means that we, on average, get it wrong 50%
> > of the time. For performance that's bad.
> > 
> > If you want to improve the worst case, we need to consider a cumulative
> > distribution function, and staying with the Gaussian assumption already
> > present, that would mean using:
> > 
> >  1  x - mu
> > CDF(x) = - [ 1 + erf(-) ]
> >  2   sigma sqrt(2)
> > 
> > Where, per the normal convention mu is the average and sigma^2 the
> > variance. See also:
> > 
> >   https://en.wikipedia.org/wiki/Normal_distribution
> > 
> > We then solve CDF(x) = n% to find the x for which we get it wrong n% of
> > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).
> > 
> > This conceptually gets us better exit latency for the cases where we got
> > it wrong before, and practically pushes down the estimate which gets us
> > C1 longer.
> > 
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> Maybe you are talking about applying some machine learning algorithm online
> to fit a multivariate normal distribution, :)

Nah, nothing that fancy..

Something that _could_ work and deals with arbitrary distributions is
buckets divided on the actual C-state selection boundaries and a
(cyclic) array of the N most recent idle times.

Something like so:

struct est_data {
u8 array[64]
u8 *dist;
u8 idx;
}

DEFINE_PER_CPU(struct est_data, est_data);

void est_init(void)
{
int size = drv->state_count;
int cpu;

for_each_possible_cpu(cpu) {
per_cpu(est_data, cpu).dist = kzalloc(size);
// handle errors
}
}

u8 est_duration_2_state(u64 duration)
{
for (i=0; istate_count; i++) {
if (duration/1024 < drv->state[i].target_residency)
return i;
}

return i-1;
}

void est_contemplate(u64 duration)
{
struct est_data *ed = this_cpu_ptr(_data);
int state = est_duration_2_state(duration);
int idx = (ed->idx++ % ARRAY_SIZE(ed->array);

ed->dist[ed->array[idx]]--;
ed->array[idx] = state;
ed->dist[ed->array[idx]]++;
}

int est_state(int pct)
{
struct est_data *ed = this_cpu_ptr(_data);
int limit = pct * ARRAY_SIZE(ed->array) / 100; /* XXX move div out of 
line */
int cnt, last = 0;

/* CDF */
for (i=0; istate_count; i++) {
cnt += ed->dist[i];
if (cnt > limit)
break;
last = i;
}

return last;
}


> Well, back to the problem, when the scheduler picks up idle thread, it does
> not look at the history, nor make the prediction. So it's possible it has
> to switch back a task ASAP when it's going into idle(very common under some
> workloads).
> 
> That is, (idle_entry + idle_exit) > idle. If the system has multiple
> hardware idle states, then:
> 
> (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep
> 
> So we eventually want the idle path lighter than what we currently have.
> A complex predictor may have high accuracy, but the cost could be high as 
> well.

I never suggested anything complex. The current menu thing uses an
average, all I said is if instead of the average you use something less,
say 'avg - 2*stdev' (it already computes the stdev) you get something,
which assuming Gaussian, is less than ~5 wrong on exit latency.

The above, also simple thing, uses a generic distribution function,
which works because it uses the exact boundaries we're limited to
anyway.

Of course, the above needs to be augmented with IRQ bits etc..

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Andi Kleen wrote:
> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Jul 2017, Andi Kleen wrote:
> > 
> > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> > > > how/if
> > > > it's better than menu governor.
> > > 
> > > I still would like to see how the fast path without the C1 heuristic 
> > > works.
> > > 
> > > Fast pathing is a different concept from a better predictor. IMHO we need
> > > both, but the first is likely lower hanging fruit.
> > 
> > Hacking something on the side is always the lower hanging fruit as it
> > avoids solving the hard problems. As Peter said already, that's not going
> > to happen unless there is a real technical reason why the general path
> > cannot be fixed. So far there is no proof for that.
> 
> You didn't look at Aubrey's data?

I did, but that data is no proof that it is unfixable. It's just data
describing the current situation, not more not less.

> There are some unavoidable slow operations in the current path -- e.g.

That's the whole point: current path, IOW current implementation.

This implementation is not set in stone and we rather fix it than just
creating a side channel and leave everything else as is.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Andi Kleen wrote:
> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Jul 2017, Andi Kleen wrote:
> > 
> > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> > > > how/if
> > > > it's better than menu governor.
> > > 
> > > I still would like to see how the fast path without the C1 heuristic 
> > > works.
> > > 
> > > Fast pathing is a different concept from a better predictor. IMHO we need
> > > both, but the first is likely lower hanging fruit.
> > 
> > Hacking something on the side is always the lower hanging fruit as it
> > avoids solving the hard problems. As Peter said already, that's not going
> > to happen unless there is a real technical reason why the general path
> > cannot be fixed. So far there is no proof for that.
> 
> You didn't look at Aubrey's data?

I did, but that data is no proof that it is unfixable. It's just data
describing the current situation, not more not less.

> There are some unavoidable slow operations in the current path -- e.g.

That's the whole point: current path, IOW current implementation.

This implementation is not set in stone and we rather fix it than just
creating a side channel and leave everything else as is.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Andi Kleen wrote:
> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Jul 2017, Andi Kleen wrote:
> > 
> > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> > > > how/if
> > > > it's better than menu governor.
> > > 
> > > I still would like to see how the fast path without the C1 heuristic 
> > > works.
> > > 
> > > Fast pathing is a different concept from a better predictor. IMHO we need
> > > both, but the first is likely lower hanging fruit.
> > 
> > Hacking something on the side is always the lower hanging fruit as it
> > avoids solving the hard problems. As Peter said already, that's not going
> > to happen unless there is a real technical reason why the general path
> > cannot be fixed. So far there is no proof for that.
> 
> You didn't look at Aubrey's data?
> 
> There are some unavoidable slow operations in the current path -- e.g.
> reprograming the timer for NOHZ. But we don't need that for really 
> short idle periods, because as you pointed out they never get woken
> up by the tick.
> 
> Similar for other things like RCU.
> 
> I don't see how you can avoid that other than without a fast path mechanism.

You can very well avoid it by taking the irq timings or whatever other
information into account for the NOHZ decision.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Andi Kleen wrote:
> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Jul 2017, Andi Kleen wrote:
> > 
> > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> > > > how/if
> > > > it's better than menu governor.
> > > 
> > > I still would like to see how the fast path without the C1 heuristic 
> > > works.
> > > 
> > > Fast pathing is a different concept from a better predictor. IMHO we need
> > > both, but the first is likely lower hanging fruit.
> > 
> > Hacking something on the side is always the lower hanging fruit as it
> > avoids solving the hard problems. As Peter said already, that's not going
> > to happen unless there is a real technical reason why the general path
> > cannot be fixed. So far there is no proof for that.
> 
> You didn't look at Aubrey's data?
> 
> There are some unavoidable slow operations in the current path -- e.g.
> reprograming the timer for NOHZ. But we don't need that for really 
> short idle periods, because as you pointed out they never get woken
> up by the tick.
> 
> Similar for other things like RCU.
> 
> I don't see how you can avoid that other than without a fast path mechanism.

You can very well avoid it by taking the irq timings or whatever other
information into account for the NOHZ decision.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Andi Kleen

On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Jul 2017, Andi Kleen wrote:
> 
> > > We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> > > how/if
> > > it's better than menu governor.
> > 
> > I still would like to see how the fast path without the C1 heuristic works.
> > 
> > Fast pathing is a different concept from a better predictor. IMHO we need
> > both, but the first is likely lower hanging fruit.
> 
> Hacking something on the side is always the lower hanging fruit as it
> avoids solving the hard problems. As Peter said already, that's not going
> to happen unless there is a real technical reason why the general path
> cannot be fixed. So far there is no proof for that.

You didn't look at Aubrey's data?

There are some unavoidable slow operations in the current path -- e.g.
reprograming the timer for NOHZ. But we don't need that for really 
short idle periods, because as you pointed out they never get woken
up by the tick.

Similar for other things like RCU.

I don't see how you can avoid that other than without a fast path mechanism.

Clearly these operations are eventually needed, just not all the time
for short sleeps.

Now in theory you could have lots of little fast paths in all the individual
operations that check this individually, but I don't see how that is better than
a single simple fast path.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Andi Kleen

On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Jul 2017, Andi Kleen wrote:
> 
> > > We need a tradeoff here IMHO. I'll check Daniel's work to understand 
> > > how/if
> > > it's better than menu governor.
> > 
> > I still would like to see how the fast path without the C1 heuristic works.
> > 
> > Fast pathing is a different concept from a better predictor. IMHO we need
> > both, but the first is likely lower hanging fruit.
> 
> Hacking something on the side is always the lower hanging fruit as it
> avoids solving the hard problems. As Peter said already, that's not going
> to happen unless there is a real technical reason why the general path
> cannot be fixed. So far there is no proof for that.

You didn't look at Aubrey's data?

There are some unavoidable slow operations in the current path -- e.g.
reprograming the timer for NOHZ. But we don't need that for really 
short idle periods, because as you pointed out they never get woken
up by the tick.

Similar for other things like RCU.

I don't see how you can avoid that other than without a fast path mechanism.

Clearly these operations are eventually needed, just not all the time
for short sleeps.

Now in theory you could have lots of little fast paths in all the individual
operations that check this individually, but I don't see how that is better than
a single simple fast path.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Li, Aubrey

On 2017/7/18 14:43, Thomas Gleixner wrote:
> On Mon, 17 Jul 2017, Andi Kleen wrote:
> 
>>> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
>>> it's better than menu governor.
>>
>> I still would like to see how the fast path without the C1 heuristic works.
>>
>> Fast pathing is a different concept from a better predictor. IMHO we need
>> both, but the first is likely lower hanging fruit.
> 
> Hacking something on the side is always the lower hanging fruit as it
> avoids solving the hard problems. As Peter said already, that's not going
> to happen unless there is a real technical reason why the general path
> cannot be fixed. So far there is no proof for that.
> 
Let me try to make a summary, please correct me if I was wrong.

1) for quiet_vmstat, we are agreed to move to another place where tick is
really stopped.

2) for rcu idle enter/exit, I measured the details which Paul provided, and
the result matches with what I have measured before, nothing notable found.
But it still makes more sense if we can make rcu idle enter/exit hooked with
tick off. (it's possible other workloads behave differently)

3) for tick nohz idle, we want to skip if the coming idle is short. If we can
skip the tick nohz idle, we then skip all the items depending on it. But, there
are two hard points:

3.1) how to compute the period of the coming idle. My current proposal is to
use two factors in the current idle menu governor. There are two possible
options from Peter and Thomas, the one is to use scheduler idle estimate, which
is task activity based, the other is to use the statistics generated from irq
timings work.

3.2) how to determine if the idle is short or long. My current proposal is to
use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
didn't get the details of an auto-adjust mechanism yet

4) for idle loop, my proposal introduces a simple one to use default idle
routine directly, while Peter and Thomas suggest we fix c-state selection
in the existing idle path.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Li, Aubrey

On 2017/7/18 14:43, Thomas Gleixner wrote:
> On Mon, 17 Jul 2017, Andi Kleen wrote:
> 
>>> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
>>> it's better than menu governor.
>>
>> I still would like to see how the fast path without the C1 heuristic works.
>>
>> Fast pathing is a different concept from a better predictor. IMHO we need
>> both, but the first is likely lower hanging fruit.
> 
> Hacking something on the side is always the lower hanging fruit as it
> avoids solving the hard problems. As Peter said already, that's not going
> to happen unless there is a real technical reason why the general path
> cannot be fixed. So far there is no proof for that.
> 
Let me try to make a summary, please correct me if I was wrong.

1) for quiet_vmstat, we are agreed to move to another place where tick is
really stopped.

2) for rcu idle enter/exit, I measured the details which Paul provided, and
the result matches with what I have measured before, nothing notable found.
But it still makes more sense if we can make rcu idle enter/exit hooked with
tick off. (it's possible other workloads behave differently)

3) for tick nohz idle, we want to skip if the coming idle is short. If we can
skip the tick nohz idle, we then skip all the items depending on it. But, there
are two hard points:

3.1) how to compute the period of the coming idle. My current proposal is to
use two factors in the current idle menu governor. There are two possible
options from Peter and Thomas, the one is to use scheduler idle estimate, which
is task activity based, the other is to use the statistics generated from irq
timings work.

3.2) how to determine if the idle is short or long. My current proposal is to
use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I
didn't get the details of an auto-adjust mechanism yet

4) for idle loop, my proposal introduces a simple one to use default idle
routine directly, while Peter and Thomas suggest we fix c-state selection
in the existing idle path.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Andi Kleen wrote:

> > We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
> > it's better than menu governor.
> 
> I still would like to see how the fast path without the C1 heuristic works.
> 
> Fast pathing is a different concept from a better predictor. IMHO we need
> both, but the first is likely lower hanging fruit.

Hacking something on the side is always the lower hanging fruit as it
avoids solving the hard problems. As Peter said already, that's not going
to happen unless there is a real technical reason why the general path
cannot be fixed. So far there is no proof for that.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-18 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Andi Kleen wrote:

> > We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
> > it's better than menu governor.
> 
> I still would like to see how the fast path without the C1 heuristic works.
> 
> Fast pathing is a different concept from a better predictor. IMHO we need
> both, but the first is likely lower hanging fruit.

Hacking something on the side is always the lower hanging fruit as it
avoids solving the hard problems. As Peter said already, that's not going
to happen unless there is a real technical reason why the general path
cannot be fixed. So far there is no proof for that.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Andi Kleen

> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
> it's better than menu governor.

I still would like to see how the fast path without the C1 heuristic works.

Fast pathing is a different concept from a better predictor. IMHO we need
both, but the first is likely lower hanging fruit.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Andi Kleen

> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
> it's better than menu governor.

I still would like to see how the fast path without the C1 heuristic works.

Fast pathing is a different concept from a better predictor. IMHO we need
both, but the first is likely lower hanging fruit.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/18 3:48, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
>> Of course, this all assumes a Gaussian distribution to begin with, if we
>> get bimodal (or worse) distributions we can still get it wrong. To fix
>> that, we'd need to do something better than what we currently have.
>>
> 
> fwiw some time ago I made a chart for predicted vs actual so you can sort
> of judge the distribution of things visually
> 
> http://git.fenrus.org/tmp/linux2.png
> 
> 
This does not look like a Gaussian, does it? I mean abs(expected - actual).

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/18 3:48, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
>> Of course, this all assumes a Gaussian distribution to begin with, if we
>> get bimodal (or worse) distributions we can still get it wrong. To fix
>> that, we'd need to do something better than what we currently have.
>>
> 
> fwiw some time ago I made a chart for predicted vs actual so you can sort
> of judge the distribution of things visually
> 
> http://git.fenrus.org/tmp/linux2.png
> 
> 
This does not look like a Gaussian, does it? I mean abs(expected - actual).

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/18 3:23, Peter Zijlstra wrote:
> On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
>>> And as said; Daniel has been working on a better predictor -- now he's
>>> probably not used it on the network workload you're looking at, so that
>>> might be something to consider.
>>
>> Deriving a better idle predictor is a bit orthogonal to fast idle.
> 
> No. If you want a different C state selected we need to fix the current
> C state selector. We're not going to tinker.
> 
> And the predictor is probably the most fundamental part of the whole C
> state selection logic.
> 
> Now I think the problem is that the current predictor goes for an
> average idle duration. This means that we, on average, get it wrong 50%
> of the time. For performance that's bad.
> 
> If you want to improve the worst case, we need to consider a cumulative
> distribution function, and staying with the Gaussian assumption already
> present, that would mean using:
> 
>1  x - mu
> CDF(x) = - [ 1 + erf(-) ]
>2   sigma sqrt(2)
> 
> Where, per the normal convention mu is the average and sigma^2 the
> variance. See also:
> 
>   https://en.wikipedia.org/wiki/Normal_distribution
> 
> We then solve CDF(x) = n% to find the x for which we get it wrong n% of
> the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).
> 
> This conceptually gets us better exit latency for the cases where we got
> it wrong before, and practically pushes down the estimate which gets us
> C1 longer.
> 
> Of course, this all assumes a Gaussian distribution to begin with, if we
> get bimodal (or worse) distributions we can still get it wrong. To fix
> that, we'd need to do something better than what we currently have.
> 
Maybe you are talking about applying some machine learning algorithm online
to fit a multivariate normal distribution, :)

Well, back to the problem, when the scheduler picks up idle thread, it does
not look at the history, nor make the prediction. So it's possible it has
to switch back a task ASAP when it's going into idle(very common under some
workloads).

That is, (idle_entry + idle_exit) > idle. If the system has multiple
hardware idle states, then:

(idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep

So we eventually want the idle path lighter than what we currently have.
A complex predictor may have high accuracy, but the cost could be high as well.

We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
it's better than menu governor.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/18 3:23, Peter Zijlstra wrote:
> On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
>>> And as said; Daniel has been working on a better predictor -- now he's
>>> probably not used it on the network workload you're looking at, so that
>>> might be something to consider.
>>
>> Deriving a better idle predictor is a bit orthogonal to fast idle.
> 
> No. If you want a different C state selected we need to fix the current
> C state selector. We're not going to tinker.
> 
> And the predictor is probably the most fundamental part of the whole C
> state selection logic.
> 
> Now I think the problem is that the current predictor goes for an
> average idle duration. This means that we, on average, get it wrong 50%
> of the time. For performance that's bad.
> 
> If you want to improve the worst case, we need to consider a cumulative
> distribution function, and staying with the Gaussian assumption already
> present, that would mean using:
> 
>1  x - mu
> CDF(x) = - [ 1 + erf(-) ]
>2   sigma sqrt(2)
> 
> Where, per the normal convention mu is the average and sigma^2 the
> variance. See also:
> 
>   https://en.wikipedia.org/wiki/Normal_distribution
> 
> We then solve CDF(x) = n% to find the x for which we get it wrong n% of
> the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).
> 
> This conceptually gets us better exit latency for the cases where we got
> it wrong before, and practically pushes down the estimate which gets us
> C1 longer.
> 
> Of course, this all assumes a Gaussian distribution to begin with, if we
> get bimodal (or worse) distributions we can still get it wrong. To fix
> that, we'd need to do something better than what we currently have.
> 
Maybe you are talking about applying some machine learning algorithm online
to fit a multivariate normal distribution, :)

Well, back to the problem, when the scheduler picks up idle thread, it does
not look at the history, nor make the prediction. So it's possible it has
to switch back a task ASAP when it's going into idle(very common under some
workloads).

That is, (idle_entry + idle_exit) > idle. If the system has multiple
hardware idle states, then:

(idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep

So we eventually want the idle path lighter than what we currently have.
A complex predictor may have high accuracy, but the cost could be high as well.

We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if
it's better than menu governor.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Arjan van de Ven wrote:
> On 7/17/2017 12:46 PM, Thomas Gleixner wrote:
> > That's where Daniel Lezcanos work of predicting interrupts comes in and
> > that's the right solution to the problem. The core infrastructure has been
> > merged, just the idle/cpufreq users are not there yet. All you need to do
> > is to select CONFIG_IRQ_TIMINGS and use the statistics generated there.
> > 
> yes ;-)

:)

> also note that the predictor does not need to perfect, on most systems C
> states are an order of magnitude apart in terms of
> power/performance/latency so if you get the general order of magnitude
> right the predictor is doing its job.

So it would be interesting just to enable the irq timings stuff and compare
the outcome as a first step. That should be reasonably simple to implement
and would give us also some information of how that code behaves on larger
systems.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Arjan van de Ven wrote:
> On 7/17/2017 12:46 PM, Thomas Gleixner wrote:
> > That's where Daniel Lezcanos work of predicting interrupts comes in and
> > that's the right solution to the problem. The core infrastructure has been
> > merged, just the idle/cpufreq users are not there yet. All you need to do
> > is to select CONFIG_IRQ_TIMINGS and use the statistics generated there.
> > 
> yes ;-)

:)

> also note that the predictor does not need to perfect, on most systems C
> states are an order of magnitude apart in terms of
> power/performance/latency so if you get the general order of magnitude
> right the predictor is doing its job.

So it would be interesting just to enable the irq timings stuff and compare
the outcome as a first step. That should be reasonably simple to implement
and would give us also some information of how that code behaves on larger
systems.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:53 PM, Thomas Gleixner wrote:

On Mon, 17 Jul 2017, Arjan van de Ven wrote:

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.



fwiw some time ago I made a chart for predicted vs actual so you can sort
of judge the distribution of things visually


Predicted by what?


this chart was with the current linux predictor

http://git.fenrus.org/tmp/timer.png is what you get if you JUST use the next 
timer ;-)
(which way back linux was doing)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:53 PM, Thomas Gleixner wrote:

On Mon, 17 Jul 2017, Arjan van de Ven wrote:

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.



fwiw some time ago I made a chart for predicted vs actual so you can sort
of judge the distribution of things visually


Predicted by what?


this chart was with the current linux predictor

http://git.fenrus.org/tmp/timer.png is what you get if you JUST use the next 
timer ;-)
(which way back linux was doing)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> 
> fwiw some time ago I made a chart for predicted vs actual so you can sort
> of judge the distribution of things visually

Predicted by what?

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
> > Of course, this all assumes a Gaussian distribution to begin with, if we
> > get bimodal (or worse) distributions we can still get it wrong. To fix
> > that, we'd need to do something better than what we currently have.
> > 
> 
> fwiw some time ago I made a chart for predicted vs actual so you can sort
> of judge the distribution of things visually

Predicted by what?

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:46 PM, Thomas Gleixner wrote:

On Mon, 17 Jul 2017, Arjan van de Ven wrote:

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.


that's not really what it does; it looks at next tick
and then discounts that based on history;
(with different discounts for different order of magnitude)


next tick is the worst thing to look at for interrupt heavy workloads as


well it was better than what was there before (without discount and without 
detecting
repeated patterns)


the next tick (as computed by the nohz code) can be far away, while the I/O
interrupts come in at a high frequency.

That's where Daniel Lezcanos work of predicting interrupts comes in and
that's the right solution to the problem. The core infrastructure has been
merged, just the idle/cpufreq users are not there yet. All you need to do
is to select CONFIG_IRQ_TIMINGS and use the statistics generated there.



yes ;-)

also note that the predictor does not need to perfect, on most systems C states 
are
an order of magnitude apart in terms of power/performance/latency so if you get 
the general
order of magnitude right the predictor is doing its job.

(this is not universally true, but physics of power gating/etc tend to drive to 
this conclusion;
the cost of implementing an extra state very close to another state means that 
the HW folks are unlikely
to do the less power saving state of the two to save their cost and testing 
effort)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:46 PM, Thomas Gleixner wrote:

On Mon, 17 Jul 2017, Arjan van de Ven wrote:

On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.


that's not really what it does; it looks at next tick
and then discounts that based on history;
(with different discounts for different order of magnitude)


next tick is the worst thing to look at for interrupt heavy workloads as


well it was better than what was there before (without discount and without 
detecting
repeated patterns)


the next tick (as computed by the nohz code) can be far away, while the I/O
interrupts come in at a high frequency.

That's where Daniel Lezcanos work of predicting interrupts comes in and
that's the right solution to the problem. The core infrastructure has been
merged, just the idle/cpufreq users are not there yet. All you need to do
is to select CONFIG_IRQ_TIMINGS and use the statistics generated there.



yes ;-)

also note that the predictor does not need to perfect, on most systems C states 
are
an order of magnitude apart in terms of power/performance/latency so if you get 
the general
order of magnitude right the predictor is doing its job.

(this is not universally true, but physics of power gating/etc tend to drive to 
this conclusion;
the cost of implementing an extra state very close to another state means that 
the HW folks are unlikely
to do the less power saving state of the two to save their cost and testing 
effort)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.



fwiw some time ago I made a chart for predicted vs actual so you can sort
of judge the distribution of things visually

http://git.fenrus.org/tmp/linux2.png

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.



fwiw some time ago I made a chart for predicted vs actual so you can sort
of judge the distribution of things visually

http://git.fenrus.org/tmp/linux2.png

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
> > Now I think the problem is that the current predictor goes for an
> > average idle duration. This means that we, on average, get it wrong 50%
> > of the time. For performance that's bad.
> 
> that's not really what it does; it looks at next tick
> and then discounts that based on history;
> (with different discounts for different order of magnitude)

next tick is the worst thing to look at for interrupt heavy workloads as
the next tick (as computed by the nohz code) can be far away, while the I/O
interrupts come in at a high frequency.

That's where Daniel Lezcanos work of predicting interrupts comes in and
that's the right solution to the problem. The core infrastructure has been
merged, just the idle/cpufreq users are not there yet. All you need to do
is to select CONFIG_IRQ_TIMINGS and use the statistics generated there.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Thomas Gleixner

On Mon, 17 Jul 2017, Arjan van de Ven wrote:
> On 7/17/2017 12:23 PM, Peter Zijlstra wrote:
> > Now I think the problem is that the current predictor goes for an
> > average idle duration. This means that we, on average, get it wrong 50%
> > of the time. For performance that's bad.
> 
> that's not really what it does; it looks at next tick
> and then discounts that based on history;
> (with different discounts for different order of magnitude)

next tick is the worst thing to look at for interrupt heavy workloads as
the next tick (as computed by the nohz code) can be far away, while the I/O
interrupts come in at a high frequency.

That's where Daniel Lezcanos work of predicting interrupts comes in and
that's the right solution to the problem. The core infrastructure has been
merged, just the idle/cpufreq users are not there yet. All you need to do
is to select CONFIG_IRQ_TIMINGS and use the statistics generated there.

Thanks,

tglx

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.


that's not really what it does; it looks at next tick
and then discounts that based on history;
(with different discounts for different order of magnitude)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Arjan van de Ven


On 7/17/2017 12:23 PM, Peter Zijlstra wrote:

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.


that's not really what it does; it looks at next tick
and then discounts that based on history;
(with different discounts for different order of magnitude)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
> > And as said; Daniel has been working on a better predictor -- now he's
> > probably not used it on the network workload you're looking at, so that
> > might be something to consider.
> 
> Deriving a better idle predictor is a bit orthogonal to fast idle.

No. If you want a different C state selected we need to fix the current
C state selector. We're not going to tinker.

And the predictor is probably the most fundamental part of the whole C
state selection logic.

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.

If you want to improve the worst case, we need to consider a cumulative
distribution function, and staying with the Gaussian assumption already
present, that would mean using:

 1  x - mu
CDF(x) = - [ 1 + erf(-) ]
 2   sigma sqrt(2)

Where, per the normal convention mu is the average and sigma^2 the
variance. See also:

  https://en.wikipedia.org/wiki/Normal_distribution

We then solve CDF(x) = n% to find the x for which we get it wrong n% of
the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).

This conceptually gets us better exit latency for the cases where we got
it wrong before, and practically pushes down the estimate which gets us
C1 longer.

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote:
> > And as said; Daniel has been working on a better predictor -- now he's
> > probably not used it on the network workload you're looking at, so that
> > might be something to consider.
> 
> Deriving a better idle predictor is a bit orthogonal to fast idle.

No. If you want a different C state selected we need to fix the current
C state selector. We're not going to tinker.

And the predictor is probably the most fundamental part of the whole C
state selection logic.

Now I think the problem is that the current predictor goes for an
average idle duration. This means that we, on average, get it wrong 50%
of the time. For performance that's bad.

If you want to improve the worst case, we need to consider a cumulative
distribution function, and staying with the Gaussian assumption already
present, that would mean using:

 1  x - mu
CDF(x) = - [ 1 + erf(-) ]
 2   sigma sqrt(2)

Where, per the normal convention mu is the average and sigma^2 the
variance. See also:

  https://en.wikipedia.org/wiki/Normal_distribution

We then solve CDF(x) = n% to find the x for which we get it wrong n% of
the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so).

This conceptually gets us better exit latency for the cases where we got
it wrong before, and practically pushes down the estimate which gets us
C1 longer.

Of course, this all assumes a Gaussian distribution to begin with, if we
get bimodal (or worse) distributions we can still get it wrong. To fix
that, we'd need to do something better than what we currently have.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/17 21:58, Peter Zijlstra wrote:
> On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote:
>> On 2017/7/14 12:05, Paul E. McKenney wrote:
>>>
>>> More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(),
>>> rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(),
>>> rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(),
>>> rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(),
>>> rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit().
>>>
>>> The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and
>>> rcu_cleanup_after_idle()) should not be significant unless you have
>>> CONFIG_RCU_FAST_NO_HZ=y.  If you do, it would be interesting to learn
>>> how often invoke_rcu_core() is invoked from rcu_prepare_for_idle()
>>> and rcu_cleanup_after_idle(), as this can raise softirq.  Also
>>> rcu_accelerate_cbs() and rcu_try_advance_all_cbs().
>>>
>>> Knowing which of these is causing the most trouble might help me
>>> reduce the overhead in the current idle path.
>>
>> I measured two cases, nothing notable found.
> 
> So skipping rcu_idle_{enter,exit}() is not in fact needed at all?
> 

I think it would make more sense if we still put them under the case
where tick is really stopped.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/17 21:58, Peter Zijlstra wrote:
> On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote:
>> On 2017/7/14 12:05, Paul E. McKenney wrote:
>>>
>>> More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(),
>>> rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(),
>>> rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(),
>>> rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(),
>>> rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit().
>>>
>>> The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and
>>> rcu_cleanup_after_idle()) should not be significant unless you have
>>> CONFIG_RCU_FAST_NO_HZ=y.  If you do, it would be interesting to learn
>>> how often invoke_rcu_core() is invoked from rcu_prepare_for_idle()
>>> and rcu_cleanup_after_idle(), as this can raise softirq.  Also
>>> rcu_accelerate_cbs() and rcu_try_advance_all_cbs().
>>>
>>> Knowing which of these is causing the most trouble might help me
>>> reduce the overhead in the current idle path.
>>
>> I measured two cases, nothing notable found.
> 
> So skipping rcu_idle_{enter,exit}() is not in fact needed at all?
> 

I think it would make more sense if we still put them under the case
where tick is really stopped.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Peter Zijlstra

On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote:
> On 2017/7/14 12:05, Paul E. McKenney wrote:
> >
> > More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(),
> > rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(),
> > rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(),
> > rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(),
> > rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit().
> >
> > The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and
> > rcu_cleanup_after_idle()) should not be significant unless you have
> > CONFIG_RCU_FAST_NO_HZ=y.  If you do, it would be interesting to learn
> > how often invoke_rcu_core() is invoked from rcu_prepare_for_idle()
> > and rcu_cleanup_after_idle(), as this can raise softirq.  Also
> > rcu_accelerate_cbs() and rcu_try_advance_all_cbs().
> >
> > Knowing which of these is causing the most trouble might help me
> > reduce the overhead in the current idle path.
> 
> I measured two cases, nothing notable found.

So skipping rcu_idle_{enter,exit}() is not in fact needed at all?

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Peter Zijlstra

On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote:
> On 2017/7/14 12:05, Paul E. McKenney wrote:
> >
> > More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(),
> > rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(),
> > rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(),
> > rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(),
> > rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit().
> >
> > The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and
> > rcu_cleanup_after_idle()) should not be significant unless you have
> > CONFIG_RCU_FAST_NO_HZ=y.  If you do, it would be interesting to learn
> > how often invoke_rcu_core() is invoked from rcu_prepare_for_idle()
> > and rcu_cleanup_after_idle(), as this can raise softirq.  Also
> > rcu_accelerate_cbs() and rcu_try_advance_all_cbs().
> >
> > Knowing which of these is causing the most trouble might help me
> > reduce the overhead in the current idle path.
> 
> I measured two cases, nothing notable found.

So skipping rcu_idle_{enter,exit}() is not in fact needed at all?

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/17 17:21, Peter Zijlstra wrote:
> On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote:
>> fast idle doesn't have an upper bound.
>>
>> If the prediction exceeds the fast idle threshold any C state can be used.
>>
>> It's just another state (fast C1), but right now it has an own threshold
>> which may be different from standard C1.
> 
> Given it uses the same estimate we end up with:
> 
> 
> Now, unless you're mister Turnbull, C2 will never get selected when
> fast_threshold > C2_threshold.
> 
> Which is wrong. If you want to effectively scale the selection of C1,
> why not also change the C2 and further criteria.
> 
That's not our intention, I think. As long as the predicted coming idle
period > threshold, we'll enter normal idle path, you can select any supported
c-states, tick can also be turned off for power saving. Any deferrable stuff
can be invoke as well because we'll sleep long enough.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/17 17:21, Peter Zijlstra wrote:
> On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote:
>> fast idle doesn't have an upper bound.
>>
>> If the prediction exceeds the fast idle threshold any C state can be used.
>>
>> It's just another state (fast C1), but right now it has an own threshold
>> which may be different from standard C1.
> 
> Given it uses the same estimate we end up with:
> 
> 
> Now, unless you're mister Turnbull, C2 will never get selected when
> fast_threshold > C2_threshold.
> 
> Which is wrong. If you want to effectively scale the selection of C1,
> why not also change the C2 and further criteria.
> 
That's not our intention, I think. As long as the predicted coming idle
period > threshold, we'll enter normal idle path, you can select any supported
c-states, tick can also be turned off for power saving. Any deferrable stuff
can be invoke as well because we'll sleep long enough.

Thanks,
-Aubrey

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/14 12:05, Paul E. McKenney wrote:
>
> More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(),
> rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(),
> rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(),
> rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(),
> rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit().
>
> The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and
> rcu_cleanup_after_idle()) should not be significant unless you have
> CONFIG_RCU_FAST_NO_HZ=y.  If you do, it would be interesting to learn
> how often invoke_rcu_core() is invoked from rcu_prepare_for_idle()
> and rcu_cleanup_after_idle(), as this can raise softirq.  Also
> rcu_accelerate_cbs() and rcu_try_advance_all_cbs().
>
> Knowing which of these is causing the most trouble might help me
> reduce the overhead in the current idle path.

I measured two cases, nothing notable found.

The one is CONFIG_NO_HZ_IDLE=y, so the following function is just empty.

rcu_prepare_for_idle(): NULL
rcu_cleanup_after_idle():   NULL
do_nocb_deferred_wakeup():  NULL
rcu_dynticks_task_enter():  NULL
rcu_dynticks_task_exit():   NULL

And the following functions are traced separately, for each function I
traced 3 times by intel_PT, for each time the sampling period is 1-second.
num means the times the function is invoked in 1-second. (min, max, avg) is
the function duration, the unit is nano-second. 

rcu_needs_cpu():
1) num: 6110min: 3  max: 564avg: 17.0
2) num: 16535   min: 3  max: 683avg: 18.0
3) num: 8815min: 3  max: 394avg: 20.0

rcu_eqs_enter():
1) num: 7956min: 17 max: 656avg: 32.0
2) num: 9170min: 17 max: 1075   avg: 35.0
3) num: 8604min: 17 max: 859avg: 29.0

rcu_eqs_enter_common():
1) num: 14676   min: 15 max: 620avg: 28.0
2) num: 11180   min: 15 max: 795avg: 30.0
3) num: 11484   min: 15 max: 725avg: 29.0

rcu_dynticks_eqs_enter():
1) num: 11035   min: 10 max: 580avg: 17.0
2) num: 15518   min: 10 max: 456avg: 16.0
3) num: 15320   min: 10 max: 454avg: 19.0

rcu_eqs_exit():
1) num: 11080   min: 14 max: 893avg: 23.0
2) num: 13526   min: 14 max: 640avg: 23.0
3) num: 12534   min: 14 max: 630avg: 22.0

rcu_eqs_exit_common():
1) num: 18002   min: 12 max: 553avg: 17.0
2) num: 10570   min: 11 max: 485avg: 17.0
3) num: 13628   min: 11 max: 567avg: 16.0

rcu_dynticks_eqs_exit():
1) num: 11195   min: 11 max: 436avg: 16.0
2) num: 11808   min: 10 max: 506avg: 16.0
3) num: 8132min: 10 max: 546avg: 15.0

==

The other case is CONFIG_NO_HZ_FULL, I also enabled the required config to make
all the functions not empty.

rcu_needs_cpu():
1) num: 8530min: 5  max: 770avg: 13.0
2) num: 9965min: 5  max: 518avg: 12.0
3) num: 12503   min: 5  max: 755avg: 16.0

rcu_prepare_for_idle():
1) num: 11662   min: 5  max: 684avg: 9.0
2) num: 15294   min: 5  max: 676avg: 9.0
3) num: 14332   min: 5  max: 524avg: 9.0

rcu_cleanup_after_idle():
1) num: 13584   min: 4  max: 657avg: 6.0
2) num: 9102min: 4  max: 529avg: 5.0
3) num: 10648   min: 4  max: 471avg: 6.0

rcu_eqs_enter():
1) num: 14222   min: 26 max: 745avg: 54.0
2) num: 12502   min: 26 max: 650avg: 53.0
3) num: 11834   min: 26 max: 863avg: 52.0

rcu_eqs_enter_common():
1) num: 16792   min: 24 max: 973avg: 43.0
2) num: 19755   min: 24 max: 898avg: 45.0
3) num: 8167min: 24 max: 722avg: 42.0

rcu_dynticks_eqs_enter():
1) num: 11605   min: 10 max: 532avg: 14.0
2) num: 10438   min: 9  max: 554avg: 14.0
3) num: 19816   min: 9  max: 701avg: 14.0

do_nocb_deferred_wakeup():
1) num: 15348   min: 1  max: 459avg: 3.0
2) num: 12822   min: 1  max: 564avg: 4.0
3) num: 8272min: 0  max: 458avg: 3.0

rcu_dynticks_task_enter():
1) num: 6358min: 1  max: 268avg: 1.0
2) num: 11128   min: 1  max: 360avg: 1.0
3) num: 20516   min: 1  max: 214avg: 1.0

rcu_eqs_exit():
1) num: 16117   min: 20 max: 782avg: 43.0
2) num: 11042   min: 20 max: 775avg: 47.0
3) num: 16499   min: 20 max: 752avg: 46.0

rcu_eqs_exit_common():
1) num: 12584   min: 17 max: 703avg: 28.0
2) num: 17412   min: 17 max: 759avg: 28.0
3) num: 16733   min: 17 max: 798avg: 29.0

rcu_dynticks_task_exit():
1) num: 11730   min: 1  max: 528avg: 4.0
2) num: 18840   min: 1  max: 581avg: 5.0
3) num: 9815min: 1  max: 381avg: 4.0

rcu_dynticks_eqs_exit():
1) num: 10902   min: 9  max: 557avg: 13.0
2) num: 19474   min: 9  max: 563avg: 13.0
3) num: 11865   min: 9  max: 672avg: 12.0

Please let me know if there is some data not reasonable, I can revisit again.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Li, Aubrey

On 2017/7/14 12:05, Paul E. McKenney wrote:
>
> More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(),
> rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(),
> rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(),
> rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(),
> rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit().
>
> The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and
> rcu_cleanup_after_idle()) should not be significant unless you have
> CONFIG_RCU_FAST_NO_HZ=y.  If you do, it would be interesting to learn
> how often invoke_rcu_core() is invoked from rcu_prepare_for_idle()
> and rcu_cleanup_after_idle(), as this can raise softirq.  Also
> rcu_accelerate_cbs() and rcu_try_advance_all_cbs().
>
> Knowing which of these is causing the most trouble might help me
> reduce the overhead in the current idle path.

I measured two cases, nothing notable found.

The one is CONFIG_NO_HZ_IDLE=y, so the following function is just empty.

rcu_prepare_for_idle(): NULL
rcu_cleanup_after_idle():   NULL
do_nocb_deferred_wakeup():  NULL
rcu_dynticks_task_enter():  NULL
rcu_dynticks_task_exit():   NULL

And the following functions are traced separately, for each function I
traced 3 times by intel_PT, for each time the sampling period is 1-second.
num means the times the function is invoked in 1-second. (min, max, avg) is
the function duration, the unit is nano-second. 

rcu_needs_cpu():
1) num: 6110min: 3  max: 564avg: 17.0
2) num: 16535   min: 3  max: 683avg: 18.0
3) num: 8815min: 3  max: 394avg: 20.0

rcu_eqs_enter():
1) num: 7956min: 17 max: 656avg: 32.0
2) num: 9170min: 17 max: 1075   avg: 35.0
3) num: 8604min: 17 max: 859avg: 29.0

rcu_eqs_enter_common():
1) num: 14676   min: 15 max: 620avg: 28.0
2) num: 11180   min: 15 max: 795avg: 30.0
3) num: 11484   min: 15 max: 725avg: 29.0

rcu_dynticks_eqs_enter():
1) num: 11035   min: 10 max: 580avg: 17.0
2) num: 15518   min: 10 max: 456avg: 16.0
3) num: 15320   min: 10 max: 454avg: 19.0

rcu_eqs_exit():
1) num: 11080   min: 14 max: 893avg: 23.0
2) num: 13526   min: 14 max: 640avg: 23.0
3) num: 12534   min: 14 max: 630avg: 22.0

rcu_eqs_exit_common():
1) num: 18002   min: 12 max: 553avg: 17.0
2) num: 10570   min: 11 max: 485avg: 17.0
3) num: 13628   min: 11 max: 567avg: 16.0

rcu_dynticks_eqs_exit():
1) num: 11195   min: 11 max: 436avg: 16.0
2) num: 11808   min: 10 max: 506avg: 16.0
3) num: 8132min: 10 max: 546avg: 15.0

==

The other case is CONFIG_NO_HZ_FULL, I also enabled the required config to make
all the functions not empty.

rcu_needs_cpu():
1) num: 8530min: 5  max: 770avg: 13.0
2) num: 9965min: 5  max: 518avg: 12.0
3) num: 12503   min: 5  max: 755avg: 16.0

rcu_prepare_for_idle():
1) num: 11662   min: 5  max: 684avg: 9.0
2) num: 15294   min: 5  max: 676avg: 9.0
3) num: 14332   min: 5  max: 524avg: 9.0

rcu_cleanup_after_idle():
1) num: 13584   min: 4  max: 657avg: 6.0
2) num: 9102min: 4  max: 529avg: 5.0
3) num: 10648   min: 4  max: 471avg: 6.0

rcu_eqs_enter():
1) num: 14222   min: 26 max: 745avg: 54.0
2) num: 12502   min: 26 max: 650avg: 53.0
3) num: 11834   min: 26 max: 863avg: 52.0

rcu_eqs_enter_common():
1) num: 16792   min: 24 max: 973avg: 43.0
2) num: 19755   min: 24 max: 898avg: 45.0
3) num: 8167min: 24 max: 722avg: 42.0

rcu_dynticks_eqs_enter():
1) num: 11605   min: 10 max: 532avg: 14.0
2) num: 10438   min: 9  max: 554avg: 14.0
3) num: 19816   min: 9  max: 701avg: 14.0

do_nocb_deferred_wakeup():
1) num: 15348   min: 1  max: 459avg: 3.0
2) num: 12822   min: 1  max: 564avg: 4.0
3) num: 8272min: 0  max: 458avg: 3.0

rcu_dynticks_task_enter():
1) num: 6358min: 1  max: 268avg: 1.0
2) num: 11128   min: 1  max: 360avg: 1.0
3) num: 20516   min: 1  max: 214avg: 1.0

rcu_eqs_exit():
1) num: 16117   min: 20 max: 782avg: 43.0
2) num: 11042   min: 20 max: 775avg: 47.0
3) num: 16499   min: 20 max: 752avg: 46.0

rcu_eqs_exit_common():
1) num: 12584   min: 17 max: 703avg: 28.0
2) num: 17412   min: 17 max: 759avg: 28.0
3) num: 16733   min: 17 max: 798avg: 29.0

rcu_dynticks_task_exit():
1) num: 11730   min: 1  max: 528avg: 4.0
2) num: 18840   min: 1  max: 581avg: 5.0
3) num: 9815min: 1  max: 381avg: 4.0

rcu_dynticks_eqs_exit():
1) num: 10902   min: 9  max: 557avg: 13.0
2) num: 19474   min: 9  max: 563avg: 13.0
3) num: 11865   min: 9  max: 672avg: 12.0

Please let me know if there is some data not reasonable, I can revisit again.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote:
> fast idle doesn't have an upper bound.
> 
> If the prediction exceeds the fast idle threshold any C state can be used.
> 
> It's just another state (fast C1), but right now it has an own threshold
> which may be different from standard C1.

Given it uses the same estimate we end up with:

select_c_state(idle_est)
{
if (idle_est < fast_threshold)
return C1;

if (idle_est < C1_threshold)
return C1;
if (idle_est < C2_threshold)
return C2;
/* ... */

return C6
}

Now, unless you're mister Turnbull, C2 will never get selected when
fast_threshold > C2_threshold.

Which is wrong. If you want to effectively scale the selection of C1,
why not also change the C2 and further criteria.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-17 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote:
> fast idle doesn't have an upper bound.
> 
> If the prediction exceeds the fast idle threshold any C state can be used.
> 
> It's just another state (fast C1), but right now it has an own threshold
> which may be different from standard C1.

Given it uses the same estimate we end up with:

select_c_state(idle_est)
{
if (idle_est < fast_threshold)
return C1;

if (idle_est < C1_threshold)
return C1;
if (idle_est < C2_threshold)
return C2;
/* ... */

return C6
}

Now, unless you're mister Turnbull, C2 will never get selected when
fast_threshold > C2_threshold.

Which is wrong. If you want to effectively scale the selection of C1,
why not also change the C2 and further criteria.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Andi Kleen

> And as said; Daniel has been working on a better predictor -- now he's
> probably not used it on the network workload you're looking at, so that
> might be something to consider.

Deriving a better idle predictor is a bit orthogonal to fast idle.

It would be a good idea to set the fast idle threshold to be the
same as intel_idle would use on that platform and rerun the benchmarks 

Then the Cx pattern should be mostly identical, and fast idle can be
evaluated on its own benefits.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Andi Kleen

> And as said; Daniel has been working on a better predictor -- now he's
> probably not used it on the network workload you're looking at, so that
> might be something to consider.

Deriving a better idle predictor is a bit orthogonal to fast idle.

It would be a good idea to set the fast idle threshold to be the
same as intel_idle would use on that platform and rerun the benchmarks 

Then the Cx pattern should be mostly identical, and fast idle can be
evaluated on its own benefits.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 08:53:56AM -0700, Andi Kleen wrote:
> > > - if (cpuidle_not_available(drv, dev)) {
> > > + if (cpuidle_not_available(drv, dev) || this_is_a_fast_idle) {
> > >   default_idle_call();
> > >   goto exit_idle;
> > >   }
> > 
> > No, that's wrong. We want to fix the normal C state selection process to
> > pick the right C state.
> > 
> > The fast-idle criteria could cut off a whole bunch of available C
> > states. We need to understand why our current C state pick is wrong and
> > amend the algorithm to do better. Not just bolt something on the side.
> 
> Fast idle uses the same predictor as the current C state governor.
> 
> The only difference is that it uses a different threshold for C1.
> Likely that's the cause. If it was using the same threshold the
> decision would be the same.

Right, so its selecting C1 for longer. That in turn means we could now
never select C2; because the fast-idle threshold is longer than our C2
time.

Which I feel is wrong; because if we're getting C1 wrong, what says
we're then getting the rest right.

> The thresholds are coming either from the tables in intel idle,
> or from ACPI (let's assume the first)
> 
> That means either: the intel idle C1 threshold on the system Aubrey
> tested on is too high, or the fast idle threshold is too low.

Or our predictor is doing it wrong. It could be its over-estimating idle
duration. For example, suppose we have an idle distribution of:

40% < C1
60% > C2

And we end up selecting C2. Even though in many of our sleeps we really
wanted C1.

And as said; Daniel has been working on a better predictor -- now he's
probably not used it on the network workload you're looking at, so that
might be something to consider.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 08:53:56AM -0700, Andi Kleen wrote:
> > > - if (cpuidle_not_available(drv, dev)) {
> > > + if (cpuidle_not_available(drv, dev) || this_is_a_fast_idle) {
> > >   default_idle_call();
> > >   goto exit_idle;
> > >   }
> > 
> > No, that's wrong. We want to fix the normal C state selection process to
> > pick the right C state.
> > 
> > The fast-idle criteria could cut off a whole bunch of available C
> > states. We need to understand why our current C state pick is wrong and
> > amend the algorithm to do better. Not just bolt something on the side.
> 
> Fast idle uses the same predictor as the current C state governor.
> 
> The only difference is that it uses a different threshold for C1.
> Likely that's the cause. If it was using the same threshold the
> decision would be the same.

Right, so its selecting C1 for longer. That in turn means we could now
never select C2; because the fast-idle threshold is longer than our C2
time.

Which I feel is wrong; because if we're getting C1 wrong, what says
we're then getting the rest right.

> The thresholds are coming either from the tables in intel idle,
> or from ACPI (let's assume the first)
> 
> That means either: the intel idle C1 threshold on the system Aubrey
> tested on is too high, or the fast idle threshold is too low.

Or our predictor is doing it wrong. It could be its over-estimating idle
duration. For example, suppose we have an idle distribution of:

40% < C1
60% > C2

And we end up selecting C2. Even though in many of our sleeps we really
wanted C1.

And as said; Daniel has been working on a better predictor -- now he's
probably not used it on the network workload you're looking at, so that
might be something to consider.

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Andi Kleen

On Fri, Jul 14, 2017 at 05:58:53PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote:
> > On 7/14/2017 8:38 AM, Peter Zijlstra wrote:
> > > No, that's wrong. We want to fix the normal C state selection process to
> > > pick the right C state.
> > > 
> > > The fast-idle criteria could cut off a whole bunch of available C
> > > states. We need to understand why our current C state pick is wrong and
> > > amend the algorithm to do better. Not just bolt something on the side.
> > 
> > I can see a fast path through selection if you know the upper bound of any
> > selection is just 1 state.

fast idle doesn't have an upper bound.

If the prediction exceeds the fast idle threshold any C state can be used.

It's just another state (fast C1), but right now it has an own threshold
which may be different from standard C1.

> > 
> > But also, how much of this is about "C1 be fast" versus "selecting C1 is 
> > slow"
> 
> I got the impression its about we need to select C1 for longer. But the
> fact that the patches don't in fact answer any of these questions,
> they're wrong in principle ;-)

For that workload. Tuning idle thresholds is a complex trade off.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Andi Kleen

On Fri, Jul 14, 2017 at 05:58:53PM +0200, Peter Zijlstra wrote:
> On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote:
> > On 7/14/2017 8:38 AM, Peter Zijlstra wrote:
> > > No, that's wrong. We want to fix the normal C state selection process to
> > > pick the right C state.
> > > 
> > > The fast-idle criteria could cut off a whole bunch of available C
> > > states. We need to understand why our current C state pick is wrong and
> > > amend the algorithm to do better. Not just bolt something on the side.
> > 
> > I can see a fast path through selection if you know the upper bound of any
> > selection is just 1 state.

fast idle doesn't have an upper bound.

If the prediction exceeds the fast idle threshold any C state can be used.

It's just another state (fast C1), but right now it has an own threshold
which may be different from standard C1.

> > 
> > But also, how much of this is about "C1 be fast" versus "selecting C1 is 
> > slow"
> 
> I got the impression its about we need to select C1 for longer. But the
> fact that the patches don't in fact answer any of these questions,
> they're wrong in principle ;-)

For that workload. Tuning idle thresholds is a complex trade off.

-Andi

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote:
> On 7/14/2017 8:38 AM, Peter Zijlstra wrote:
> > No, that's wrong. We want to fix the normal C state selection process to
> > pick the right C state.
> > 
> > The fast-idle criteria could cut off a whole bunch of available C
> > states. We need to understand why our current C state pick is wrong and
> > amend the algorithm to do better. Not just bolt something on the side.
> 
> I can see a fast path through selection if you know the upper bound of any
> selection is just 1 state.
> 
> But also, how much of this is about "C1 be fast" versus "selecting C1 is slow"

I got the impression its about we need to select C1 for longer. But the
fact that the patches don't in fact answer any of these questions,
they're wrong in principle ;-)

Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods

2017-07-14 Thread Peter Zijlstra

On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote:
> On 7/14/2017 8:38 AM, Peter Zijlstra wrote:
> > No, that's wrong. We want to fix the normal C state selection process to
> > pick the right C state.
> > 
> > The fast-idle criteria could cut off a whole bunch of available C
> > states. We need to understand why our current C state pick is wrong and
> > amend the algorithm to do better. Not just bolt something on the side.
> 
> I can see a fast path through selection if you know the upper bound of any
> selection is just 1 state.
> 
> But also, how much of this is about "C1 be fast" versus "selecting C1 is slow"

I got the impression its about we need to select C1 for longer. But the
fact that the patches don't in fact answer any of these questions,
they're wrong in principle ;-)

1 2 >

1 - 100 of 188 matches

Mail list logo