Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, Jul 20, 2017 at 04:19:40PM +0200, Peter Zijlstra wrote: > On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote: > > > > > > static void cpuidle_idle_call() > > > { > > > rcu_idle_enter() > > > .. > > > rcu_idle_exit() > > > } > > > > > > I want > > > > > > static void cpuidle_idle_call() > > > { > > > if (tick stopped) > > > rcu_idle_enter() > > > .. > > > if (tick stopped) > > > rcu_idle_exit() > > > } > > > > > > Or checking tick stop can be put into rcu_idle_enter/exit > > > > The answer is the traditional "it depends". > > > > If the above change was all that you did, that would be a bug in the > > case where the predicted short idle time turned out to in reality be an > > indefinite idle time. > > Can't be, you didn't disable the tick after all, so you're guaranteed to > get interrupted by the tick and try again. I will reserve judgment on that until I see the patch. But to your point, I would indeed hope that it works that way. ;-) Thanx, Paul
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, Jul 20, 2017 at 04:19:40PM +0200, Peter Zijlstra wrote: > On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote: > > > > > > static void cpuidle_idle_call() > > > { > > > rcu_idle_enter() > > > .. > > > rcu_idle_exit() > > > } > > > > > > I want > > > > > > static void cpuidle_idle_call() > > > { > > > if (tick stopped) > > > rcu_idle_enter() > > > .. > > > if (tick stopped) > > > rcu_idle_exit() > > > } > > > > > > Or checking tick stop can be put into rcu_idle_enter/exit > > > > The answer is the traditional "it depends". > > > > If the above change was all that you did, that would be a bug in the > > case where the predicted short idle time turned out to in reality be an > > indefinite idle time. > > Can't be, you didn't disable the tick after all, so you're guaranteed to > get interrupted by the tick and try again. I will reserve judgment on that until I see the patch. But to your point, I would indeed hope that it works that way. ;-) Thanx, Paul
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote: > > > > static void cpuidle_idle_call() > > { > > rcu_idle_enter() > > .. > > rcu_idle_exit() > > } > > > > I want > > > > static void cpuidle_idle_call() > > { > > if (tick stopped) > > rcu_idle_enter() > > .. > > if (tick stopped) > > rcu_idle_exit() > > } > > > > Or checking tick stop can be put into rcu_idle_enter/exit > > The answer is the traditional "it depends". > > If the above change was all that you did, that would be a bug in the > case where the predicted short idle time turned out to in reality be an > indefinite idle time. Can't be, you didn't disable the tick after all, so you're guaranteed to get interrupted by the tick and try again.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, Jul 20, 2017 at 05:50:54AM -0700, Paul E. McKenney wrote: > > > > static void cpuidle_idle_call() > > { > > rcu_idle_enter() > > .. > > rcu_idle_exit() > > } > > > > I want > > > > static void cpuidle_idle_call() > > { > > if (tick stopped) > > rcu_idle_enter() > > .. > > if (tick stopped) > > rcu_idle_exit() > > } > > > > Or checking tick stop can be put into rcu_idle_enter/exit > > The answer is the traditional "it depends". > > If the above change was all that you did, that would be a bug in the > case where the predicted short idle time turned out to in reality be an > indefinite idle time. Can't be, you didn't disable the tick after all, so you're guaranteed to get interrupted by the tick and try again.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/20/2017 1:11 AM, Thomas Gleixner wrote: On Thu, 20 Jul 2017, Li, Aubrey wrote: Don't get me wrong, even if a fast path is acceptable, we still need to figure out if the coming idle is short and when to switch. I'm just worried about if irq timings is not an ideal statistics, we have to skip it too. There is no ideal solution ever. Lets sit back and look at that from the big picture first before dismissing a particular item upfront. The current NOHZ implementation does: predict = nohz_predict(timers, rcu, arch, irqwork); if ((predict - now) > X) stop_tick() The C-State machinery does something like: predict = cstate_predict(next_timer, scheduler); cstate = cstate_select(predict); That disconnect is part of the problem. What we really want is: predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings); two separate predictors is clearly a recipe for badness. (likewise, C and P states try to estimate "performance sensitivity" and sometimes estimate in opposite directions) to be honest, performance sensitivity estimation is probably 10x more critical for C state selection than idle duration; a lot of modern hardware will do the energy efficiency stuff in a microcontroller when it coordinates between multiple cores in the system on C and P states. (both x86 and ARM have such microcontroller nowadays, at least for the higher performance designs)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/20/2017 1:11 AM, Thomas Gleixner wrote: On Thu, 20 Jul 2017, Li, Aubrey wrote: Don't get me wrong, even if a fast path is acceptable, we still need to figure out if the coming idle is short and when to switch. I'm just worried about if irq timings is not an ideal statistics, we have to skip it too. There is no ideal solution ever. Lets sit back and look at that from the big picture first before dismissing a particular item upfront. The current NOHZ implementation does: predict = nohz_predict(timers, rcu, arch, irqwork); if ((predict - now) > X) stop_tick() The C-State machinery does something like: predict = cstate_predict(next_timer, scheduler); cstate = cstate_select(predict); That disconnect is part of the problem. What we really want is: predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings); two separate predictors is clearly a recipe for badness. (likewise, C and P states try to estimate "performance sensitivity" and sometimes estimate in opposite directions) to be honest, performance sensitivity estimation is probably 10x more critical for C state selection than idle duration; a lot of modern hardware will do the energy efficiency stuff in a microcontroller when it coordinates between multiple cores in the system on C and P states. (both x86 and ARM have such microcontroller nowadays, at least for the higher performance designs)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/20/2017 5:50 AM, Paul E. McKenney wrote: To make this work reasonably, you would also need some way to check for the case where the prediction idle time is short but the real idle time is very long. so the case where you predict very short but is actually "indefinite", the real solution likely is that we set a timer some time in the future (say 100msec, or some other value that is long but not indefinite) where we wake up the system and make a new prediction, since clearly we were insanely wrong in the prediction and should try again. that or we turn the prediction from a single value into a range of (expected, upper bound) where upper bound is likely the next timer or other going-to-happen events.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/20/2017 5:50 AM, Paul E. McKenney wrote: To make this work reasonably, you would also need some way to check for the case where the prediction idle time is short but the real idle time is very long. so the case where you predict very short but is actually "indefinite", the real solution likely is that we set a timer some time in the future (say 100msec, or some other value that is long but not indefinite) where we wake up the system and make a new prediction, since clearly we were insanely wrong in the prediction and should try again. that or we turn the prediction from a single value into a range of (expected, upper bound) where upper bound is likely the next timer or other going-to-happen events.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, Jul 20, 2017 at 09:40:49AM +0800, Li, Aubrey wrote: > On 2017/7/19 22:48, Paul E. McKenney wrote: > > On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote: > >> On 2017/7/18 23:20, Paul E. McKenney wrote: > >> > 2) for rcu idle enter/exit, I measured the details which Paul provided, > and > the result matches with what I have measured before, nothing notable > found. > But it still makes more sense if we can make rcu idle enter/exit hooked > with > tick off. (it's possible other workloads behave differently) > >>> > >>> Again, assuming that RCU is informed of CPUs in the kernel, regardless > >>> of whether or not the tick is on that that point in time. > >>> > >> Yeah, I see, no problem for a normal idle. > >> > >> But for a short idle, we want to return to the task ASAP. Even though RCU > >> cost > >> is not notable, it would still be better for me if we can save some cycles > >> in > >> idle entry and idle exit. > >> > >> Do we have any problem if we skip RCU idle enter/exit under a fast idle > >> scenario? > >> My understanding is, if tick is not stopped, then we don't need inform RCU > >> in > >> idle path, it can be informed in irq exit. > > > > Indeed, the problem arises when the tick is stopped. > > My question is, does problem arise when the tick is *not* stopped (skipping > nohz idle)? > > instead of > > static void cpuidle_idle_call() > { > rcu_idle_enter() > .. > rcu_idle_exit() > } > > I want > > static void cpuidle_idle_call() > { > if (tick stopped) > rcu_idle_enter() > .. > if (tick stopped) > rcu_idle_exit() > } > > Or checking tick stop can be put into rcu_idle_enter/exit The answer is the traditional "it depends". If the above change was all that you did, that would be a bug in the case where the predicted short idle time turned out to in reality be an indefinite idle time. RCU would indefinitely believe that the CPU was non-idle, and would wait for it to report a quiescent state, which it would not, even given the scheduling-clock interrupt (it is idle, so it knows RCU already knows that it is idle, you see). RCU would eventually send it a resched IPI, which would probably get things going again, but the best you could hope for was extra resched IPIs and excessively long grace periods. To make this work reasonably, you would also need some way to check for the case where the prediction idle time is short but the real idle time is very long. Thanx, Paul
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, Jul 20, 2017 at 09:40:49AM +0800, Li, Aubrey wrote: > On 2017/7/19 22:48, Paul E. McKenney wrote: > > On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote: > >> On 2017/7/18 23:20, Paul E. McKenney wrote: > >> > 2) for rcu idle enter/exit, I measured the details which Paul provided, > and > the result matches with what I have measured before, nothing notable > found. > But it still makes more sense if we can make rcu idle enter/exit hooked > with > tick off. (it's possible other workloads behave differently) > >>> > >>> Again, assuming that RCU is informed of CPUs in the kernel, regardless > >>> of whether or not the tick is on that that point in time. > >>> > >> Yeah, I see, no problem for a normal idle. > >> > >> But for a short idle, we want to return to the task ASAP. Even though RCU > >> cost > >> is not notable, it would still be better for me if we can save some cycles > >> in > >> idle entry and idle exit. > >> > >> Do we have any problem if we skip RCU idle enter/exit under a fast idle > >> scenario? > >> My understanding is, if tick is not stopped, then we don't need inform RCU > >> in > >> idle path, it can be informed in irq exit. > > > > Indeed, the problem arises when the tick is stopped. > > My question is, does problem arise when the tick is *not* stopped (skipping > nohz idle)? > > instead of > > static void cpuidle_idle_call() > { > rcu_idle_enter() > .. > rcu_idle_exit() > } > > I want > > static void cpuidle_idle_call() > { > if (tick stopped) > rcu_idle_enter() > .. > if (tick stopped) > rcu_idle_exit() > } > > Or checking tick stop can be put into rcu_idle_enter/exit The answer is the traditional "it depends". If the above change was all that you did, that would be a bug in the case where the predicted short idle time turned out to in reality be an indefinite idle time. RCU would indefinitely believe that the CPU was non-idle, and would wait for it to report a quiescent state, which it would not, even given the scheduling-clock interrupt (it is idle, so it knows RCU already knows that it is idle, you see). RCU would eventually send it a resched IPI, which would probably get things going again, but the best you could hope for was extra resched IPIs and excessively long grace periods. To make this work reasonably, you would also need some way to check for the case where the prediction idle time is short but the real idle time is very long. Thanx, Paul
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, 20 Jul 2017, Li, Aubrey wrote: > Don't get me wrong, even if a fast path is acceptable, we still need to > figure out if the coming idle is short and when to switch. I'm just worried > about if irq timings is not an ideal statistics, we have to skip it too. There is no ideal solution ever. Lets sit back and look at that from the big picture first before dismissing a particular item upfront. The current NOHZ implementation does: predict = nohz_predict(timers, rcu, arch, irqwork); if ((predict - now) > X) stop_tick() The C-State machinery does something like: predict = cstate_predict(next_timer, scheduler); cstate = cstate_select(predict); That disconnect is part of the problem. What we really want is: predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings); and use that prediction for both the NOHZ and the C-State decision function. That's the first thing which needs to be addressed. Once that is done, you can look into the prediction functions and optimize that or tweak the bits and pieces there and decide which predictors work best for a particular workload. As long as you just look into a particular predictor function and do not address the underlying conceptual issues first, the outcome is very much predictable: It's going to be useless crap. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Thu, 20 Jul 2017, Li, Aubrey wrote: > Don't get me wrong, even if a fast path is acceptable, we still need to > figure out if the coming idle is short and when to switch. I'm just worried > about if irq timings is not an ideal statistics, we have to skip it too. There is no ideal solution ever. Lets sit back and look at that from the big picture first before dismissing a particular item upfront. The current NOHZ implementation does: predict = nohz_predict(timers, rcu, arch, irqwork); if ((predict - now) > X) stop_tick() The C-State machinery does something like: predict = cstate_predict(next_timer, scheduler); cstate = cstate_select(predict); That disconnect is part of the problem. What we really want is: predict = idle_predict(timers, rcu, arch, irqwork, scheduler, irq timings); and use that prediction for both the NOHZ and the C-State decision function. That's the first thing which needs to be addressed. Once that is done, you can look into the prediction functions and optimize that or tweak the bits and pieces there and decide which predictors work best for a particular workload. As long as you just look into a particular predictor function and do not address the underlying conceptual issues first, the outcome is very much predictable: It's going to be useless crap. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/19 15:55, Thomas Gleixner wrote: > On Wed, 19 Jul 2017, Li, Aubrey wrote: >> On 2017/7/18 15:19, Thomas Gleixner wrote: >>> You can very well avoid it by taking the irq timings or whatever other >>> information into account for the NOHZ decision. >>> >> If I read the source correctly, irq timing statistics computation happens at >> idle time. Sadly, this is what Andi Kleen worried about. People keep putting >> more and more slow stuff in idle path, not realizing it could be a critical >> path. > > Can you please stop this right here? We all realize that there is an issue, > we are not as stupid as you might think. > > But we disagree that the proper solution is to add an ad hoc hack which > shortcuts stuff and creates a maintenance problem in the long run. > > Just dismissing a proposal based on 'oh this adds more code to the idle > path' and then yelling 'you all do not get it' is not going to solve this > in any way. > > You simply refuse to look at the big picture and you are just not willing > to analyze the involved bits and pieces and find a solution which allows to > serve both the NOHZ power saving and the interrupt driven workloads best. > > All you are doing is providing data about the status quo, and then > deferring from that, that you need an extra magic code path. That > information is just a inventory of the current behaviour and the current > behaviour is caused by the current implementation. There is nothing set in > stone with that implementation and it can be changed. > > But you are just in refusal mode and instead of sitting down and doing the > hard work, you accuse other people that they are not realizing the problem > and insist on your sloppy hackery. That's simply not the way it works. Don't get me wrong, even if a fast path is acceptable, we still need to figure out if the coming idle is short and when to switch. I'm just worried about if irq timings is not an ideal statistics, we have to skip it too. Let me a little time to digest the details of irq timings and obtain some data to see if it's suitable for my case. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/19 15:55, Thomas Gleixner wrote: > On Wed, 19 Jul 2017, Li, Aubrey wrote: >> On 2017/7/18 15:19, Thomas Gleixner wrote: >>> You can very well avoid it by taking the irq timings or whatever other >>> information into account for the NOHZ decision. >>> >> If I read the source correctly, irq timing statistics computation happens at >> idle time. Sadly, this is what Andi Kleen worried about. People keep putting >> more and more slow stuff in idle path, not realizing it could be a critical >> path. > > Can you please stop this right here? We all realize that there is an issue, > we are not as stupid as you might think. > > But we disagree that the proper solution is to add an ad hoc hack which > shortcuts stuff and creates a maintenance problem in the long run. > > Just dismissing a proposal based on 'oh this adds more code to the idle > path' and then yelling 'you all do not get it' is not going to solve this > in any way. > > You simply refuse to look at the big picture and you are just not willing > to analyze the involved bits and pieces and find a solution which allows to > serve both the NOHZ power saving and the interrupt driven workloads best. > > All you are doing is providing data about the status quo, and then > deferring from that, that you need an extra magic code path. That > information is just a inventory of the current behaviour and the current > behaviour is caused by the current implementation. There is nothing set in > stone with that implementation and it can be changed. > > But you are just in refusal mode and instead of sitting down and doing the > hard work, you accuse other people that they are not realizing the problem > and insist on your sloppy hackery. That's simply not the way it works. Don't get me wrong, even if a fast path is acceptable, we still need to figure out if the coming idle is short and when to switch. I'm just worried about if irq timings is not an ideal statistics, we have to skip it too. Let me a little time to digest the details of irq timings and obtain some data to see if it's suitable for my case. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/19 22:48, Paul E. McKenney wrote: > On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote: >> On 2017/7/18 23:20, Paul E. McKenney wrote: >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and the result matches with what I have measured before, nothing notable found. But it still makes more sense if we can make rcu idle enter/exit hooked with tick off. (it's possible other workloads behave differently) >>> >>> Again, assuming that RCU is informed of CPUs in the kernel, regardless >>> of whether or not the tick is on that that point in time. >>> >> Yeah, I see, no problem for a normal idle. >> >> But for a short idle, we want to return to the task ASAP. Even though RCU >> cost >> is not notable, it would still be better for me if we can save some cycles in >> idle entry and idle exit. >> >> Do we have any problem if we skip RCU idle enter/exit under a fast idle >> scenario? >> My understanding is, if tick is not stopped, then we don't need inform RCU in >> idle path, it can be informed in irq exit. > > Indeed, the problem arises when the tick is stopped. My question is, does problem arise when the tick is *not* stopped (skipping nohz idle)? instead of static void cpuidle_idle_call() { rcu_idle_enter() .. rcu_idle_exit() } I want static void cpuidle_idle_call() { if (tick stopped) rcu_idle_enter() .. if (tick stopped) rcu_idle_exit() } Or checking tick stop can be put into rcu_idle_enter/exit Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/19 22:48, Paul E. McKenney wrote: > On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote: >> On 2017/7/18 23:20, Paul E. McKenney wrote: >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and the result matches with what I have measured before, nothing notable found. But it still makes more sense if we can make rcu idle enter/exit hooked with tick off. (it's possible other workloads behave differently) >>> >>> Again, assuming that RCU is informed of CPUs in the kernel, regardless >>> of whether or not the tick is on that that point in time. >>> >> Yeah, I see, no problem for a normal idle. >> >> But for a short idle, we want to return to the task ASAP. Even though RCU >> cost >> is not notable, it would still be better for me if we can save some cycles in >> idle entry and idle exit. >> >> Do we have any problem if we skip RCU idle enter/exit under a fast idle >> scenario? >> My understanding is, if tick is not stopped, then we don't need inform RCU in >> idle path, it can be informed in irq exit. > > Indeed, the problem arises when the tick is stopped. My question is, does problem arise when the tick is *not* stopped (skipping nohz idle)? instead of static void cpuidle_idle_call() { rcu_idle_enter() .. rcu_idle_exit() } I want static void cpuidle_idle_call() { if (tick stopped) rcu_idle_enter() .. if (tick stopped) rcu_idle_exit() } Or checking tick stop can be put into rcu_idle_enter/exit Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 19, 2017 at 10:03:22AM -0500, Christopher Lameter wrote: > On Wed, 19 Jul 2017, Paul E. McKenney wrote: > > > > Do we have any problem if we skip RCU idle enter/exit under a fast idle > > > scenario? > > > My understanding is, if tick is not stopped, then we don't need inform > > > RCU in > > > idle path, it can be informed in irq exit. > > > > Indeed, the problem arises when the tick is stopped. > > Well is there a boundary when you would want the notification calls? I > would think that even an idle period of a couple of seconds does not > necessarily require a callback to rcu. Had some brokenness here where RCU > calls did not occur for hours or so. At some point the system ran out of > memory but thats far off. Yeah, I should spell this out more completely, shouldn't I? And get it into the documentation if it isn't already there... Where it is currently at best hinted at. :-/ 1. If a CPU is either idle or executing in usermode, and RCU believes it is non-idle, the scheduling-clock tick had better be running. Otherwise, you will get RCU CPU stall warnings. Or at best, very long (11-second) grace periods, with a pointless IPI waking the CPU each time. 2. If a CPU is in a portion of the kernel that executes RCU read-side critical sections, and RCU believes this CPU to be idle, you can get random memory corruption. DON'T DO THIS!!! This is one reason to test with lockdep, which will complain about this sort of thing. 3. If a CPU is in a portion of the kernel that is absolutely positively no-joking guaranteed to never execute any RCU read-side critical sections, and RCU believes this CPU to to be idle, no problem. This sort of thing is used by some architectures for light-weight exception handlers, which can then avoid the overhead of rcu_irq_enter() and rcu_irq_exit(). Some go further and avoid the entireties of irq_enter() and irq_exit(). Just make very sure you are running some of your tests with CONFIG_PROVE_RCU=y. 4. If a CPU is executing in the kernel with the scheduling-clock interrupt disabled and RCU believes this CPU to be non-idle, and if the CPU goes idle (from an RCU perspective) every few jiffies, no problem. It is usually OK for there to be the occasional gap between idle periods of up to a second or so. If the gap grows too long, you get RCU CPU stall warnings. 5. If a CPU is either idle or executing in usermode, and RCU believes it to be idle, of course no problem. 6. If a CPU is executing in the kernel, the kernel code path is passing through quiescent states at a reasonable frequency (preferably about once per few jiffies, but the occasional excursion to a second or so is usually OK) and the scheduling-clock interrupt is enabled, of course no problem. If the gap between a successive pair of quiescent states grows too long, you get RCU CPU stall warnings. For purposes of comparison, the default time until RCU CPU stall warning in mainline is 21 seconds. A number of distros set this to 60 seconds. Back in the 90s, the analogous timeout for DYNIX/ptx was 1.5 seconds. :-/ Hence "a second or so" instead of your "a couple of seconds". ;-) Please see below for the corresponding patch to RCU's requirements. Thoughts? Thanx, Paul commit 693c9bfa43f92570dd362d8834440b418bbb994a Author: Paul E. McKenneyDate: Wed Jul 19 09:52:58 2017 -0700 doc: Set down RCU's scheduling-clock-interrupt needs This commit documents the situations in which RCU needs the scheduling-clock interrupt to be enabled, along with the consequences of failing to meet RCU's needs in this area. Signed-off-by: Paul E. McKenney diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 95b30fa25d56..7980bee5607f 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows: Scheduler and RCU. Tracing and RCU. Energy Efficiency. + + Scheduling-Clock Interrupts and RCU. Memory Efficiency. Performance, Scalability, Response Time, and Reliability. @@ -2532,6 +2534,113 @@ I learned of many of these requirements via angry phone calls: Flaming me on the Linux-kernel mailing list was apparently not sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +Scheduling-Clock Interrupts and RCU + + +The kernel transitions between in-kernel non-idle execution,
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 19, 2017 at 10:03:22AM -0500, Christopher Lameter wrote: > On Wed, 19 Jul 2017, Paul E. McKenney wrote: > > > > Do we have any problem if we skip RCU idle enter/exit under a fast idle > > > scenario? > > > My understanding is, if tick is not stopped, then we don't need inform > > > RCU in > > > idle path, it can be informed in irq exit. > > > > Indeed, the problem arises when the tick is stopped. > > Well is there a boundary when you would want the notification calls? I > would think that even an idle period of a couple of seconds does not > necessarily require a callback to rcu. Had some brokenness here where RCU > calls did not occur for hours or so. At some point the system ran out of > memory but thats far off. Yeah, I should spell this out more completely, shouldn't I? And get it into the documentation if it isn't already there... Where it is currently at best hinted at. :-/ 1. If a CPU is either idle or executing in usermode, and RCU believes it is non-idle, the scheduling-clock tick had better be running. Otherwise, you will get RCU CPU stall warnings. Or at best, very long (11-second) grace periods, with a pointless IPI waking the CPU each time. 2. If a CPU is in a portion of the kernel that executes RCU read-side critical sections, and RCU believes this CPU to be idle, you can get random memory corruption. DON'T DO THIS!!! This is one reason to test with lockdep, which will complain about this sort of thing. 3. If a CPU is in a portion of the kernel that is absolutely positively no-joking guaranteed to never execute any RCU read-side critical sections, and RCU believes this CPU to to be idle, no problem. This sort of thing is used by some architectures for light-weight exception handlers, which can then avoid the overhead of rcu_irq_enter() and rcu_irq_exit(). Some go further and avoid the entireties of irq_enter() and irq_exit(). Just make very sure you are running some of your tests with CONFIG_PROVE_RCU=y. 4. If a CPU is executing in the kernel with the scheduling-clock interrupt disabled and RCU believes this CPU to be non-idle, and if the CPU goes idle (from an RCU perspective) every few jiffies, no problem. It is usually OK for there to be the occasional gap between idle periods of up to a second or so. If the gap grows too long, you get RCU CPU stall warnings. 5. If a CPU is either idle or executing in usermode, and RCU believes it to be idle, of course no problem. 6. If a CPU is executing in the kernel, the kernel code path is passing through quiescent states at a reasonable frequency (preferably about once per few jiffies, but the occasional excursion to a second or so is usually OK) and the scheduling-clock interrupt is enabled, of course no problem. If the gap between a successive pair of quiescent states grows too long, you get RCU CPU stall warnings. For purposes of comparison, the default time until RCU CPU stall warning in mainline is 21 seconds. A number of distros set this to 60 seconds. Back in the 90s, the analogous timeout for DYNIX/ptx was 1.5 seconds. :-/ Hence "a second or so" instead of your "a couple of seconds". ;-) Please see below for the corresponding patch to RCU's requirements. Thoughts? Thanx, Paul commit 693c9bfa43f92570dd362d8834440b418bbb994a Author: Paul E. McKenney Date: Wed Jul 19 09:52:58 2017 -0700 doc: Set down RCU's scheduling-clock-interrupt needs This commit documents the situations in which RCU needs the scheduling-clock interrupt to be enabled, along with the consequences of failing to meet RCU's needs in this area. Signed-off-by: Paul E. McKenney diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 95b30fa25d56..7980bee5607f 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows: Scheduler and RCU. Tracing and RCU. Energy Efficiency. + + Scheduling-Clock Interrupts and RCU. Memory Efficiency. Performance, Scalability, Response Time, and Reliability. @@ -2532,6 +2534,113 @@ I learned of many of these requirements via angry phone calls: Flaming me on the Linux-kernel mailing list was apparently not sufficient to fully vent their ire at RCU's energy-efficiency bugs! + +Scheduling-Clock Interrupts and RCU + + +The kernel transitions between in-kernel non-idle execution, userspace +execution, and the idle loop. +Depending on
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, 19 Jul 2017, Paul E. McKenney wrote: > > Do we have any problem if we skip RCU idle enter/exit under a fast idle > > scenario? > > My understanding is, if tick is not stopped, then we don't need inform RCU > > in > > idle path, it can be informed in irq exit. > > Indeed, the problem arises when the tick is stopped. Well is there a boundary when you would want the notification calls? I would think that even an idle period of a couple of seconds does not necessarily require a callback to rcu. Had some brokenness here where RCU calls did not occur for hours or so. At some point the system ran out of memory but thats far off.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, 19 Jul 2017, Paul E. McKenney wrote: > > Do we have any problem if we skip RCU idle enter/exit under a fast idle > > scenario? > > My understanding is, if tick is not stopped, then we don't need inform RCU > > in > > idle path, it can be informed in irq exit. > > Indeed, the problem arises when the tick is stopped. Well is there a boundary when you would want the notification calls? I would think that even an idle period of a couple of seconds does not necessarily require a callback to rcu. Had some brokenness here where RCU calls did not occur for hours or so. At some point the system ran out of memory but thats far off.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 19, 2017 at 03:43:07PM +0200, Frederic Weisbecker wrote: > On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote: > > On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote: > > > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote: > > > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote: > > > > > > > > But I think we can at the very least do this; it only gets called from > > > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled > > > > > by > > > > > that point. > > > > > > > > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > > index 51d4c3acf32d..dccf2dc8155a 100644 > > > > > --- a/kernel/rcu/tree.c > > > > > +++ b/kernel/rcu/tree.c > > > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user) > > > > > */ > > > > > void rcu_idle_enter(void) > > > > > { > > > > > - unsigned long flags; > > > > > - > > > > > - local_irq_save(flags); > > > > > > > > With this addition, I am all for it: > > > > > > > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > > > > enabled!!!"); > > > > > > > > If you are OK with this addition, may I please have your Signed-off-by? > > > > > > Sure, > > > > > > Signed-off-by: Peter Zijlstra (Intel)> > > > Very good, I have queued the patch below. I left out the removal of > > the export as I need to work out why the export was there. If it turns > > out not to be needed, I will remove the related ones as well. > > > > Fair enough? > > > > Thanx, Paul > > > > > > > > commit 95f3e587ce6388028a51f0c852800fca944e7032 > > Author: Peter Zijlstra (Intel) > > Date: Wed Jul 12 07:59:54 2017 -0700 > > > > rcu: Make rcu_idle_enter() rely on callers disabling irqs > > > > All callers to rcu_idle_enter() have irqs disabled, so there is no > > point in rcu_idle_enter disabling them again. This commit therefore > > replaces the irq disabling with a RCU_LOCKDEP_WARN(). > > > > Signed-off-by: Peter Zijlstra (Intel) > > Signed-off-by: Paul E. McKenney > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 6de6b1c5ee53..c78b076653ce 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user) > > */ > > void rcu_idle_enter(void) > > { > > - unsigned long flags; > > - > > - local_irq_save(flags); > > + RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > > enabled!!!"); > > Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset > because there > are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we > need it but we were too > shy to do it for performance reason. If you create it, let me know, I could switch to it. Thanx, Paul > > rcu_eqs_enter(false); > > - local_irq_restore(flags); > > } > > EXPORT_SYMBOL_GPL(rcu_idle_enter); > > > > >
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 19, 2017 at 03:43:07PM +0200, Frederic Weisbecker wrote: > On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote: > > On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote: > > > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote: > > > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote: > > > > > > > > But I think we can at the very least do this; it only gets called from > > > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled > > > > > by > > > > > that point. > > > > > > > > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > > index 51d4c3acf32d..dccf2dc8155a 100644 > > > > > --- a/kernel/rcu/tree.c > > > > > +++ b/kernel/rcu/tree.c > > > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user) > > > > > */ > > > > > void rcu_idle_enter(void) > > > > > { > > > > > - unsigned long flags; > > > > > - > > > > > - local_irq_save(flags); > > > > > > > > With this addition, I am all for it: > > > > > > > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > > > > enabled!!!"); > > > > > > > > If you are OK with this addition, may I please have your Signed-off-by? > > > > > > Sure, > > > > > > Signed-off-by: Peter Zijlstra (Intel) > > > > Very good, I have queued the patch below. I left out the removal of > > the export as I need to work out why the export was there. If it turns > > out not to be needed, I will remove the related ones as well. > > > > Fair enough? > > > > Thanx, Paul > > > > > > > > commit 95f3e587ce6388028a51f0c852800fca944e7032 > > Author: Peter Zijlstra (Intel) > > Date: Wed Jul 12 07:59:54 2017 -0700 > > > > rcu: Make rcu_idle_enter() rely on callers disabling irqs > > > > All callers to rcu_idle_enter() have irqs disabled, so there is no > > point in rcu_idle_enter disabling them again. This commit therefore > > replaces the irq disabling with a RCU_LOCKDEP_WARN(). > > > > Signed-off-by: Peter Zijlstra (Intel) > > Signed-off-by: Paul E. McKenney > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index 6de6b1c5ee53..c78b076653ce 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user) > > */ > > void rcu_idle_enter(void) > > { > > - unsigned long flags; > > - > > - local_irq_save(flags); > > + RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > > enabled!!!"); > > Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset > because there > are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we > need it but we were too > shy to do it for performance reason. If you create it, let me know, I could switch to it. Thanx, Paul > > rcu_eqs_enter(false); > > - local_irq_restore(flags); > > } > > EXPORT_SYMBOL_GPL(rcu_idle_enter); > > > > >
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote: > On 2017/7/18 23:20, Paul E. McKenney wrote: > > >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and > >> the result matches with what I have measured before, nothing notable found. > >> But it still makes more sense if we can make rcu idle enter/exit hooked > >> with > >> tick off. (it's possible other workloads behave differently) > > > > Again, assuming that RCU is informed of CPUs in the kernel, regardless > > of whether or not the tick is on that that point in time. > > > Yeah, I see, no problem for a normal idle. > > But for a short idle, we want to return to the task ASAP. Even though RCU cost > is not notable, it would still be better for me if we can save some cycles in > idle entry and idle exit. > > Do we have any problem if we skip RCU idle enter/exit under a fast idle > scenario? > My understanding is, if tick is not stopped, then we don't need inform RCU in > idle path, it can be informed in irq exit. Indeed, the problem arises when the tick is stopped. Thanx, Paul
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 19, 2017 at 01:44:06PM +0800, Li, Aubrey wrote: > On 2017/7/18 23:20, Paul E. McKenney wrote: > > >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and > >> the result matches with what I have measured before, nothing notable found. > >> But it still makes more sense if we can make rcu idle enter/exit hooked > >> with > >> tick off. (it's possible other workloads behave differently) > > > > Again, assuming that RCU is informed of CPUs in the kernel, regardless > > of whether or not the tick is on that that point in time. > > > Yeah, I see, no problem for a normal idle. > > But for a short idle, we want to return to the task ASAP. Even though RCU cost > is not notable, it would still be better for me if we can save some cycles in > idle entry and idle exit. > > Do we have any problem if we skip RCU idle enter/exit under a fast idle > scenario? > My understanding is, if tick is not stopped, then we don't need inform RCU in > idle path, it can be informed in irq exit. Indeed, the problem arises when the tick is stopped. Thanx, Paul
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote: > On 2017/7/18 3:23, Peter Zijlstra wrote: > > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: > >>> And as said; Daniel has been working on a better predictor -- now he's > >>> probably not used it on the network workload you're looking at, so that > >>> might be something to consider. > >> > >> Deriving a better idle predictor is a bit orthogonal to fast idle. > > > > No. If you want a different C state selected we need to fix the current > > C state selector. We're not going to tinker. > > > > And the predictor is probably the most fundamental part of the whole C > > state selection logic. > > > > Now I think the problem is that the current predictor goes for an > > average idle duration. This means that we, on average, get it wrong 50% > > of the time. For performance that's bad. > > > > If you want to improve the worst case, we need to consider a cumulative > > distribution function, and staying with the Gaussian assumption already > > present, that would mean using: > > > > 1 x - mu > > CDF(x) = - [ 1 + erf(-) ] > > 2 sigma sqrt(2) > > > > Where, per the normal convention mu is the average and sigma^2 the > > variance. See also: > > > > https://en.wikipedia.org/wiki/Normal_distribution > > > > We then solve CDF(x) = n% to find the x for which we get it wrong n% of > > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). > > > > This conceptually gets us better exit latency for the cases where we got > > it wrong before, and practically pushes down the estimate which gets us > > C1 longer. > > > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > Maybe you are talking about applying some machine learning algorithm online > to fit a multivariate normal distribution, :) See here for an implementation of what I spoke about above: https://lkml.kernel.org/r/20170719133940.uytsixvfgpmo3...@hirez.programming.kicks-ass.net Very much statistics 101.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote: > On 2017/7/18 3:23, Peter Zijlstra wrote: > > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: > >>> And as said; Daniel has been working on a better predictor -- now he's > >>> probably not used it on the network workload you're looking at, so that > >>> might be something to consider. > >> > >> Deriving a better idle predictor is a bit orthogonal to fast idle. > > > > No. If you want a different C state selected we need to fix the current > > C state selector. We're not going to tinker. > > > > And the predictor is probably the most fundamental part of the whole C > > state selection logic. > > > > Now I think the problem is that the current predictor goes for an > > average idle duration. This means that we, on average, get it wrong 50% > > of the time. For performance that's bad. > > > > If you want to improve the worst case, we need to consider a cumulative > > distribution function, and staying with the Gaussian assumption already > > present, that would mean using: > > > > 1 x - mu > > CDF(x) = - [ 1 + erf(-) ] > > 2 sigma sqrt(2) > > > > Where, per the normal convention mu is the average and sigma^2 the > > variance. See also: > > > > https://en.wikipedia.org/wiki/Normal_distribution > > > > We then solve CDF(x) = n% to find the x for which we get it wrong n% of > > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). > > > > This conceptually gets us better exit latency for the cases where we got > > it wrong before, and practically pushes down the estimate which gets us > > C1 longer. > > > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > Maybe you are talking about applying some machine learning algorithm online > to fit a multivariate normal distribution, :) See here for an implementation of what I spoke about above: https://lkml.kernel.org/r/20170719133940.uytsixvfgpmo3...@hirez.programming.kicks-ass.net Very much statistics 101.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote: > On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote: > > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote: > > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote: > > > > > > But I think we can at the very least do this; it only gets called from > > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled by > > > > that point. > > > > > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > index 51d4c3acf32d..dccf2dc8155a 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user) > > > > */ > > > > void rcu_idle_enter(void) > > > > { > > > > - unsigned long flags; > > > > - > > > > - local_irq_save(flags); > > > > > > With this addition, I am all for it: > > > > > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > > > enabled!!!"); > > > > > > If you are OK with this addition, may I please have your Signed-off-by? > > > > Sure, > > > > Signed-off-by: Peter Zijlstra (Intel)> > Very good, I have queued the patch below. I left out the removal of > the export as I need to work out why the export was there. If it turns > out not to be needed, I will remove the related ones as well. > > Fair enough? > > Thanx, Paul > > > > commit 95f3e587ce6388028a51f0c852800fca944e7032 > Author: Peter Zijlstra (Intel) > Date: Wed Jul 12 07:59:54 2017 -0700 > > rcu: Make rcu_idle_enter() rely on callers disabling irqs > > All callers to rcu_idle_enter() have irqs disabled, so there is no > point in rcu_idle_enter disabling them again. This commit therefore > replaces the irq disabling with a RCU_LOCKDEP_WARN(). > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Paul E. McKenney > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 6de6b1c5ee53..c78b076653ce 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user) > */ > void rcu_idle_enter(void) > { > - unsigned long flags; > - > - local_irq_save(flags); > + RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > enabled!!!"); Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset because there are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we need it but we were too shy to do it for performance reason. > rcu_eqs_enter(false); > - local_irq_restore(flags); > } > EXPORT_SYMBOL_GPL(rcu_idle_enter); > >
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, Jul 12, 2017 at 08:56:51AM -0700, Paul E. McKenney wrote: > On Wed, Jul 12, 2017 at 01:54:51PM +0200, Peter Zijlstra wrote: > > On Tue, Jul 11, 2017 at 11:09:31AM -0700, Paul E. McKenney wrote: > > > On Tue, Jul 11, 2017 at 06:34:22PM +0200, Peter Zijlstra wrote: > > > > > > But I think we can at the very least do this; it only gets called from > > > > kernel/sched/idle.c and both callsites have IRQs explicitly disabled by > > > > that point. > > > > > > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > > index 51d4c3acf32d..dccf2dc8155a 100644 > > > > --- a/kernel/rcu/tree.c > > > > +++ b/kernel/rcu/tree.c > > > > @@ -843,13 +843,8 @@ static void rcu_eqs_enter(bool user) > > > > */ > > > > void rcu_idle_enter(void) > > > > { > > > > - unsigned long flags; > > > > - > > > > - local_irq_save(flags); > > > > > > With this addition, I am all for it: > > > > > > RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > > > enabled!!!"); > > > > > > If you are OK with this addition, may I please have your Signed-off-by? > > > > Sure, > > > > Signed-off-by: Peter Zijlstra (Intel) > > Very good, I have queued the patch below. I left out the removal of > the export as I need to work out why the export was there. If it turns > out not to be needed, I will remove the related ones as well. > > Fair enough? > > Thanx, Paul > > > > commit 95f3e587ce6388028a51f0c852800fca944e7032 > Author: Peter Zijlstra (Intel) > Date: Wed Jul 12 07:59:54 2017 -0700 > > rcu: Make rcu_idle_enter() rely on callers disabling irqs > > All callers to rcu_idle_enter() have irqs disabled, so there is no > point in rcu_idle_enter disabling them again. This commit therefore > replaces the irq disabling with a RCU_LOCKDEP_WARN(). > > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Paul E. McKenney > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 6de6b1c5ee53..c78b076653ce 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -840,11 +840,8 @@ static void rcu_eqs_enter(bool user) > */ > void rcu_idle_enter(void) > { > - unsigned long flags; > - > - local_irq_save(flags); > + RCU_LOCKDEP_WARN(!irqs_disabled(), "rcu_idle_enter() invoked with irqs > enabled!!!"); Nice! I really need to make that LOCKDEP_ASSERT_IRQS_DISABLED() patchset because there are many places where either we have WARN_ON_ONCE(!irqs_disabled()) or we need it but we were too shy to do it for performance reason. > rcu_eqs_enter(false); > - local_irq_restore(flags); > } > EXPORT_SYMBOL_GPL(rcu_idle_enter); > >
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, 19 Jul 2017, Li, Aubrey wrote: > On 2017/7/18 15:19, Thomas Gleixner wrote: > > You can very well avoid it by taking the irq timings or whatever other > > information into account for the NOHZ decision. > > > If I read the source correctly, irq timing statistics computation happens at > idle time. Sadly, this is what Andi Kleen worried about. People keep putting > more and more slow stuff in idle path, not realizing it could be a critical > path. Can you please stop this right here? We all realize that there is an issue, we are not as stupid as you might think. But we disagree that the proper solution is to add an ad hoc hack which shortcuts stuff and creates a maintenance problem in the long run. Just dismissing a proposal based on 'oh this adds more code to the idle path' and then yelling 'you all do not get it' is not going to solve this in any way. You simply refuse to look at the big picture and you are just not willing to analyze the involved bits and pieces and find a solution which allows to serve both the NOHZ power saving and the interrupt driven workloads best. All you are doing is providing data about the status quo, and then deferring from that, that you need an extra magic code path. That information is just a inventory of the current behaviour and the current behaviour is caused by the current implementation. There is nothing set in stone with that implementation and it can be changed. But you are just in refusal mode and instead of sitting down and doing the hard work, you accuse other people that they are not realizing the problem and insist on your sloppy hackery. That's simply not the way it works. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Wed, 19 Jul 2017, Li, Aubrey wrote: > On 2017/7/18 15:19, Thomas Gleixner wrote: > > You can very well avoid it by taking the irq timings or whatever other > > information into account for the NOHZ decision. > > > If I read the source correctly, irq timing statistics computation happens at > idle time. Sadly, this is what Andi Kleen worried about. People keep putting > more and more slow stuff in idle path, not realizing it could be a critical > path. Can you please stop this right here? We all realize that there is an issue, we are not as stupid as you might think. But we disagree that the proper solution is to add an ad hoc hack which shortcuts stuff and creates a maintenance problem in the long run. Just dismissing a proposal based on 'oh this adds more code to the idle path' and then yelling 'you all do not get it' is not going to solve this in any way. You simply refuse to look at the big picture and you are just not willing to analyze the involved bits and pieces and find a solution which allows to serve both the NOHZ power saving and the interrupt driven workloads best. All you are doing is providing data about the status quo, and then deferring from that, that you need an extra magic code path. That information is just a inventory of the current behaviour and the current behaviour is caused by the current implementation. There is nothing set in stone with that implementation and it can be changed. But you are just in refusal mode and instead of sitting down and doing the hard work, you accuse other people that they are not realizing the problem and insist on your sloppy hackery. That's simply not the way it works. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 15:19, Thomas Gleixner wrote: > On Mon, 17 Jul 2017, Andi Kleen wrote: >> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: >>> On Mon, 17 Jul 2017, Andi Kleen wrote: >>> > We need a tradeoff here IMHO. I'll check Daniel's work to understand > how/if > it's better than menu governor. I still would like to see how the fast path without the C1 heuristic works. Fast pathing is a different concept from a better predictor. IMHO we need both, but the first is likely lower hanging fruit. >>> >>> Hacking something on the side is always the lower hanging fruit as it >>> avoids solving the hard problems. As Peter said already, that's not going >>> to happen unless there is a real technical reason why the general path >>> cannot be fixed. So far there is no proof for that. >> >> You didn't look at Aubrey's data? >> >> There are some unavoidable slow operations in the current path -- e.g. >> reprograming the timer for NOHZ. But we don't need that for really >> short idle periods, because as you pointed out they never get woken >> up by the tick. >> >> Similar for other things like RCU. >> >> I don't see how you can avoid that other than without a fast path mechanism. > > You can very well avoid it by taking the irq timings or whatever other > information into account for the NOHZ decision. > If I read the source correctly, irq timing statistics computation happens at idle time. Sadly, this is what Andi Kleen worried about. People keep putting more and more slow stuff in idle path, not realizing it could be a critical path. /* * The computation happens at idle time. When the CPU is not idle, the * interrupts' timestamps are stored in the circular buffer, when the * CPU goes idle and this routine is called, all the buffer's values * are injected in the statistical model continuing to extend the * statistics from the previous busy-idle cycle. */ Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 15:19, Thomas Gleixner wrote: > On Mon, 17 Jul 2017, Andi Kleen wrote: >> On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: >>> On Mon, 17 Jul 2017, Andi Kleen wrote: >>> > We need a tradeoff here IMHO. I'll check Daniel's work to understand > how/if > it's better than menu governor. I still would like to see how the fast path without the C1 heuristic works. Fast pathing is a different concept from a better predictor. IMHO we need both, but the first is likely lower hanging fruit. >>> >>> Hacking something on the side is always the lower hanging fruit as it >>> avoids solving the hard problems. As Peter said already, that's not going >>> to happen unless there is a real technical reason why the general path >>> cannot be fixed. So far there is no proof for that. >> >> You didn't look at Aubrey's data? >> >> There are some unavoidable slow operations in the current path -- e.g. >> reprograming the timer for NOHZ. But we don't need that for really >> short idle periods, because as you pointed out they never get woken >> up by the tick. >> >> Similar for other things like RCU. >> >> I don't see how you can avoid that other than without a fast path mechanism. > > You can very well avoid it by taking the irq timings or whatever other > information into account for the NOHZ decision. > If I read the source correctly, irq timing statistics computation happens at idle time. Sadly, this is what Andi Kleen worried about. People keep putting more and more slow stuff in idle path, not realizing it could be a critical path. /* * The computation happens at idle time. When the CPU is not idle, the * interrupts' timestamps are stored in the circular buffer, when the * CPU goes idle and this routine is called, all the buffer's values * are injected in the statistical model continuing to extend the * statistics from the previous busy-idle cycle. */ Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 23:20, Paul E. McKenney wrote: >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and >> the result matches with what I have measured before, nothing notable found. >> But it still makes more sense if we can make rcu idle enter/exit hooked with >> tick off. (it's possible other workloads behave differently) > > Again, assuming that RCU is informed of CPUs in the kernel, regardless > of whether or not the tick is on that that point in time. > Yeah, I see, no problem for a normal idle. But for a short idle, we want to return to the task ASAP. Even though RCU cost is not notable, it would still be better for me if we can save some cycles in idle entry and idle exit. Do we have any problem if we skip RCU idle enter/exit under a fast idle scenario? My understanding is, if tick is not stopped, then we don't need inform RCU in idle path, it can be informed in irq exit. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 23:20, Paul E. McKenney wrote: >> 2) for rcu idle enter/exit, I measured the details which Paul provided, and >> the result matches with what I have measured before, nothing notable found. >> But it still makes more sense if we can make rcu idle enter/exit hooked with >> tick off. (it's possible other workloads behave differently) > > Again, assuming that RCU is informed of CPUs in the kernel, regardless > of whether or not the tick is on that that point in time. > Yeah, I see, no problem for a normal idle. But for a short idle, we want to return to the task ASAP. Even though RCU cost is not notable, it would still be better for me if we can save some cycles in idle entry and idle exit. Do we have any problem if we skip RCU idle enter/exit under a fast idle scenario? My understanding is, if tick is not stopped, then we don't need inform RCU in idle path, it can be informed in irq exit. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, Jul 17, 2017 at 12:48:38PM -0700, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > > fwiw some time ago I made a chart for predicted vs actual so you can sort > of judge the distribution of things visually > > http://git.fenrus.org/tmp/linux2.png That shows we get it wrong a lot of times (about 50%, as per the average) and moving the line has benefit. Since for performance you really don't want to pick the deeper idle state, so you want to bias your pick towards a shallower state. Using the CDF approach you can quantify by how much you want it moved.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, Jul 17, 2017 at 12:48:38PM -0700, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > > fwiw some time ago I made a chart for predicted vs actual so you can sort > of judge the distribution of things visually > > http://git.fenrus.org/tmp/linux2.png That shows we get it wrong a lot of times (about 50%, as per the average) and moving the line has benefit. Since for performance you really don't want to pick the deeper idle state, so you want to bias your pick towards a shallower state. Using the CDF approach you can quantify by how much you want it moved.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 09:37:57AM -0700, Arjan van de Ven wrote: > that's just a matter of fixing the C1 and later thresholds to line up right. > shrug that's the most trivial thing to do, it's a number in a table. Well, they represent a physical measure, namely the break-even-time. If you go muck with them you don't have anything left. This is tinkering of the worst possible kind. Fix the estimator if you want behavioural changes. > some distros do those tunings anyway when they don't like the upstream tunings *shudder*
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 09:37:57AM -0700, Arjan van de Ven wrote: > that's just a matter of fixing the C1 and later thresholds to line up right. > shrug that's the most trivial thing to do, it's a number in a table. Well, they represent a physical measure, namely the break-even-time. If you go muck with them you don't have anything left. This is tinkering of the worst possible kind. Fix the estimator if you want behavioural changes. > some distros do those tunings anyway when they don't like the upstream tunings *shudder*
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote: > 3) for tick nohz idle, we want to skip if the coming idle is short. If we can > skip the tick nohz idle, we then skip all the items depending on it. But, > there > are two hard points: > > 3.1) how to compute the period of the coming idle. My current proposal is to > use two factors in the current idle menu governor. There are two possible > options from Peter and Thomas, the one is to use scheduler idle estimate, > which > is task activity based, the other is to use the statistics generated from irq > timings work. > > 3.2) how to determine if the idle is short or long. My current proposal is to > use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I > didn't get the details of an auto-adjust mechanism yet So the problem is that the cost of NOHZ enter/exit are for a large part determined by (micro) architecture costs of programming timer hardware. A single static threshold will never be the right value across all the various machines we run Linux on. So my suggestion was simply timing the cost of doing those functions ever time we do them and keeping a running average of their cost. Then use that measured cost as a basis for selecting when to skip them. For example if the estimated idle period (by whatever estimate we end up using) is less than 4 times the cost of doing NOHZ, don't do NOHZ. Note how both tick_nohz_idle_{enter,exit}() already take a timestamp at entry; all you need to do is take another one at exit and subtract.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote: > 3) for tick nohz idle, we want to skip if the coming idle is short. If we can > skip the tick nohz idle, we then skip all the items depending on it. But, > there > are two hard points: > > 3.1) how to compute the period of the coming idle. My current proposal is to > use two factors in the current idle menu governor. There are two possible > options from Peter and Thomas, the one is to use scheduler idle estimate, > which > is task activity based, the other is to use the statistics generated from irq > timings work. > > 3.2) how to determine if the idle is short or long. My current proposal is to > use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I > didn't get the details of an auto-adjust mechanism yet So the problem is that the cost of NOHZ enter/exit are for a large part determined by (micro) architecture costs of programming timer hardware. A single static threshold will never be the right value across all the various machines we run Linux on. So my suggestion was simply timing the cost of doing those functions ever time we do them and keeping a running average of their cost. Then use that measured cost as a basis for selecting when to skip them. For example if the estimated idle period (by whatever estimate we end up using) is less than 4 times the cost of doing NOHZ, don't do NOHZ. Note how both tick_nohz_idle_{enter,exit}() already take a timestamp at entry; all you need to do is take another one at exit and subtract.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/18/2017 9:36 AM, Peter Zijlstra wrote: On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote: the most obvious way to do this (for me, maybe I'm naive) is to add another C state, lets call it "C1-lite" with its own thresholds and power levels etc, and just let that be picked naturally based on the heuristics. (if we want to improve the heuristics, that's fine and always welcome but that is completely orthogonal in my mind) C1-lite would then have a threshold < C1, whereas I understood the desire to be for the fast-idle crud to have a larger threshold than C1 currently has. That is, from what I understood, they want C1 selected *longer*. that's just a matter of fixing the C1 and later thresholds to line up right. shrug that's the most trivial thing to do, it's a number in a table. some distros do those tunings anyway when they don't like the upstream tunings
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/18/2017 9:36 AM, Peter Zijlstra wrote: On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote: the most obvious way to do this (for me, maybe I'm naive) is to add another C state, lets call it "C1-lite" with its own thresholds and power levels etc, and just let that be picked naturally based on the heuristics. (if we want to improve the heuristics, that's fine and always welcome but that is completely orthogonal in my mind) C1-lite would then have a threshold < C1, whereas I understood the desire to be for the fast-idle crud to have a larger threshold than C1 currently has. That is, from what I understood, they want C1 selected *longer*. that's just a matter of fixing the C1 and later thresholds to line up right. shrug that's the most trivial thing to do, it's a number in a table. some distros do those tunings anyway when they don't like the upstream tunings
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote: > > the most obvious way to do this (for me, maybe I'm naive) is to add another > C state, lets call it "C1-lite" with its own thresholds and power levels etc, > and just let that be picked naturally based on the heuristics. > (if we want to improve the heuristics, that's fine and always welcome but that > is completely orthogonal in my mind) C1-lite would then have a threshold < C1, whereas I understood the desire to be for the fast-idle crud to have a larger threshold than C1 currently has. That is, from what I understood, they want C1 selected *longer*.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 08:29:40AM -0700, Arjan van de Ven wrote: > > the most obvious way to do this (for me, maybe I'm naive) is to add another > C state, lets call it "C1-lite" with its own thresholds and power levels etc, > and just let that be picked naturally based on the heuristics. > (if we want to improve the heuristics, that's fine and always welcome but that > is completely orthogonal in my mind) C1-lite would then have a threshold < C1, whereas I understood the desire to be for the fast-idle crud to have a larger threshold than C1 currently has. That is, from what I understood, they want C1 selected *longer*.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/18/2017 8:20 AM, Paul E. McKenney wrote: 3.2) how to determine if the idle is short or long. My current proposal is to use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I didn't get the details of an auto-adjust mechanism yet the most obvious way to do this (for me, maybe I'm naive) is to add another C state, lets call it "C1-lite" with its own thresholds and power levels etc, and just let that be picked naturally based on the heuristics. (if we want to improve the heuristics, that's fine and always welcome but that is completely orthogonal in my mind) this C1-lite would then skip some of the idle steps like the nohz logic. How we plumb that ... might end up being a flag or whatever, we'll figure that out easily. as long as "real C1" has a break even time that is appropriate compared to C1-lite, we'll only pick C1-lite for very very short idles like is desired... but we don't end up creating a parallel infra for picking states, that part just does not make sense to me tbh I have yet to see any reason why C1-lite couldn't be just another C-state for everything except the actual place where we do the "go idle" last bit of logic. (Also note that for extreme short idles, today we just spinloop (C0), so by this argument we should also do a C0-lite.. or make this C0 always the lite variant)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/18/2017 8:20 AM, Paul E. McKenney wrote: 3.2) how to determine if the idle is short or long. My current proposal is to use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I didn't get the details of an auto-adjust mechanism yet the most obvious way to do this (for me, maybe I'm naive) is to add another C state, lets call it "C1-lite" with its own thresholds and power levels etc, and just let that be picked naturally based on the heuristics. (if we want to improve the heuristics, that's fine and always welcome but that is completely orthogonal in my mind) this C1-lite would then skip some of the idle steps like the nohz logic. How we plumb that ... might end up being a flag or whatever, we'll figure that out easily. as long as "real C1" has a break even time that is appropriate compared to C1-lite, we'll only pick C1-lite for very very short idles like is desired... but we don't end up creating a parallel infra for picking states, that part just does not make sense to me tbh I have yet to see any reason why C1-lite couldn't be just another C-state for everything except the actual place where we do the "go idle" last bit of logic. (Also note that for extreme short idles, today we just spinloop (C0), so by this argument we should also do a C0-lite.. or make this C0 always the lite variant)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote: > On 2017/7/18 14:43, Thomas Gleixner wrote: > > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > >>> We need a tradeoff here IMHO. I'll check Daniel's work to understand > >>> how/if > >>> it's better than menu governor. > >> > >> I still would like to see how the fast path without the C1 heuristic works. > >> > >> Fast pathing is a different concept from a better predictor. IMHO we need > >> both, but the first is likely lower hanging fruit. > > > > Hacking something on the side is always the lower hanging fruit as it > > avoids solving the hard problems. As Peter said already, that's not going > > to happen unless there is a real technical reason why the general path > > cannot be fixed. So far there is no proof for that. > > > Let me try to make a summary, please correct me if I was wrong. > > 1) for quiet_vmstat, we are agreed to move to another place where tick is > really stopped. > > 2) for rcu idle enter/exit, I measured the details which Paul provided, and > the result matches with what I have measured before, nothing notable found. > But it still makes more sense if we can make rcu idle enter/exit hooked with > tick off. (it's possible other workloads behave differently) Again, assuming that RCU is informed of CPUs in the kernel, regardless of whether or not the tick is on that that point in time. Thanx, Paul > 3) for tick nohz idle, we want to skip if the coming idle is short. If we can > skip the tick nohz idle, we then skip all the items depending on it. But, > there > are two hard points: > > 3.1) how to compute the period of the coming idle. My current proposal is to > use two factors in the current idle menu governor. There are two possible > options from Peter and Thomas, the one is to use scheduler idle estimate, > which > is task activity based, the other is to use the statistics generated from irq > timings work. > > 3.2) how to determine if the idle is short or long. My current proposal is to > use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I > didn't get the details of an auto-adjust mechanism yet > > 4) for idle loop, my proposal introduces a simple one to use default idle > routine directly, while Peter and Thomas suggest we fix c-state selection > in the existing idle path. > > Thanks, > -Aubrey >
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 02:56:47PM +0800, Li, Aubrey wrote: > On 2017/7/18 14:43, Thomas Gleixner wrote: > > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > >>> We need a tradeoff here IMHO. I'll check Daniel's work to understand > >>> how/if > >>> it's better than menu governor. > >> > >> I still would like to see how the fast path without the C1 heuristic works. > >> > >> Fast pathing is a different concept from a better predictor. IMHO we need > >> both, but the first is likely lower hanging fruit. > > > > Hacking something on the side is always the lower hanging fruit as it > > avoids solving the hard problems. As Peter said already, that's not going > > to happen unless there is a real technical reason why the general path > > cannot be fixed. So far there is no proof for that. > > > Let me try to make a summary, please correct me if I was wrong. > > 1) for quiet_vmstat, we are agreed to move to another place where tick is > really stopped. > > 2) for rcu idle enter/exit, I measured the details which Paul provided, and > the result matches with what I have measured before, nothing notable found. > But it still makes more sense if we can make rcu idle enter/exit hooked with > tick off. (it's possible other workloads behave differently) Again, assuming that RCU is informed of CPUs in the kernel, regardless of whether or not the tick is on that that point in time. Thanx, Paul > 3) for tick nohz idle, we want to skip if the coming idle is short. If we can > skip the tick nohz idle, we then skip all the items depending on it. But, > there > are two hard points: > > 3.1) how to compute the period of the coming idle. My current proposal is to > use two factors in the current idle menu governor. There are two possible > options from Peter and Thomas, the one is to use scheduler idle estimate, > which > is task activity based, the other is to use the statistics generated from irq > timings work. > > 3.2) how to determine if the idle is short or long. My current proposal is to > use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I > didn't get the details of an auto-adjust mechanism yet > > 4) for idle loop, my proposal introduces a simple one to use default idle > routine directly, while Peter and Thomas suggest we fix c-state selection > in the existing idle path. > > Thanks, > -Aubrey >
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote: > On 2017/7/18 3:23, Peter Zijlstra wrote: > > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: > >>> And as said; Daniel has been working on a better predictor -- now he's > >>> probably not used it on the network workload you're looking at, so that > >>> might be something to consider. > >> > >> Deriving a better idle predictor is a bit orthogonal to fast idle. > > > > No. If you want a different C state selected we need to fix the current > > C state selector. We're not going to tinker. > > > > And the predictor is probably the most fundamental part of the whole C > > state selection logic. > > > > Now I think the problem is that the current predictor goes for an > > average idle duration. This means that we, on average, get it wrong 50% > > of the time. For performance that's bad. > > > > If you want to improve the worst case, we need to consider a cumulative > > distribution function, and staying with the Gaussian assumption already > > present, that would mean using: > > > > 1 x - mu > > CDF(x) = - [ 1 + erf(-) ] > > 2 sigma sqrt(2) > > > > Where, per the normal convention mu is the average and sigma^2 the > > variance. See also: > > > > https://en.wikipedia.org/wiki/Normal_distribution > > > > We then solve CDF(x) = n% to find the x for which we get it wrong n% of > > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). > > > > This conceptually gets us better exit latency for the cases where we got > > it wrong before, and practically pushes down the estimate which gets us > > C1 longer. > > > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > Maybe you are talking about applying some machine learning algorithm online > to fit a multivariate normal distribution, :) Nah, nothing that fancy.. Something that _could_ work and deals with arbitrary distributions is buckets divided on the actual C-state selection boundaries and a (cyclic) array of the N most recent idle times. Something like so: struct est_data { u8 array[64] u8 *dist; u8 idx; } DEFINE_PER_CPU(struct est_data, est_data); void est_init(void) { int size = drv->state_count; int cpu; for_each_possible_cpu(cpu) { per_cpu(est_data, cpu).dist = kzalloc(size); // handle errors } } u8 est_duration_2_state(u64 duration) { for (i=0; istate_count; i++) { if (duration/1024 < drv->state[i].target_residency) return i; } return i-1; } void est_contemplate(u64 duration) { struct est_data *ed = this_cpu_ptr(_data); int state = est_duration_2_state(duration); int idx = (ed->idx++ % ARRAY_SIZE(ed->array); ed->dist[ed->array[idx]]--; ed->array[idx] = state; ed->dist[ed->array[idx]]++; } int est_state(int pct) { struct est_data *ed = this_cpu_ptr(_data); int limit = pct * ARRAY_SIZE(ed->array) / 100; /* XXX move div out of line */ int cnt, last = 0; /* CDF */ for (i=0; istate_count; i++) { cnt += ed->dist[i]; if (cnt > limit) break; last = i; } return last; } > Well, back to the problem, when the scheduler picks up idle thread, it does > not look at the history, nor make the prediction. So it's possible it has > to switch back a task ASAP when it's going into idle(very common under some > workloads). > > That is, (idle_entry + idle_exit) > idle. If the system has multiple > hardware idle states, then: > > (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep > > So we eventually want the idle path lighter than what we currently have. > A complex predictor may have high accuracy, but the cost could be high as > well. I never suggested anything complex. The current menu thing uses an average, all I said is if instead of the average you use something less, say 'avg - 2*stdev' (it already computes the stdev) you get something, which assuming Gaussian, is less than ~5 wrong on exit latency. The above, also simple thing, uses a generic distribution function, which works because it uses the exact boundaries we're limited to anyway. Of course, the above needs to be augmented with IRQ bits etc..
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 11:14:57AM +0800, Li, Aubrey wrote: > On 2017/7/18 3:23, Peter Zijlstra wrote: > > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: > >>> And as said; Daniel has been working on a better predictor -- now he's > >>> probably not used it on the network workload you're looking at, so that > >>> might be something to consider. > >> > >> Deriving a better idle predictor is a bit orthogonal to fast idle. > > > > No. If you want a different C state selected we need to fix the current > > C state selector. We're not going to tinker. > > > > And the predictor is probably the most fundamental part of the whole C > > state selection logic. > > > > Now I think the problem is that the current predictor goes for an > > average idle duration. This means that we, on average, get it wrong 50% > > of the time. For performance that's bad. > > > > If you want to improve the worst case, we need to consider a cumulative > > distribution function, and staying with the Gaussian assumption already > > present, that would mean using: > > > > 1 x - mu > > CDF(x) = - [ 1 + erf(-) ] > > 2 sigma sqrt(2) > > > > Where, per the normal convention mu is the average and sigma^2 the > > variance. See also: > > > > https://en.wikipedia.org/wiki/Normal_distribution > > > > We then solve CDF(x) = n% to find the x for which we get it wrong n% of > > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). > > > > This conceptually gets us better exit latency for the cases where we got > > it wrong before, and practically pushes down the estimate which gets us > > C1 longer. > > > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > Maybe you are talking about applying some machine learning algorithm online > to fit a multivariate normal distribution, :) Nah, nothing that fancy.. Something that _could_ work and deals with arbitrary distributions is buckets divided on the actual C-state selection boundaries and a (cyclic) array of the N most recent idle times. Something like so: struct est_data { u8 array[64] u8 *dist; u8 idx; } DEFINE_PER_CPU(struct est_data, est_data); void est_init(void) { int size = drv->state_count; int cpu; for_each_possible_cpu(cpu) { per_cpu(est_data, cpu).dist = kzalloc(size); // handle errors } } u8 est_duration_2_state(u64 duration) { for (i=0; istate_count; i++) { if (duration/1024 < drv->state[i].target_residency) return i; } return i-1; } void est_contemplate(u64 duration) { struct est_data *ed = this_cpu_ptr(_data); int state = est_duration_2_state(duration); int idx = (ed->idx++ % ARRAY_SIZE(ed->array); ed->dist[ed->array[idx]]--; ed->array[idx] = state; ed->dist[ed->array[idx]]++; } int est_state(int pct) { struct est_data *ed = this_cpu_ptr(_data); int limit = pct * ARRAY_SIZE(ed->array) / 100; /* XXX move div out of line */ int cnt, last = 0; /* CDF */ for (i=0; istate_count; i++) { cnt += ed->dist[i]; if (cnt > limit) break; last = i; } return last; } > Well, back to the problem, when the scheduler picks up idle thread, it does > not look at the history, nor make the prediction. So it's possible it has > to switch back a task ASAP when it's going into idle(very common under some > workloads). > > That is, (idle_entry + idle_exit) > idle. If the system has multiple > hardware idle states, then: > > (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep > > So we eventually want the idle path lighter than what we currently have. > A complex predictor may have high accuracy, but the cost could be high as > well. I never suggested anything complex. The current menu thing uses an average, all I said is if instead of the average you use something less, say 'avg - 2*stdev' (it already computes the stdev) you get something, which assuming Gaussian, is less than ~5 wrong on exit latency. The above, also simple thing, uses a generic distribution function, which works because it uses the exact boundaries we're limited to anyway. Of course, the above needs to be augmented with IRQ bits etc..
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Andi Kleen wrote: > On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: > > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand > > > > how/if > > > > it's better than menu governor. > > > > > > I still would like to see how the fast path without the C1 heuristic > > > works. > > > > > > Fast pathing is a different concept from a better predictor. IMHO we need > > > both, but the first is likely lower hanging fruit. > > > > Hacking something on the side is always the lower hanging fruit as it > > avoids solving the hard problems. As Peter said already, that's not going > > to happen unless there is a real technical reason why the general path > > cannot be fixed. So far there is no proof for that. > > You didn't look at Aubrey's data? I did, but that data is no proof that it is unfixable. It's just data describing the current situation, not more not less. > There are some unavoidable slow operations in the current path -- e.g. That's the whole point: current path, IOW current implementation. This implementation is not set in stone and we rather fix it than just creating a side channel and leave everything else as is. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Andi Kleen wrote: > On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: > > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand > > > > how/if > > > > it's better than menu governor. > > > > > > I still would like to see how the fast path without the C1 heuristic > > > works. > > > > > > Fast pathing is a different concept from a better predictor. IMHO we need > > > both, but the first is likely lower hanging fruit. > > > > Hacking something on the side is always the lower hanging fruit as it > > avoids solving the hard problems. As Peter said already, that's not going > > to happen unless there is a real technical reason why the general path > > cannot be fixed. So far there is no proof for that. > > You didn't look at Aubrey's data? I did, but that data is no proof that it is unfixable. It's just data describing the current situation, not more not less. > There are some unavoidable slow operations in the current path -- e.g. That's the whole point: current path, IOW current implementation. This implementation is not set in stone and we rather fix it than just creating a side channel and leave everything else as is. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Andi Kleen wrote: > On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: > > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand > > > > how/if > > > > it's better than menu governor. > > > > > > I still would like to see how the fast path without the C1 heuristic > > > works. > > > > > > Fast pathing is a different concept from a better predictor. IMHO we need > > > both, but the first is likely lower hanging fruit. > > > > Hacking something on the side is always the lower hanging fruit as it > > avoids solving the hard problems. As Peter said already, that's not going > > to happen unless there is a real technical reason why the general path > > cannot be fixed. So far there is no proof for that. > > You didn't look at Aubrey's data? > > There are some unavoidable slow operations in the current path -- e.g. > reprograming the timer for NOHZ. But we don't need that for really > short idle periods, because as you pointed out they never get woken > up by the tick. > > Similar for other things like RCU. > > I don't see how you can avoid that other than without a fast path mechanism. You can very well avoid it by taking the irq timings or whatever other information into account for the NOHZ decision. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Andi Kleen wrote: > On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: > > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand > > > > how/if > > > > it's better than menu governor. > > > > > > I still would like to see how the fast path without the C1 heuristic > > > works. > > > > > > Fast pathing is a different concept from a better predictor. IMHO we need > > > both, but the first is likely lower hanging fruit. > > > > Hacking something on the side is always the lower hanging fruit as it > > avoids solving the hard problems. As Peter said already, that's not going > > to happen unless there is a real technical reason why the general path > > cannot be fixed. So far there is no proof for that. > > You didn't look at Aubrey's data? > > There are some unavoidable slow operations in the current path -- e.g. > reprograming the timer for NOHZ. But we don't need that for really > short idle periods, because as you pointed out they never get woken > up by the tick. > > Similar for other things like RCU. > > I don't see how you can avoid that other than without a fast path mechanism. You can very well avoid it by taking the irq timings or whatever other information into account for the NOHZ decision. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand > > > how/if > > > it's better than menu governor. > > > > I still would like to see how the fast path without the C1 heuristic works. > > > > Fast pathing is a different concept from a better predictor. IMHO we need > > both, but the first is likely lower hanging fruit. > > Hacking something on the side is always the lower hanging fruit as it > avoids solving the hard problems. As Peter said already, that's not going > to happen unless there is a real technical reason why the general path > cannot be fixed. So far there is no proof for that. You didn't look at Aubrey's data? There are some unavoidable slow operations in the current path -- e.g. reprograming the timer for NOHZ. But we don't need that for really short idle periods, because as you pointed out they never get woken up by the tick. Similar for other things like RCU. I don't see how you can avoid that other than without a fast path mechanism. Clearly these operations are eventually needed, just not all the time for short sleeps. Now in theory you could have lots of little fast paths in all the individual operations that check this individually, but I don't see how that is better than a single simple fast path. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Tue, Jul 18, 2017 at 08:43:53AM +0200, Thomas Gleixner wrote: > On Mon, 17 Jul 2017, Andi Kleen wrote: > > > > We need a tradeoff here IMHO. I'll check Daniel's work to understand > > > how/if > > > it's better than menu governor. > > > > I still would like to see how the fast path without the C1 heuristic works. > > > > Fast pathing is a different concept from a better predictor. IMHO we need > > both, but the first is likely lower hanging fruit. > > Hacking something on the side is always the lower hanging fruit as it > avoids solving the hard problems. As Peter said already, that's not going > to happen unless there is a real technical reason why the general path > cannot be fixed. So far there is no proof for that. You didn't look at Aubrey's data? There are some unavoidable slow operations in the current path -- e.g. reprograming the timer for NOHZ. But we don't need that for really short idle periods, because as you pointed out they never get woken up by the tick. Similar for other things like RCU. I don't see how you can avoid that other than without a fast path mechanism. Clearly these operations are eventually needed, just not all the time for short sleeps. Now in theory you could have lots of little fast paths in all the individual operations that check this individually, but I don't see how that is better than a single simple fast path. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 14:43, Thomas Gleixner wrote: > On Mon, 17 Jul 2017, Andi Kleen wrote: > >>> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if >>> it's better than menu governor. >> >> I still would like to see how the fast path without the C1 heuristic works. >> >> Fast pathing is a different concept from a better predictor. IMHO we need >> both, but the first is likely lower hanging fruit. > > Hacking something on the side is always the lower hanging fruit as it > avoids solving the hard problems. As Peter said already, that's not going > to happen unless there is a real technical reason why the general path > cannot be fixed. So far there is no proof for that. > Let me try to make a summary, please correct me if I was wrong. 1) for quiet_vmstat, we are agreed to move to another place where tick is really stopped. 2) for rcu idle enter/exit, I measured the details which Paul provided, and the result matches with what I have measured before, nothing notable found. But it still makes more sense if we can make rcu idle enter/exit hooked with tick off. (it's possible other workloads behave differently) 3) for tick nohz idle, we want to skip if the coming idle is short. If we can skip the tick nohz idle, we then skip all the items depending on it. But, there are two hard points: 3.1) how to compute the period of the coming idle. My current proposal is to use two factors in the current idle menu governor. There are two possible options from Peter and Thomas, the one is to use scheduler idle estimate, which is task activity based, the other is to use the statistics generated from irq timings work. 3.2) how to determine if the idle is short or long. My current proposal is to use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I didn't get the details of an auto-adjust mechanism yet 4) for idle loop, my proposal introduces a simple one to use default idle routine directly, while Peter and Thomas suggest we fix c-state selection in the existing idle path. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 14:43, Thomas Gleixner wrote: > On Mon, 17 Jul 2017, Andi Kleen wrote: > >>> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if >>> it's better than menu governor. >> >> I still would like to see how the fast path without the C1 heuristic works. >> >> Fast pathing is a different concept from a better predictor. IMHO we need >> both, but the first is likely lower hanging fruit. > > Hacking something on the side is always the lower hanging fruit as it > avoids solving the hard problems. As Peter said already, that's not going > to happen unless there is a real technical reason why the general path > cannot be fixed. So far there is no proof for that. > Let me try to make a summary, please correct me if I was wrong. 1) for quiet_vmstat, we are agreed to move to another place where tick is really stopped. 2) for rcu idle enter/exit, I measured the details which Paul provided, and the result matches with what I have measured before, nothing notable found. But it still makes more sense if we can make rcu idle enter/exit hooked with tick off. (it's possible other workloads behave differently) 3) for tick nohz idle, we want to skip if the coming idle is short. If we can skip the tick nohz idle, we then skip all the items depending on it. But, there are two hard points: 3.1) how to compute the period of the coming idle. My current proposal is to use two factors in the current idle menu governor. There are two possible options from Peter and Thomas, the one is to use scheduler idle estimate, which is task activity based, the other is to use the statistics generated from irq timings work. 3.2) how to determine if the idle is short or long. My current proposal is to use a tunable value via /sys, while Peter prefers an auto-adjust mechanism. I didn't get the details of an auto-adjust mechanism yet 4) for idle loop, my proposal introduces a simple one to use default idle routine directly, while Peter and Thomas suggest we fix c-state selection in the existing idle path. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Andi Kleen wrote: > > We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if > > it's better than menu governor. > > I still would like to see how the fast path without the C1 heuristic works. > > Fast pathing is a different concept from a better predictor. IMHO we need > both, but the first is likely lower hanging fruit. Hacking something on the side is always the lower hanging fruit as it avoids solving the hard problems. As Peter said already, that's not going to happen unless there is a real technical reason why the general path cannot be fixed. So far there is no proof for that. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Andi Kleen wrote: > > We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if > > it's better than menu governor. > > I still would like to see how the fast path without the C1 heuristic works. > > Fast pathing is a different concept from a better predictor. IMHO we need > both, but the first is likely lower hanging fruit. Hacking something on the side is always the lower hanging fruit as it avoids solving the hard problems. As Peter said already, that's not going to happen unless there is a real technical reason why the general path cannot be fixed. So far there is no proof for that. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if > it's better than menu governor. I still would like to see how the fast path without the C1 heuristic works. Fast pathing is a different concept from a better predictor. IMHO we need both, but the first is likely lower hanging fruit. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
> We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if > it's better than menu governor. I still would like to see how the fast path without the C1 heuristic works. Fast pathing is a different concept from a better predictor. IMHO we need both, but the first is likely lower hanging fruit. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 3:48, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: >> Of course, this all assumes a Gaussian distribution to begin with, if we >> get bimodal (or worse) distributions we can still get it wrong. To fix >> that, we'd need to do something better than what we currently have. >> > > fwiw some time ago I made a chart for predicted vs actual so you can sort > of judge the distribution of things visually > > http://git.fenrus.org/tmp/linux2.png > > This does not look like a Gaussian, does it? I mean abs(expected - actual). Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 3:48, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: >> Of course, this all assumes a Gaussian distribution to begin with, if we >> get bimodal (or worse) distributions we can still get it wrong. To fix >> that, we'd need to do something better than what we currently have. >> > > fwiw some time ago I made a chart for predicted vs actual so you can sort > of judge the distribution of things visually > > http://git.fenrus.org/tmp/linux2.png > > This does not look like a Gaussian, does it? I mean abs(expected - actual). Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 3:23, Peter Zijlstra wrote: > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: >>> And as said; Daniel has been working on a better predictor -- now he's >>> probably not used it on the network workload you're looking at, so that >>> might be something to consider. >> >> Deriving a better idle predictor is a bit orthogonal to fast idle. > > No. If you want a different C state selected we need to fix the current > C state selector. We're not going to tinker. > > And the predictor is probably the most fundamental part of the whole C > state selection logic. > > Now I think the problem is that the current predictor goes for an > average idle duration. This means that we, on average, get it wrong 50% > of the time. For performance that's bad. > > If you want to improve the worst case, we need to consider a cumulative > distribution function, and staying with the Gaussian assumption already > present, that would mean using: > >1 x - mu > CDF(x) = - [ 1 + erf(-) ] >2 sigma sqrt(2) > > Where, per the normal convention mu is the average and sigma^2 the > variance. See also: > > https://en.wikipedia.org/wiki/Normal_distribution > > We then solve CDF(x) = n% to find the x for which we get it wrong n% of > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). > > This conceptually gets us better exit latency for the cases where we got > it wrong before, and practically pushes down the estimate which gets us > C1 longer. > > Of course, this all assumes a Gaussian distribution to begin with, if we > get bimodal (or worse) distributions we can still get it wrong. To fix > that, we'd need to do something better than what we currently have. > Maybe you are talking about applying some machine learning algorithm online to fit a multivariate normal distribution, :) Well, back to the problem, when the scheduler picks up idle thread, it does not look at the history, nor make the prediction. So it's possible it has to switch back a task ASAP when it's going into idle(very common under some workloads). That is, (idle_entry + idle_exit) > idle. If the system has multiple hardware idle states, then: (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep So we eventually want the idle path lighter than what we currently have. A complex predictor may have high accuracy, but the cost could be high as well. We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if it's better than menu governor. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/18 3:23, Peter Zijlstra wrote: > On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: >>> And as said; Daniel has been working on a better predictor -- now he's >>> probably not used it on the network workload you're looking at, so that >>> might be something to consider. >> >> Deriving a better idle predictor is a bit orthogonal to fast idle. > > No. If you want a different C state selected we need to fix the current > C state selector. We're not going to tinker. > > And the predictor is probably the most fundamental part of the whole C > state selection logic. > > Now I think the problem is that the current predictor goes for an > average idle duration. This means that we, on average, get it wrong 50% > of the time. For performance that's bad. > > If you want to improve the worst case, we need to consider a cumulative > distribution function, and staying with the Gaussian assumption already > present, that would mean using: > >1 x - mu > CDF(x) = - [ 1 + erf(-) ] >2 sigma sqrt(2) > > Where, per the normal convention mu is the average and sigma^2 the > variance. See also: > > https://en.wikipedia.org/wiki/Normal_distribution > > We then solve CDF(x) = n% to find the x for which we get it wrong n% of > the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). > > This conceptually gets us better exit latency for the cases where we got > it wrong before, and practically pushes down the estimate which gets us > C1 longer. > > Of course, this all assumes a Gaussian distribution to begin with, if we > get bimodal (or worse) distributions we can still get it wrong. To fix > that, we'd need to do something better than what we currently have. > Maybe you are talking about applying some machine learning algorithm online to fit a multivariate normal distribution, :) Well, back to the problem, when the scheduler picks up idle thread, it does not look at the history, nor make the prediction. So it's possible it has to switch back a task ASAP when it's going into idle(very common under some workloads). That is, (idle_entry + idle_exit) > idle. If the system has multiple hardware idle states, then: (idle_entry + idle_exit + HW_entry + HW_exit) > HW_sleep So we eventually want the idle path lighter than what we currently have. A complex predictor may have high accuracy, but the cost could be high as well. We need a tradeoff here IMHO. I'll check Daniel's work to understand how/if it's better than menu governor. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Arjan van de Ven wrote: > On 7/17/2017 12:46 PM, Thomas Gleixner wrote: > > That's where Daniel Lezcanos work of predicting interrupts comes in and > > that's the right solution to the problem. The core infrastructure has been > > merged, just the idle/cpufreq users are not there yet. All you need to do > > is to select CONFIG_IRQ_TIMINGS and use the statistics generated there. > > > yes ;-) :) > also note that the predictor does not need to perfect, on most systems C > states are an order of magnitude apart in terms of > power/performance/latency so if you get the general order of magnitude > right the predictor is doing its job. So it would be interesting just to enable the irq timings stuff and compare the outcome as a first step. That should be reasonably simple to implement and would give us also some information of how that code behaves on larger systems. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Arjan van de Ven wrote: > On 7/17/2017 12:46 PM, Thomas Gleixner wrote: > > That's where Daniel Lezcanos work of predicting interrupts comes in and > > that's the right solution to the problem. The core infrastructure has been > > merged, just the idle/cpufreq users are not there yet. All you need to do > > is to select CONFIG_IRQ_TIMINGS and use the statistics generated there. > > > yes ;-) :) > also note that the predictor does not need to perfect, on most systems C > states are an order of magnitude apart in terms of > power/performance/latency so if you get the general order of magnitude > right the predictor is doing its job. So it would be interesting just to enable the irq timings stuff and compare the outcome as a first step. That should be reasonably simple to implement and would give us also some information of how that code behaves on larger systems. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:53 PM, Thomas Gleixner wrote: On Mon, 17 Jul 2017, Arjan van de Ven wrote: On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have. fwiw some time ago I made a chart for predicted vs actual so you can sort of judge the distribution of things visually Predicted by what? this chart was with the current linux predictor http://git.fenrus.org/tmp/timer.png is what you get if you JUST use the next timer ;-) (which way back linux was doing)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:53 PM, Thomas Gleixner wrote: On Mon, 17 Jul 2017, Arjan van de Ven wrote: On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have. fwiw some time ago I made a chart for predicted vs actual so you can sort of judge the distribution of things visually Predicted by what? this chart was with the current linux predictor http://git.fenrus.org/tmp/timer.png is what you get if you JUST use the next timer ;-) (which way back linux was doing)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > > fwiw some time ago I made a chart for predicted vs actual so you can sort > of judge the distribution of things visually Predicted by what? Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: > > Of course, this all assumes a Gaussian distribution to begin with, if we > > get bimodal (or worse) distributions we can still get it wrong. To fix > > that, we'd need to do something better than what we currently have. > > > > fwiw some time ago I made a chart for predicted vs actual so you can sort > of judge the distribution of things visually Predicted by what? Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:46 PM, Thomas Gleixner wrote: On Mon, 17 Jul 2017, Arjan van de Ven wrote: On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. that's not really what it does; it looks at next tick and then discounts that based on history; (with different discounts for different order of magnitude) next tick is the worst thing to look at for interrupt heavy workloads as well it was better than what was there before (without discount and without detecting repeated patterns) the next tick (as computed by the nohz code) can be far away, while the I/O interrupts come in at a high frequency. That's where Daniel Lezcanos work of predicting interrupts comes in and that's the right solution to the problem. The core infrastructure has been merged, just the idle/cpufreq users are not there yet. All you need to do is to select CONFIG_IRQ_TIMINGS and use the statistics generated there. yes ;-) also note that the predictor does not need to perfect, on most systems C states are an order of magnitude apart in terms of power/performance/latency so if you get the general order of magnitude right the predictor is doing its job. (this is not universally true, but physics of power gating/etc tend to drive to this conclusion; the cost of implementing an extra state very close to another state means that the HW folks are unlikely to do the less power saving state of the two to save their cost and testing effort)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:46 PM, Thomas Gleixner wrote: On Mon, 17 Jul 2017, Arjan van de Ven wrote: On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. that's not really what it does; it looks at next tick and then discounts that based on history; (with different discounts for different order of magnitude) next tick is the worst thing to look at for interrupt heavy workloads as well it was better than what was there before (without discount and without detecting repeated patterns) the next tick (as computed by the nohz code) can be far away, while the I/O interrupts come in at a high frequency. That's where Daniel Lezcanos work of predicting interrupts comes in and that's the right solution to the problem. The core infrastructure has been merged, just the idle/cpufreq users are not there yet. All you need to do is to select CONFIG_IRQ_TIMINGS and use the statistics generated there. yes ;-) also note that the predictor does not need to perfect, on most systems C states are an order of magnitude apart in terms of power/performance/latency so if you get the general order of magnitude right the predictor is doing its job. (this is not universally true, but physics of power gating/etc tend to drive to this conclusion; the cost of implementing an extra state very close to another state means that the HW folks are unlikely to do the less power saving state of the two to save their cost and testing effort)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have. fwiw some time ago I made a chart for predicted vs actual so you can sort of judge the distribution of things visually http://git.fenrus.org/tmp/linux2.png
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have. fwiw some time ago I made a chart for predicted vs actual so you can sort of judge the distribution of things visually http://git.fenrus.org/tmp/linux2.png
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: > > Now I think the problem is that the current predictor goes for an > > average idle duration. This means that we, on average, get it wrong 50% > > of the time. For performance that's bad. > > that's not really what it does; it looks at next tick > and then discounts that based on history; > (with different discounts for different order of magnitude) next tick is the worst thing to look at for interrupt heavy workloads as the next tick (as computed by the nohz code) can be far away, while the I/O interrupts come in at a high frequency. That's where Daniel Lezcanos work of predicting interrupts comes in and that's the right solution to the problem. The core infrastructure has been merged, just the idle/cpufreq users are not there yet. All you need to do is to select CONFIG_IRQ_TIMINGS and use the statistics generated there. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, 17 Jul 2017, Arjan van de Ven wrote: > On 7/17/2017 12:23 PM, Peter Zijlstra wrote: > > Now I think the problem is that the current predictor goes for an > > average idle duration. This means that we, on average, get it wrong 50% > > of the time. For performance that's bad. > > that's not really what it does; it looks at next tick > and then discounts that based on history; > (with different discounts for different order of magnitude) next tick is the worst thing to look at for interrupt heavy workloads as the next tick (as computed by the nohz code) can be far away, while the I/O interrupts come in at a high frequency. That's where Daniel Lezcanos work of predicting interrupts comes in and that's the right solution to the problem. The core infrastructure has been merged, just the idle/cpufreq users are not there yet. All you need to do is to select CONFIG_IRQ_TIMINGS and use the statistics generated there. Thanks, tglx
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. that's not really what it does; it looks at next tick and then discounts that based on history; (with different discounts for different order of magnitude)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 7/17/2017 12:23 PM, Peter Zijlstra wrote: Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. that's not really what it does; it looks at next tick and then discounts that based on history; (with different discounts for different order of magnitude)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: > > And as said; Daniel has been working on a better predictor -- now he's > > probably not used it on the network workload you're looking at, so that > > might be something to consider. > > Deriving a better idle predictor is a bit orthogonal to fast idle. No. If you want a different C state selected we need to fix the current C state selector. We're not going to tinker. And the predictor is probably the most fundamental part of the whole C state selection logic. Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. If you want to improve the worst case, we need to consider a cumulative distribution function, and staying with the Gaussian assumption already present, that would mean using: 1 x - mu CDF(x) = - [ 1 + erf(-) ] 2 sigma sqrt(2) Where, per the normal convention mu is the average and sigma^2 the variance. See also: https://en.wikipedia.org/wiki/Normal_distribution We then solve CDF(x) = n% to find the x for which we get it wrong n% of the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). This conceptually gets us better exit latency for the cases where we got it wrong before, and practically pushes down the estimate which gets us C1 longer. Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 09:26:19AM -0700, Andi Kleen wrote: > > And as said; Daniel has been working on a better predictor -- now he's > > probably not used it on the network workload you're looking at, so that > > might be something to consider. > > Deriving a better idle predictor is a bit orthogonal to fast idle. No. If you want a different C state selected we need to fix the current C state selector. We're not going to tinker. And the predictor is probably the most fundamental part of the whole C state selection logic. Now I think the problem is that the current predictor goes for an average idle duration. This means that we, on average, get it wrong 50% of the time. For performance that's bad. If you want to improve the worst case, we need to consider a cumulative distribution function, and staying with the Gaussian assumption already present, that would mean using: 1 x - mu CDF(x) = - [ 1 + erf(-) ] 2 sigma sqrt(2) Where, per the normal convention mu is the average and sigma^2 the variance. See also: https://en.wikipedia.org/wiki/Normal_distribution We then solve CDF(x) = n% to find the x for which we get it wrong n% of the time (IIRC something like: 'mu - 2sigma' ends up being 5% or so). This conceptually gets us better exit latency for the cases where we got it wrong before, and practically pushes down the estimate which gets us C1 longer. Of course, this all assumes a Gaussian distribution to begin with, if we get bimodal (or worse) distributions we can still get it wrong. To fix that, we'd need to do something better than what we currently have.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/17 21:58, Peter Zijlstra wrote: > On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote: >> On 2017/7/14 12:05, Paul E. McKenney wrote: >>> >>> More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(), >>> rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(), >>> rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(), >>> rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(), >>> rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit(). >>> >>> The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and >>> rcu_cleanup_after_idle()) should not be significant unless you have >>> CONFIG_RCU_FAST_NO_HZ=y. If you do, it would be interesting to learn >>> how often invoke_rcu_core() is invoked from rcu_prepare_for_idle() >>> and rcu_cleanup_after_idle(), as this can raise softirq. Also >>> rcu_accelerate_cbs() and rcu_try_advance_all_cbs(). >>> >>> Knowing which of these is causing the most trouble might help me >>> reduce the overhead in the current idle path. >> >> I measured two cases, nothing notable found. > > So skipping rcu_idle_{enter,exit}() is not in fact needed at all? > I think it would make more sense if we still put them under the case where tick is really stopped. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/17 21:58, Peter Zijlstra wrote: > On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote: >> On 2017/7/14 12:05, Paul E. McKenney wrote: >>> >>> More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(), >>> rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(), >>> rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(), >>> rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(), >>> rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit(). >>> >>> The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and >>> rcu_cleanup_after_idle()) should not be significant unless you have >>> CONFIG_RCU_FAST_NO_HZ=y. If you do, it would be interesting to learn >>> how often invoke_rcu_core() is invoked from rcu_prepare_for_idle() >>> and rcu_cleanup_after_idle(), as this can raise softirq. Also >>> rcu_accelerate_cbs() and rcu_try_advance_all_cbs(). >>> >>> Knowing which of these is causing the most trouble might help me >>> reduce the overhead in the current idle path. >> >> I measured two cases, nothing notable found. > > So skipping rcu_idle_{enter,exit}() is not in fact needed at all? > I think it would make more sense if we still put them under the case where tick is really stopped. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote: > On 2017/7/14 12:05, Paul E. McKenney wrote: > > > > More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(), > > rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(), > > rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(), > > rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(), > > rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit(). > > > > The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and > > rcu_cleanup_after_idle()) should not be significant unless you have > > CONFIG_RCU_FAST_NO_HZ=y. If you do, it would be interesting to learn > > how often invoke_rcu_core() is invoked from rcu_prepare_for_idle() > > and rcu_cleanup_after_idle(), as this can raise softirq. Also > > rcu_accelerate_cbs() and rcu_try_advance_all_cbs(). > > > > Knowing which of these is causing the most trouble might help me > > reduce the overhead in the current idle path. > > I measured two cases, nothing notable found. So skipping rcu_idle_{enter,exit}() is not in fact needed at all?
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Mon, Jul 17, 2017 at 09:24:51PM +0800, Li, Aubrey wrote: > On 2017/7/14 12:05, Paul E. McKenney wrote: > > > > More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(), > > rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(), > > rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(), > > rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(), > > rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit(). > > > > The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and > > rcu_cleanup_after_idle()) should not be significant unless you have > > CONFIG_RCU_FAST_NO_HZ=y. If you do, it would be interesting to learn > > how often invoke_rcu_core() is invoked from rcu_prepare_for_idle() > > and rcu_cleanup_after_idle(), as this can raise softirq. Also > > rcu_accelerate_cbs() and rcu_try_advance_all_cbs(). > > > > Knowing which of these is causing the most trouble might help me > > reduce the overhead in the current idle path. > > I measured two cases, nothing notable found. So skipping rcu_idle_{enter,exit}() is not in fact needed at all?
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/17 17:21, Peter Zijlstra wrote: > On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote: >> fast idle doesn't have an upper bound. >> >> If the prediction exceeds the fast idle threshold any C state can be used. >> >> It's just another state (fast C1), but right now it has an own threshold >> which may be different from standard C1. > > Given it uses the same estimate we end up with: > > > Now, unless you're mister Turnbull, C2 will never get selected when > fast_threshold > C2_threshold. > > Which is wrong. If you want to effectively scale the selection of C1, > why not also change the C2 and further criteria. > That's not our intention, I think. As long as the predicted coming idle period > threshold, we'll enter normal idle path, you can select any supported c-states, tick can also be turned off for power saving. Any deferrable stuff can be invoke as well because we'll sleep long enough. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/17 17:21, Peter Zijlstra wrote: > On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote: >> fast idle doesn't have an upper bound. >> >> If the prediction exceeds the fast idle threshold any C state can be used. >> >> It's just another state (fast C1), but right now it has an own threshold >> which may be different from standard C1. > > Given it uses the same estimate we end up with: > > > Now, unless you're mister Turnbull, C2 will never get selected when > fast_threshold > C2_threshold. > > Which is wrong. If you want to effectively scale the selection of C1, > why not also change the C2 and further criteria. > That's not our intention, I think. As long as the predicted coming idle period > threshold, we'll enter normal idle path, you can select any supported c-states, tick can also be turned off for power saving. Any deferrable stuff can be invoke as well because we'll sleep long enough. Thanks, -Aubrey
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/14 12:05, Paul E. McKenney wrote: > > More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(), > rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(), > rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(), > rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(), > rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit(). > > The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and > rcu_cleanup_after_idle()) should not be significant unless you have > CONFIG_RCU_FAST_NO_HZ=y. If you do, it would be interesting to learn > how often invoke_rcu_core() is invoked from rcu_prepare_for_idle() > and rcu_cleanup_after_idle(), as this can raise softirq. Also > rcu_accelerate_cbs() and rcu_try_advance_all_cbs(). > > Knowing which of these is causing the most trouble might help me > reduce the overhead in the current idle path. I measured two cases, nothing notable found. The one is CONFIG_NO_HZ_IDLE=y, so the following function is just empty. rcu_prepare_for_idle(): NULL rcu_cleanup_after_idle(): NULL do_nocb_deferred_wakeup(): NULL rcu_dynticks_task_enter(): NULL rcu_dynticks_task_exit(): NULL And the following functions are traced separately, for each function I traced 3 times by intel_PT, for each time the sampling period is 1-second. num means the times the function is invoked in 1-second. (min, max, avg) is the function duration, the unit is nano-second. rcu_needs_cpu(): 1) num: 6110min: 3 max: 564avg: 17.0 2) num: 16535 min: 3 max: 683avg: 18.0 3) num: 8815min: 3 max: 394avg: 20.0 rcu_eqs_enter(): 1) num: 7956min: 17 max: 656avg: 32.0 2) num: 9170min: 17 max: 1075 avg: 35.0 3) num: 8604min: 17 max: 859avg: 29.0 rcu_eqs_enter_common(): 1) num: 14676 min: 15 max: 620avg: 28.0 2) num: 11180 min: 15 max: 795avg: 30.0 3) num: 11484 min: 15 max: 725avg: 29.0 rcu_dynticks_eqs_enter(): 1) num: 11035 min: 10 max: 580avg: 17.0 2) num: 15518 min: 10 max: 456avg: 16.0 3) num: 15320 min: 10 max: 454avg: 19.0 rcu_eqs_exit(): 1) num: 11080 min: 14 max: 893avg: 23.0 2) num: 13526 min: 14 max: 640avg: 23.0 3) num: 12534 min: 14 max: 630avg: 22.0 rcu_eqs_exit_common(): 1) num: 18002 min: 12 max: 553avg: 17.0 2) num: 10570 min: 11 max: 485avg: 17.0 3) num: 13628 min: 11 max: 567avg: 16.0 rcu_dynticks_eqs_exit(): 1) num: 11195 min: 11 max: 436avg: 16.0 2) num: 11808 min: 10 max: 506avg: 16.0 3) num: 8132min: 10 max: 546avg: 15.0 == The other case is CONFIG_NO_HZ_FULL, I also enabled the required config to make all the functions not empty. rcu_needs_cpu(): 1) num: 8530min: 5 max: 770avg: 13.0 2) num: 9965min: 5 max: 518avg: 12.0 3) num: 12503 min: 5 max: 755avg: 16.0 rcu_prepare_for_idle(): 1) num: 11662 min: 5 max: 684avg: 9.0 2) num: 15294 min: 5 max: 676avg: 9.0 3) num: 14332 min: 5 max: 524avg: 9.0 rcu_cleanup_after_idle(): 1) num: 13584 min: 4 max: 657avg: 6.0 2) num: 9102min: 4 max: 529avg: 5.0 3) num: 10648 min: 4 max: 471avg: 6.0 rcu_eqs_enter(): 1) num: 14222 min: 26 max: 745avg: 54.0 2) num: 12502 min: 26 max: 650avg: 53.0 3) num: 11834 min: 26 max: 863avg: 52.0 rcu_eqs_enter_common(): 1) num: 16792 min: 24 max: 973avg: 43.0 2) num: 19755 min: 24 max: 898avg: 45.0 3) num: 8167min: 24 max: 722avg: 42.0 rcu_dynticks_eqs_enter(): 1) num: 11605 min: 10 max: 532avg: 14.0 2) num: 10438 min: 9 max: 554avg: 14.0 3) num: 19816 min: 9 max: 701avg: 14.0 do_nocb_deferred_wakeup(): 1) num: 15348 min: 1 max: 459avg: 3.0 2) num: 12822 min: 1 max: 564avg: 4.0 3) num: 8272min: 0 max: 458avg: 3.0 rcu_dynticks_task_enter(): 1) num: 6358min: 1 max: 268avg: 1.0 2) num: 11128 min: 1 max: 360avg: 1.0 3) num: 20516 min: 1 max: 214avg: 1.0 rcu_eqs_exit(): 1) num: 16117 min: 20 max: 782avg: 43.0 2) num: 11042 min: 20 max: 775avg: 47.0 3) num: 16499 min: 20 max: 752avg: 46.0 rcu_eqs_exit_common(): 1) num: 12584 min: 17 max: 703avg: 28.0 2) num: 17412 min: 17 max: 759avg: 28.0 3) num: 16733 min: 17 max: 798avg: 29.0 rcu_dynticks_task_exit(): 1) num: 11730 min: 1 max: 528avg: 4.0 2) num: 18840 min: 1 max: 581avg: 5.0 3) num: 9815min: 1 max: 381avg: 4.0 rcu_dynticks_eqs_exit(): 1) num: 10902 min: 9 max: 557avg: 13.0 2) num: 19474 min: 9 max: 563avg: 13.0 3) num: 11865 min: 9 max: 672avg: 12.0 Please let me know if there is some data not reasonable, I can revisit again.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On 2017/7/14 12:05, Paul E. McKenney wrote: > > More specifically: rcu_needs_cpu(), rcu_prepare_for_idle(), > rcu_cleanup_after_idle(), rcu_eqs_enter(), rcu_eqs_enter_common(), > rcu_dynticks_eqs_enter(), do_nocb_deferred_wakeup(), > rcu_dynticks_task_enter(), rcu_eqs_exit(), rcu_eqs_exit_common(), > rcu_dynticks_task_exit(), rcu_dynticks_eqs_exit(). > > The first three (rcu_needs_cpu(), rcu_prepare_for_idle(), and > rcu_cleanup_after_idle()) should not be significant unless you have > CONFIG_RCU_FAST_NO_HZ=y. If you do, it would be interesting to learn > how often invoke_rcu_core() is invoked from rcu_prepare_for_idle() > and rcu_cleanup_after_idle(), as this can raise softirq. Also > rcu_accelerate_cbs() and rcu_try_advance_all_cbs(). > > Knowing which of these is causing the most trouble might help me > reduce the overhead in the current idle path. I measured two cases, nothing notable found. The one is CONFIG_NO_HZ_IDLE=y, so the following function is just empty. rcu_prepare_for_idle(): NULL rcu_cleanup_after_idle(): NULL do_nocb_deferred_wakeup(): NULL rcu_dynticks_task_enter(): NULL rcu_dynticks_task_exit(): NULL And the following functions are traced separately, for each function I traced 3 times by intel_PT, for each time the sampling period is 1-second. num means the times the function is invoked in 1-second. (min, max, avg) is the function duration, the unit is nano-second. rcu_needs_cpu(): 1) num: 6110min: 3 max: 564avg: 17.0 2) num: 16535 min: 3 max: 683avg: 18.0 3) num: 8815min: 3 max: 394avg: 20.0 rcu_eqs_enter(): 1) num: 7956min: 17 max: 656avg: 32.0 2) num: 9170min: 17 max: 1075 avg: 35.0 3) num: 8604min: 17 max: 859avg: 29.0 rcu_eqs_enter_common(): 1) num: 14676 min: 15 max: 620avg: 28.0 2) num: 11180 min: 15 max: 795avg: 30.0 3) num: 11484 min: 15 max: 725avg: 29.0 rcu_dynticks_eqs_enter(): 1) num: 11035 min: 10 max: 580avg: 17.0 2) num: 15518 min: 10 max: 456avg: 16.0 3) num: 15320 min: 10 max: 454avg: 19.0 rcu_eqs_exit(): 1) num: 11080 min: 14 max: 893avg: 23.0 2) num: 13526 min: 14 max: 640avg: 23.0 3) num: 12534 min: 14 max: 630avg: 22.0 rcu_eqs_exit_common(): 1) num: 18002 min: 12 max: 553avg: 17.0 2) num: 10570 min: 11 max: 485avg: 17.0 3) num: 13628 min: 11 max: 567avg: 16.0 rcu_dynticks_eqs_exit(): 1) num: 11195 min: 11 max: 436avg: 16.0 2) num: 11808 min: 10 max: 506avg: 16.0 3) num: 8132min: 10 max: 546avg: 15.0 == The other case is CONFIG_NO_HZ_FULL, I also enabled the required config to make all the functions not empty. rcu_needs_cpu(): 1) num: 8530min: 5 max: 770avg: 13.0 2) num: 9965min: 5 max: 518avg: 12.0 3) num: 12503 min: 5 max: 755avg: 16.0 rcu_prepare_for_idle(): 1) num: 11662 min: 5 max: 684avg: 9.0 2) num: 15294 min: 5 max: 676avg: 9.0 3) num: 14332 min: 5 max: 524avg: 9.0 rcu_cleanup_after_idle(): 1) num: 13584 min: 4 max: 657avg: 6.0 2) num: 9102min: 4 max: 529avg: 5.0 3) num: 10648 min: 4 max: 471avg: 6.0 rcu_eqs_enter(): 1) num: 14222 min: 26 max: 745avg: 54.0 2) num: 12502 min: 26 max: 650avg: 53.0 3) num: 11834 min: 26 max: 863avg: 52.0 rcu_eqs_enter_common(): 1) num: 16792 min: 24 max: 973avg: 43.0 2) num: 19755 min: 24 max: 898avg: 45.0 3) num: 8167min: 24 max: 722avg: 42.0 rcu_dynticks_eqs_enter(): 1) num: 11605 min: 10 max: 532avg: 14.0 2) num: 10438 min: 9 max: 554avg: 14.0 3) num: 19816 min: 9 max: 701avg: 14.0 do_nocb_deferred_wakeup(): 1) num: 15348 min: 1 max: 459avg: 3.0 2) num: 12822 min: 1 max: 564avg: 4.0 3) num: 8272min: 0 max: 458avg: 3.0 rcu_dynticks_task_enter(): 1) num: 6358min: 1 max: 268avg: 1.0 2) num: 11128 min: 1 max: 360avg: 1.0 3) num: 20516 min: 1 max: 214avg: 1.0 rcu_eqs_exit(): 1) num: 16117 min: 20 max: 782avg: 43.0 2) num: 11042 min: 20 max: 775avg: 47.0 3) num: 16499 min: 20 max: 752avg: 46.0 rcu_eqs_exit_common(): 1) num: 12584 min: 17 max: 703avg: 28.0 2) num: 17412 min: 17 max: 759avg: 28.0 3) num: 16733 min: 17 max: 798avg: 29.0 rcu_dynticks_task_exit(): 1) num: 11730 min: 1 max: 528avg: 4.0 2) num: 18840 min: 1 max: 581avg: 5.0 3) num: 9815min: 1 max: 381avg: 4.0 rcu_dynticks_eqs_exit(): 1) num: 10902 min: 9 max: 557avg: 13.0 2) num: 19474 min: 9 max: 563avg: 13.0 3) num: 11865 min: 9 max: 672avg: 12.0 Please let me know if there is some data not reasonable, I can revisit again.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote: > fast idle doesn't have an upper bound. > > If the prediction exceeds the fast idle threshold any C state can be used. > > It's just another state (fast C1), but right now it has an own threshold > which may be different from standard C1. Given it uses the same estimate we end up with: select_c_state(idle_est) { if (idle_est < fast_threshold) return C1; if (idle_est < C1_threshold) return C1; if (idle_est < C2_threshold) return C2; /* ... */ return C6 } Now, unless you're mister Turnbull, C2 will never get selected when fast_threshold > C2_threshold. Which is wrong. If you want to effectively scale the selection of C1, why not also change the C2 and further criteria.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 09:03:14AM -0700, Andi Kleen wrote: > fast idle doesn't have an upper bound. > > If the prediction exceeds the fast idle threshold any C state can be used. > > It's just another state (fast C1), but right now it has an own threshold > which may be different from standard C1. Given it uses the same estimate we end up with: select_c_state(idle_est) { if (idle_est < fast_threshold) return C1; if (idle_est < C1_threshold) return C1; if (idle_est < C2_threshold) return C2; /* ... */ return C6 } Now, unless you're mister Turnbull, C2 will never get selected when fast_threshold > C2_threshold. Which is wrong. If you want to effectively scale the selection of C1, why not also change the C2 and further criteria.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
> And as said; Daniel has been working on a better predictor -- now he's > probably not used it on the network workload you're looking at, so that > might be something to consider. Deriving a better idle predictor is a bit orthogonal to fast idle. It would be a good idea to set the fast idle threshold to be the same as intel_idle would use on that platform and rerun the benchmarks Then the Cx pattern should be mostly identical, and fast idle can be evaluated on its own benefits. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
> And as said; Daniel has been working on a better predictor -- now he's > probably not used it on the network workload you're looking at, so that > might be something to consider. Deriving a better idle predictor is a bit orthogonal to fast idle. It would be a good idea to set the fast idle threshold to be the same as intel_idle would use on that platform and rerun the benchmarks Then the Cx pattern should be mostly identical, and fast idle can be evaluated on its own benefits. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 08:53:56AM -0700, Andi Kleen wrote: > > > - if (cpuidle_not_available(drv, dev)) { > > > + if (cpuidle_not_available(drv, dev) || this_is_a_fast_idle) { > > > default_idle_call(); > > > goto exit_idle; > > > } > > > > No, that's wrong. We want to fix the normal C state selection process to > > pick the right C state. > > > > The fast-idle criteria could cut off a whole bunch of available C > > states. We need to understand why our current C state pick is wrong and > > amend the algorithm to do better. Not just bolt something on the side. > > Fast idle uses the same predictor as the current C state governor. > > The only difference is that it uses a different threshold for C1. > Likely that's the cause. If it was using the same threshold the > decision would be the same. Right, so its selecting C1 for longer. That in turn means we could now never select C2; because the fast-idle threshold is longer than our C2 time. Which I feel is wrong; because if we're getting C1 wrong, what says we're then getting the rest right. > The thresholds are coming either from the tables in intel idle, > or from ACPI (let's assume the first) > > That means either: the intel idle C1 threshold on the system Aubrey > tested on is too high, or the fast idle threshold is too low. Or our predictor is doing it wrong. It could be its over-estimating idle duration. For example, suppose we have an idle distribution of: 40% < C1 60% > C2 And we end up selecting C2. Even though in many of our sleeps we really wanted C1. And as said; Daniel has been working on a better predictor -- now he's probably not used it on the network workload you're looking at, so that might be something to consider.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 08:53:56AM -0700, Andi Kleen wrote: > > > - if (cpuidle_not_available(drv, dev)) { > > > + if (cpuidle_not_available(drv, dev) || this_is_a_fast_idle) { > > > default_idle_call(); > > > goto exit_idle; > > > } > > > > No, that's wrong. We want to fix the normal C state selection process to > > pick the right C state. > > > > The fast-idle criteria could cut off a whole bunch of available C > > states. We need to understand why our current C state pick is wrong and > > amend the algorithm to do better. Not just bolt something on the side. > > Fast idle uses the same predictor as the current C state governor. > > The only difference is that it uses a different threshold for C1. > Likely that's the cause. If it was using the same threshold the > decision would be the same. Right, so its selecting C1 for longer. That in turn means we could now never select C2; because the fast-idle threshold is longer than our C2 time. Which I feel is wrong; because if we're getting C1 wrong, what says we're then getting the rest right. > The thresholds are coming either from the tables in intel idle, > or from ACPI (let's assume the first) > > That means either: the intel idle C1 threshold on the system Aubrey > tested on is too high, or the fast idle threshold is too low. Or our predictor is doing it wrong. It could be its over-estimating idle duration. For example, suppose we have an idle distribution of: 40% < C1 60% > C2 And we end up selecting C2. Even though in many of our sleeps we really wanted C1. And as said; Daniel has been working on a better predictor -- now he's probably not used it on the network workload you're looking at, so that might be something to consider.
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 05:58:53PM +0200, Peter Zijlstra wrote: > On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote: > > On 7/14/2017 8:38 AM, Peter Zijlstra wrote: > > > No, that's wrong. We want to fix the normal C state selection process to > > > pick the right C state. > > > > > > The fast-idle criteria could cut off a whole bunch of available C > > > states. We need to understand why our current C state pick is wrong and > > > amend the algorithm to do better. Not just bolt something on the side. > > > > I can see a fast path through selection if you know the upper bound of any > > selection is just 1 state. fast idle doesn't have an upper bound. If the prediction exceeds the fast idle threshold any C state can be used. It's just another state (fast C1), but right now it has an own threshold which may be different from standard C1. > > > > But also, how much of this is about "C1 be fast" versus "selecting C1 is > > slow" > > I got the impression its about we need to select C1 for longer. But the > fact that the patches don't in fact answer any of these questions, > they're wrong in principle ;-) For that workload. Tuning idle thresholds is a complex trade off. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 05:58:53PM +0200, Peter Zijlstra wrote: > On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote: > > On 7/14/2017 8:38 AM, Peter Zijlstra wrote: > > > No, that's wrong. We want to fix the normal C state selection process to > > > pick the right C state. > > > > > > The fast-idle criteria could cut off a whole bunch of available C > > > states. We need to understand why our current C state pick is wrong and > > > amend the algorithm to do better. Not just bolt something on the side. > > > > I can see a fast path through selection if you know the upper bound of any > > selection is just 1 state. fast idle doesn't have an upper bound. If the prediction exceeds the fast idle threshold any C state can be used. It's just another state (fast C1), but right now it has an own threshold which may be different from standard C1. > > > > But also, how much of this is about "C1 be fast" versus "selecting C1 is > > slow" > > I got the impression its about we need to select C1 for longer. But the > fact that the patches don't in fact answer any of these questions, > they're wrong in principle ;-) For that workload. Tuning idle thresholds is a complex trade off. -Andi
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote: > On 7/14/2017 8:38 AM, Peter Zijlstra wrote: > > No, that's wrong. We want to fix the normal C state selection process to > > pick the right C state. > > > > The fast-idle criteria could cut off a whole bunch of available C > > states. We need to understand why our current C state pick is wrong and > > amend the algorithm to do better. Not just bolt something on the side. > > I can see a fast path through selection if you know the upper bound of any > selection is just 1 state. > > But also, how much of this is about "C1 be fast" versus "selecting C1 is slow" I got the impression its about we need to select C1 for longer. But the fact that the patches don't in fact answer any of these questions, they're wrong in principle ;-)
Re: [RFC PATCH v1 00/11] Create fast idle path for short idle periods
On Fri, Jul 14, 2017 at 08:52:28AM -0700, Arjan van de Ven wrote: > On 7/14/2017 8:38 AM, Peter Zijlstra wrote: > > No, that's wrong. We want to fix the normal C state selection process to > > pick the right C state. > > > > The fast-idle criteria could cut off a whole bunch of available C > > states. We need to understand why our current C state pick is wrong and > > amend the algorithm to do better. Not just bolt something on the side. > > I can see a fast path through selection if you know the upper bound of any > selection is just 1 state. > > But also, how much of this is about "C1 be fast" versus "selecting C1 is slow" I got the impression its about we need to select C1 for longer. But the fact that the patches don't in fact answer any of these questions, they're wrong in principle ;-)