Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/22/2012 05:10 PM, Ingo Molnar wrote: > > * Matthew Garrett wrote: > >> [...] >> >> Our power consumption is worse than under other operating >> systems is almost entirely because only one of our three GPU >> drivers implements any kind of useful power management. [...] > > ... and because our CPU frequency and C state selection logic is > doing pretty much the worst possible decisions (on x86 at > least). > > Regardless, you cannot possibly seriously suggest that because > there's even greater suckage elsewhere for some workloads we > should not even bother with improving the situation here. > > Anyway, I agree with Alan that actual numbers matter. Sure. we'd better make ideas into code, and then let benchmarks and data speaking. > > Thanks, > > Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/22/2012 05:10 PM, Ingo Molnar wrote: * Matthew Garrett mj...@srcf.ucam.org wrote: [...] Our power consumption is worse than under other operating systems is almost entirely because only one of our three GPU drivers implements any kind of useful power management. [...] ... and because our CPU frequency and C state selection logic is doing pretty much the worst possible decisions (on x86 at least). Regardless, you cannot possibly seriously suggest that because there's even greater suckage elsewhere for some workloads we should not even bother with improving the situation here. Anyway, I agree with Alan that actual numbers matter. Sure. we'd better make ideas into code, and then let benchmarks and data speaking. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/22/2012 6:21 AM, Matthew Garrett wrote: > On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote: >> On 8/21/2012 10:41 PM, Mike Galbraith wrote: >>> For my dinky dual core laptop, I suspect you're right, but for a more >>> powerful laptop, I'd expect spread/don't to be noticeable. >> >> yeah if you don't spread, you will waste some power. >> but.. current linux behavior is to spread. >> so we can only make it worse. > > Right. For a single socket system the only thing you can do is use two > threads in preference to using two cores. That'll keep an extra core in > a deep C state for longer, at the cost of keeping the package out of a > deep C state for longer. There might be a win if the two processes > benefit from improved L1 cache locality, or if you're talking about basically "if HT sharing would be good for performance" ;-) (btw this is good news, it means this is not an actual power/performance tradeoff, but a "get it right" tradeoff) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote: > On 8/21/2012 10:41 PM, Mike Galbraith wrote: > > For my dinky dual core laptop, I suspect you're right, but for a more > > powerful laptop, I'd expect spread/don't to be noticeable. > > yeah if you don't spread, you will waste some power. > but.. current linux behavior is to spread. > so we can only make it worse. Right. For a single socket system the only thing you can do is use two threads in preference to using two cores. That'll keep an extra core in a deep C state for longer, at the cost of keeping the package out of a deep C state for longer. There might be a win if the two processes benefit from improved L1 cache locality, or if you're talking about short periodic work, but for the majority of cases I'd expect Arjan to be completely correct here. Things get more interesting with multi-socket systems, but that's beyond the laptop use case. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, 2012-08-22 at 06:02 -0700, Arjan van de Ven wrote: > On 8/21/2012 10:41 PM, Mike Galbraith wrote: > > On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote: > > > >> I'd like to see actual numbers and evidence on a wide range of workloads > >> the spread/don't spread thing is even measurable given that you've also > >> got to factor in effects like completing faster and turning everything > >> off. I'd *really* like to see such evidence on a laptop,which is your > >> one cited case it might work. > > > > For my dinky dual core laptop, I suspect you're right, but for a more > > powerful laptop, I'd expect spread/don't to be noticeable. > > yeah if you don't spread, you will waste some power. > but.. current linux behavior is to spread. > so we can only make it worse. Hm, so I can stop fretting about select_idle_sibling(). Good. > > Yeah, hard numbers would be nice to see. > > > > If I had a powerful laptop, I'd kill irq balancing, and all but periodic > > load balancing, and expect to see a positive result. > > I'd expect to see a negative result ;-) Ok, so I have my head on backward. Gives a different perspective :) -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/21/2012 10:41 PM, Mike Galbraith wrote: > On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote: > >> I'd like to see actual numbers and evidence on a wide range of workloads >> the spread/don't spread thing is even measurable given that you've also >> got to factor in effects like completing faster and turning everything >> off. I'd *really* like to see such evidence on a laptop,which is your >> one cited case it might work. > > For my dinky dual core laptop, I suspect you're right, but for a more > powerful laptop, I'd expect spread/don't to be noticeable. yeah if you don't spread, you will waste some power. but.. current linux behavior is to spread. so we can only make it worse. > > Yeah, hard numbers would be nice to see. > > If I had a powerful laptop, I'd kill irq balancing, and all but periodic > load balancing, and expect to see a positive result. I'd expect to see a negative result ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
> It can be more than an irrelevance if the CPU is saturated - say > a game running on a mobile device very commonly saturates the > CPU. A third of the energy is spent in the CPU, sometimes more. If the CPU is saturated you already lost. What you going to do - the CPU is saturated - slow it down, then it'll use more power. > > You *can't* fix PM in one place. [...] > > Preferably one project, not one place - but at least don't go > down the false path of: > > " Policy always belongs into user-space so the kernel can >continue to do a shitty job even for pieces it could >understand better ..." > > My opinion is that it depends, and I also think that we are so > bad currently (on x86) that we can do little harm by trying to > do things better. All the evidence I've seen says we are doing the kernel side stuff right. > > > [...] Power management is a top to bottom thing. It starts in > > the hardware and propogates right to the top of the user space > > stack. > > Partly because it's misdesigned: in practice there's very little > true user policy about power saving: It's not about policy, its about code behaviour. You have to fix every single piece of code. > - On mobile devices I almost never tweak policy as a user - > sometimes I override screen brightness but that's all (and > it's trivial compared to all the many other things that go > on). Put a single badly broken app on an Android device and your battery life will plough. That's despite Android having some highly active management policies to minimise the effect. It works out of the box because someone spent a huge amount of time with a power meter and monitoring tools beating up whoever was top of the wakeup lists. > it should all work. There arent millions of people out there > wanting to tweak the heck out of PM. Don't confuse policy managed by the userspace and buttons for users to tweak. Userspace understands things like "would it be better to drop video quality or burn more power" and has access to info the kernel can't even begin to evaluate. > People prefer no knobs at all - they want good defaults and they > want at most a single, intuitive, actionable control to override > the automation in 1% of the usecases, such as screen brightness. That's a different discussion. > > A single stupid behaviour in a desktop app is all it needs to > > knock the odd hour or two off your battery life. Something is > > mundane as refreshing a bit of the display all the time > > keeping the GPU and CPU from sleeping well. > > Even with highly powertop-optimized systems that have no such > app and have very low wakeup rates we still lag behind the > competition. Actually we don't. Well not if your distro is put together properly, and has the relevant SATA patches and the like merged. Stock Fedora may be pants but if so that's a distro problem. > So why not move most pieces into one well-informed code domain > (the kernel) and only expose high level controls, instead of > expecting user-space to get it all right. Because the kernel doesn't have the information needed. You'd have to add megabytes of code to the kernel - including things like video playback engines. > Then the 'only' job of user-space would be to not be silly when > implementing their functionality. (and there's nothing > intimately PM about that.) That sounds like ignorance > Kernel design decisions *matter*: Of course they do but its a tiny part of the story. The power management function mathematically has a large number of important inputs for which the kernel cannot deduce the values without massive layering violations. Also inconveniently for your worldview but as demonstrated in every case and by everyone who has dug into it, you also have to fix all the wakeup sources on each level. That's the reality. From the moment you wake for an event that was not strictly needed you are essentially attempting to mitigate a failure not trying to deal with the actual problem. > Look for example how moving X lowlevel drivers from user-space > into kernel-space enabled GPU level power management to begin > with. With the old X method it was essentially impossible. Now > it's at least possible. Actually it was perfectly possible before for what the cards of the time could do. The kernel GPU stuff is for DMA and IRQ handling. It happens to be a good place to do PM. > Or look at how Android adding a high-level interface like > suspend blockers materially improved the power saving situation > for them. Blockers are not policy. The blocking *policy* is managed elsewhere. They are a tool for freezing stuff that is being rude. > This learned helplessness that "the kernel can do nothing about > PM" is somewhat annoying :-) Sorry was that a different thread I didn't read ? The inability to learn from both the past and basic systems theory is what I find rather more irritating. Plus your mistaken belief that we are worse than the other
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 22, 2012 at 11:10:13AM +0200, Ingo Molnar wrote: > > * Matthew Garrett wrote: > > > [...] > > > > Our power consumption is worse than under other operating > > systems is almost entirely because only one of our three GPU > > drivers implements any kind of useful power management. [...] > > ... and because our CPU frequency and C state selection logic is > doing pretty much the worst possible decisions (on x86 at > least). You have figures showing that our C state residence is worse than, say, Windows? Because my own testing says that we're way better at that. Could we be better? Sure. Is it why we're worse? No. > Regardless, you cannot possibly seriously suggest that because > there's even greater suckage elsewhere for some workloads we > should not even bother with improving the situation here. I'm enthusiastic about improving the scheduler's behaviour. I'm unenthusiastic about putting in automatic hacks related to AC state. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Alan Cox wrote: > > With deep enough C states it's rather relevant whether we > > continue to burn +50W for a couple of more milliseconds or > > not, and whether we have the right information from the > > scheduler and timer subsystem about how long the next idle > > period is expected to be and how bursty a given task is. > > 50W for 2mS here and there is an irrelevance compared with > burning a continual half a watt due to the upstream tree lack > some of the SATA power patches for example. It can be more than an irrelevance if the CPU is saturated - say a game running on a mobile device very commonly saturates the CPU. A third of the energy is spent in the CPU, sometimes more. > It's the classic "standby mode" problem - energy efficiency > has time as a factor and there are a lot of milliseconds in 5 > hours. That means anything continually on rapidly dominates > the problem space. > > > > PM means fixing the stack top to bottom, and its a whackamole > > > game, each one you fix you find the next. You have to sort the > > > entire stack from desktop apps to kernel. > > > > Moving 'policy' into user-space has been an utter failure, > > mostly because there's not a single project/subsystem > > responsible for getting a good result to users. This is why > > I resist "policy should not be in the kernel" meme here. > > You *can't* fix PM in one place. [...] Preferably one project, not one place - but at least don't go down the false path of: " Policy always belongs into user-space so the kernel can continue to do a shitty job even for pieces it could understand better ..." My opinion is that it depends, and I also think that we are so bad currently (on x86) that we can do little harm by trying to do things better. > [...] Power management is a top to bottom thing. It starts in > the hardware and propogates right to the top of the user space > stack. Partly because it's misdesigned: in practice there's very little true user policy about power saving: - On mobile devices I almost never tweak policy as a user - sometimes I override screen brightness but that's all (and it's trivial compared to all the many other things that go on). - On a laptop I'd love to never have to tweak it either - running fast when on AC and running efficient when on battery is a perfectly fine life-time default for me. 90% of the "policy" comes with the *form factor* - i.e. it's something the hardware and thus the kernel could intimately know about. Yes, there are exceptions and there are servers. The mobile device user mostly *only cares about battery life*, for a given amount of real utility provided by the device. The "user policy" fetish here is a serious misunderstanding of how it should all work. There arent millions of people out there wanting to tweak the heck out of PM. People prefer no knobs at all - they want good defaults and they want at most a single, intuitive, actionable control to override the automation in 1% of the usecases, such as screen brightness. > A single stupid behaviour in a desktop app is all it needs to > knock the odd hour or two off your battery life. Something is > mundane as refreshing a bit of the display all the time > keeping the GPU and CPU from sleeping well. Even with highly powertop-optimized systems that have no such app and have very low wakeup rates we still lag behind the competition. > Most distros haven't managed to do power management properly > because it is this entire integration problem. Every single > piece of the puzzle has to be in place before you get any > serious gain. Most certainly. So why not move most pieces into one well-informed code domain (the kernel) and only expose high level controls, instead of expecting user-space to get it all right. Then the 'only' job of user-space would be to not be silly when implementing their functionality. (and there's nothing intimately PM about that.) > It's not a kernel v user thing. The kernel can't fix it, > random bits of userspace can't fix it. This is effectively a > "product level" integration problem. Of course the kernel can fix many parts by offering automation like automatically shutting down unused interfaces (and offering better ABIs if that is not possible due to some poor historic choice), choosing frequencies and C states wisely, etc. Kernel design decisions *matter*: Look for example how moving X lowlevel drivers from user-space into kernel-space enabled GPU level power management to begin with. With the old X method it was essentially impossible. Now it's at least possible. Or look at how Android adding a high-level interface like suspend blockers materially improved the power saving situation for them. This learned helplessness that "the kernel can do nothing about PM" is somewhat annoying :-) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
> With deep enough C states it's rather relevant whether we > continue to burn +50W for a couple of more milliseconds or not, > and whether we have the right information from the scheduler and > timer subsystem about how long the next idle period is expected > to be and how bursty a given task is. 50W for 2mS here and there is an irrelevance compared with burning a continual half a watt due to the upstream tree lack some of the SATA power patches for example. It's the classic "standby mode" problem - energy efficiency has time as a factor and there are a lot of milliseconds in 5 hours. That means anything continually on rapidly dominates the problem space. > > PM means fixing the stack top to bottom, and its a whackamole > > game, each one you fix you find the next. You have to sort the > > entire stack from desktop apps to kernel. > > Moving 'policy' into user-space has been an utter failure, > mostly because there's not a single project/subsystem > responsible for getting a good result to users. This is why > I resist "policy should not be in the kernel" meme here. You *can't* fix PM in one place. Power management is a top to bottom thing. It starts in the hardware and propogates right to the top of the user space stack. A single stupid behaviour in a desktop app is all it needs to knock the odd hour or two off your battery life. Something is mundane as refreshing a bit of the display all the time keeping the GPU and CPU from sleeping well. Most distros haven't managed to do power management properly because it is this entire integration problem. Every single piece of the puzzle has to be in place before you get any serious gain. It's not a kernel v user thing. The kernel can't fix it, random bits of userspace can't fix it. This is effectively a "product level" integration problem. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett wrote: > [...] > > Our power consumption is worse than under other operating > systems is almost entirely because only one of our three GPU > drivers implements any kind of useful power management. [...] ... and because our CPU frequency and C state selection logic is doing pretty much the worst possible decisions (on x86 at least). Regardless, you cannot possibly seriously suggest that because there's even greater suckage elsewhere for some workloads we should not even bother with improving the situation here. Anyway, I agree with Alan that actual numbers matter. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Alan Cox wrote: > > Why? Good scheduling is useful even in isolation. > > For power - I suspect it's damn near irrelevant except on a > big big machine. With deep enough C states it's rather relevant whether we continue to burn +50W for a couple of more milliseconds or not, and whether we have the right information from the scheduler and timer subsystem about how long the next idle period is expected to be and how bursty a given task is. 'Balance for energy efficiency' obviously ties into the C state and frequency selection logic, which is rather detached right now, running its own (imperfect) scheduling metrics logic and doing pretty much the worst possible C state and frequency decisions in typical everyday desktop workloads. > Unless you've sorted out your SATA, fixed your phy handling, > optimised your desktop for wakeups and worked down the big > wakeup causes one by one it's turd polishing. > > PM means fixing the stack top to bottom, and its a whackamole > game, each one you fix you find the next. You have to sort the > entire stack from desktop apps to kernel. Moving 'policy' into user-space has been an utter failure, mostly because there's not a single project/subsystem responsible for getting a good result to users. This is why I resist "policy should not be in the kernel" meme here. > However benchmarks talk - so lets have some benchmarks ... on > a laptop. Agreed. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Alan Cox a...@lxorguk.ukuu.org.uk wrote: Why? Good scheduling is useful even in isolation. For power - I suspect it's damn near irrelevant except on a big big machine. With deep enough C states it's rather relevant whether we continue to burn +50W for a couple of more milliseconds or not, and whether we have the right information from the scheduler and timer subsystem about how long the next idle period is expected to be and how bursty a given task is. 'Balance for energy efficiency' obviously ties into the C state and frequency selection logic, which is rather detached right now, running its own (imperfect) scheduling metrics logic and doing pretty much the worst possible C state and frequency decisions in typical everyday desktop workloads. Unless you've sorted out your SATA, fixed your phy handling, optimised your desktop for wakeups and worked down the big wakeup causes one by one it's turd polishing. PM means fixing the stack top to bottom, and its a whackamole game, each one you fix you find the next. You have to sort the entire stack from desktop apps to kernel. Moving 'policy' into user-space has been an utter failure, mostly because there's not a single project/subsystem responsible for getting a good result to users. This is why I resist policy should not be in the kernel meme here. However benchmarks talk - so lets have some benchmarks ... on a laptop. Agreed. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett mj...@srcf.ucam.org wrote: [...] Our power consumption is worse than under other operating systems is almost entirely because only one of our three GPU drivers implements any kind of useful power management. [...] ... and because our CPU frequency and C state selection logic is doing pretty much the worst possible decisions (on x86 at least). Regardless, you cannot possibly seriously suggest that because there's even greater suckage elsewhere for some workloads we should not even bother with improving the situation here. Anyway, I agree with Alan that actual numbers matter. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
With deep enough C states it's rather relevant whether we continue to burn +50W for a couple of more milliseconds or not, and whether we have the right information from the scheduler and timer subsystem about how long the next idle period is expected to be and how bursty a given task is. 50W for 2mS here and there is an irrelevance compared with burning a continual half a watt due to the upstream tree lack some of the SATA power patches for example. It's the classic standby mode problem - energy efficiency has time as a factor and there are a lot of milliseconds in 5 hours. That means anything continually on rapidly dominates the problem space. PM means fixing the stack top to bottom, and its a whackamole game, each one you fix you find the next. You have to sort the entire stack from desktop apps to kernel. Moving 'policy' into user-space has been an utter failure, mostly because there's not a single project/subsystem responsible for getting a good result to users. This is why I resist policy should not be in the kernel meme here. You *can't* fix PM in one place. Power management is a top to bottom thing. It starts in the hardware and propogates right to the top of the user space stack. A single stupid behaviour in a desktop app is all it needs to knock the odd hour or two off your battery life. Something is mundane as refreshing a bit of the display all the time keeping the GPU and CPU from sleeping well. Most distros haven't managed to do power management properly because it is this entire integration problem. Every single piece of the puzzle has to be in place before you get any serious gain. It's not a kernel v user thing. The kernel can't fix it, random bits of userspace can't fix it. This is effectively a product level integration problem. Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Alan Cox a...@lxorguk.ukuu.org.uk wrote: With deep enough C states it's rather relevant whether we continue to burn +50W for a couple of more milliseconds or not, and whether we have the right information from the scheduler and timer subsystem about how long the next idle period is expected to be and how bursty a given task is. 50W for 2mS here and there is an irrelevance compared with burning a continual half a watt due to the upstream tree lack some of the SATA power patches for example. It can be more than an irrelevance if the CPU is saturated - say a game running on a mobile device very commonly saturates the CPU. A third of the energy is spent in the CPU, sometimes more. It's the classic standby mode problem - energy efficiency has time as a factor and there are a lot of milliseconds in 5 hours. That means anything continually on rapidly dominates the problem space. PM means fixing the stack top to bottom, and its a whackamole game, each one you fix you find the next. You have to sort the entire stack from desktop apps to kernel. Moving 'policy' into user-space has been an utter failure, mostly because there's not a single project/subsystem responsible for getting a good result to users. This is why I resist policy should not be in the kernel meme here. You *can't* fix PM in one place. [...] Preferably one project, not one place - but at least don't go down the false path of: Policy always belongs into user-space so the kernel can continue to do a shitty job even for pieces it could understand better ... My opinion is that it depends, and I also think that we are so bad currently (on x86) that we can do little harm by trying to do things better. [...] Power management is a top to bottom thing. It starts in the hardware and propogates right to the top of the user space stack. Partly because it's misdesigned: in practice there's very little true user policy about power saving: - On mobile devices I almost never tweak policy as a user - sometimes I override screen brightness but that's all (and it's trivial compared to all the many other things that go on). - On a laptop I'd love to never have to tweak it either - running fast when on AC and running efficient when on battery is a perfectly fine life-time default for me. 90% of the policy comes with the *form factor* - i.e. it's something the hardware and thus the kernel could intimately know about. Yes, there are exceptions and there are servers. The mobile device user mostly *only cares about battery life*, for a given amount of real utility provided by the device. The user policy fetish here is a serious misunderstanding of how it should all work. There arent millions of people out there wanting to tweak the heck out of PM. People prefer no knobs at all - they want good defaults and they want at most a single, intuitive, actionable control to override the automation in 1% of the usecases, such as screen brightness. A single stupid behaviour in a desktop app is all it needs to knock the odd hour or two off your battery life. Something is mundane as refreshing a bit of the display all the time keeping the GPU and CPU from sleeping well. Even with highly powertop-optimized systems that have no such app and have very low wakeup rates we still lag behind the competition. Most distros haven't managed to do power management properly because it is this entire integration problem. Every single piece of the puzzle has to be in place before you get any serious gain. Most certainly. So why not move most pieces into one well-informed code domain (the kernel) and only expose high level controls, instead of expecting user-space to get it all right. Then the 'only' job of user-space would be to not be silly when implementing their functionality. (and there's nothing intimately PM about that.) It's not a kernel v user thing. The kernel can't fix it, random bits of userspace can't fix it. This is effectively a product level integration problem. Of course the kernel can fix many parts by offering automation like automatically shutting down unused interfaces (and offering better ABIs if that is not possible due to some poor historic choice), choosing frequencies and C states wisely, etc. Kernel design decisions *matter*: Look for example how moving X lowlevel drivers from user-space into kernel-space enabled GPU level power management to begin with. With the old X method it was essentially impossible. Now it's at least possible. Or look at how Android adding a high-level interface like suspend blockers materially improved the power saving situation for them. This learned helplessness that the kernel can do nothing about PM is somewhat annoying :-) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 22, 2012 at 11:10:13AM +0200, Ingo Molnar wrote: * Matthew Garrett mj...@srcf.ucam.org wrote: [...] Our power consumption is worse than under other operating systems is almost entirely because only one of our three GPU drivers implements any kind of useful power management. [...] ... and because our CPU frequency and C state selection logic is doing pretty much the worst possible decisions (on x86 at least). You have figures showing that our C state residence is worse than, say, Windows? Because my own testing says that we're way better at that. Could we be better? Sure. Is it why we're worse? No. Regardless, you cannot possibly seriously suggest that because there's even greater suckage elsewhere for some workloads we should not even bother with improving the situation here. I'm enthusiastic about improving the scheduler's behaviour. I'm unenthusiastic about putting in automatic hacks related to AC state. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
It can be more than an irrelevance if the CPU is saturated - say a game running on a mobile device very commonly saturates the CPU. A third of the energy is spent in the CPU, sometimes more. If the CPU is saturated you already lost. What you going to do - the CPU is saturated - slow it down, then it'll use more power. You *can't* fix PM in one place. [...] Preferably one project, not one place - but at least don't go down the false path of: Policy always belongs into user-space so the kernel can continue to do a shitty job even for pieces it could understand better ... My opinion is that it depends, and I also think that we are so bad currently (on x86) that we can do little harm by trying to do things better. All the evidence I've seen says we are doing the kernel side stuff right. [...] Power management is a top to bottom thing. It starts in the hardware and propogates right to the top of the user space stack. Partly because it's misdesigned: in practice there's very little true user policy about power saving: It's not about policy, its about code behaviour. You have to fix every single piece of code. - On mobile devices I almost never tweak policy as a user - sometimes I override screen brightness but that's all (and it's trivial compared to all the many other things that go on). Put a single badly broken app on an Android device and your battery life will plough. That's despite Android having some highly active management policies to minimise the effect. It works out of the box because someone spent a huge amount of time with a power meter and monitoring tools beating up whoever was top of the wakeup lists. it should all work. There arent millions of people out there wanting to tweak the heck out of PM. Don't confuse policy managed by the userspace and buttons for users to tweak. Userspace understands things like would it be better to drop video quality or burn more power and has access to info the kernel can't even begin to evaluate. People prefer no knobs at all - they want good defaults and they want at most a single, intuitive, actionable control to override the automation in 1% of the usecases, such as screen brightness. That's a different discussion. A single stupid behaviour in a desktop app is all it needs to knock the odd hour or two off your battery life. Something is mundane as refreshing a bit of the display all the time keeping the GPU and CPU from sleeping well. Even with highly powertop-optimized systems that have no such app and have very low wakeup rates we still lag behind the competition. Actually we don't. Well not if your distro is put together properly, and has the relevant SATA patches and the like merged. Stock Fedora may be pants but if so that's a distro problem. So why not move most pieces into one well-informed code domain (the kernel) and only expose high level controls, instead of expecting user-space to get it all right. Because the kernel doesn't have the information needed. You'd have to add megabytes of code to the kernel - including things like video playback engines. Then the 'only' job of user-space would be to not be silly when implementing their functionality. (and there's nothing intimately PM about that.) That sounds like ignorance Kernel design decisions *matter*: Of course they do but its a tiny part of the story. The power management function mathematically has a large number of important inputs for which the kernel cannot deduce the values without massive layering violations. Also inconveniently for your worldview but as demonstrated in every case and by everyone who has dug into it, you also have to fix all the wakeup sources on each level. That's the reality. From the moment you wake for an event that was not strictly needed you are essentially attempting to mitigate a failure not trying to deal with the actual problem. Look for example how moving X lowlevel drivers from user-space into kernel-space enabled GPU level power management to begin with. With the old X method it was essentially impossible. Now it's at least possible. Actually it was perfectly possible before for what the cards of the time could do. The kernel GPU stuff is for DMA and IRQ handling. It happens to be a good place to do PM. Or look at how Android adding a high-level interface like suspend blockers materially improved the power saving situation for them. Blockers are not policy. The blocking *policy* is managed elsewhere. They are a tool for freezing stuff that is being rude. This learned helplessness that the kernel can do nothing about PM is somewhat annoying :-) Sorry was that a different thread I didn't read ? The inability to learn from both the past and basic systems theory is what I find rather more irritating. Plus your mistaken belief that we are worse than the other OS's on this. We are not. If your system sucks then instrument it,
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/21/2012 10:41 PM, Mike Galbraith wrote: On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote: I'd like to see actual numbers and evidence on a wide range of workloads the spread/don't spread thing is even measurable given that you've also got to factor in effects like completing faster and turning everything off. I'd *really* like to see such evidence on a laptop,which is your one cited case it might work. For my dinky dual core laptop, I suspect you're right, but for a more powerful laptop, I'd expect spread/don't to be noticeable. yeah if you don't spread, you will waste some power. but.. current linux behavior is to spread. so we can only make it worse. Yeah, hard numbers would be nice to see. If I had a powerful laptop, I'd kill irq balancing, and all but periodic load balancing, and expect to see a positive result. I'd expect to see a negative result ;-) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, 2012-08-22 at 06:02 -0700, Arjan van de Ven wrote: On 8/21/2012 10:41 PM, Mike Galbraith wrote: On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote: I'd like to see actual numbers and evidence on a wide range of workloads the spread/don't spread thing is even measurable given that you've also got to factor in effects like completing faster and turning everything off. I'd *really* like to see such evidence on a laptop,which is your one cited case it might work. For my dinky dual core laptop, I suspect you're right, but for a more powerful laptop, I'd expect spread/don't to be noticeable. yeah if you don't spread, you will waste some power. but.. current linux behavior is to spread. so we can only make it worse. Hm, so I can stop fretting about select_idle_sibling(). Good. Yeah, hard numbers would be nice to see. If I had a powerful laptop, I'd kill irq balancing, and all but periodic load balancing, and expect to see a positive result. I'd expect to see a negative result ;-) Ok, so I have my head on backward. Gives a different perspective :) -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote: On 8/21/2012 10:41 PM, Mike Galbraith wrote: For my dinky dual core laptop, I suspect you're right, but for a more powerful laptop, I'd expect spread/don't to be noticeable. yeah if you don't spread, you will waste some power. but.. current linux behavior is to spread. so we can only make it worse. Right. For a single socket system the only thing you can do is use two threads in preference to using two cores. That'll keep an extra core in a deep C state for longer, at the cost of keeping the package out of a deep C state for longer. There might be a win if the two processes benefit from improved L1 cache locality, or if you're talking about short periodic work, but for the majority of cases I'd expect Arjan to be completely correct here. Things get more interesting with multi-socket systems, but that's beyond the laptop use case. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/22/2012 6:21 AM, Matthew Garrett wrote: On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote: On 8/21/2012 10:41 PM, Mike Galbraith wrote: For my dinky dual core laptop, I suspect you're right, but for a more powerful laptop, I'd expect spread/don't to be noticeable. yeah if you don't spread, you will waste some power. but.. current linux behavior is to spread. so we can only make it worse. Right. For a single socket system the only thing you can do is use two threads in preference to using two cores. That'll keep an extra core in a deep C state for longer, at the cost of keeping the package out of a deep C state for longer. There might be a win if the two processes benefit from improved L1 cache locality, or if you're talking about basically if HT sharing would be good for performance ;-) (btw this is good news, it means this is not an actual power/performance tradeoff, but a get it right tradeoff) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote: > I'd like to see actual numbers and evidence on a wide range of workloads > the spread/don't spread thing is even measurable given that you've also > got to factor in effects like completing faster and turning everything > off. I'd *really* like to see such evidence on a laptop,which is your > one cited case it might work. For my dinky dual core laptop, I suspect you're right, but for a more powerful laptop, I'd expect spread/don't to be noticeable. Yeah, hard numbers would be nice to see. If I had a powerful laptop, I'd kill irq balancing, and all but periodic load balancing, and expect to see a positive result. Dunno what fickle electron gods would _really_ do with those prayers though. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
> Why? Good scheduling is useful even in isolation. For power - I suspect it's damn near irrelevant except on a big big machine. Unless you've sorted out your SATA, fixed your phy handling, optimised your desktop for wakeups and worked down the big wakeup causes one by one it's turd polishing. PM means fixing the stack top to bottom, and its a whackamole game, each one you fix you find the next. You have to sort the entire stack from desktop apps to kernel. However benchmarks talk - so lets have some benchmarks ... on a laptop. Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 08:23:46PM +0200, Ingo Molnar wrote: > * Matthew Garrett wrote: > > The scheduler is unaware of whether I care about a process > > finishing quickly or whether I care about it consuming less > > power. > > You are posing them as if the two were mutually exclusive, while > in reality they are not necessarily exclusive: it's quite > possible that the highest (non-turbo) CPU frequency happens to > be the most energy efficient one for a CPU with a particular > workload ... You just put in a proviso that makes them mutually exclusive. If I want it done fast, I want it done in the highest turbo CPU frequency. If I don't want it done fast, I want it done in the most efficient CPU frequency. They're probably not the same thing. > You also missed the bit of my mail where I suggested that such > user preferences and tolerances can be communicated to the > scheduler via a policy toggle - which the scheduler would take > into account. Yes. And that toggle should be the thing that defines the policy under all circumstances. > > Ok so what you're actually telling me here is that you don't > > understand anything about power management and where our > > problems are. > > Huh? In practice we suck today in terms of energy efficiency. > That covers both scheduling and other areas. > > Saying this out aloud does not tell anything about my > understanding of power management... > > So please outline a technical point. Our power consumption is worse than under other operating systems is almost entirely because only one of our three GPU drivers implements any kind of useful power management. The power saving functionality that we expose to userspace is already used when it's safe to do so. So blaming our userspace policy management for our higher power consumption means that you can't possibly understand where the problems actually are, which indicates that you probably shouldn't be trying to tell me about optimal approaches to power management. > You mean the code is in drivers? Or the problem is in drivers? The problem is in the drivers. > > sched_mt_powersave was inherently going to have a huge impact > > on performance, and with modern chips that would result in the > > platform consuming more power. It was a feature that was > > useful for a small number of generations of desktop CPUs - I > > don't think it would ever skew the power/performance ratio in > > a useful direction on mobile hardware. But feel free to blame > > userspace for hardware design. > > FYI, sched_mt_powersave is *GONE* in recent kernels, because it > basically never worked. This thread is about designing and > implementing something that actually works. Yes. You asked me whether userspace ever used the knobs that the kernel exposed. I said no, because the only knob relevant for laptops would never improve energy efficiency on laptops. It is therefore impossible to use this as an example of userspace policy management not doing the right thing. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett wrote: > On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote: > > * Matthew Garrett wrote: > > > The scheduler's behaviour is going to have a minimal impact on > > > power consumption on laptops. Other things are much more > > > important - backlight level, ASPM state, that kind of thing. > > > So why special case the scheduler? [...] > > > > I'm not special casing the scheduler - but we are talking about > > scheduler policies here, so *if* it makes sense to handle this > > dynamically then obviously the scheduler wants to use system > > state information when/if the kernel can get it. > > > > Your argument is as if you said that the shape of a car's side > > view mirrors is not important to its top speed, because the > > overall shape of the chassis and engine power are much more > > important. > > > > But we are desiging side view mirrors here, so we might as well > > do a good job there. > > If the kernel is going to make power choices automatically > then it should do it everywhere, not piecemeal. Why? Good scheduling is useful even in isolation. > The scheduler is unaware of whether I care about a process > finishing quickly or whether I care about it consuming less > power. You are posing them as if the two were mutually exclusive, while in reality they are not necessarily exclusive: it's quite possible that the highest (non-turbo) CPU frequency happens to be the most energy efficient one for a CPU with a particular workload ... You also missed the bit of my mail where I suggested that such user preferences and tolerances can be communicated to the scheduler via a policy toggle - which the scheduler would take into account. I suggest to use sane defaults, such as being energy efficient on battery power (within a sane threshold) and maximizing throughput on AC power (within a sane threshold). That would go a *long* way improving the current mess. If Linux power efficiency was so good today then I'd not ask for kernel driven defaults - but the reality is that in terms of process scheduling we suck today (and have sucked for the last 10 years) so pretty much any approach will improve things. > > > > The thing is, when I use Linux on a laptop then > > > > AC/battery is *the* main policy input. > > > > > > And it's already well handled from userspace, as it has to > > > be. > > > > Not according to the developers switching away from Linux > > desktop distros in droves, because MacOSX or Win7 has 30%+ > > better battery efficiency. > > Ok so what you're actually telling me here is that you don't > understand anything about power management and where our > problems are. Huh? In practice we suck today in terms of energy efficiency. That covers both scheduling and other areas. Saying this out aloud does not tell anything about my understanding of power management... So please outline a technical point. > > The scheduler might be a small part of the picture, but it's > > certainly a part of it. > > It's in the drivers, which is where it has been since we went > tickless. You mean the code is in drivers? Or the problem is in drivers? Both is true currently - this discussion is about the future, to make the scheduler aware of power concerns, as the scheduler (and the timer subsystem) already calculates various interesting metrics that matter to energy efficient scheduling. > > > No, because sched_mt_powersave usually crippled performance > > > more than it saved power and nobody makes multi-socket > > > laptops. > > > > That's a user-space policy management fail right there: why > > wasn't this fixed? If the default policy is in the kernel we can > > at least fix it in one place for the most common cases. If it's > > spread out amongst multiple projects then progress only happens > > at glacial speed ... > > sched_mt_powersave was inherently going to have a huge impact > on performance, and with modern chips that would result in the > platform consuming more power. It was a feature that was > useful for a small number of generations of desktop CPUs - I > don't think it would ever skew the power/performance ratio in > a useful direction on mobile hardware. But feel free to blame > userspace for hardware design. FYI, sched_mt_powersave is *GONE* in recent kernels, because it basically never worked. This thread is about designing and implementing something that actually works. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote: > * Matthew Garrett wrote: > > The scheduler's behaviour is going to have a minimal impact on > > power consumption on laptops. Other things are much more > > important - backlight level, ASPM state, that kind of thing. > > So why special case the scheduler? [...] > > I'm not special casing the scheduler - but we are talking about > scheduler policies here, so *if* it makes sense to handle this > dynamically then obviously the scheduler wants to use system > state information when/if the kernel can get it. > > Your argument is as if you said that the shape of a car's side > view mirrors is not important to its top speed, because the > overall shape of the chassis and engine power are much more > important. > > But we are desiging side view mirrors here, so we might as well > do a good job there. If the kernel is going to make power choices automatically then it should do it everywhere, not piecemeal. > > [...] This is going to be hugely more important on > > multi-socket systems, where your policy is usually going to be > > dictated by the specific workload that you're running at the > > time. [...] > > If only we had some kernel subsystem that is intimiately familar > with the workloads running on the system and could act > accordingly and with low latency. > > We could name that subsystem it in some intuitive fashion: it > switches and schedules workloads, so how about calling it the > 'scheduler'? The scheduler is unaware of whether I care about a process finishing quickly or whether I care about it consuming less power. > > [...] The exception is in cases where your rack is > > overcommitted for power and your rack management unit is > > telling you to reduce power consumption since otherwise it's > > going to have to cut the power to one of the machines in the > > rack in the next few seconds. > > ( That must be some ACPI middleware driven crap, right? Not > really the Linux kernel's problem. ) It's as much the Linux kernel's problem as AC/battery decisions are - ie, it's not. > > > The thing is, when I use Linux on a laptop then AC/battery > > > is *the* main policy input. > > > > And it's already well handled from userspace, as it has to be. > > Not according to the developers switching away from Linux > desktop distros in droves, because MacOSX or Win7 has 30%+ > better battery efficiency. Ok so what you're actually telling me here is that you don't understand anything about power management and where our problems are. > The scheduler might be a small part of the picture, but it's > certainly a part of it. It's in the drivers, which is where it has been since we went tickless. > > No, because sched_mt_powersave usually crippled performance > > more than it saved power and nobody makes multi-socket > > laptops. > > That's a user-space policy management fail right there: why > wasn't this fixed? If the default policy is in the kernel we can > at least fix it in one place for the most common cases. If it's > spread out amongst multiple projects then progress only happens > at glacial speed ... sched_mt_powersave was inherently going to have a huge impact on performance, and with modern chips that would result in the platform consuming more power. It was a feature that was useful for a small number of generations of desktop CPUs - I don't think it would ever skew the power/performance ratio in a useful direction on mobile hardware. But feel free to blame userspace for hardware design. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett wrote: > On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote: > > * Matthew Garrett wrote: > > > [...] AC/battery is just not an important power management > > > policy input when compared to various other things. > > > > Such as? > > The scheduler's behaviour is going to have a minimal impact on > power consumption on laptops. Other things are much more > important - backlight level, ASPM state, that kind of thing. > So why special case the scheduler? [...] I'm not special casing the scheduler - but we are talking about scheduler policies here, so *if* it makes sense to handle this dynamically then obviously the scheduler wants to use system state information when/if the kernel can get it. Your argument is as if you said that the shape of a car's side view mirrors is not important to its top speed, because the overall shape of the chassis and engine power are much more important. But we are desiging side view mirrors here, so we might as well do a good job there. > [...] This is going to be hugely more important on > multi-socket systems, where your policy is usually going to be > dictated by the specific workload that you're running at the > time. [...] If only we had some kernel subsystem that is intimiately familar with the workloads running on the system and could act accordingly and with low latency. We could name that subsystem it in some intuitive fashion: it switches and schedules workloads, so how about calling it the 'scheduler'? > [...] The exception is in cases where your rack is > overcommitted for power and your rack management unit is > telling you to reduce power consumption since otherwise it's > going to have to cut the power to one of the machines in the > rack in the next few seconds. ( That must be some ACPI middleware driven crap, right? Not really the Linux kernel's problem. ) > > The thing is, when I use Linux on a laptop then AC/battery > > is *the* main policy input. > > And it's already well handled from userspace, as it has to be. Not according to the developers switching away from Linux desktop distros in droves, because MacOSX or Win7 has 30%+ better battery efficiency. The scheduler might be a small part of the picture, but it's certainly a part of it. > > > Userspace has been doing a perfectly reasonable job of > > > determining policy here. > > > > Has it properly switched the scheduler's balancing between > > power-effient and performance-maximizing strategies when for > > example a laptop's AC got unplugged/replugged? > > No, because sched_mt_powersave usually crippled performance > more than it saved power and nobody makes multi-socket > laptops. That's a user-space policy management fail right there: why wasn't this fixed? If the default policy is in the kernel we can at least fix it in one place for the most common cases. If it's spread out amongst multiple projects then progress only happens at glacial speed ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
> > That's a fundamentally uninteresting thing for the kernel to > > know about. [...] > > I disagree. The kernel has no idea of the power architecture leading up to the plug socket. The kernel has no idea of the policy concerns of the user. > > [...] AC/battery is just not an important power management > > policy input when compared to various other things. > > Such as? > > The thing is, when I use Linux on a laptop then AC/battery is > *the* main policy input. Along with distance likely to be travelled without a socket being available, whether you remembered the charger, and a pile of other things ('can I get this built before Linus wakes up'). The kernel isn't capable of computing these other factors. The userspace can at least make an educated guess, In the business space its even more complicated because battery/mains may well only be visible via SNMP queries to the power systems and the bigger concern may well be heat efficiency. If you are running a cloud your policy considerations also include things like your current spot electricity price, outside temperature and your current spot compute price chargeable. > > Userspace has been doing a perfectly reasonable job of > > determining policy here. > > Has it properly switched the scheduler's balancing between > power-effient and performance-maximizing strategies when for > example a laptop's AC got unplugged/replugged? You work for Red Hat, maybe you should ask your distro people if they do. While you are it at perhaps also some of the ATA power management that will probably be an order of magnitude more significant could get included ;) Seriously. On a typical laptop the things you can do about power are dominated by the backlight, by disk power (eg idle SATA links), by USB device power downs where possible, by turning off any unused phys and by not having the CPU wake up, which means fixing the desktop apps to behave sensibly. I'd like to see actual numbers and evidence on a wide range of workloads the spread/don't spread thing is even measurable given that you've also got to factor in effects like completing faster and turning everything off. I'd *really* like to see such evidence on a laptop,which is your one cited case it might work. > > Your suggestions aren't a working default mechanism. > > In what way? For one if the default behaviour is that when I get on the train and am on battery my video playback begins to stutter due to some kernel magic then I shall be unamused and file it as a regression. Policy is userspace - the desktop can figure out I'm watching movies and what this means, the kernel can't. I'd also note there have been repeated attempts to put power management policy on various OS's by putting the power management policy - in the hardware - in SMM handlers - in the kernel and every single one has been a failure because those parts of the system never have enough information nor do they have enough variety of control to manage the complexity of input state. It's a single policy file for a distro to do scheduler configuration based upon power events. One trivial 'drop it here' shell script. The difference then being the desktop can be taught to do overrides and policy properly. It might be the kernel has important knowledge about what "schedule for efficiency" means and even to be able to ask the kernel to dot hat - but it has no idea what the right policy is at any given moment. ie even if there is a /sys/mumble/schedule_for_efficiency the echo "1" > and echo "0" > belong in a script Alan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote: > * Matthew Garrett wrote: > > [...] AC/battery is just not an important power management > > policy input when compared to various other things. > > Such as? The scheduler's behaviour is going to have a minimal impact on power consumption on laptops. Other things are much more important - backlight level, ASPM state, that kind of thing. So why special case the scheduler? This is going to be hugely more important on multi-socket systems, where your policy is usually going to be dictated by the specific workload that you're running at the time. The exception is in cases where your rack is overcommitted for power and your rack management unit is telling you to reduce power consumption since otherwise it's going to have to cut the power to one of the machines in the rack in the next few seconds. > The thing is, when I use Linux on a laptop then AC/battery is > *the* main policy input. And it's already well handled from userspace, as it has to be. > > Userspace has been doing a perfectly reasonable job of > > determining policy here. > > Has it properly switched the scheduler's balancing between > power-effient and performance-maximizing strategies when for > example a laptop's AC got unplugged/replugged? No, because sched_mt_powersave usually crippled performance more than it saved power and nobody makes multi-socket laptops. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
>>> A modern kernel better know what state the system is in: on >>> battery or on AC power. >> >> That's a fundamentally uninteresting thing for the kernel to >> know about. [...] > > I disagree. and I'll agree with Matthew and disagree with you ;-) > >> [...] AC/battery is just not an important power management >> policy input when compared to various other things. > > Such as? > > The thing is, when I use Linux on a laptop then AC/battery is > *the* main policy input. I think you're wrong there. First of all, not the whole world is a laptop. Phones and servers are very different than laptops in this sense. In a phone, when you're charging, you want to be EXTRA power efficient in many ways (since charging creates heat, and that heat will take away your thermal budget). In a datacenter, you're either on AC or DC all the time, and power efficiency still matters. And even on a laptop.. heat production matters even when on AC... laptops are more and more like phones that way. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett wrote: > On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote: > > * Matthew Garrett wrote: > > > [...] Putting this kind of policy in the kernel is an awful > > > idea. [...] > > > > A modern kernel better know what state the system is in: on > > battery or on AC power. > > That's a fundamentally uninteresting thing for the kernel to > know about. [...] I disagree. > [...] AC/battery is just not an important power management > policy input when compared to various other things. Such as? The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. > > > [...] It should never be altering policy itself, [...] > > > > The kernel/scheduler simply offers sensible defaults where > > it can. User-space can augment/modify/override that in any > > which way it wishes to. > > > > This stuff has not been properly sorted out in the last 10+ > > years since we have battery driven devices, so we might as > > well start with the kernel offering sane default behavior > > where it can ... > > Userspace has been doing a perfectly reasonable job of > determining policy here. Has it properly switched the scheduler's balancing between power-effient and performance-maximizing strategies when for example a laptop's AC got unplugged/replugged? > > > [...] because it'll get it wrong and people will file bugs > > > complaining that it got it wrong and the biggest case > > > where you *need* to be able to handle switching between > > > performance and power optimisations (your rack management > > > unit just told you that you're going to have to drop power > > > consumption by 20W) is one where the kernel doesn't have > > > all the information it needs to do this. So why bother at > > > all? > > > > The point is to have a working default mechanism. > > Your suggestions aren't a working default mechanism. In what way? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote: > * Matthew Garrett wrote: > > [...] Putting this kind of policy in the kernel is an awful > > idea. [...] > > A modern kernel better know what state the system is in: on > battery or on AC power. That's a fundamentally uninteresting thing for the kernel to know about. AC/battery is just not an important power management policy input when compared to various other things. > > [...] It should never be altering policy itself, [...] > > The kernel/scheduler simply offers sensible defaults where it > can. User-space can augment/modify/override that in any which > way it wishes to. > > This stuff has not been properly sorted out in the last 10+ > years since we have battery driven devices, so we might as well > start with the kernel offering sane default behavior where it > can ... Userspace has been doing a perfectly reasonable job of determining policy here. > > [...] because it'll get it wrong and people will file bugs > > complaining that it got it wrong and the biggest case where > > you *need* to be able to handle switching between performance > > and power optimisations (your rack management unit just told > > you that you're going to have to drop power consumption by > > 20W) is one where the kernel doesn't have all the information > > it needs to do this. So why bother at all? > > The point is to have a working default mechanism. Your suggestions aren't a working default mechanism. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 21 August 2012 02:58, Alex Shi wrote: > On 08/20/2012 11:36 PM, Vincent Guittot wrote: > >>> > What you want it to keep track of a per-cpu utilization level (inverse >>> > of idle-time) and using PJTs per-task runnable avg see if placing the >>> > new task on will exceed the utilization limit. >>> > >>> > I think some of the Linaro people actually played around with this, >>> > Vincent? >> Sorry for the late reply but I had almost no network access during last >> weeks. >> >> So Linaro also works on a power aware scheduler as Peter mentioned. >> >> Based on previous tests, we have concluded that main drawback of the >> (now removed) old power scheduler was that we had no way to make >> difference between short and long running tasks whereas it's a key >> input (at least for phone) for deciding to pack tasks and for >> selecting the core on an asymmetric system. > > > It is hard to estimate future in general view point. but from hack > point, maybe you can add something to hint this from task_struct. :) > per-task load tracking patchsets give you a good view of the last dozen of ms >> One additional key information is the power distribution in the system >> which can have a finer granularity than current sched_domain >> description. Peter's proposal was to use a SHARE_POWERLINE flag >> similarly to flags that already describe if a sched_domain share >> resources or cpu capacity. > > > Seems I missed this. what's difference with current SD_SHARE_CPUPOWER > and SD_SHARE_PKG_RESOURCES. SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some part of the physical core) SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like cache and memory access) > >> >> With these 2 new information, we can have a 1st power saving scheduler >> which spread or packed tasks across core and package > > > Fine, I like to test them on X86, plus SMT and NUMA :) > >> >> Vincent > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett wrote: > On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote: > > > If the answer is 'yes' then there's clear cases where the kernel > > (should) automatically know the events where we switch from > > balancing for performance to balancing for power: > > No. We can't identify all of these cases and we can't identify > corner cases. [...] There's no need to identify 'all' of these cases - but if the kernel knows then it can have intelligent default behavior. > [...] Putting this kind of policy in the kernel is an awful > idea. [...] A modern kernel better know what state the system is in: on battery or on AC power. > [...] It should never be altering policy itself, [...] The kernel/scheduler simply offers sensible defaults where it can. User-space can augment/modify/override that in any which way it wishes to. This stuff has not been properly sorted out in the last 10+ years since we have battery driven devices, so we might as well start with the kernel offering sane default behavior where it can ... > [...] because it'll get it wrong and people will file bugs > complaining that it got it wrong and the biggest case where > you *need* to be able to handle switching between performance > and power optimisations (your rack management unit just told > you that you're going to have to drop power consumption by > 20W) is one where the kernel doesn't have all the information > it needs to do this. So why bother at all? The point is to have a working default mechanism. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote: I'd like to see actual numbers and evidence on a wide range of workloads the spread/don't spread thing is even measurable given that you've also got to factor in effects like completing faster and turning everything off. I'd *really* like to see such evidence on a laptop,which is your one cited case it might work. For my dinky dual core laptop, I suspect you're right, but for a more powerful laptop, I'd expect spread/don't to be noticeable. Yeah, hard numbers would be nice to see. If I had a powerful laptop, I'd kill irq balancing, and all but periodic load balancing, and expect to see a positive result. Dunno what fickle electron gods would _really_ do with those prayers though. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett mj...@srcf.ucam.org wrote: On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote: If the answer is 'yes' then there's clear cases where the kernel (should) automatically know the events where we switch from balancing for performance to balancing for power: No. We can't identify all of these cases and we can't identify corner cases. [...] There's no need to identify 'all' of these cases - but if the kernel knows then it can have intelligent default behavior. [...] Putting this kind of policy in the kernel is an awful idea. [...] A modern kernel better know what state the system is in: on battery or on AC power. [...] It should never be altering policy itself, [...] The kernel/scheduler simply offers sensible defaults where it can. User-space can augment/modify/override that in any which way it wishes to. This stuff has not been properly sorted out in the last 10+ years since we have battery driven devices, so we might as well start with the kernel offering sane default behavior where it can ... [...] because it'll get it wrong and people will file bugs complaining that it got it wrong and the biggest case where you *need* to be able to handle switching between performance and power optimisations (your rack management unit just told you that you're going to have to drop power consumption by 20W) is one where the kernel doesn't have all the information it needs to do this. So why bother at all? The point is to have a working default mechanism. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 21 August 2012 02:58, Alex Shi alex@intel.com wrote: On 08/20/2012 11:36 PM, Vincent Guittot wrote: What you want it to keep track of a per-cpu utilization level (inverse of idle-time) and using PJTs per-task runnable avg see if placing the new task on will exceed the utilization limit. I think some of the Linaro people actually played around with this, Vincent? Sorry for the late reply but I had almost no network access during last weeks. So Linaro also works on a power aware scheduler as Peter mentioned. Based on previous tests, we have concluded that main drawback of the (now removed) old power scheduler was that we had no way to make difference between short and long running tasks whereas it's a key input (at least for phone) for deciding to pack tasks and for selecting the core on an asymmetric system. It is hard to estimate future in general view point. but from hack point, maybe you can add something to hint this from task_struct. :) per-task load tracking patchsets give you a good view of the last dozen of ms One additional key information is the power distribution in the system which can have a finer granularity than current sched_domain description. Peter's proposal was to use a SHARE_POWERLINE flag similarly to flags that already describe if a sched_domain share resources or cpu capacity. Seems I missed this. what's difference with current SD_SHARE_CPUPOWER and SD_SHARE_PKG_RESOURCES. SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some part of the physical core) SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like cache and memory access) With these 2 new information, we can have a 1st power saving scheduler which spread or packed tasks across core and package Fine, I like to test them on X86, plus SMT and NUMA :) Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote: * Matthew Garrett mj...@srcf.ucam.org wrote: [...] Putting this kind of policy in the kernel is an awful idea. [...] A modern kernel better know what state the system is in: on battery or on AC power. That's a fundamentally uninteresting thing for the kernel to know about. AC/battery is just not an important power management policy input when compared to various other things. [...] It should never be altering policy itself, [...] The kernel/scheduler simply offers sensible defaults where it can. User-space can augment/modify/override that in any which way it wishes to. This stuff has not been properly sorted out in the last 10+ years since we have battery driven devices, so we might as well start with the kernel offering sane default behavior where it can ... Userspace has been doing a perfectly reasonable job of determining policy here. [...] because it'll get it wrong and people will file bugs complaining that it got it wrong and the biggest case where you *need* to be able to handle switching between performance and power optimisations (your rack management unit just told you that you're going to have to drop power consumption by 20W) is one where the kernel doesn't have all the information it needs to do this. So why bother at all? The point is to have a working default mechanism. Your suggestions aren't a working default mechanism. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett mj...@srcf.ucam.org wrote: On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote: * Matthew Garrett mj...@srcf.ucam.org wrote: [...] Putting this kind of policy in the kernel is an awful idea. [...] A modern kernel better know what state the system is in: on battery or on AC power. That's a fundamentally uninteresting thing for the kernel to know about. [...] I disagree. [...] AC/battery is just not an important power management policy input when compared to various other things. Such as? The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. [...] It should never be altering policy itself, [...] The kernel/scheduler simply offers sensible defaults where it can. User-space can augment/modify/override that in any which way it wishes to. This stuff has not been properly sorted out in the last 10+ years since we have battery driven devices, so we might as well start with the kernel offering sane default behavior where it can ... Userspace has been doing a perfectly reasonable job of determining policy here. Has it properly switched the scheduler's balancing between power-effient and performance-maximizing strategies when for example a laptop's AC got unplugged/replugged? [...] because it'll get it wrong and people will file bugs complaining that it got it wrong and the biggest case where you *need* to be able to handle switching between performance and power optimisations (your rack management unit just told you that you're going to have to drop power consumption by 20W) is one where the kernel doesn't have all the information it needs to do this. So why bother at all? The point is to have a working default mechanism. Your suggestions aren't a working default mechanism. In what way? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
A modern kernel better know what state the system is in: on battery or on AC power. That's a fundamentally uninteresting thing for the kernel to know about. [...] I disagree. and I'll agree with Matthew and disagree with you ;-) [...] AC/battery is just not an important power management policy input when compared to various other things. Such as? The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. I think you're wrong there. First of all, not the whole world is a laptop. Phones and servers are very different than laptops in this sense. In a phone, when you're charging, you want to be EXTRA power efficient in many ways (since charging creates heat, and that heat will take away your thermal budget). In a datacenter, you're either on AC or DC all the time, and power efficiency still matters. And even on a laptop.. heat production matters even when on AC... laptops are more and more like phones that way. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote: * Matthew Garrett mj...@srcf.ucam.org wrote: [...] AC/battery is just not an important power management policy input when compared to various other things. Such as? The scheduler's behaviour is going to have a minimal impact on power consumption on laptops. Other things are much more important - backlight level, ASPM state, that kind of thing. So why special case the scheduler? This is going to be hugely more important on multi-socket systems, where your policy is usually going to be dictated by the specific workload that you're running at the time. The exception is in cases where your rack is overcommitted for power and your rack management unit is telling you to reduce power consumption since otherwise it's going to have to cut the power to one of the machines in the rack in the next few seconds. The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. And it's already well handled from userspace, as it has to be. Userspace has been doing a perfectly reasonable job of determining policy here. Has it properly switched the scheduler's balancing between power-effient and performance-maximizing strategies when for example a laptop's AC got unplugged/replugged? No, because sched_mt_powersave usually crippled performance more than it saved power and nobody makes multi-socket laptops. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
That's a fundamentally uninteresting thing for the kernel to know about. [...] I disagree. The kernel has no idea of the power architecture leading up to the plug socket. The kernel has no idea of the policy concerns of the user. [...] AC/battery is just not an important power management policy input when compared to various other things. Such as? The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. Along with distance likely to be travelled without a socket being available, whether you remembered the charger, and a pile of other things ('can I get this built before Linus wakes up'). The kernel isn't capable of computing these other factors. The userspace can at least make an educated guess, In the business space its even more complicated because battery/mains may well only be visible via SNMP queries to the power systems and the bigger concern may well be heat efficiency. If you are running a cloud your policy considerations also include things like your current spot electricity price, outside temperature and your current spot compute price chargeable. Userspace has been doing a perfectly reasonable job of determining policy here. Has it properly switched the scheduler's balancing between power-effient and performance-maximizing strategies when for example a laptop's AC got unplugged/replugged? You work for Red Hat, maybe you should ask your distro people if they do. While you are it at perhaps also some of the ATA power management that will probably be an order of magnitude more significant could get included ;) Seriously. On a typical laptop the things you can do about power are dominated by the backlight, by disk power (eg idle SATA links), by USB device power downs where possible, by turning off any unused phys and by not having the CPU wake up, which means fixing the desktop apps to behave sensibly. I'd like to see actual numbers and evidence on a wide range of workloads the spread/don't spread thing is even measurable given that you've also got to factor in effects like completing faster and turning everything off. I'd *really* like to see such evidence on a laptop,which is your one cited case it might work. Your suggestions aren't a working default mechanism. In what way? For one if the default behaviour is that when I get on the train and am on battery my video playback begins to stutter due to some kernel magic then I shall be unamused and file it as a regression. Policy is userspace - the desktop can figure out I'm watching movies and what this means, the kernel can't. I'd also note there have been repeated attempts to put power management policy on various OS's by putting the power management policy - in the hardware - in SMM handlers - in the kernel and every single one has been a failure because those parts of the system never have enough information nor do they have enough variety of control to manage the complexity of input state. It's a single policy file for a distro to do scheduler configuration based upon power events. One trivial 'drop it here' shell script. The difference then being the desktop can be taught to do overrides and policy properly. It might be the kernel has important knowledge about what schedule for efficiency means and even to be able to ask the kernel to dot hat - but it has no idea what the right policy is at any given moment. ie even if there is a /sys/mumble/schedule_for_efficiency the echo 1 and echo 0 belong in a script Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett m...@redhat.com wrote: On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote: * Matthew Garrett mj...@srcf.ucam.org wrote: [...] AC/battery is just not an important power management policy input when compared to various other things. Such as? The scheduler's behaviour is going to have a minimal impact on power consumption on laptops. Other things are much more important - backlight level, ASPM state, that kind of thing. So why special case the scheduler? [...] I'm not special casing the scheduler - but we are talking about scheduler policies here, so *if* it makes sense to handle this dynamically then obviously the scheduler wants to use system state information when/if the kernel can get it. Your argument is as if you said that the shape of a car's side view mirrors is not important to its top speed, because the overall shape of the chassis and engine power are much more important. But we are desiging side view mirrors here, so we might as well do a good job there. [...] This is going to be hugely more important on multi-socket systems, where your policy is usually going to be dictated by the specific workload that you're running at the time. [...] If only we had some kernel subsystem that is intimiately familar with the workloads running on the system and could act accordingly and with low latency. We could name that subsystem it in some intuitive fashion: it switches and schedules workloads, so how about calling it the 'scheduler'? [...] The exception is in cases where your rack is overcommitted for power and your rack management unit is telling you to reduce power consumption since otherwise it's going to have to cut the power to one of the machines in the rack in the next few seconds. ( That must be some ACPI middleware driven crap, right? Not really the Linux kernel's problem. ) The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. And it's already well handled from userspace, as it has to be. Not according to the developers switching away from Linux desktop distros in droves, because MacOSX or Win7 has 30%+ better battery efficiency. The scheduler might be a small part of the picture, but it's certainly a part of it. Userspace has been doing a perfectly reasonable job of determining policy here. Has it properly switched the scheduler's balancing between power-effient and performance-maximizing strategies when for example a laptop's AC got unplugged/replugged? No, because sched_mt_powersave usually crippled performance more than it saved power and nobody makes multi-socket laptops. That's a user-space policy management fail right there: why wasn't this fixed? If the default policy is in the kernel we can at least fix it in one place for the most common cases. If it's spread out amongst multiple projects then progress only happens at glacial speed ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote: * Matthew Garrett m...@redhat.com wrote: The scheduler's behaviour is going to have a minimal impact on power consumption on laptops. Other things are much more important - backlight level, ASPM state, that kind of thing. So why special case the scheduler? [...] I'm not special casing the scheduler - but we are talking about scheduler policies here, so *if* it makes sense to handle this dynamically then obviously the scheduler wants to use system state information when/if the kernel can get it. Your argument is as if you said that the shape of a car's side view mirrors is not important to its top speed, because the overall shape of the chassis and engine power are much more important. But we are desiging side view mirrors here, so we might as well do a good job there. If the kernel is going to make power choices automatically then it should do it everywhere, not piecemeal. [...] This is going to be hugely more important on multi-socket systems, where your policy is usually going to be dictated by the specific workload that you're running at the time. [...] If only we had some kernel subsystem that is intimiately familar with the workloads running on the system and could act accordingly and with low latency. We could name that subsystem it in some intuitive fashion: it switches and schedules workloads, so how about calling it the 'scheduler'? The scheduler is unaware of whether I care about a process finishing quickly or whether I care about it consuming less power. [...] The exception is in cases where your rack is overcommitted for power and your rack management unit is telling you to reduce power consumption since otherwise it's going to have to cut the power to one of the machines in the rack in the next few seconds. ( That must be some ACPI middleware driven crap, right? Not really the Linux kernel's problem. ) It's as much the Linux kernel's problem as AC/battery decisions are - ie, it's not. The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. And it's already well handled from userspace, as it has to be. Not according to the developers switching away from Linux desktop distros in droves, because MacOSX or Win7 has 30%+ better battery efficiency. Ok so what you're actually telling me here is that you don't understand anything about power management and where our problems are. The scheduler might be a small part of the picture, but it's certainly a part of it. It's in the drivers, which is where it has been since we went tickless. No, because sched_mt_powersave usually crippled performance more than it saved power and nobody makes multi-socket laptops. That's a user-space policy management fail right there: why wasn't this fixed? If the default policy is in the kernel we can at least fix it in one place for the most common cases. If it's spread out amongst multiple projects then progress only happens at glacial speed ... sched_mt_powersave was inherently going to have a huge impact on performance, and with modern chips that would result in the platform consuming more power. It was a feature that was useful for a small number of generations of desktop CPUs - I don't think it would ever skew the power/performance ratio in a useful direction on mobile hardware. But feel free to blame userspace for hardware design. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Matthew Garrett mj...@srcf.ucam.org wrote: On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote: * Matthew Garrett m...@redhat.com wrote: The scheduler's behaviour is going to have a minimal impact on power consumption on laptops. Other things are much more important - backlight level, ASPM state, that kind of thing. So why special case the scheduler? [...] I'm not special casing the scheduler - but we are talking about scheduler policies here, so *if* it makes sense to handle this dynamically then obviously the scheduler wants to use system state information when/if the kernel can get it. Your argument is as if you said that the shape of a car's side view mirrors is not important to its top speed, because the overall shape of the chassis and engine power are much more important. But we are desiging side view mirrors here, so we might as well do a good job there. If the kernel is going to make power choices automatically then it should do it everywhere, not piecemeal. Why? Good scheduling is useful even in isolation. The scheduler is unaware of whether I care about a process finishing quickly or whether I care about it consuming less power. You are posing them as if the two were mutually exclusive, while in reality they are not necessarily exclusive: it's quite possible that the highest (non-turbo) CPU frequency happens to be the most energy efficient one for a CPU with a particular workload ... You also missed the bit of my mail where I suggested that such user preferences and tolerances can be communicated to the scheduler via a policy toggle - which the scheduler would take into account. I suggest to use sane defaults, such as being energy efficient on battery power (within a sane threshold) and maximizing throughput on AC power (within a sane threshold). That would go a *long* way improving the current mess. If Linux power efficiency was so good today then I'd not ask for kernel driven defaults - but the reality is that in terms of process scheduling we suck today (and have sucked for the last 10 years) so pretty much any approach will improve things. The thing is, when I use Linux on a laptop then AC/battery is *the* main policy input. And it's already well handled from userspace, as it has to be. Not according to the developers switching away from Linux desktop distros in droves, because MacOSX or Win7 has 30%+ better battery efficiency. Ok so what you're actually telling me here is that you don't understand anything about power management and where our problems are. Huh? In practice we suck today in terms of energy efficiency. That covers both scheduling and other areas. Saying this out aloud does not tell anything about my understanding of power management... So please outline a technical point. The scheduler might be a small part of the picture, but it's certainly a part of it. It's in the drivers, which is where it has been since we went tickless. You mean the code is in drivers? Or the problem is in drivers? Both is true currently - this discussion is about the future, to make the scheduler aware of power concerns, as the scheduler (and the timer subsystem) already calculates various interesting metrics that matter to energy efficient scheduling. No, because sched_mt_powersave usually crippled performance more than it saved power and nobody makes multi-socket laptops. That's a user-space policy management fail right there: why wasn't this fixed? If the default policy is in the kernel we can at least fix it in one place for the most common cases. If it's spread out amongst multiple projects then progress only happens at glacial speed ... sched_mt_powersave was inherently going to have a huge impact on performance, and with modern chips that would result in the platform consuming more power. It was a feature that was useful for a small number of generations of desktop CPUs - I don't think it would ever skew the power/performance ratio in a useful direction on mobile hardware. But feel free to blame userspace for hardware design. FYI, sched_mt_powersave is *GONE* in recent kernels, because it basically never worked. This thread is about designing and implementing something that actually works. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Tue, Aug 21, 2012 at 08:23:46PM +0200, Ingo Molnar wrote: * Matthew Garrett mj...@srcf.ucam.org wrote: The scheduler is unaware of whether I care about a process finishing quickly or whether I care about it consuming less power. You are posing them as if the two were mutually exclusive, while in reality they are not necessarily exclusive: it's quite possible that the highest (non-turbo) CPU frequency happens to be the most energy efficient one for a CPU with a particular workload ... You just put in a proviso that makes them mutually exclusive. If I want it done fast, I want it done in the highest turbo CPU frequency. If I don't want it done fast, I want it done in the most efficient CPU frequency. They're probably not the same thing. You also missed the bit of my mail where I suggested that such user preferences and tolerances can be communicated to the scheduler via a policy toggle - which the scheduler would take into account. Yes. And that toggle should be the thing that defines the policy under all circumstances. Ok so what you're actually telling me here is that you don't understand anything about power management and where our problems are. Huh? In practice we suck today in terms of energy efficiency. That covers both scheduling and other areas. Saying this out aloud does not tell anything about my understanding of power management... So please outline a technical point. Our power consumption is worse than under other operating systems is almost entirely because only one of our three GPU drivers implements any kind of useful power management. The power saving functionality that we expose to userspace is already used when it's safe to do so. So blaming our userspace policy management for our higher power consumption means that you can't possibly understand where the problems actually are, which indicates that you probably shouldn't be trying to tell me about optimal approaches to power management. You mean the code is in drivers? Or the problem is in drivers? The problem is in the drivers. sched_mt_powersave was inherently going to have a huge impact on performance, and with modern chips that would result in the platform consuming more power. It was a feature that was useful for a small number of generations of desktop CPUs - I don't think it would ever skew the power/performance ratio in a useful direction on mobile hardware. But feel free to blame userspace for hardware design. FYI, sched_mt_powersave is *GONE* in recent kernels, because it basically never worked. This thread is about designing and implementing something that actually works. Yes. You asked me whether userspace ever used the knobs that the kernel exposed. I said no, because the only knob relevant for laptops would never improve energy efficiency on laptops. It is therefore impossible to use this as an example of userspace policy management not doing the right thing. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
Why? Good scheduling is useful even in isolation. For power - I suspect it's damn near irrelevant except on a big big machine. Unless you've sorted out your SATA, fixed your phy handling, optimised your desktop for wakeups and worked down the big wakeup causes one by one it's turd polishing. PM means fixing the stack top to bottom, and its a whackamole game, each one you fix you find the next. You have to sort the entire stack from desktop apps to kernel. However benchmarks talk - so lets have some benchmarks ... on a laptop. Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/20/2012 11:47 PM, Vincent Guittot wrote: > On 16 August 2012 07:03, Alex Shi wrote: >> On 08/16/2012 12:19 AM, Matthew Garrett wrote: >> >>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote: >>> power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. >>> >>> Are there workloads in which "power" might provide more performance than >>> "performance"? If so, don't use these terms. >>> >> >> >> Power scheme should no chance has better performance in design. > > A side effect of packing small tasks on one core is that you always > use the core with the lowest C-state which will minimize the wake up > latency so you can sometime get better results than performance mode > which will try to use a other core in another cluster which will take > more time to wake up that waiting for the end of the current task. > Sure. some scenario packing tasks into smaller domain will bring performance benefit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/20/2012 11:36 PM, Vincent Guittot wrote: >> > What you want it to keep track of a per-cpu utilization level (inverse >> > of idle-time) and using PJTs per-task runnable avg see if placing the >> > new task on will exceed the utilization limit. >> > >> > I think some of the Linaro people actually played around with this, >> > Vincent? > Sorry for the late reply but I had almost no network access during last weeks. > > So Linaro also works on a power aware scheduler as Peter mentioned. > > Based on previous tests, we have concluded that main drawback of the > (now removed) old power scheduler was that we had no way to make > difference between short and long running tasks whereas it's a key > input (at least for phone) for deciding to pack tasks and for > selecting the core on an asymmetric system. It is hard to estimate future in general view point. but from hack point, maybe you can add something to hint this from task_struct. :) > One additional key information is the power distribution in the system > which can have a finer granularity than current sched_domain > description. Peter's proposal was to use a SHARE_POWERLINE flag > similarly to flags that already describe if a sched_domain share > resources or cpu capacity. Seems I missed this. what's difference with current SD_SHARE_CPUPOWER and SD_SHARE_PKG_RESOURCES. > > With these 2 new information, we can have a 1st power saving scheduler > which spread or packed tasks across core and package Fine, I like to test them on X86, plus SMT and NUMA :) > > Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, 20 Aug 2012, Matthew Garrett wrote: > On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote: > > > So please make sure that there are obvious and easy ways to switch this > > stuff off or provide "low latency" know that keeps the system from > > assuming that idle time means that full performance is not needed. > > That seems like an issue for cpuidle, not the scheduler. Does pm_qos not > already do what you want? Dont know. A simple solution is not to compile power management into the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote: > If the answer is 'yes' then there's clear cases where the kernel > (should) automatically know the events where we switch from > balancing for performance to balancing for power: No. We can't identify all of these cases and we can't identify corner cases. Putting this kind of policy in the kernel is an awful idea. It should never be altering policy itself, because it'll get it wrong and people will file bugs complaining that it got it wrong and the biggest case where you *need* to be able to handle switching between performance and power optimisations (your rack management unit just told you that you're going to have to drop power consumption by 20W) is one where the kernel doesn't have all the information it needs to do this. So why bother at all? -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 17 August 2012 10:43, Paul Turner wrote: > On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra > wrote: >> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: >>> Since there is no power saving consideration in scheduler CFS, I has a >>> very rough idea for enabling a new power saving schema in CFS. >> >> Adding Thomas, he always delights poking holes in power schemes. >> >>> It bases on the following assumption: >>> 1, If there are many task crowd in system, just let few domain cpus >>> running and let other cpus idle can not save power. Let all cpu take the >>> load, finish tasks early, and then get into idle. will save more power >>> and have better user experience. >> >> I'm not sure this is a valid assumption. I've had it explained to me by >> various people that race-to-idle isn't always the best thing. It has to >> do with the cost of switching power states and the duration of execution >> and other such things. >> >>> 2, schedule domain, schedule group perfect match the hardware, and >>> the power consumption unit. So, pull tasks out of a domain means >>> potentially this power consumption unit idle. >> >> I'm not sure I understand what you're saying, sorry. >> >>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale >>> power aware scheduling), this proposal will adopt the >>> sched_balance_policy concept and use 2 kind of policy: performance, power. >> >> Yay, ideally we'd also provide a 3rd option: auto, which simply switches >> between the two based on AC/BAT, UPS status and simple things like that. >> But this seems like a later concern, you have to have something to pick >> between before you can pick :-) >> >>> And in scheduling, 2 place will care the policy, load_balance() and in >>> task fork/exec: select_task_rq_fair(). >> >> ack >> >>> Here is some pseudo code try to explain the proposal behaviour in >>> load_balance() and select_task_rq_fair(); >> >> Oh man.. A few words outlining the general idea would've been nice. >> >>> load_balance() { >>> update_sd_lb_stats(); //get busiest group, idlest group data. >>> >>> if (sd->nr_running > sd's capacity) { >>> //power saving policy is not suitable for >>> //this scenario, it runs like performance policy >>> mv tasks from busiest cpu in busiest group to >>> idlest cpu in idlest group; >> >> Once upon a time we talked about adding a factor to the capacity for >> this. So say you'd allow 2*capacity before overflowing and waking >> another power group. >> >> But I think we should not go on nr_running here, PJTs per-entity load >> tracking stuff gives us much better measures -- also, repost that series >> already Paul! :-) > > Yes -- I just got back from Africa this week. It's updated for almost > all the previous comments but I ran out of time before I left to > re-post. I'm just about caught up enough that I should be able to get > this done over the upcoming weekend. Monday at the latest. > >> >> Also, I'm not sure this is entirely correct, the thing you want to do >> for power aware stuff is to minimize the number of active power domains, >> this means you don't want idlest, you want least busy non-idle. >> >>> } else {// the sd has enough capacity to hold all tasks. >>> if (sg->nr_running > sg's capacity) { >>> //imbalanced between groups >>> if (schedule policy == performance) { >>> //when 2 busiest group at same busy >>> //degree, need to prefer the one has >>> // softest group?? >>> move tasks from busiest group to >>> idletest group; >> >> So I'd leave the currently implemented scheme as performance, and I >> don't think the above describes the current state. >> >>> } else if (schedule policy == power) >>> move tasks from busiest group to >>> idlest group until busiest is just full >>> of capacity. >>> //the busiest group can balance >>> //internally after next time LB, >> >> There's another thing we need to do, and that is collect tasks in a >> minimal amount of power domains. The old code (that got deleted) did >> something like that, you can revive some of the that code if needed -- I >> just killed everything to be able to start with a clean slate. >> >> >>> } else { >>> //all groups has enough capacity for its tasks. >>> if (schedule policy == performance) >>> //all tasks may has enough cpu >>> //resources to run, >>> //mv tasks from busiest to idlest group? >>> //no, at this time,
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote: > So please make sure that there are obvious and easy ways to switch this > stuff off or provide "low latency" know that keeps the system from > assuming that idle time means that full performance is not needed. That seems like an issue for cpuidle, not the scheduler. Does pm_qos not already do what you want? -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
One issue that is often forgotten is that there are users who want lowest latency and not highest performance. Our systems sit idle for most of the time but when a specific event occurs (typically a packet is received) they must react in the fastest way possible. On every new generation of hardware and software we keep on running into various mechanisms that automatically power down when idle for a long time (to save power...). And its pretty hard to figure these things out given the complexity of modern hardware. F.e. for the Sandybridges we found that the memory channel powers down after 2 milliseconds idle time and that was unaffected by any of the bios config options. Similar mechanisms exist in the kernel but those are easier discover since there is source. So please make sure that there are obvious and easy ways to switch this stuff off or provide "low latency" know that keeps the system from assuming that idle time means that full performance is not needed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 16 August 2012 07:03, Alex Shi wrote: > On 08/16/2012 12:19 AM, Matthew Garrett wrote: > >> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote: >> >>> power aware scheduling), this proposal will adopt the >>> sched_balance_policy concept and use 2 kind of policy: performance, power. >> >> Are there workloads in which "power" might provide more performance than >> "performance"? If so, don't use these terms. >> > > > Power scheme should no chance has better performance in design. A side effect of packing small tasks on one core is that you always use the core with the lowest C-state which will minimize the wake up latency so you can sometime get better results than performance mode which will try to use a other core in another cluster which will take more time to wake up that waiting for the end of the current task. Vincent -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 15 August 2012 13:05, Peter Zijlstra wrote: > On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: >> Since there is no power saving consideration in scheduler CFS, I has a >> very rough idea for enabling a new power saving schema in CFS. > > Adding Thomas, he always delights poking holes in power schemes. > >> It bases on the following assumption: >> 1, If there are many task crowd in system, just let few domain cpus >> running and let other cpus idle can not save power. Let all cpu take the >> load, finish tasks early, and then get into idle. will save more power >> and have better user experience. > > I'm not sure this is a valid assumption. I've had it explained to me by > various people that race-to-idle isn't always the best thing. It has to > do with the cost of switching power states and the duration of execution > and other such things. > >> 2, schedule domain, schedule group perfect match the hardware, and >> the power consumption unit. So, pull tasks out of a domain means >> potentially this power consumption unit idle. > > I'm not sure I understand what you're saying, sorry. > >> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale >> power aware scheduling), this proposal will adopt the >> sched_balance_policy concept and use 2 kind of policy: performance, power. > > Yay, ideally we'd also provide a 3rd option: auto, which simply switches > between the two based on AC/BAT, UPS status and simple things like that. > But this seems like a later concern, you have to have something to pick > between before you can pick :-) > >> And in scheduling, 2 place will care the policy, load_balance() and in >> task fork/exec: select_task_rq_fair(). > > ack > >> Here is some pseudo code try to explain the proposal behaviour in >> load_balance() and select_task_rq_fair(); > > Oh man.. A few words outlining the general idea would've been nice. > >> load_balance() { >> update_sd_lb_stats(); //get busiest group, idlest group data. >> >> if (sd->nr_running > sd's capacity) { >> //power saving policy is not suitable for >> //this scenario, it runs like performance policy >> mv tasks from busiest cpu in busiest group to >> idlest cpu in idlest group; > > Once upon a time we talked about adding a factor to the capacity for > this. So say you'd allow 2*capacity before overflowing and waking > another power group. > > But I think we should not go on nr_running here, PJTs per-entity load > tracking stuff gives us much better measures -- also, repost that series > already Paul! :-) > > Also, I'm not sure this is entirely correct, the thing you want to do > for power aware stuff is to minimize the number of active power domains, > this means you don't want idlest, you want least busy non-idle. > >> } else {// the sd has enough capacity to hold all tasks. >> if (sg->nr_running > sg's capacity) { >> //imbalanced between groups >> if (schedule policy == performance) { >> //when 2 busiest group at same busy >> //degree, need to prefer the one has >> // softest group?? >> move tasks from busiest group to >> idletest group; > > So I'd leave the currently implemented scheme as performance, and I > don't think the above describes the current state. > >> } else if (schedule policy == power) >> move tasks from busiest group to >> idlest group until busiest is just full >> of capacity. >> //the busiest group can balance >> //internally after next time LB, > > There's another thing we need to do, and that is collect tasks in a > minimal amount of power domains. The old code (that got deleted) did > something like that, you can revive some of the that code if needed -- I > just killed everything to be able to start with a clean slate. > > >> } else { >> //all groups has enough capacity for its tasks. >> if (schedule policy == performance) >> //all tasks may has enough cpu >> //resources to run, >> //mv tasks from busiest to idlest group? >> //no, at this time, it's better to keep >> //the task on current cpu. >> //so, it is maybe better to do balance >> //in each of groups >> for_each_imbalance_groups() >> move tasks from busiest cpu to >> idlest cpu in each of groups; >>
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/20/2012 1:06 AM, Ingo Molnar wrote: > > > There's also cases where the kernel has insufficient information > from the hardware and from the admin about the preferred > characteristics/policy of the system - a tweakable fallback knob > might be provided for that sad case. > > The point is, that knob is not the policy setting and it's not > the main mechanism. It's a fallback. if we call the knob "powersave", it better save power... if we call it "group together" or "spread out".. no problem with that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, 2012-08-20 at 10:06 +0200, Ingo Molnar wrote: > > > I was really more thinking of something useful for the > > > laptops out there, when they pull the power cord it makes > > > sense to try and keep CPUs asleep until the one that's awake > > > is saturated. > > s/CPU/core ? I was thinking logical cpus, but whatever really. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Arjan van de Ven wrote: > On 8/15/2012 8:04 AM, Peter Zijlstra wrote: > > > This all sounds far too complicated.. we're talking about > > simple spreading and packing balancers without deep arch > > knowledge and knobs, we couldn't possibly evaluate anything > > like that. > > > > I was really more thinking of something useful for the > > laptops out there, when they pull the power cord it makes > > sense to try and keep CPUs asleep until the one that's awake > > is saturated. s/CPU/core ? > as long as you don't do that on machines with an Intel CPU.. > since that'd be the worst case behavior for tasks that run for > more than 100 usec. (e.g. not interrupts, but almost > everything else) The question is, do we need to balance for 'power saving', on systems that care more about power use than they care about peak performance/throughput, at all? If the answer is 'no' then things get rather simple. If the answer is 'yes' then there's clear cases where the kernel (should) automatically know the events where we switch from balancing for performance to balancing for power: - the system boots up on battery - the system was on AC but the cord has been pulled and the system is now on battery - the administrator configures the system on AC to be power-conscious. ( and the opposite direction events wants the scheduler to switch from 'balancing for power' to 'balancing for performance'. ) There's also cases where the kernel has insufficient information from the hardware and from the admin about the preferred characteristics/policy of the system - a tweakable fallback knob might be provided for that sad case. The point is, that knob is not the policy setting and it's not the main mechanism. It's a fallback. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
* Arjan van de Ven ar...@linux.intel.com wrote: On 8/15/2012 8:04 AM, Peter Zijlstra wrote: This all sounds far too complicated.. we're talking about simple spreading and packing balancers without deep arch knowledge and knobs, we couldn't possibly evaluate anything like that. I was really more thinking of something useful for the laptops out there, when they pull the power cord it makes sense to try and keep CPUs asleep until the one that's awake is saturated. s/CPU/core ? as long as you don't do that on machines with an Intel CPU.. since that'd be the worst case behavior for tasks that run for more than 100 usec. (e.g. not interrupts, but almost everything else) The question is, do we need to balance for 'power saving', on systems that care more about power use than they care about peak performance/throughput, at all? If the answer is 'no' then things get rather simple. If the answer is 'yes' then there's clear cases where the kernel (should) automatically know the events where we switch from balancing for performance to balancing for power: - the system boots up on battery - the system was on AC but the cord has been pulled and the system is now on battery - the administrator configures the system on AC to be power-conscious. ( and the opposite direction events wants the scheduler to switch from 'balancing for power' to 'balancing for performance'. ) There's also cases where the kernel has insufficient information from the hardware and from the admin about the preferred characteristics/policy of the system - a tweakable fallback knob might be provided for that sad case. The point is, that knob is not the policy setting and it's not the main mechanism. It's a fallback. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, 2012-08-20 at 10:06 +0200, Ingo Molnar wrote: I was really more thinking of something useful for the laptops out there, when they pull the power cord it makes sense to try and keep CPUs asleep until the one that's awake is saturated. s/CPU/core ? I was thinking logical cpus, but whatever really. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/20/2012 1:06 AM, Ingo Molnar wrote: There's also cases where the kernel has insufficient information from the hardware and from the admin about the preferred characteristics/policy of the system - a tweakable fallback knob might be provided for that sad case. The point is, that knob is not the policy setting and it's not the main mechanism. It's a fallback. if we call the knob powersave, it better save power... if we call it group together or spread out.. no problem with that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 15 August 2012 13:05, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: Since there is no power saving consideration in scheduler CFS, I has a very rough idea for enabling a new power saving schema in CFS. Adding Thomas, he always delights poking holes in power schemes. It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. I'm not sure this is a valid assumption. I've had it explained to me by various people that race-to-idle isn't always the best thing. It has to do with the cost of switching power states and the duration of execution and other such things. 2, schedule domain, schedule group perfect match the hardware, and the power consumption unit. So, pull tasks out of a domain means potentially this power consumption unit idle. I'm not sure I understand what you're saying, sorry. So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Yay, ideally we'd also provide a 3rd option: auto, which simply switches between the two based on AC/BAT, UPS status and simple things like that. But this seems like a later concern, you have to have something to pick between before you can pick :-) And in scheduling, 2 place will care the policy, load_balance() and in task fork/exec: select_task_rq_fair(). ack Here is some pseudo code try to explain the proposal behaviour in load_balance() and select_task_rq_fair(); Oh man.. A few words outlining the general idea would've been nice. load_balance() { update_sd_lb_stats(); //get busiest group, idlest group data. if (sd-nr_running sd's capacity) { //power saving policy is not suitable for //this scenario, it runs like performance policy mv tasks from busiest cpu in busiest group to idlest cpu in idlest group; Once upon a time we talked about adding a factor to the capacity for this. So say you'd allow 2*capacity before overflowing and waking another power group. But I think we should not go on nr_running here, PJTs per-entity load tracking stuff gives us much better measures -- also, repost that series already Paul! :-) Also, I'm not sure this is entirely correct, the thing you want to do for power aware stuff is to minimize the number of active power domains, this means you don't want idlest, you want least busy non-idle. } else {// the sd has enough capacity to hold all tasks. if (sg-nr_running sg's capacity) { //imbalanced between groups if (schedule policy == performance) { //when 2 busiest group at same busy //degree, need to prefer the one has // softest group?? move tasks from busiest group to idletest group; So I'd leave the currently implemented scheme as performance, and I don't think the above describes the current state. } else if (schedule policy == power) move tasks from busiest group to idlest group until busiest is just full of capacity. //the busiest group can balance //internally after next time LB, There's another thing we need to do, and that is collect tasks in a minimal amount of power domains. The old code (that got deleted) did something like that, you can revive some of the that code if needed -- I just killed everything to be able to start with a clean slate. } else { //all groups has enough capacity for its tasks. if (schedule policy == performance) //all tasks may has enough cpu //resources to run, //mv tasks from busiest to idlest group? //no, at this time, it's better to keep //the task on current cpu. //so, it is maybe better to do balance //in each of groups for_each_imbalance_groups() move tasks from busiest cpu to idlest cpu in each of groups; else if (schedule policy == power) { if (no hard pin in idlest group)
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 16 August 2012 07:03, Alex Shi alex@intel.com wrote: On 08/16/2012 12:19 AM, Matthew Garrett wrote: On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote: power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Are there workloads in which power might provide more performance than performance? If so, don't use these terms. Power scheme should no chance has better performance in design. A side effect of packing small tasks on one core is that you always use the core with the lowest C-state which will minimize the wake up latency so you can sometime get better results than performance mode which will try to use a other core in another cluster which will take more time to wake up that waiting for the end of the current task. Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
One issue that is often forgotten is that there are users who want lowest latency and not highest performance. Our systems sit idle for most of the time but when a specific event occurs (typically a packet is received) they must react in the fastest way possible. On every new generation of hardware and software we keep on running into various mechanisms that automatically power down when idle for a long time (to save power...). And its pretty hard to figure these things out given the complexity of modern hardware. F.e. for the Sandybridges we found that the memory channel powers down after 2 milliseconds idle time and that was unaffected by any of the bios config options. Similar mechanisms exist in the kernel but those are easier discover since there is source. So please make sure that there are obvious and easy ways to switch this stuff off or provide low latency know that keeps the system from assuming that idle time means that full performance is not needed. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote: So please make sure that there are obvious and easy ways to switch this stuff off or provide low latency know that keeps the system from assuming that idle time means that full performance is not needed. That seems like an issue for cpuidle, not the scheduler. Does pm_qos not already do what you want? -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 17 August 2012 10:43, Paul Turner p...@google.com wrote: On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: Since there is no power saving consideration in scheduler CFS, I has a very rough idea for enabling a new power saving schema in CFS. Adding Thomas, he always delights poking holes in power schemes. It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. I'm not sure this is a valid assumption. I've had it explained to me by various people that race-to-idle isn't always the best thing. It has to do with the cost of switching power states and the duration of execution and other such things. 2, schedule domain, schedule group perfect match the hardware, and the power consumption unit. So, pull tasks out of a domain means potentially this power consumption unit idle. I'm not sure I understand what you're saying, sorry. So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Yay, ideally we'd also provide a 3rd option: auto, which simply switches between the two based on AC/BAT, UPS status and simple things like that. But this seems like a later concern, you have to have something to pick between before you can pick :-) And in scheduling, 2 place will care the policy, load_balance() and in task fork/exec: select_task_rq_fair(). ack Here is some pseudo code try to explain the proposal behaviour in load_balance() and select_task_rq_fair(); Oh man.. A few words outlining the general idea would've been nice. load_balance() { update_sd_lb_stats(); //get busiest group, idlest group data. if (sd-nr_running sd's capacity) { //power saving policy is not suitable for //this scenario, it runs like performance policy mv tasks from busiest cpu in busiest group to idlest cpu in idlest group; Once upon a time we talked about adding a factor to the capacity for this. So say you'd allow 2*capacity before overflowing and waking another power group. But I think we should not go on nr_running here, PJTs per-entity load tracking stuff gives us much better measures -- also, repost that series already Paul! :-) Yes -- I just got back from Africa this week. It's updated for almost all the previous comments but I ran out of time before I left to re-post. I'm just about caught up enough that I should be able to get this done over the upcoming weekend. Monday at the latest. Also, I'm not sure this is entirely correct, the thing you want to do for power aware stuff is to minimize the number of active power domains, this means you don't want idlest, you want least busy non-idle. } else {// the sd has enough capacity to hold all tasks. if (sg-nr_running sg's capacity) { //imbalanced between groups if (schedule policy == performance) { //when 2 busiest group at same busy //degree, need to prefer the one has // softest group?? move tasks from busiest group to idletest group; So I'd leave the currently implemented scheme as performance, and I don't think the above describes the current state. } else if (schedule policy == power) move tasks from busiest group to idlest group until busiest is just full of capacity. //the busiest group can balance //internally after next time LB, There's another thing we need to do, and that is collect tasks in a minimal amount of power domains. The old code (that got deleted) did something like that, you can revive some of the that code if needed -- I just killed everything to be able to start with a clean slate. } else { //all groups has enough capacity for its tasks. if (schedule policy == performance) //all tasks may has enough cpu //resources to run, //mv tasks from busiest to idlest group? //no, at this time, it's better to keep //the task on current cpu. //so, it is maybe better to do balance //in each of groups
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote: If the answer is 'yes' then there's clear cases where the kernel (should) automatically know the events where we switch from balancing for performance to balancing for power: No. We can't identify all of these cases and we can't identify corner cases. Putting this kind of policy in the kernel is an awful idea. It should never be altering policy itself, because it'll get it wrong and people will file bugs complaining that it got it wrong and the biggest case where you *need* to be able to handle switching between performance and power optimisations (your rack management unit just told you that you're going to have to drop power consumption by 20W) is one where the kernel doesn't have all the information it needs to do this. So why bother at all? -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Mon, 20 Aug 2012, Matthew Garrett wrote: On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote: So please make sure that there are obvious and easy ways to switch this stuff off or provide low latency know that keeps the system from assuming that idle time means that full performance is not needed. That seems like an issue for cpuidle, not the scheduler. Does pm_qos not already do what you want? Dont know. A simple solution is not to compile power management into the kernel. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/20/2012 11:36 PM, Vincent Guittot wrote: What you want it to keep track of a per-cpu utilization level (inverse of idle-time) and using PJTs per-task runnable avg see if placing the new task on will exceed the utilization limit. I think some of the Linaro people actually played around with this, Vincent? Sorry for the late reply but I had almost no network access during last weeks. So Linaro also works on a power aware scheduler as Peter mentioned. Based on previous tests, we have concluded that main drawback of the (now removed) old power scheduler was that we had no way to make difference between short and long running tasks whereas it's a key input (at least for phone) for deciding to pack tasks and for selecting the core on an asymmetric system. It is hard to estimate future in general view point. but from hack point, maybe you can add something to hint this from task_struct. :) One additional key information is the power distribution in the system which can have a finer granularity than current sched_domain description. Peter's proposal was to use a SHARE_POWERLINE flag similarly to flags that already describe if a sched_domain share resources or cpu capacity. Seems I missed this. what's difference with current SD_SHARE_CPUPOWER and SD_SHARE_PKG_RESOURCES. With these 2 new information, we can have a 1st power saving scheduler which spread or packed tasks across core and package Fine, I like to test them on X86, plus SMT and NUMA :) Vincent -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/20/2012 11:47 PM, Vincent Guittot wrote: On 16 August 2012 07:03, Alex Shi alex@intel.com wrote: On 08/16/2012 12:19 AM, Matthew Garrett wrote: On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote: power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Are there workloads in which power might provide more performance than performance? If so, don't use these terms. Power scheme should no chance has better performance in design. A side effect of packing small tasks on one core is that you always use the core with the lowest C-state which will minimize the wake up latency so you can sometime get better results than performance mode which will try to use a other core in another cluster which will take more time to wake up that waiting for the end of the current task. Sure. some scenario packing tasks into smaller domain will bring performance benefit. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
Hi all, I can probably add some bits to the discussion, after all I'm preparing a talk for Plumbers that is strictly related :-). My points are not CFS related (so feel free to ignore me), but they would probably be interesting if we talk about power aware scheduling in Linux in general. On 08/16/2012 04:31 PM, Morten Rasmussen wrote: Hi all, On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote: sub proposal: 1, If it's possible to balance task on idlest cpu not appointed 'balance cpu'. If so, it may can reduce one more time balancing. The idlest cpu can prefer the new idle cpu; and is the least load cpu; 2, se or task load is good for running time setting. but it should the second basis in load balancing. The first basis of LB is running tasks' number in group/cpu. Since whatever of the weight of groups is, if the tasks number is less than cpu number, the group is still has capacity to take more tasks. (will consider the SMT cpu power or other big/little cpu capacity on ARM.) Ah, no we shouldn't balance on nr_running, but on the amount of time consumed. Imagine two tasks being woken at the same time, both tasks will only run a fraction of the available time, you don't want this to exceed your capacity because ran back to back the one cpu will still be mostly idle. What you want it to keep track of a per-cpu utilization level (inverse of idle-time) and using PJTs per-task runnable avg see if placing the new task on will exceed the utilization limit. I think some of the Linaro people actually played around with this, Vincent? I agree. A better measure of cpu load and task weight than nr_running and the current task load weight are necessary to do proper task packing. I have used PJTs per-task load-tracking for scheduling experiments on heterogeneous systems and my experience is that it works quite well for determining the load of a specific task. Something like PJTs work would be a good starting point for power aware scheduling and better support for heterogeneous systems. I didn't tried PJTs work myself (it's on my todo list), but with SCHED_DEADLINE you can see the picture from the other side and, instead of tracking per-task load, you can enforce a task not to exceed its allowed "load". This is done reserving some fraction of CPU time (runtime or budget) every predefined interval of time (period). Than this allocated bandwidth is enforced with proper scheduling mechanisms (BTW, I have another talk at Plumbers explaining the SCHED_DEADLINE patchset in more details). One of the biggest challenges here for load-balancing is translating task load from one cpu to another as the task load is influenced by the total load of its cpu. So a task that appears to be heavy on an oversubscribed cpu might not be so heavy after all when it is moved to a cpu with plenty cpu time to spare. This issue is likely to be more pronounced on heterogeneous systems and system with aggressive frequency scaling. It might be possible to avoid having to translate load or that it doesn't really matter, but I haven't completely convinced myself yet. This is probably a key point where deadline scheduling could be helpful. A task load in this case cannot be influenced by other tasks in the system and it is one of the known variables. Actually, this is however half true. Isolation is achieved only considering CPU time between concurrently executing task, other terms like cache interferences etc. cannot be controlled. The nice fact is that a misbehaving task, one that tries or experiments deviations from its allowed CPU fraction, is throttled and cannot influence other tasks behavior. As I will show during my talk (power aware deadline scheduling), other techniques are required when a task execution time it is not stricly known beforehand, beeing this due to interferences or intrinsic variability on the performed activity. They fall in the domain of adaptive/feedback scheduling. My point is that getting the task load right or at least better is a fundamental requirement for improving power aware scheduling. Fully agree :-). As I said, I just wanted to add something, sorry if I misinterpret the purpose of this discussion. Best Regards, - Juri Lelli -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
Hi all, I can probably add some bits to the discussion, after all I'm preparing a talk for Plumbers that is strictly related :-). My points are not CFS related (so feel free to ignore me), but they would probably be interesting if we talk about power aware scheduling in Linux in general. On 08/16/2012 04:31 PM, Morten Rasmussen wrote: Hi all, On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote: sub proposal: 1, If it's possible to balance task on idlest cpu not appointed 'balance cpu'. If so, it may can reduce one more time balancing. The idlest cpu can prefer the new idle cpu; and is the least load cpu; 2, se or task load is good for running time setting. but it should the second basis in load balancing. The first basis of LB is running tasks' number in group/cpu. Since whatever of the weight of groups is, if the tasks number is less than cpu number, the group is still has capacity to take more tasks. (will consider the SMT cpu power or other big/little cpu capacity on ARM.) Ah, no we shouldn't balance on nr_running, but on the amount of time consumed. Imagine two tasks being woken at the same time, both tasks will only run a fraction of the available time, you don't want this to exceed your capacity because ran back to back the one cpu will still be mostly idle. What you want it to keep track of a per-cpu utilization level (inverse of idle-time) and using PJTs per-task runnable avg see if placing the new task on will exceed the utilization limit. I think some of the Linaro people actually played around with this, Vincent? I agree. A better measure of cpu load and task weight than nr_running and the current task load weight are necessary to do proper task packing. I have used PJTs per-task load-tracking for scheduling experiments on heterogeneous systems and my experience is that it works quite well for determining the load of a specific task. Something like PJTs work would be a good starting point for power aware scheduling and better support for heterogeneous systems. I didn't tried PJTs work myself (it's on my todo list), but with SCHED_DEADLINE you can see the picture from the other side and, instead of tracking per-task load, you can enforce a task not to exceed its allowed load. This is done reserving some fraction of CPU time (runtime or budget) every predefined interval of time (period). Than this allocated bandwidth is enforced with proper scheduling mechanisms (BTW, I have another talk at Plumbers explaining the SCHED_DEADLINE patchset in more details). One of the biggest challenges here for load-balancing is translating task load from one cpu to another as the task load is influenced by the total load of its cpu. So a task that appears to be heavy on an oversubscribed cpu might not be so heavy after all when it is moved to a cpu with plenty cpu time to spare. This issue is likely to be more pronounced on heterogeneous systems and system with aggressive frequency scaling. It might be possible to avoid having to translate load or that it doesn't really matter, but I haven't completely convinced myself yet. This is probably a key point where deadline scheduling could be helpful. A task load in this case cannot be influenced by other tasks in the system and it is one of the known variables. Actually, this is however half true. Isolation is achieved only considering CPU time between concurrently executing task, other terms like cache interferences etc. cannot be controlled. The nice fact is that a misbehaving task, one that tries or experiments deviations from its allowed CPU fraction, is throttled and cannot influence other tasks behavior. As I will show during my talk (power aware deadline scheduling), other techniques are required when a task execution time it is not stricly known beforehand, beeing this due to interferences or intrinsic variability on the performed activity. They fall in the domain of adaptive/feedback scheduling. My point is that getting the task load right or at least better is a fundamental requirement for improving power aware scheduling. Fully agree :-). As I said, I just wanted to add something, sorry if I misinterpret the purpose of this discussion. Best Regards, - Juri Lelli -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/18/2012 7:33 AM, Luming Yu wrote: > saving mode. But obviously, we need to spread as much as possible > across all cores in another socket(to race to idle). So from the > example above, we see a threshold that we need to reference before > selecting one from two complete different policy: spread or not > spread... As long as there is hardware limitation, we could always > need knob like that referenced threshold to adapt on different > hardware in one kernel I think the physics are slightly simpler, if you abstract it one level. every reasonable system out there has things that can be off if all cores are in the deep power state, that have to be on if even one of them is alive. On "big core" Intel, that's uncore and memory controller, on small core (atom/phone) Intel that is the chipset fabric only. On ARM it might be something else. On all of them it's some clocks, PLLs, voltage regulators etc etc. not all chips are advanced enough to aggressively these things off when they could, but most are nowadays. so in abstract, there's a power offset that gets you from 0 to 1, Lets call this P0 there is also a power offset to go from 1 to 2, but that's smaller than 0->1. Lets call this Pc or rather, 0->1 has the same kind of offset as 1->2 plus some extra offset.. so P0 = Pbase + Pc there's also an energy cost for waking a cpu up (and letting it go back to sleep afterwards)... call it Ewake so the abstract question is you're running a task A on cpu 0 you want to also run a task B, which you estimate to run for time T it's more energy efficient to wake a 2nd cpu if Ewake < T * Pbase (this assumes all cores are the same, you get a more complex formula if that's not the case, where T is even core specific) there is no hardware policy *switch* in such formula, only parameters. If Pbase = 0 (e.g. your hardware has no extra power savings), then the formula very naturally leads to one extreme of the behavior if Ewake is very high, then it leads to the other extreme. The only other variable is the user preference between power and performance balance.. but that's a pure preference, not hardware specific anymore. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Sat, Aug 18, 2012 at 4:16 AM, Chris Friesen wrote: > On 08/17/2012 01:50 PM, Matthew Garrett wrote: >> >> On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote: >>> >>> On 08/17/2012 12:47 PM, Matthew Garrett wrote: >> >> >>> The datasheet for the Xeon E5 (my variant at least) says it doesn't >>> do C7 so never powers down the LLC. However, as you said earlier >>> once you can put the socket into C6 which saves about 30W compared >>> to C1E. >>> >>> So as far as I can see with this CPU at least you would benefit from >>> shutting down a whole socket when possible. >> >> >> Having any active cores on the system prevents all packages from going >> into PC6 or deeper. What I'm not clear on is whether less deep package C >> states are also blocked. >> > > Right, we need the memory controller. > > The E5 datasheet is a bit ambiguous, it reads: > > > A processor enters the package C3 low power state when: > -At least one core is in the C3 state. > -The other cores are in a C3 or lower power state, and the processor has > been granted permission by the platform. > > > Unfortunately it doesn't specify whether that is the other cores in the > package, or the other cores on the whole system. > Hardware limitations is just part of the problem. We could find them out from various white papers or data sheets, or test out.To me, the key problem in terms of power and performance balancing still lies in CPU and memory allocation method. For example, on a system we can benefit from shutting down a whole socket when possible, if a workload allocates 50% CPU cycles and 50% memory bandwidth and space on a two socket system(modern), an ideal allocation method ( I assume it's our goal of the discussion) should leave CPU, cache, memory controller and memory on one socket ( node) completely idle and in deepest power saving mode. But obviously, we need to spread as much as possible across all cores in another socket(to race to idle). So from the example above, we see a threshold that we need to reference before selecting one from two complete different policy: spread or not spread... As long as there is hardware limitation, we could always need knob like that referenced threshold to adapt on different hardware in one kernel /l -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Sat, Aug 18, 2012 at 4:16 AM, Chris Friesen chris.frie...@genband.com wrote: On 08/17/2012 01:50 PM, Matthew Garrett wrote: On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote: On 08/17/2012 12:47 PM, Matthew Garrett wrote: The datasheet for the Xeon E5 (my variant at least) says it doesn't do C7 so never powers down the LLC. However, as you said earlier once you can put the socket into C6 which saves about 30W compared to C1E. So as far as I can see with this CPU at least you would benefit from shutting down a whole socket when possible. Having any active cores on the system prevents all packages from going into PC6 or deeper. What I'm not clear on is whether less deep package C states are also blocked. Right, we need the memory controller. The E5 datasheet is a bit ambiguous, it reads: A processor enters the package C3 low power state when: -At least one core is in the C3 state. -The other cores are in a C3 or lower power state, and the processor has been granted permission by the platform. Unfortunately it doesn't specify whether that is the other cores in the package, or the other cores on the whole system. Hardware limitations is just part of the problem. We could find them out from various white papers or data sheets, or test out.To me, the key problem in terms of power and performance balancing still lies in CPU and memory allocation method. For example, on a system we can benefit from shutting down a whole socket when possible, if a workload allocates 50% CPU cycles and 50% memory bandwidth and space on a two socket system(modern), an ideal allocation method ( I assume it's our goal of the discussion) should leave CPU, cache, memory controller and memory on one socket ( node) completely idle and in deepest power saving mode. But obviously, we need to spread as much as possible across all cores in another socket(to race to idle). So from the example above, we see a threshold that we need to reference before selecting one from two complete different policy: spread or not spread... As long as there is hardware limitation, we could always need knob like that referenced threshold to adapt on different hardware in one kernel /l -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/18/2012 7:33 AM, Luming Yu wrote: saving mode. But obviously, we need to spread as much as possible across all cores in another socket(to race to idle). So from the example above, we see a threshold that we need to reference before selecting one from two complete different policy: spread or not spread... As long as there is hardware limitation, we could always need knob like that referenced threshold to adapt on different hardware in one kernel I think the physics are slightly simpler, if you abstract it one level. every reasonable system out there has things that can be off if all cores are in the deep power state, that have to be on if even one of them is alive. On big core Intel, that's uncore and memory controller, on small core (atom/phone) Intel that is the chipset fabric only. On ARM it might be something else. On all of them it's some clocks, PLLs, voltage regulators etc etc. not all chips are advanced enough to aggressively these things off when they could, but most are nowadays. so in abstract, there's a power offset that gets you from 0 to 1, Lets call this P0 there is also a power offset to go from 1 to 2, but that's smaller than 0-1. Lets call this Pc or rather, 0-1 has the same kind of offset as 1-2 plus some extra offset.. so P0 = Pbase + Pc there's also an energy cost for waking a cpu up (and letting it go back to sleep afterwards)... call it Ewake so the abstract question is you're running a task A on cpu 0 you want to also run a task B, which you estimate to run for time T it's more energy efficient to wake a 2nd cpu if Ewake T * Pbase (this assumes all cores are the same, you get a more complex formula if that's not the case, where T is even core specific) there is no hardware policy *switch* in such formula, only parameters. If Pbase = 0 (e.g. your hardware has no extra power savings), then the formula very naturally leads to one extreme of the behavior if Ewake is very high, then it leads to the other extreme. The only other variable is the user preference between power and performance balance.. but that's a pure preference, not hardware specific anymore. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/17/2012 01:50 PM, Matthew Garrett wrote: On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote: On 08/17/2012 12:47 PM, Matthew Garrett wrote: The datasheet for the Xeon E5 (my variant at least) says it doesn't do C7 so never powers down the LLC. However, as you said earlier once you can put the socket into C6 which saves about 30W compared to C1E. So as far as I can see with this CPU at least you would benefit from shutting down a whole socket when possible. Having any active cores on the system prevents all packages from going into PC6 or deeper. What I'm not clear on is whether less deep package C states are also blocked. Right, we need the memory controller. The E5 datasheet is a bit ambiguous, it reads: A processor enters the package C3 low power state when: -At least one core is in the C3 state. -The other cores are in a C3 or lower power state, and the processor has been granted permission by the platform. Unfortunately it doesn't specify whether that is the other cores in the package, or the other cores on the whole system. Chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote: > On 08/17/2012 12:47 PM, Matthew Garrett wrote: > The datasheet for the Xeon E5 (my variant at least) says it doesn't > do C7 so never powers down the LLC. However, as you said earlier > once you can put the socket into C6 which saves about 30W compared > to C1E. > > So as far as I can see with this CPU at least you would benefit from > shutting down a whole socket when possible. Having any active cores on the system prevents all packages from going into PC6 or deeper. What I'm not clear on is whether less deep package C states are also blocked. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/17/2012 12:47 PM, Matthew Garrett wrote: On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote: On 8/17/2012 11:41 AM, Matthew Garrett wrote: On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: this is ... a dubiously general statement. for good power, at least on Intel cpus, you want to spread. Parallelism is efficient. Is this really true? In a two-socket system I'd have thought the benefit of keeping socket 1 in package C3 outweighed the cost of keeping socket 0 awake for slightly longer. not on Intel you can't enter package c3 either until every one is down. (e.g. memory controller must stay on etc etc) I thought that was only PC6 - is there any reason why the package cache can't be entirely powered down? According to "http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf; once you're in package C6 then you can go to package C7. The datasheet for the Xeon E5 (my variant at least) says it doesn't do C7 so never powers down the LLC. However, as you said earlier once you can put the socket into C6 which saves about 30W compared to C1E. So as far as I can see with this CPU at least you would benefit from shutting down a whole socket when possible. Chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote: > On 8/17/2012 11:41 AM, Matthew Garrett wrote: > > On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: > >> this is ... a dubiously general statement. > >> > >> for good power, at least on Intel cpus, you want to spread. Parallelism is > >> efficient. > > > > Is this really true? In a two-socket system I'd have thought the benefit > > of keeping socket 1 in package C3 outweighed the cost of keeping socket > > 0 awake for slightly longer. > > not on Intel > > you can't enter package c3 either until every one is down. > (e.g. memory controller must stay on etc etc) I thought that was only PC6 - is there any reason why the package cache can't be entirely powered down? -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/17/2012 11:41 AM, Matthew Garrett wrote: > On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: >>> *Power policy*: >>> >>> So how is power policy different? As Peter says,'pack more than spread >>> more'. >> >> this is ... a dubiously general statement. >> >> for good power, at least on Intel cpus, you want to spread. Parallelism is >> efficient. > > Is this really true? In a two-socket system I'd have thought the benefit > of keeping socket 1 in package C3 outweighed the cost of keeping socket > 0 awake for slightly longer. not on Intel you can't enter package c3 either until every one is down. (e.g. memory controller must stay on etc etc) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: > > *Power policy*: > > > > So how is power policy different? As Peter says,'pack more than spread > > more'. > > this is ... a dubiously general statement. > > for good power, at least on Intel cpus, you want to spread. Parallelism is > efficient. Is this really true? In a two-socket system I'd have thought the benefit of keeping socket 1 in package C3 outweighed the cost of keeping socket 0 awake for slightly longer. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 15, 2012 at 11:02 AM, Arjan van de Ven wrote: > On 8/15/2012 9:34 AM, Matthew Garrett wrote: >> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote: >>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. >>> >>> I'm not sure this is a valid assumption. I've had it explained to me by >>> various people that race-to-idle isn't always the best thing. It has to >>> do with the cost of switching power states and the duration of execution >>> and other such things. >> >> This is affected by Intel's implementation - if there's a single active > > not just intel.. also AMD > basically everyone who has the memory controller in the cpu package will end > up with > a restriction very similar to this. > I think this is circular to discussion previously held on this topic. This preference is arch specific; we need to reduce the set of inputs to a sensible, actionable set, and plumb that so that the architecture and not the scheduler can supply this preference. That you believe 100-300us is actually the tipping point vs power migration cost is probably in itself one of the most useful replies I've seen on this topic in all of the last few rounds of discussion its been through. It suggests we could actually parameterize this in a manner similar to wake-up migration cost; with a minimum usage average for which it's worth spilling to an idle sibling. - Paul > (this is because the exit-from-self-refresh latency is pretty high.. at least > in DDR2/3) > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra wrote: > On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: >> Since there is no power saving consideration in scheduler CFS, I has a >> very rough idea for enabling a new power saving schema in CFS. > > Adding Thomas, he always delights poking holes in power schemes. > >> It bases on the following assumption: >> 1, If there are many task crowd in system, just let few domain cpus >> running and let other cpus idle can not save power. Let all cpu take the >> load, finish tasks early, and then get into idle. will save more power >> and have better user experience. > > I'm not sure this is a valid assumption. I've had it explained to me by > various people that race-to-idle isn't always the best thing. It has to > do with the cost of switching power states and the duration of execution > and other such things. > >> 2, schedule domain, schedule group perfect match the hardware, and >> the power consumption unit. So, pull tasks out of a domain means >> potentially this power consumption unit idle. > > I'm not sure I understand what you're saying, sorry. > >> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale >> power aware scheduling), this proposal will adopt the >> sched_balance_policy concept and use 2 kind of policy: performance, power. > > Yay, ideally we'd also provide a 3rd option: auto, which simply switches > between the two based on AC/BAT, UPS status and simple things like that. > But this seems like a later concern, you have to have something to pick > between before you can pick :-) > >> And in scheduling, 2 place will care the policy, load_balance() and in >> task fork/exec: select_task_rq_fair(). > > ack > >> Here is some pseudo code try to explain the proposal behaviour in >> load_balance() and select_task_rq_fair(); > > Oh man.. A few words outlining the general idea would've been nice. > >> load_balance() { >> update_sd_lb_stats(); //get busiest group, idlest group data. >> >> if (sd->nr_running > sd's capacity) { >> //power saving policy is not suitable for >> //this scenario, it runs like performance policy >> mv tasks from busiest cpu in busiest group to >> idlest cpu in idlest group; > > Once upon a time we talked about adding a factor to the capacity for > this. So say you'd allow 2*capacity before overflowing and waking > another power group. > > But I think we should not go on nr_running here, PJTs per-entity load > tracking stuff gives us much better measures -- also, repost that series > already Paul! :-) Yes -- I just got back from Africa this week. It's updated for almost all the previous comments but I ran out of time before I left to re-post. I'm just about caught up enough that I should be able to get this done over the upcoming weekend. Monday at the latest. > > Also, I'm not sure this is entirely correct, the thing you want to do > for power aware stuff is to minimize the number of active power domains, > this means you don't want idlest, you want least busy non-idle. > >> } else {// the sd has enough capacity to hold all tasks. >> if (sg->nr_running > sg's capacity) { >> //imbalanced between groups >> if (schedule policy == performance) { >> //when 2 busiest group at same busy >> //degree, need to prefer the one has >> // softest group?? >> move tasks from busiest group to >> idletest group; > > So I'd leave the currently implemented scheme as performance, and I > don't think the above describes the current state. > >> } else if (schedule policy == power) >> move tasks from busiest group to >> idlest group until busiest is just full >> of capacity. >> //the busiest group can balance >> //internally after next time LB, > > There's another thing we need to do, and that is collect tasks in a > minimal amount of power domains. The old code (that got deleted) did > something like that, you can revive some of the that code if needed -- I > just killed everything to be able to start with a clean slate. > > >> } else { >> //all groups has enough capacity for its tasks. >> if (schedule policy == performance) >> //all tasks may has enough cpu >> //resources to run, >> //mv tasks from busiest to idlest group? >> //no, at this time, it's better to keep >> //the task on current cpu. >> //so, it is maybe better to do balance >>
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote: On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: Since there is no power saving consideration in scheduler CFS, I has a very rough idea for enabling a new power saving schema in CFS. Adding Thomas, he always delights poking holes in power schemes. It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. I'm not sure this is a valid assumption. I've had it explained to me by various people that race-to-idle isn't always the best thing. It has to do with the cost of switching power states and the duration of execution and other such things. 2, schedule domain, schedule group perfect match the hardware, and the power consumption unit. So, pull tasks out of a domain means potentially this power consumption unit idle. I'm not sure I understand what you're saying, sorry. So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale power aware scheduling), this proposal will adopt the sched_balance_policy concept and use 2 kind of policy: performance, power. Yay, ideally we'd also provide a 3rd option: auto, which simply switches between the two based on AC/BAT, UPS status and simple things like that. But this seems like a later concern, you have to have something to pick between before you can pick :-) And in scheduling, 2 place will care the policy, load_balance() and in task fork/exec: select_task_rq_fair(). ack Here is some pseudo code try to explain the proposal behaviour in load_balance() and select_task_rq_fair(); Oh man.. A few words outlining the general idea would've been nice. load_balance() { update_sd_lb_stats(); //get busiest group, idlest group data. if (sd-nr_running sd's capacity) { //power saving policy is not suitable for //this scenario, it runs like performance policy mv tasks from busiest cpu in busiest group to idlest cpu in idlest group; Once upon a time we talked about adding a factor to the capacity for this. So say you'd allow 2*capacity before overflowing and waking another power group. But I think we should not go on nr_running here, PJTs per-entity load tracking stuff gives us much better measures -- also, repost that series already Paul! :-) Yes -- I just got back from Africa this week. It's updated for almost all the previous comments but I ran out of time before I left to re-post. I'm just about caught up enough that I should be able to get this done over the upcoming weekend. Monday at the latest. Also, I'm not sure this is entirely correct, the thing you want to do for power aware stuff is to minimize the number of active power domains, this means you don't want idlest, you want least busy non-idle. } else {// the sd has enough capacity to hold all tasks. if (sg-nr_running sg's capacity) { //imbalanced between groups if (schedule policy == performance) { //when 2 busiest group at same busy //degree, need to prefer the one has // softest group?? move tasks from busiest group to idletest group; So I'd leave the currently implemented scheme as performance, and I don't think the above describes the current state. } else if (schedule policy == power) move tasks from busiest group to idlest group until busiest is just full of capacity. //the busiest group can balance //internally after next time LB, There's another thing we need to do, and that is collect tasks in a minimal amount of power domains. The old code (that got deleted) did something like that, you can revive some of the that code if needed -- I just killed everything to be able to start with a clean slate. } else { //all groups has enough capacity for its tasks. if (schedule policy == performance) //all tasks may has enough cpu //resources to run, //mv tasks from busiest to idlest group? //no, at this time, it's better to keep //the task on current cpu. //so, it is maybe better to do balance //in each of groups for_each_imbalance_groups()
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Wed, Aug 15, 2012 at 11:02 AM, Arjan van de Ven ar...@linux.intel.com wrote: On 8/15/2012 9:34 AM, Matthew Garrett wrote: On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote: On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote: It bases on the following assumption: 1, If there are many task crowd in system, just let few domain cpus running and let other cpus idle can not save power. Let all cpu take the load, finish tasks early, and then get into idle. will save more power and have better user experience. I'm not sure this is a valid assumption. I've had it explained to me by various people that race-to-idle isn't always the best thing. It has to do with the cost of switching power states and the duration of execution and other such things. This is affected by Intel's implementation - if there's a single active not just intel.. also AMD basically everyone who has the memory controller in the cpu package will end up with a restriction very similar to this. I think this is circular to discussion previously held on this topic. This preference is arch specific; we need to reduce the set of inputs to a sensible, actionable set, and plumb that so that the architecture and not the scheduler can supply this preference. That you believe 100-300us is actually the tipping point vs power migration cost is probably in itself one of the most useful replies I've seen on this topic in all of the last few rounds of discussion its been through. It suggests we could actually parameterize this in a manner similar to wake-up migration cost; with a minimum usage average for which it's worth spilling to an idle sibling. - Paul (this is because the exit-from-self-refresh latency is pretty high.. at least in DDR2/3) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: *Power policy*: So how is power policy different? As Peter says,'pack more than spread more'. this is ... a dubiously general statement. for good power, at least on Intel cpus, you want to spread. Parallelism is efficient. Is this really true? In a two-socket system I'd have thought the benefit of keeping socket 1 in package C3 outweighed the cost of keeping socket 0 awake for slightly longer. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/17/2012 11:41 AM, Matthew Garrett wrote: On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: *Power policy*: So how is power policy different? As Peter says,'pack more than spread more'. this is ... a dubiously general statement. for good power, at least on Intel cpus, you want to spread. Parallelism is efficient. Is this really true? In a two-socket system I'd have thought the benefit of keeping socket 1 in package C3 outweighed the cost of keeping socket 0 awake for slightly longer. not on Intel you can't enter package c3 either until every one is down. (e.g. memory controller must stay on etc etc) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote: On 8/17/2012 11:41 AM, Matthew Garrett wrote: On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: this is ... a dubiously general statement. for good power, at least on Intel cpus, you want to spread. Parallelism is efficient. Is this really true? In a two-socket system I'd have thought the benefit of keeping socket 1 in package C3 outweighed the cost of keeping socket 0 awake for slightly longer. not on Intel you can't enter package c3 either until every one is down. (e.g. memory controller must stay on etc etc) I thought that was only PC6 - is there any reason why the package cache can't be entirely powered down? -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/17/2012 12:47 PM, Matthew Garrett wrote: On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote: On 8/17/2012 11:41 AM, Matthew Garrett wrote: On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote: this is ... a dubiously general statement. for good power, at least on Intel cpus, you want to spread. Parallelism is efficient. Is this really true? In a two-socket system I'd have thought the benefit of keeping socket 1 in package C3 outweighed the cost of keeping socket 0 awake for slightly longer. not on Intel you can't enter package c3 either until every one is down. (e.g. memory controller must stay on etc etc) I thought that was only PC6 - is there any reason why the package cache can't be entirely powered down? According to http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf; once you're in package C6 then you can go to package C7. The datasheet for the Xeon E5 (my variant at least) says it doesn't do C7 so never powers down the LLC. However, as you said earlier once you can put the socket into C6 which saves about 30W compared to C1E. So as far as I can see with this CPU at least you would benefit from shutting down a whole socket when possible. Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote: On 08/17/2012 12:47 PM, Matthew Garrett wrote: The datasheet for the Xeon E5 (my variant at least) says it doesn't do C7 so never powers down the LLC. However, as you said earlier once you can put the socket into C6 which saves about 30W compared to C1E. So as far as I can see with this CPU at least you would benefit from shutting down a whole socket when possible. Having any active cores on the system prevents all packages from going into PC6 or deeper. What I'm not clear on is whether less deep package C states are also blocked. -- Matthew Garrett | mj...@srcf.ucam.org -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/17/2012 01:50 PM, Matthew Garrett wrote: On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote: On 08/17/2012 12:47 PM, Matthew Garrett wrote: The datasheet for the Xeon E5 (my variant at least) says it doesn't do C7 so never powers down the LLC. However, as you said earlier once you can put the socket into C6 which saves about 30W compared to C1E. So as far as I can see with this CPU at least you would benefit from shutting down a whole socket when possible. Having any active cores on the system prevents all packages from going into PC6 or deeper. What I'm not clear on is whether less deep package C states are also blocked. Right, we need the memory controller. The E5 datasheet is a bit ambiguous, it reads: A processor enters the package C3 low power state when: -At least one core is in the C3 state. -The other cores are in a C3 or lower power state, and the processor has been granted permission by the platform. Unfortunately it doesn't specify whether that is the other cores in the package, or the other cores on the whole system. Chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/16/2012 10:01 PM, Arjan van de Ven wrote: >> *Power policy*: >> >> So how is power policy different? As Peter says,'pack more than spread >> more'. > > this is ... a dubiously general statement. > > for good power, at least on Intel cpus, you want to spread. Parallelism is > efficient. > > the only thing you do not want to do, is wake cpus up for > tasks that only run extremely briefly (think "100 usec" or less). It's a very important and valuable info! Just want to know how you get this? From CS cost or cache/TLB refill cost? > > so maybe the balance interval is slightly different, or more, you don't > balance tasks that > historically ran only for brief periods > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/16/2012 11:45 AM, Rik van Riel wrote: > > The c-state governor can call the scheduler code before > putting a CPU to sleep, to indicate (1) the wakeup latency > of the CPU, and (2) whether TLB and/or cache get invalidated. I don't think (2) is useful really; that basically always happens ;-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 08/16/2012 10:01 AM, Arjan van de Ven wrote: *Power policy*: So how is power policy different? As Peter says,'pack more than spread more'. this is ... a dubiously general statement. for good power, at least on Intel cpus, you want to spread. Parallelism is efficient. the only thing you do not want to do, is wake cpus up for tasks that only run extremely briefly (think "100 usec" or less). so maybe the balance interval is slightly different, or more, you don't balance tasks that historically ran only for brief periods This makes me think that maybe, in addition to tracking the idle residency time in the c-state governor, we may also want to track the average run times in the scheduler. The c-state governor can call the scheduler code before putting a CPU to sleep, to indicate (1) the wakeup latency of the CPU, and (2) whether TLB and/or cache get invalidated. At wakeup time, the scheduler can check whether the CPU the to-be-woken process ran on is in a deeper sleep state, and whether the typical run time for the process significantly exceeds the wakeup latency of the CPU it last ran on. If the process typically runs for a short interval, and/or the process's CPU lost its cached state, it may be better to run the just-woken task on the CPU that is doing the waking up, instead of on the CPU where it used to run. Does that make sense? Am I overlooking any factors? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
Hi all, On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote: > > > > sub proposal: > > 1, If it's possible to balance task on idlest cpu not appointed 'balance > > cpu'. If so, it may can reduce one more time balancing. > > The idlest cpu can prefer the new idle cpu; and is the least load cpu; > > 2, se or task load is good for running time setting. > > but it should the second basis in load balancing. The first basis of LB > > is running tasks' number in group/cpu. Since whatever of the weight of > > groups is, if the tasks number is less than cpu number, the group is > > still has capacity to take more tasks. (will consider the SMT cpu power > > or other big/little cpu capacity on ARM.) > > Ah, no we shouldn't balance on nr_running, but on the amount of time > consumed. Imagine two tasks being woken at the same time, both tasks > will only run a fraction of the available time, you don't want this to > exceed your capacity because ran back to back the one cpu will still be > mostly idle. > > What you want it to keep track of a per-cpu utilization level (inverse > of idle-time) and using PJTs per-task runnable avg see if placing the > new task on will exceed the utilization limit. > > I think some of the Linaro people actually played around with this, > Vincent? > I agree. A better measure of cpu load and task weight than nr_running and the current task load weight are necessary to do proper task packing. I have used PJTs per-task load-tracking for scheduling experiments on heterogeneous systems and my experience is that it works quite well for determining the load of a specific task. Something like PJTs work would be a good starting point for power aware scheduling and better support for heterogeneous systems. One of the biggest challenges here for load-balancing is translating task load from one cpu to another as the task load is influenced by the total load of its cpu. So a task that appears to be heavy on an oversubscribed cpu might not be so heavy after all when it is moved to a cpu with plenty cpu time to spare. This issue is likely to be more pronounced on heterogeneous systems and system with aggressive frequency scaling. It might be possible to avoid having to translate load or that it doesn't really matter, but I haven't completely convinced myself yet. My point is that getting the task load right or at least better is a fundamental requirement for improving power aware scheduling. Best regards, Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
> *Power policy*: > > So how is power policy different? As Peter says,'pack more than spread > more'. this is ... a dubiously general statement. for good power, at least on Intel cpus, you want to spread. Parallelism is efficient. the only thing you do not want to do, is wake cpus up for tasks that only run extremely briefly (think "100 usec" or less). so maybe the balance interval is slightly different, or more, you don't balance tasks that historically ran only for brief periods -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [discussion]sched: a rough proposal to enable power saving in scheduler
On 8/15/2012 10:03 PM, Alex Shi wrote: > On 08/16/2012 12:19 AM, Matthew Garrett wrote: > >> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote: >> >>> power aware scheduling), this proposal will adopt the >>> sched_balance_policy concept and use 2 kind of policy: performance, power. >> >> Are there workloads in which "power" might provide more performance than >> "performance"? If so, don't use these terms. >> > > > Power scheme should no chance has better performance in design. ehm. so in reality, the very first thing that helps power, is to run software efficiently. anything else is completely secondary. if placement policy leads to a placement that's different from the most efficient placement, you're already burning extra power... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/