Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-23 Thread Alex Shi
On 08/22/2012 05:10 PM, Ingo Molnar wrote:

> 
> * Matthew Garrett  wrote:
> 
>> [...]
>>
>> Our power consumption is worse than under other operating 
>> systems is almost entirely because only one of our three GPU 
>> drivers implements any kind of useful power management. [...]
> 
> ... and because our CPU frequency and C state selection logic is 
> doing pretty much the worst possible decisions (on x86 at 
> least).
> 
> Regardless, you cannot possibly seriously suggest that because 
> there's even greater suckage elsewhere for some workloads we 
> should not even bother with improving the situation here.
> 
> Anyway, I agree with Alan that actual numbers matter.


Sure. we'd better make ideas into code, and then let benchmarks and data
speaking.

> 
> Thanks,
> 
>   Ingo


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-23 Thread Alex Shi
On 08/22/2012 05:10 PM, Ingo Molnar wrote:

 
 * Matthew Garrett mj...@srcf.ucam.org wrote:
 
 [...]

 Our power consumption is worse than under other operating 
 systems is almost entirely because only one of our three GPU 
 drivers implements any kind of useful power management. [...]
 
 ... and because our CPU frequency and C state selection logic is 
 doing pretty much the worst possible decisions (on x86 at 
 least).
 
 Regardless, you cannot possibly seriously suggest that because 
 there's even greater suckage elsewhere for some workloads we 
 should not even bother with improving the situation here.
 
 Anyway, I agree with Alan that actual numbers matter.


Sure. we'd better make ideas into code, and then let benchmarks and data
speaking.

 
 Thanks,
 
   Ingo


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Arjan van de Ven
On 8/22/2012 6:21 AM, Matthew Garrett wrote:
> On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
>> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
>>> For my dinky dual core laptop, I suspect you're right, but for a more
>>> powerful laptop, I'd expect spread/don't to be noticeable.
>>
>> yeah if you don't spread, you will waste some power.
>> but.. current linux behavior is to spread.
>> so we can only make it worse.
> 
> Right. For a single socket system the only thing you can do is use two 
> threads in preference to using two cores. That'll keep an extra core in 
> a deep C state for longer, at the cost of keeping the package out of a 
> deep C state for longer. There might be a win if the two processes 
> benefit from improved L1 cache locality, or if you're talking about 

basically "if HT sharing would be good for performance" ;-)

(btw this is good news, it means this is not an actual power/performance 
tradeoff, but a "get it right" tradeoff)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Matthew Garrett
On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> > For my dinky dual core laptop, I suspect you're right, but for a more
> > powerful laptop, I'd expect spread/don't to be noticeable.
> 
> yeah if you don't spread, you will waste some power.
> but.. current linux behavior is to spread.
> so we can only make it worse.

Right. For a single socket system the only thing you can do is use two 
threads in preference to using two cores. That'll keep an extra core in 
a deep C state for longer, at the cost of keeping the package out of a 
deep C state for longer. There might be a win if the two processes 
benefit from improved L1 cache locality, or if you're talking about 
short periodic work, but for the majority of cases I'd expect Arjan to 
be completely correct here. Things get more interesting with 
multi-socket systems, but that's beyond the laptop use case.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Mike Galbraith
On Wed, 2012-08-22 at 06:02 -0700, Arjan van de Ven wrote: 
> On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> > On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
> > 
> >> I'd like to see actual numbers and evidence on a wide range of workloads
> >> the spread/don't spread thing is even measurable given that you've also
> >> got to factor in effects like completing faster and turning everything
> >> off. I'd *really* like to see such evidence on a laptop,which is your
> >> one cited case it might work.
> > 
> > For my dinky dual core laptop, I suspect you're right, but for a more
> > powerful laptop, I'd expect spread/don't to be noticeable.
> 
> yeah if you don't spread, you will waste some power.
> but.. current linux behavior is to spread.
> so we can only make it worse.

Hm, so I can stop fretting about select_idle_sibling().  Good. 

> > Yeah, hard numbers would be nice to see.
> > 
> > If I had a powerful laptop, I'd kill irq balancing, and all but periodic
> > load balancing, and expect to see a positive result. 
> 
> I'd expect to see a negative result ;-)

Ok, so I have my head on backward.  Gives a different perspective :)

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Arjan van de Ven
On 8/21/2012 10:41 PM, Mike Galbraith wrote:
> On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
> 
>> I'd like to see actual numbers and evidence on a wide range of workloads
>> the spread/don't spread thing is even measurable given that you've also
>> got to factor in effects like completing faster and turning everything
>> off. I'd *really* like to see such evidence on a laptop,which is your
>> one cited case it might work.
> 
> For my dinky dual core laptop, I suspect you're right, but for a more
> powerful laptop, I'd expect spread/don't to be noticeable.

yeah if you don't spread, you will waste some power.
but.. current linux behavior is to spread.
so we can only make it worse.


> 
> Yeah, hard numbers would be nice to see.
> 
> If I had a powerful laptop, I'd kill irq balancing, and all but periodic
> load balancing, and expect to see a positive result. 

I'd expect to see a negative result ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Alan Cox
> It can be more than an irrelevance if the CPU is saturated - say 
> a game running on a mobile device very commonly saturates the 
> CPU. A third of the energy is spent in the CPU, sometimes more.

If the CPU is saturated you already lost. What you going to do - the CPU
is saturated - slow it down, then it'll use more power.

> > You *can't* fix PM in one place. [...]
> 
> Preferably one project, not one place - but at least don't go 
> down the false path of:
> 
>  " Policy always belongs into user-space so the kernel can 
>continue to do a shitty job even for pieces it could 
>understand better ..."
> 
> My opinion is that it depends, and I also think that we are so 
> bad currently (on x86) that we can do little harm by trying to 
> do things better.

All the evidence I've seen says we are doing the kernel side stuff right.

> 
> > [...] Power management is a top to bottom thing. It starts in 
> > the hardware and propogates right to the top of the user space 
> > stack.
> 
> Partly because it's misdesigned: in practice there's very little 
> true user policy about power saving:

It's not about policy, its about code behaviour. You have to fix every
single piece of code.

> - On mobile devices I almost never tweak policy as a user - 
>   sometimes I override screen brightness but that's all (and 
>   it's trivial compared to all the many other things that go 
>   on).

Put a single badly broken app on an Android device and your battery life
will plough. That's despite Android having some highly active management
policies to minimise the effect. It works out of the box because someone
spent a huge amount of time with a power meter and monitoring tools
beating up whoever was top of the wakeup lists.

> it should all work. There arent millions of people out there 
> wanting to tweak the heck out of PM.

Don't confuse policy managed by the userspace and buttons for users to
tweak. Userspace understands things like "would it be better to drop
video quality or burn more power" and has access to info the kernel can't
even begin to evaluate.

> People prefer no knobs at all - they want good defaults and they 
> want at most a single, intuitive, actionable control to override 
> the automation in 1% of the usecases, such as screen brightness.

That's a different discussion.

> > A single stupid behaviour in a desktop app is all it needs to 
> > knock the odd hour or two off your battery life. Something is 
> > mundane as refreshing a bit of the display all the time 
> > keeping the GPU and CPU from sleeping well.
> 
> Even with highly powertop-optimized systems that have no such 
> app and have very low wakeup rates we still lag behind the 
> competition.

Actually we don't. Well not if your distro is put together properly,
and has the relevant SATA patches and the like merged. Stock Fedora may
be pants but if so that's a distro problem.

> So why not move most pieces into one well-informed code domain 
> (the kernel) and only expose high level controls, instead of 
> expecting user-space to get it all right.

Because the kernel doesn't have the information needed. You'd have to add
megabytes of code to the kernel - including things like video playback
engines.

> Then the 'only' job of user-space would be to not be silly when 
> implementing their functionality. (and there's nothing 
> intimately PM about that.)

That sounds like ignorance

> Kernel design decisions *matter*:

Of course they do but its a tiny part of the story. The power management
function mathematically has a large number of important inputs for which
the kernel cannot deduce the values without massive layering violations.

Also inconveniently for your worldview but as demonstrated in every case
and by everyone who has dug into it, you also have to fix all the wakeup
sources on each level. That's the reality. From the moment you wake for
an event that was not strictly needed you are essentially attempting to
mitigate a failure not trying to deal with the actual problem.

> Look for example how moving X lowlevel drivers from user-space 
> into kernel-space enabled GPU level power management to begin 
> with. With the old X method it was essentially impossible. Now 
> it's at least possible.

Actually it was perfectly possible before for what the cards of the time
could do. The kernel GPU stuff is for DMA and IRQ handling. It happens to
be a good place to do PM.

> Or look at how Android adding a high-level interface like 
> suspend blockers materially improved the power saving situation 
> for them.

Blockers are not policy. The blocking *policy* is managed elsewhere. They
are a tool for freezing stuff that is being rude.

> This learned helplessness that "the kernel can do nothing about 
> PM" is somewhat annoying :-)

Sorry was that a different thread I didn't read ?

The inability to learn from both the past and basic systems theory is
what I find rather more irritating. Plus your mistaken belief that we are
worse than the other 

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Matthew Garrett
On Wed, Aug 22, 2012 at 11:10:13AM +0200, Ingo Molnar wrote:
> 
> * Matthew Garrett  wrote:
> 
> > [...]
> >
> > Our power consumption is worse than under other operating 
> > systems is almost entirely because only one of our three GPU 
> > drivers implements any kind of useful power management. [...]
> 
> ... and because our CPU frequency and C state selection logic is 
> doing pretty much the worst possible decisions (on x86 at 
> least).

You have figures showing that our C state residence is worse than, say, 
Windows? Because my own testing says that we're way better at that. 
Could we be better? Sure. Is it why we're worse? No.

> Regardless, you cannot possibly seriously suggest that because 
> there's even greater suckage elsewhere for some workloads we 
> should not even bother with improving the situation here.

I'm enthusiastic about improving the scheduler's behaviour. I'm 
unenthusiastic about putting in automatic hacks related to AC state.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Ingo Molnar

* Alan Cox  wrote:

> > With deep enough C states it's rather relevant whether we 
> > continue to burn +50W for a couple of more milliseconds or 
> > not, and whether we have the right information from the 
> > scheduler and timer subsystem about how long the next idle 
> > period is expected to be and how bursty a given task is.
> 
> 50W for 2mS here and there is an irrelevance compared with 
> burning a continual half a watt due to the upstream tree lack 
> some of the SATA power patches for example.

It can be more than an irrelevance if the CPU is saturated - say 
a game running on a mobile device very commonly saturates the 
CPU. A third of the energy is spent in the CPU, sometimes more.

> It's the classic "standby mode" problem - energy efficiency 
> has time as a factor and there are a lot of milliseconds in 5 
> hours. That means anything continually on rapidly dominates 
> the problem space.
> 
> > > PM means fixing the stack top to bottom, and its a whackamole 
> > > game, each one you fix you find the next. You have to sort the 
> > > entire stack from desktop apps to kernel.
> > 
> > Moving 'policy' into user-space has been an utter failure, 
> > mostly because there's not a single project/subsystem 
> > responsible for getting a good result to users. This is why 
> > I resist "policy should not be in the kernel" meme here.
> 
> You *can't* fix PM in one place. [...]

Preferably one project, not one place - but at least don't go 
down the false path of:

 " Policy always belongs into user-space so the kernel can 
   continue to do a shitty job even for pieces it could 
   understand better ..."

My opinion is that it depends, and I also think that we are so 
bad currently (on x86) that we can do little harm by trying to 
do things better.

> [...] Power management is a top to bottom thing. It starts in 
> the hardware and propogates right to the top of the user space 
> stack.

Partly because it's misdesigned: in practice there's very little 
true user policy about power saving:

- On mobile devices I almost never tweak policy as a user - 
  sometimes I override screen brightness but that's all (and 
  it's trivial compared to all the many other things that go 
  on).

- On a laptop I'd love to never have to tweak it either - 
  running fast when on AC and running efficient when on battery 
  is a perfectly fine life-time default for me.

90% of the "policy" comes with the *form factor* - i.e. it's 
something the hardware and thus the kernel could intimately
know about.

Yes, there are exceptions and there are servers.

The mobile device user mostly *only cares about battery life*, 
for a given amount of real utility provided by the device. The 
"user policy" fetish here is a serious misunderstanding of how 
it should all work. There arent millions of people out there 
wanting to tweak the heck out of PM.

People prefer no knobs at all - they want good defaults and they 
want at most a single, intuitive, actionable control to override 
the automation in 1% of the usecases, such as screen brightness.

> A single stupid behaviour in a desktop app is all it needs to 
> knock the odd hour or two off your battery life. Something is 
> mundane as refreshing a bit of the display all the time 
> keeping the GPU and CPU from sleeping well.

Even with highly powertop-optimized systems that have no such 
app and have very low wakeup rates we still lag behind the 
competition.

> Most distros haven't managed to do power management properly 
> because it is this entire integration problem. Every single 
> piece of the puzzle has to be in place before you get any 
> serious gain.

Most certainly.

So why not move most pieces into one well-informed code domain 
(the kernel) and only expose high level controls, instead of 
expecting user-space to get it all right.

Then the 'only' job of user-space would be to not be silly when 
implementing their functionality. (and there's nothing 
intimately PM about that.)

> It's not a kernel v user thing. The kernel can't fix it, 
> random bits of userspace can't fix it. This is effectively a 
> "product level" integration problem.

Of course the kernel can fix many parts by offering automation 
like automatically shutting down unused interfaces (and offering 
better ABIs if that is not possible due to some poor historic 
choice), choosing frequencies and C states wisely, etc.

Kernel design decisions *matter*:

Look for example how moving X lowlevel drivers from user-space 
into kernel-space enabled GPU level power management to begin 
with. With the old X method it was essentially impossible. Now 
it's at least possible.

Or look at how Android adding a high-level interface like 
suspend blockers materially improved the power saving situation 
for them.

This learned helplessness that "the kernel can do nothing about 
PM" is somewhat annoying :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a 

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Alan Cox
> With deep enough C states it's rather relevant whether we 
> continue to burn +50W for a couple of more milliseconds or not, 
> and whether we have the right information from the scheduler and 
> timer subsystem about how long the next idle period is expected 
> to be and how bursty a given task is.

50W for 2mS here and there is an irrelevance compared with burning a
continual half a watt due to the upstream tree lack some of the SATA
power patches for example.

It's the classic "standby mode" problem - energy efficiency has time as a
factor and there are a lot of milliseconds in 5 hours. That means
anything continually on rapidly dominates the problem space.

> > PM means fixing the stack top to bottom, and its a whackamole 
> > game, each one you fix you find the next. You have to sort the 
> > entire stack from desktop apps to kernel.
> 
> Moving 'policy' into user-space has been an utter failure, 
> mostly because there's not a single project/subsystem 
> responsible for getting a good result to users. This is why
> I resist "policy should not be in the kernel" meme here.

You *can't* fix PM in one place. Power management is a top to bottom
thing. It starts in the hardware and propogates right to the top of the
user space stack.

A single stupid behaviour in a desktop app is all it needs to knock the
odd hour or two off your battery life. Something is mundane as refreshing
a bit of the display all the time keeping the GPU and CPU from sleeping
well.

Most distros haven't managed to do power management properly because it
is this entire integration problem. Every single piece of the puzzle has
to be in place before you get any serious gain.

It's not a kernel v user thing. The kernel can't fix it, random bits of
userspace can't fix it. This is effectively a "product level" integration
problem.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Ingo Molnar

* Matthew Garrett  wrote:

> [...]
>
> Our power consumption is worse than under other operating 
> systems is almost entirely because only one of our three GPU 
> drivers implements any kind of useful power management. [...]

... and because our CPU frequency and C state selection logic is 
doing pretty much the worst possible decisions (on x86 at 
least).

Regardless, you cannot possibly seriously suggest that because 
there's even greater suckage elsewhere for some workloads we 
should not even bother with improving the situation here.

Anyway, I agree with Alan that actual numbers matter.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Ingo Molnar

* Alan Cox  wrote:

> > Why? Good scheduling is useful even in isolation.
> 
> For power - I suspect it's damn near irrelevant except on a 
> big big machine.

With deep enough C states it's rather relevant whether we 
continue to burn +50W for a couple of more milliseconds or not, 
and whether we have the right information from the scheduler and 
timer subsystem about how long the next idle period is expected 
to be and how bursty a given task is.

'Balance for energy efficiency' obviously ties into the C state 
and frequency selection logic, which is rather detached right 
now, running its own (imperfect) scheduling metrics logic and 
doing pretty much the worst possible C state and frequency 
decisions in typical everyday desktop workloads.

> Unless you've sorted out your SATA, fixed your phy handling, 
> optimised your desktop for wakeups and worked down the big 
> wakeup causes one by one it's turd polishing.
> 
> PM means fixing the stack top to bottom, and its a whackamole 
> game, each one you fix you find the next. You have to sort the 
> entire stack from desktop apps to kernel.

Moving 'policy' into user-space has been an utter failure, 
mostly because there's not a single project/subsystem 
responsible for getting a good result to users. This is why
I resist "policy should not be in the kernel" meme here.

> However benchmarks talk - so lets have some benchmarks ... on 
> a laptop.

Agreed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Ingo Molnar

* Alan Cox a...@lxorguk.ukuu.org.uk wrote:

  Why? Good scheduling is useful even in isolation.
 
 For power - I suspect it's damn near irrelevant except on a 
 big big machine.

With deep enough C states it's rather relevant whether we 
continue to burn +50W for a couple of more milliseconds or not, 
and whether we have the right information from the scheduler and 
timer subsystem about how long the next idle period is expected 
to be and how bursty a given task is.

'Balance for energy efficiency' obviously ties into the C state 
and frequency selection logic, which is rather detached right 
now, running its own (imperfect) scheduling metrics logic and 
doing pretty much the worst possible C state and frequency 
decisions in typical everyday desktop workloads.

 Unless you've sorted out your SATA, fixed your phy handling, 
 optimised your desktop for wakeups and worked down the big 
 wakeup causes one by one it's turd polishing.
 
 PM means fixing the stack top to bottom, and its a whackamole 
 game, each one you fix you find the next. You have to sort the 
 entire stack from desktop apps to kernel.

Moving 'policy' into user-space has been an utter failure, 
mostly because there's not a single project/subsystem 
responsible for getting a good result to users. This is why
I resist policy should not be in the kernel meme here.

 However benchmarks talk - so lets have some benchmarks ... on 
 a laptop.

Agreed.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Ingo Molnar

* Matthew Garrett mj...@srcf.ucam.org wrote:

 [...]

 Our power consumption is worse than under other operating 
 systems is almost entirely because only one of our three GPU 
 drivers implements any kind of useful power management. [...]

... and because our CPU frequency and C state selection logic is 
doing pretty much the worst possible decisions (on x86 at 
least).

Regardless, you cannot possibly seriously suggest that because 
there's even greater suckage elsewhere for some workloads we 
should not even bother with improving the situation here.

Anyway, I agree with Alan that actual numbers matter.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Alan Cox
 With deep enough C states it's rather relevant whether we 
 continue to burn +50W for a couple of more milliseconds or not, 
 and whether we have the right information from the scheduler and 
 timer subsystem about how long the next idle period is expected 
 to be and how bursty a given task is.

50W for 2mS here and there is an irrelevance compared with burning a
continual half a watt due to the upstream tree lack some of the SATA
power patches for example.

It's the classic standby mode problem - energy efficiency has time as a
factor and there are a lot of milliseconds in 5 hours. That means
anything continually on rapidly dominates the problem space.

  PM means fixing the stack top to bottom, and its a whackamole 
  game, each one you fix you find the next. You have to sort the 
  entire stack from desktop apps to kernel.
 
 Moving 'policy' into user-space has been an utter failure, 
 mostly because there's not a single project/subsystem 
 responsible for getting a good result to users. This is why
 I resist policy should not be in the kernel meme here.

You *can't* fix PM in one place. Power management is a top to bottom
thing. It starts in the hardware and propogates right to the top of the
user space stack.

A single stupid behaviour in a desktop app is all it needs to knock the
odd hour or two off your battery life. Something is mundane as refreshing
a bit of the display all the time keeping the GPU and CPU from sleeping
well.

Most distros haven't managed to do power management properly because it
is this entire integration problem. Every single piece of the puzzle has
to be in place before you get any serious gain.

It's not a kernel v user thing. The kernel can't fix it, random bits of
userspace can't fix it. This is effectively a product level integration
problem.

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Ingo Molnar

* Alan Cox a...@lxorguk.ukuu.org.uk wrote:

  With deep enough C states it's rather relevant whether we 
  continue to burn +50W for a couple of more milliseconds or 
  not, and whether we have the right information from the 
  scheduler and timer subsystem about how long the next idle 
  period is expected to be and how bursty a given task is.
 
 50W for 2mS here and there is an irrelevance compared with 
 burning a continual half a watt due to the upstream tree lack 
 some of the SATA power patches for example.

It can be more than an irrelevance if the CPU is saturated - say 
a game running on a mobile device very commonly saturates the 
CPU. A third of the energy is spent in the CPU, sometimes more.

 It's the classic standby mode problem - energy efficiency 
 has time as a factor and there are a lot of milliseconds in 5 
 hours. That means anything continually on rapidly dominates 
 the problem space.
 
   PM means fixing the stack top to bottom, and its a whackamole 
   game, each one you fix you find the next. You have to sort the 
   entire stack from desktop apps to kernel.
  
  Moving 'policy' into user-space has been an utter failure, 
  mostly because there's not a single project/subsystem 
  responsible for getting a good result to users. This is why 
  I resist policy should not be in the kernel meme here.
 
 You *can't* fix PM in one place. [...]

Preferably one project, not one place - but at least don't go 
down the false path of:

  Policy always belongs into user-space so the kernel can 
   continue to do a shitty job even for pieces it could 
   understand better ...

My opinion is that it depends, and I also think that we are so 
bad currently (on x86) that we can do little harm by trying to 
do things better.

 [...] Power management is a top to bottom thing. It starts in 
 the hardware and propogates right to the top of the user space 
 stack.

Partly because it's misdesigned: in practice there's very little 
true user policy about power saving:

- On mobile devices I almost never tweak policy as a user - 
  sometimes I override screen brightness but that's all (and 
  it's trivial compared to all the many other things that go 
  on).

- On a laptop I'd love to never have to tweak it either - 
  running fast when on AC and running efficient when on battery 
  is a perfectly fine life-time default for me.

90% of the policy comes with the *form factor* - i.e. it's 
something the hardware and thus the kernel could intimately
know about.

Yes, there are exceptions and there are servers.

The mobile device user mostly *only cares about battery life*, 
for a given amount of real utility provided by the device. The 
user policy fetish here is a serious misunderstanding of how 
it should all work. There arent millions of people out there 
wanting to tweak the heck out of PM.

People prefer no knobs at all - they want good defaults and they 
want at most a single, intuitive, actionable control to override 
the automation in 1% of the usecases, such as screen brightness.

 A single stupid behaviour in a desktop app is all it needs to 
 knock the odd hour or two off your battery life. Something is 
 mundane as refreshing a bit of the display all the time 
 keeping the GPU and CPU from sleeping well.

Even with highly powertop-optimized systems that have no such 
app and have very low wakeup rates we still lag behind the 
competition.

 Most distros haven't managed to do power management properly 
 because it is this entire integration problem. Every single 
 piece of the puzzle has to be in place before you get any 
 serious gain.

Most certainly.

So why not move most pieces into one well-informed code domain 
(the kernel) and only expose high level controls, instead of 
expecting user-space to get it all right.

Then the 'only' job of user-space would be to not be silly when 
implementing their functionality. (and there's nothing 
intimately PM about that.)

 It's not a kernel v user thing. The kernel can't fix it, 
 random bits of userspace can't fix it. This is effectively a 
 product level integration problem.

Of course the kernel can fix many parts by offering automation 
like automatically shutting down unused interfaces (and offering 
better ABIs if that is not possible due to some poor historic 
choice), choosing frequencies and C states wisely, etc.

Kernel design decisions *matter*:

Look for example how moving X lowlevel drivers from user-space 
into kernel-space enabled GPU level power management to begin 
with. With the old X method it was essentially impossible. Now 
it's at least possible.

Or look at how Android adding a high-level interface like 
suspend blockers materially improved the power saving situation 
for them.

This learned helplessness that the kernel can do nothing about 
PM is somewhat annoying :-)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo 

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Matthew Garrett
On Wed, Aug 22, 2012 at 11:10:13AM +0200, Ingo Molnar wrote:
 
 * Matthew Garrett mj...@srcf.ucam.org wrote:
 
  [...]
 
  Our power consumption is worse than under other operating 
  systems is almost entirely because only one of our three GPU 
  drivers implements any kind of useful power management. [...]
 
 ... and because our CPU frequency and C state selection logic is 
 doing pretty much the worst possible decisions (on x86 at 
 least).

You have figures showing that our C state residence is worse than, say, 
Windows? Because my own testing says that we're way better at that. 
Could we be better? Sure. Is it why we're worse? No.

 Regardless, you cannot possibly seriously suggest that because 
 there's even greater suckage elsewhere for some workloads we 
 should not even bother with improving the situation here.

I'm enthusiastic about improving the scheduler's behaviour. I'm 
unenthusiastic about putting in automatic hacks related to AC state.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Alan Cox
 It can be more than an irrelevance if the CPU is saturated - say 
 a game running on a mobile device very commonly saturates the 
 CPU. A third of the energy is spent in the CPU, sometimes more.

If the CPU is saturated you already lost. What you going to do - the CPU
is saturated - slow it down, then it'll use more power.

  You *can't* fix PM in one place. [...]
 
 Preferably one project, not one place - but at least don't go 
 down the false path of:
 
   Policy always belongs into user-space so the kernel can 
continue to do a shitty job even for pieces it could 
understand better ...
 
 My opinion is that it depends, and I also think that we are so 
 bad currently (on x86) that we can do little harm by trying to 
 do things better.

All the evidence I've seen says we are doing the kernel side stuff right.

 
  [...] Power management is a top to bottom thing. It starts in 
  the hardware and propogates right to the top of the user space 
  stack.
 
 Partly because it's misdesigned: in practice there's very little 
 true user policy about power saving:

It's not about policy, its about code behaviour. You have to fix every
single piece of code.

 - On mobile devices I almost never tweak policy as a user - 
   sometimes I override screen brightness but that's all (and 
   it's trivial compared to all the many other things that go 
   on).

Put a single badly broken app on an Android device and your battery life
will plough. That's despite Android having some highly active management
policies to minimise the effect. It works out of the box because someone
spent a huge amount of time with a power meter and monitoring tools
beating up whoever was top of the wakeup lists.

 it should all work. There arent millions of people out there 
 wanting to tweak the heck out of PM.

Don't confuse policy managed by the userspace and buttons for users to
tweak. Userspace understands things like would it be better to drop
video quality or burn more power and has access to info the kernel can't
even begin to evaluate.

 People prefer no knobs at all - they want good defaults and they 
 want at most a single, intuitive, actionable control to override 
 the automation in 1% of the usecases, such as screen brightness.

That's a different discussion.

  A single stupid behaviour in a desktop app is all it needs to 
  knock the odd hour or two off your battery life. Something is 
  mundane as refreshing a bit of the display all the time 
  keeping the GPU and CPU from sleeping well.
 
 Even with highly powertop-optimized systems that have no such 
 app and have very low wakeup rates we still lag behind the 
 competition.

Actually we don't. Well not if your distro is put together properly,
and has the relevant SATA patches and the like merged. Stock Fedora may
be pants but if so that's a distro problem.

 So why not move most pieces into one well-informed code domain 
 (the kernel) and only expose high level controls, instead of 
 expecting user-space to get it all right.

Because the kernel doesn't have the information needed. You'd have to add
megabytes of code to the kernel - including things like video playback
engines.

 Then the 'only' job of user-space would be to not be silly when 
 implementing their functionality. (and there's nothing 
 intimately PM about that.)

That sounds like ignorance

 Kernel design decisions *matter*:

Of course they do but its a tiny part of the story. The power management
function mathematically has a large number of important inputs for which
the kernel cannot deduce the values without massive layering violations.

Also inconveniently for your worldview but as demonstrated in every case
and by everyone who has dug into it, you also have to fix all the wakeup
sources on each level. That's the reality. From the moment you wake for
an event that was not strictly needed you are essentially attempting to
mitigate a failure not trying to deal with the actual problem.

 Look for example how moving X lowlevel drivers from user-space 
 into kernel-space enabled GPU level power management to begin 
 with. With the old X method it was essentially impossible. Now 
 it's at least possible.

Actually it was perfectly possible before for what the cards of the time
could do. The kernel GPU stuff is for DMA and IRQ handling. It happens to
be a good place to do PM.

 Or look at how Android adding a high-level interface like 
 suspend blockers materially improved the power saving situation 
 for them.

Blockers are not policy. The blocking *policy* is managed elsewhere. They
are a tool for freezing stuff that is being rude.

 This learned helplessness that the kernel can do nothing about 
 PM is somewhat annoying :-)

Sorry was that a different thread I didn't read ?

The inability to learn from both the past and basic systems theory is
what I find rather more irritating. Plus your mistaken belief that we are
worse than the other OS's on this. We are not. If your system sucks then
instrument it, 

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Arjan van de Ven
On 8/21/2012 10:41 PM, Mike Galbraith wrote:
 On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
 
 I'd like to see actual numbers and evidence on a wide range of workloads
 the spread/don't spread thing is even measurable given that you've also
 got to factor in effects like completing faster and turning everything
 off. I'd *really* like to see such evidence on a laptop,which is your
 one cited case it might work.
 
 For my dinky dual core laptop, I suspect you're right, but for a more
 powerful laptop, I'd expect spread/don't to be noticeable.

yeah if you don't spread, you will waste some power.
but.. current linux behavior is to spread.
so we can only make it worse.


 
 Yeah, hard numbers would be nice to see.
 
 If I had a powerful laptop, I'd kill irq balancing, and all but periodic
 load balancing, and expect to see a positive result. 

I'd expect to see a negative result ;-)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Mike Galbraith
On Wed, 2012-08-22 at 06:02 -0700, Arjan van de Ven wrote: 
 On 8/21/2012 10:41 PM, Mike Galbraith wrote:
  On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:
  
  I'd like to see actual numbers and evidence on a wide range of workloads
  the spread/don't spread thing is even measurable given that you've also
  got to factor in effects like completing faster and turning everything
  off. I'd *really* like to see such evidence on a laptop,which is your
  one cited case it might work.
  
  For my dinky dual core laptop, I suspect you're right, but for a more
  powerful laptop, I'd expect spread/don't to be noticeable.
 
 yeah if you don't spread, you will waste some power.
 but.. current linux behavior is to spread.
 so we can only make it worse.

Hm, so I can stop fretting about select_idle_sibling().  Good. 

  Yeah, hard numbers would be nice to see.
  
  If I had a powerful laptop, I'd kill irq balancing, and all but periodic
  load balancing, and expect to see a positive result. 
 
 I'd expect to see a negative result ;-)

Ok, so I have my head on backward.  Gives a different perspective :)

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Matthew Garrett
On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
 On 8/21/2012 10:41 PM, Mike Galbraith wrote:
  For my dinky dual core laptop, I suspect you're right, but for a more
  powerful laptop, I'd expect spread/don't to be noticeable.
 
 yeah if you don't spread, you will waste some power.
 but.. current linux behavior is to spread.
 so we can only make it worse.

Right. For a single socket system the only thing you can do is use two 
threads in preference to using two cores. That'll keep an extra core in 
a deep C state for longer, at the cost of keeping the package out of a 
deep C state for longer. There might be a win if the two processes 
benefit from improved L1 cache locality, or if you're talking about 
short periodic work, but for the majority of cases I'd expect Arjan to 
be completely correct here. Things get more interesting with 
multi-socket systems, but that's beyond the laptop use case.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-22 Thread Arjan van de Ven
On 8/22/2012 6:21 AM, Matthew Garrett wrote:
 On Wed, Aug 22, 2012 at 06:02:48AM -0700, Arjan van de Ven wrote:
 On 8/21/2012 10:41 PM, Mike Galbraith wrote:
 For my dinky dual core laptop, I suspect you're right, but for a more
 powerful laptop, I'd expect spread/don't to be noticeable.

 yeah if you don't spread, you will waste some power.
 but.. current linux behavior is to spread.
 so we can only make it worse.
 
 Right. For a single socket system the only thing you can do is use two 
 threads in preference to using two cores. That'll keep an extra core in 
 a deep C state for longer, at the cost of keeping the package out of a 
 deep C state for longer. There might be a win if the two processes 
 benefit from improved L1 cache locality, or if you're talking about 

basically if HT sharing would be good for performance ;-)

(btw this is good news, it means this is not an actual power/performance 
tradeoff, but a get it right tradeoff)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Mike Galbraith
On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:

> I'd like to see actual numbers and evidence on a wide range of workloads
> the spread/don't spread thing is even measurable given that you've also
> got to factor in effects like completing faster and turning everything
> off. I'd *really* like to see such evidence on a laptop,which is your
> one cited case it might work.

For my dinky dual core laptop, I suspect you're right, but for a more
powerful laptop, I'd expect spread/don't to be noticeable.

Yeah, hard numbers would be nice to see.

If I had a powerful laptop, I'd kill irq balancing, and all but periodic
load balancing, and expect to see a positive result.  Dunno what fickle
electron gods would _really_ do with those prayers though.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Alan Cox
> Why? Good scheduling is useful even in isolation.

For power - I suspect it's damn near irrelevant except on a big big
machine.

Unless you've sorted out your SATA, fixed your phy handling, optimised
your desktop for wakeups and worked down the big wakeup causes one by one
it's turd polishing.

PM means fixing the stack top to bottom, and its a whackamole game, each
one you fix you find the next. You have to sort the entire stack from
desktop apps to kernel.

However benchmarks talk - so lets have some benchmarks ... on a laptop.

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 08:23:46PM +0200, Ingo Molnar wrote:
> * Matthew Garrett  wrote:
> > The scheduler is unaware of whether I care about a process 
> > finishing quickly or whether I care about it consuming less 
> > power.
> 
> You are posing them as if the two were mutually exclusive, while 
> in reality they are not necessarily exclusive: it's quite 
> possible that the highest (non-turbo) CPU frequency happens to 
> be the most energy efficient one for a CPU with a particular 
> workload ...

You just put in a proviso that makes them mutually exclusive. If I want 
it done fast, I want it done in the highest turbo CPU frequency. If I 
don't want it done fast, I want it done in the most efficient CPU 
frequency. They're probably not the same thing.

> You also missed the bit of my mail where I suggested that such 
> user preferences and tolerances can be communicated to the 
> scheduler via a policy toggle - which the scheduler would take 
> into account.

Yes. And that toggle should be the thing that defines the policy under 
all circumstances.

> > Ok so what you're actually telling me here is that you don't 
> > understand anything about power management and where our 
> > problems are.
> 
> Huh? In practice we suck today in terms of energy efficiency. 
> That covers both scheduling and other areas.
> 
> Saying this out aloud does not tell anything about my 
> understanding of power management...
> 
> So please outline a technical point.

Our power consumption is worse than under other operating systems is 
almost entirely because only one of our three GPU drivers implements any 
kind of useful power management. The power saving functionality that we 
expose to userspace is already used when it's safe to do so. So blaming 
our userspace policy management for our higher power consumption means 
that you can't possibly understand where the problems actually are, 
which indicates that you probably shouldn't be trying to tell me about 
optimal approaches to power management.

> You mean the code is in drivers? Or the problem is in drivers? 

The problem is in the drivers.

> > sched_mt_powersave was inherently going to have a huge impact 
> > on performance, and with modern chips that would result in the 
> > platform consuming more power. It was a feature that was 
> > useful for a small number of generations of desktop CPUs - I 
> > don't think it would ever skew the power/performance ratio in 
> > a useful direction on mobile hardware. But feel free to blame 
> > userspace for hardware design.
> 
> FYI, sched_mt_powersave is *GONE* in recent kernels, because it 
> basically never worked. This thread is about designing and 
> implementing something that actually works.

Yes. You asked me whether userspace ever used the knobs that the kernel 
exposed. I said no, because the only knob relevant for laptops would 
never improve energy efficiency on laptops. It is therefore impossible 
to use this as an example of userspace policy management not doing the 
right thing.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett  wrote:

> On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
> > * Matthew Garrett  wrote:
> > > The scheduler's behaviour is going to have a minimal impact on 
> > > power consumption on laptops. Other things are much more 
> > > important - backlight level, ASPM state, that kind of thing. 
> > > So why special case the scheduler? [...]
> > 
> > I'm not special casing the scheduler - but we are talking about 
> > scheduler policies here, so *if* it makes sense to handle this 
> > dynamically then obviously the scheduler wants to use system 
> > state information when/if the kernel can get it.
> > 
> > Your argument is as if you said that the shape of a car's side 
> > view mirrors is not important to its top speed, because the 
> > overall shape of the chassis and engine power are much more 
> > important.
> > 
> > But we are desiging side view mirrors here, so we might as well 
> > do a good job there.
> 
> If the kernel is going to make power choices automatically 
> then it should do it everywhere, not piecemeal.

Why? Good scheduling is useful even in isolation.

> The scheduler is unaware of whether I care about a process 
> finishing quickly or whether I care about it consuming less 
> power.

You are posing them as if the two were mutually exclusive, while 
in reality they are not necessarily exclusive: it's quite 
possible that the highest (non-turbo) CPU frequency happens to 
be the most energy efficient one for a CPU with a particular 
workload ...

You also missed the bit of my mail where I suggested that such 
user preferences and tolerances can be communicated to the 
scheduler via a policy toggle - which the scheduler would take 
into account.

I suggest to use sane defaults, such as being energy efficient 
on battery power (within a sane threshold) and maximizing 
throughput on AC power (within a sane threshold).

That would go a *long* way improving the current mess. If Linux 
power efficiency was so good today then I'd not ask for kernel 
driven defaults - but the reality is that in terms of process 
scheduling we suck today (and have sucked for the last 10 years) 
so pretty much any approach will improve things.

> > > > The thing is, when I use Linux on a laptop then 
> > > > AC/battery is *the* main policy input.
> > > 
> > > And it's already well handled from userspace, as it has to 
> > > be.
> > 
> > Not according to the developers switching away from Linux 
> > desktop distros in droves, because MacOSX or Win7 has 30%+ 
> > better battery efficiency.
> 
> Ok so what you're actually telling me here is that you don't 
> understand anything about power management and where our 
> problems are.

Huh? In practice we suck today in terms of energy efficiency. 
That covers both scheduling and other areas.

Saying this out aloud does not tell anything about my 
understanding of power management...

So please outline a technical point.

> > The scheduler might be a small part of the picture, but it's 
> > certainly a part of it.
> 
> It's in the drivers, which is where it has been since we went 
> tickless.

You mean the code is in drivers? Or the problem is in drivers? 

Both is true currently - this discussion is about the future, to 
make the scheduler aware of power concerns, as the scheduler 
(and the timer subsystem) already calculates various interesting 
metrics that matter to energy efficient scheduling.

> > > No, because sched_mt_powersave usually crippled performance 
> > > more than it saved power and nobody makes multi-socket 
> > > laptops.
> > 
> > That's a user-space policy management fail right there: why 
> > wasn't this fixed? If the default policy is in the kernel we can 
> > at least fix it in one place for the most common cases. If it's 
> > spread out amongst multiple projects then progress only happens 
> > at glacial speed ...
> 
> sched_mt_powersave was inherently going to have a huge impact 
> on performance, and with modern chips that would result in the 
> platform consuming more power. It was a feature that was 
> useful for a small number of generations of desktop CPUs - I 
> don't think it would ever skew the power/performance ratio in 
> a useful direction on mobile hardware. But feel free to blame 
> userspace for hardware design.

FYI, sched_mt_powersave is *GONE* in recent kernels, because it 
basically never worked. This thread is about designing and 
implementing something that actually works.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
> * Matthew Garrett  wrote:
> > The scheduler's behaviour is going to have a minimal impact on 
> > power consumption on laptops. Other things are much more 
> > important - backlight level, ASPM state, that kind of thing. 
> > So why special case the scheduler? [...]
> 
> I'm not special casing the scheduler - but we are talking about 
> scheduler policies here, so *if* it makes sense to handle this 
> dynamically then obviously the scheduler wants to use system 
> state information when/if the kernel can get it.
> 
> Your argument is as if you said that the shape of a car's side 
> view mirrors is not important to its top speed, because the 
> overall shape of the chassis and engine power are much more 
> important.
> 
> But we are desiging side view mirrors here, so we might as well 
> do a good job there.

If the kernel is going to make power choices automatically then it 
should do it everywhere, not piecemeal.

> > [...] This is going to be hugely more important on 
> > multi-socket systems, where your policy is usually going to be 
> > dictated by the specific workload that you're running at the 
> > time. [...]
> 
> If only we had some kernel subsystem that is intimiately familar 
> with the workloads running on the system and could act 
> accordingly and with low latency.
> 
> We could name that subsystem it in some intuitive fashion: it 
> switches and schedules workloads, so how about calling it the 
> 'scheduler'?

The scheduler is unaware of whether I care about a process finishing 
quickly or whether I care about it consuming less power.

> > [...] The exception is in cases where your rack is 
> > overcommitted for power and your rack management unit is 
> > telling you to reduce power consumption since otherwise it's 
> > going to have to cut the power to one of the machines in the 
> > rack in the next few seconds.
> 
> ( That must be some ACPI middleware driven crap, right? Not 
>   really the Linux kernel's problem. )

It's as much the Linux kernel's problem as AC/battery decisions are - 
ie, it's not.

> > > The thing is, when I use Linux on a laptop then AC/battery 
> > > is *the* main policy input.
> > 
> > And it's already well handled from userspace, as it has to be.
> 
> Not according to the developers switching away from Linux 
> desktop distros in droves, because MacOSX or Win7 has 30%+ 
> better battery efficiency.

Ok so what you're actually telling me here is that you don't understand 
anything about power management and where our problems are.

> The scheduler might be a small part of the picture, but it's 
> certainly a part of it.

It's in the drivers, which is where it has been since we went tickless. 

> > No, because sched_mt_powersave usually crippled performance 
> > more than it saved power and nobody makes multi-socket 
> > laptops.
> 
> That's a user-space policy management fail right there: why 
> wasn't this fixed? If the default policy is in the kernel we can 
> at least fix it in one place for the most common cases. If it's 
> spread out amongst multiple projects then progress only happens 
> at glacial speed ...

sched_mt_powersave was inherently going to have a huge impact on 
performance, and with modern chips that would result in the platform 
consuming more power. It was a feature that was useful for a small 
number of generations of desktop CPUs - I don't think it would ever skew 
the power/performance ratio in a useful direction on mobile hardware. 
But feel free to blame userspace for hardware design.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett  wrote:

> On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
> > * Matthew Garrett  wrote:
> > > [...] AC/battery is just not an important power management 
> > > policy input when compared to various other things.
> > 
> > Such as?
> 
> The scheduler's behaviour is going to have a minimal impact on 
> power consumption on laptops. Other things are much more 
> important - backlight level, ASPM state, that kind of thing. 
> So why special case the scheduler? [...]

I'm not special casing the scheduler - but we are talking about 
scheduler policies here, so *if* it makes sense to handle this 
dynamically then obviously the scheduler wants to use system 
state information when/if the kernel can get it.

Your argument is as if you said that the shape of a car's side 
view mirrors is not important to its top speed, because the 
overall shape of the chassis and engine power are much more 
important.

But we are desiging side view mirrors here, so we might as well 
do a good job there.

> [...] This is going to be hugely more important on 
> multi-socket systems, where your policy is usually going to be 
> dictated by the specific workload that you're running at the 
> time. [...]

If only we had some kernel subsystem that is intimiately familar 
with the workloads running on the system and could act 
accordingly and with low latency.

We could name that subsystem it in some intuitive fashion: it 
switches and schedules workloads, so how about calling it the 
'scheduler'?

> [...] The exception is in cases where your rack is 
> overcommitted for power and your rack management unit is 
> telling you to reduce power consumption since otherwise it's 
> going to have to cut the power to one of the machines in the 
> rack in the next few seconds.

( That must be some ACPI middleware driven crap, right? Not 
  really the Linux kernel's problem. )

> > The thing is, when I use Linux on a laptop then AC/battery 
> > is *the* main policy input.
> 
> And it's already well handled from userspace, as it has to be.

Not according to the developers switching away from Linux 
desktop distros in droves, because MacOSX or Win7 has 30%+ 
better battery efficiency.

The scheduler might be a small part of the picture, but it's 
certainly a part of it.

> > > Userspace has been doing a perfectly reasonable job of 
> > > determining policy here.
> > 
> > Has it properly switched the scheduler's balancing between 
> > power-effient and performance-maximizing strategies when for 
> > example a laptop's AC got unplugged/replugged?
> 
> No, because sched_mt_powersave usually crippled performance 
> more than it saved power and nobody makes multi-socket 
> laptops.

That's a user-space policy management fail right there: why 
wasn't this fixed? If the default policy is in the kernel we can 
at least fix it in one place for the most common cases. If it's 
spread out amongst multiple projects then progress only happens 
at glacial speed ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Alan Cox
> > That's a fundamentally uninteresting thing for the kernel to 
> > know about. [...]
> 
> I disagree.

The kernel has no idea of the power architecture leading up to the plug
socket. The kernel has no idea of the policy concerns of the user.

> > [...] AC/battery is just not an important power management 
> > policy input when compared to various other things.
> 
> Such as?
> 
> The thing is, when I use Linux on a laptop then AC/battery is 
> *the* main policy input.

Along with distance likely to be travelled without a socket being
available, whether you remembered the charger, and a pile of other things
('can I get this built before Linus wakes up').

The kernel isn't capable of computing these other factors. The userspace
can at least make an educated guess,

In the business space its even more complicated because battery/mains may
well only be visible via SNMP queries to the power systems and the bigger
concern may well be heat efficiency. If you are running a cloud your
policy considerations also include things like your current spot
electricity price, outside temperature and your current spot compute price
chargeable.

> > Userspace has been doing a perfectly reasonable job of 
> > determining policy here.
> 
> Has it properly switched the scheduler's balancing between 
> power-effient and performance-maximizing strategies when for 
> example a laptop's AC got unplugged/replugged?

You work for Red Hat, maybe you should ask your distro people if they do.
While you are it at perhaps also some of the ATA power management that
will probably be an order of magnitude more significant could get
included ;)

Seriously. On a typical laptop the things you can do about power are
dominated by the backlight, by disk power (eg idle SATA links), by USB
device power downs where possible, by turning off any unused phys and by
not having the CPU wake up, which means fixing the desktop apps to behave
sensibly.

I'd like to see actual numbers and evidence on a wide range of workloads
the spread/don't spread thing is even measurable given that you've also
got to factor in effects like completing faster and turning everything
off. I'd *really* like to see such evidence on a laptop,which is your
one cited case it might work.

> > Your suggestions aren't a working default mechanism.
> 
> In what way?

For one if the default behaviour is that when I get on the train and am
on battery my video playback begins to stutter due to some kernel
magic then I shall be unamused and file it as a regression.

Policy is userspace - the desktop can figure out I'm watching movies and
what this means, the kernel can't.

I'd also note there have been repeated attempts to put power management
policy on various OS's by putting the power management policy 

- in the hardware
- in SMM handlers
- in the kernel

and every single one has been a failure because those parts of the system
never have enough information nor do they have enough variety of control
to manage the complexity of input state.

It's a single policy file for a distro to do scheduler configuration
based upon power events. One trivial 'drop it here' shell script. The
difference then being the desktop can be taught to do overrides and
policy properly.

It might be the kernel has important knowledge about what "schedule
for efficiency" means and even to be able to ask the kernel to dot hat
- but it has no idea what the right policy is at any given moment.

ie even if there is a /sys/mumble/schedule_for_efficiency

the echo "1" > and echo "0" > belong in a script

Alan

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
> * Matthew Garrett  wrote:
> > [...] AC/battery is just not an important power management 
> > policy input when compared to various other things.
> 
> Such as?

The scheduler's behaviour is going to have a minimal impact on power 
consumption on laptops. Other things are much more important - backlight 
level, ASPM state, that kind of thing. So why special case the 
scheduler? This is going to be hugely more important on multi-socket 
systems, where your policy is usually going to be dictated by the 
specific workload that you're running at the time. The exception is in 
cases where your rack is overcommitted for power and your rack 
management unit is telling you to reduce power consumption since 
otherwise it's going to have to cut the power to one of the machines in 
the rack in the next few seconds.

> The thing is, when I use Linux on a laptop then AC/battery is 
> *the* main policy input.

And it's already well handled from userspace, as it has to be.

> > Userspace has been doing a perfectly reasonable job of 
> > determining policy here.
> 
> Has it properly switched the scheduler's balancing between 
> power-effient and performance-maximizing strategies when for 
> example a laptop's AC got unplugged/replugged?

No, because sched_mt_powersave usually crippled performance more than it 
saved power and nobody makes multi-socket laptops.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Arjan van de Ven

>>> A modern kernel better know what state the system is in: on 
>>> battery or on AC power.
>>
>> That's a fundamentally uninteresting thing for the kernel to 
>> know about. [...]
> 
> I disagree.

and I'll agree with Matthew and disagree with you ;-)

> 
>> [...] AC/battery is just not an important power management 
>> policy input when compared to various other things.
> 
> Such as?
> 
> The thing is, when I use Linux on a laptop then AC/battery is 
> *the* main policy input.

I think you're wrong there.
First of all, not the whole world is a laptop.
Phones and servers are very different than laptops in this sense.
In a phone, when you're charging, you want to be EXTRA power efficient in many 
ways
(since charging creates heat, and that heat will take away your thermal budget).
In a datacenter, you're either on AC or DC all the time, and power efficiency 
still matters.

And even on a laptop.. heat production matters even when on AC... laptops are 
more and more like phones
that way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett  wrote:

> On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
> > * Matthew Garrett  wrote:
> > > [...] Putting this kind of policy in the kernel is an awful 
> > > idea. [...]
> > 
> > A modern kernel better know what state the system is in: on 
> > battery or on AC power.
> 
> That's a fundamentally uninteresting thing for the kernel to 
> know about. [...]

I disagree.

> [...] AC/battery is just not an important power management 
> policy input when compared to various other things.

Such as?

The thing is, when I use Linux on a laptop then AC/battery is 
*the* main policy input.

> > > [...] It should never be altering policy itself, [...]
> > 
> > The kernel/scheduler simply offers sensible defaults where 
> > it can. User-space can augment/modify/override that in any 
> > which way it wishes to.
> >
> > This stuff has not been properly sorted out in the last 10+ 
> > years since we have battery driven devices, so we might as 
> > well start with the kernel offering sane default behavior 
> > where it can ...
> 
> Userspace has been doing a perfectly reasonable job of 
> determining policy here.

Has it properly switched the scheduler's balancing between 
power-effient and performance-maximizing strategies when for 
example a laptop's AC got unplugged/replugged?

> > > [...] because it'll get it wrong and people will file bugs 
> > > complaining that it got it wrong and the biggest case 
> > > where you *need* to be able to handle switching between 
> > > performance and power optimisations (your rack management 
> > > unit just told you that you're going to have to drop power 
> > > consumption by 20W) is one where the kernel doesn't have 
> > > all the information it needs to do this. So why bother at 
> > > all?
> > 
> > The point is to have a working default mechanism.
> 
> Your suggestions aren't a working default mechanism.

In what way?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
> * Matthew Garrett  wrote:
> > [...] Putting this kind of policy in the kernel is an awful 
> > idea. [...]
> 
> A modern kernel better know what state the system is in: on 
> battery or on AC power.

That's a fundamentally uninteresting thing for the kernel to know about. 
AC/battery is just not an important power management policy input when 
compared to various other things.

> > [...] It should never be altering policy itself, [...]
> 
> The kernel/scheduler simply offers sensible defaults where it 
> can. User-space can augment/modify/override that in any which 
> way it wishes to.
>
> This stuff has not been properly sorted out in the last 10+ 
> years since we have battery driven devices, so we might as well 
> start with the kernel offering sane default behavior where it 
> can ...

Userspace has been doing a perfectly reasonable job of determining 
policy here.
 
> > [...] because it'll get it wrong and people will file bugs 
> > complaining that it got it wrong and the biggest case where 
> > you *need* to be able to handle switching between performance 
> > and power optimisations (your rack management unit just told 
> > you that you're going to have to drop power consumption by 
> > 20W) is one where the kernel doesn't have all the information 
> > it needs to do this. So why bother at all?
> 
> The point is to have a working default mechanism.

Your suggestions aren't a working default mechanism.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Vincent Guittot
On 21 August 2012 02:58, Alex Shi  wrote:
> On 08/20/2012 11:36 PM, Vincent Guittot wrote:
>
>>> > What you want it to keep track of a per-cpu utilization level (inverse
>>> > of idle-time) and using PJTs per-task runnable avg see if placing the
>>> > new task on will exceed the utilization limit.
>>> >
>>> > I think some of the Linaro people actually played around with this,
>>> > Vincent?
>> Sorry for the late reply but I had almost no network access during last 
>> weeks.
>>
>> So Linaro also works on a power aware scheduler as Peter mentioned.
>>
>> Based on previous tests, we have concluded that main drawback of the
>> (now removed) old power scheduler was that we had no way to make
>> difference between short and long running tasks whereas it's a key
>> input (at least for phone) for deciding to pack tasks and for
>> selecting the core on an asymmetric system.
>
>
> It is hard to estimate future in general view point. but from hack
> point, maybe you can add something to hint this from task_struct. :)
>

per-task load tracking patchsets give you a good view of the last dozen of ms

>> One additional key information is the power distribution in the system
>> which can have a finer granularity than current sched_domain
>> description. Peter's proposal was to use a SHARE_POWERLINE flag
>> similarly to flags that already describe if a sched_domain share
>> resources or cpu capacity.
>
>
> Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
> and SD_SHARE_PKG_RESOURCES.

SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some
part of the physical core)
SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like
cache and memory access)

>
>>
>> With these 2 new information, we can have a 1st power saving scheduler
>> which spread or packed tasks across core and package
>
>
> Fine, I like to test them on X86, plus SMT and NUMA :)
>
>>
>> Vincent
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett  wrote:

> On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:
> 
> > If the answer is 'yes' then there's clear cases where the kernel 
> > (should) automatically know the events where we switch from 
> > balancing for performance to balancing for power:
> 
> No. We can't identify all of these cases and we can't identify 
> corner cases. [...]

There's no need to identify 'all' of these cases - but if the 
kernel knows then it can have intelligent default behavior.

> [...] Putting this kind of policy in the kernel is an awful 
> idea. [...]

A modern kernel better know what state the system is in: on 
battery or on AC power.

> [...] It should never be altering policy itself, [...]

The kernel/scheduler simply offers sensible defaults where it 
can. User-space can augment/modify/override that in any which 
way it wishes to.

This stuff has not been properly sorted out in the last 10+ 
years since we have battery driven devices, so we might as well 
start with the kernel offering sane default behavior where it 
can ...

> [...] because it'll get it wrong and people will file bugs 
> complaining that it got it wrong and the biggest case where 
> you *need* to be able to handle switching between performance 
> and power optimisations (your rack management unit just told 
> you that you're going to have to drop power consumption by 
> 20W) is one where the kernel doesn't have all the information 
> it needs to do this. So why bother at all?

The point is to have a working default mechanism.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Mike Galbraith
On Tue, 2012-08-21 at 17:02 +0100, Alan Cox wrote:

 I'd like to see actual numbers and evidence on a wide range of workloads
 the spread/don't spread thing is even measurable given that you've also
 got to factor in effects like completing faster and turning everything
 off. I'd *really* like to see such evidence on a laptop,which is your
 one cited case it might work.

For my dinky dual core laptop, I suspect you're right, but for a more
powerful laptop, I'd expect spread/don't to be noticeable.

Yeah, hard numbers would be nice to see.

If I had a powerful laptop, I'd kill irq balancing, and all but periodic
load balancing, and expect to see a positive result.  Dunno what fickle
electron gods would _really_ do with those prayers though.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett mj...@srcf.ucam.org wrote:

 On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:
 
  If the answer is 'yes' then there's clear cases where the kernel 
  (should) automatically know the events where we switch from 
  balancing for performance to balancing for power:
 
 No. We can't identify all of these cases and we can't identify 
 corner cases. [...]

There's no need to identify 'all' of these cases - but if the 
kernel knows then it can have intelligent default behavior.

 [...] Putting this kind of policy in the kernel is an awful 
 idea. [...]

A modern kernel better know what state the system is in: on 
battery or on AC power.

 [...] It should never be altering policy itself, [...]

The kernel/scheduler simply offers sensible defaults where it 
can. User-space can augment/modify/override that in any which 
way it wishes to.

This stuff has not been properly sorted out in the last 10+ 
years since we have battery driven devices, so we might as well 
start with the kernel offering sane default behavior where it 
can ...

 [...] because it'll get it wrong and people will file bugs 
 complaining that it got it wrong and the biggest case where 
 you *need* to be able to handle switching between performance 
 and power optimisations (your rack management unit just told 
 you that you're going to have to drop power consumption by 
 20W) is one where the kernel doesn't have all the information 
 it needs to do this. So why bother at all?

The point is to have a working default mechanism.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Vincent Guittot
On 21 August 2012 02:58, Alex Shi alex@intel.com wrote:
 On 08/20/2012 11:36 PM, Vincent Guittot wrote:

  What you want it to keep track of a per-cpu utilization level (inverse
  of idle-time) and using PJTs per-task runnable avg see if placing the
  new task on will exceed the utilization limit.
 
  I think some of the Linaro people actually played around with this,
  Vincent?
 Sorry for the late reply but I had almost no network access during last 
 weeks.

 So Linaro also works on a power aware scheduler as Peter mentioned.

 Based on previous tests, we have concluded that main drawback of the
 (now removed) old power scheduler was that we had no way to make
 difference between short and long running tasks whereas it's a key
 input (at least for phone) for deciding to pack tasks and for
 selecting the core on an asymmetric system.


 It is hard to estimate future in general view point. but from hack
 point, maybe you can add something to hint this from task_struct. :)


per-task load tracking patchsets give you a good view of the last dozen of ms

 One additional key information is the power distribution in the system
 which can have a finer granularity than current sched_domain
 description. Peter's proposal was to use a SHARE_POWERLINE flag
 similarly to flags that already describe if a sched_domain share
 resources or cpu capacity.


 Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
 and SD_SHARE_PKG_RESOURCES.

SD_SHARE_CPUPOWER is set in a sched domain at SMT level (sharing some
part of the physical core)
SD_SHARE_PKG_RESOURCES is set at MC level (sharing some resources like
cache and memory access)



 With these 2 new information, we can have a 1st power saving scheduler
 which spread or packed tasks across core and package


 Fine, I like to test them on X86, plus SMT and NUMA :)


 Vincent


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
 * Matthew Garrett mj...@srcf.ucam.org wrote:
  [...] Putting this kind of policy in the kernel is an awful 
  idea. [...]
 
 A modern kernel better know what state the system is in: on 
 battery or on AC power.

That's a fundamentally uninteresting thing for the kernel to know about. 
AC/battery is just not an important power management policy input when 
compared to various other things.

  [...] It should never be altering policy itself, [...]
 
 The kernel/scheduler simply offers sensible defaults where it 
 can. User-space can augment/modify/override that in any which 
 way it wishes to.

 This stuff has not been properly sorted out in the last 10+ 
 years since we have battery driven devices, so we might as well 
 start with the kernel offering sane default behavior where it 
 can ...

Userspace has been doing a perfectly reasonable job of determining 
policy here.
 
  [...] because it'll get it wrong and people will file bugs 
  complaining that it got it wrong and the biggest case where 
  you *need* to be able to handle switching between performance 
  and power optimisations (your rack management unit just told 
  you that you're going to have to drop power consumption by 
  20W) is one where the kernel doesn't have all the information 
  it needs to do this. So why bother at all?
 
 The point is to have a working default mechanism.

Your suggestions aren't a working default mechanism.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett mj...@srcf.ucam.org wrote:

 On Tue, Aug 21, 2012 at 11:42:04AM +0200, Ingo Molnar wrote:
  * Matthew Garrett mj...@srcf.ucam.org wrote:
   [...] Putting this kind of policy in the kernel is an awful 
   idea. [...]
  
  A modern kernel better know what state the system is in: on 
  battery or on AC power.
 
 That's a fundamentally uninteresting thing for the kernel to 
 know about. [...]

I disagree.

 [...] AC/battery is just not an important power management 
 policy input when compared to various other things.

Such as?

The thing is, when I use Linux on a laptop then AC/battery is 
*the* main policy input.

   [...] It should never be altering policy itself, [...]
  
  The kernel/scheduler simply offers sensible defaults where 
  it can. User-space can augment/modify/override that in any 
  which way it wishes to.
 
  This stuff has not been properly sorted out in the last 10+ 
  years since we have battery driven devices, so we might as 
  well start with the kernel offering sane default behavior 
  where it can ...
 
 Userspace has been doing a perfectly reasonable job of 
 determining policy here.

Has it properly switched the scheduler's balancing between 
power-effient and performance-maximizing strategies when for 
example a laptop's AC got unplugged/replugged?

   [...] because it'll get it wrong and people will file bugs 
   complaining that it got it wrong and the biggest case 
   where you *need* to be able to handle switching between 
   performance and power optimisations (your rack management 
   unit just told you that you're going to have to drop power 
   consumption by 20W) is one where the kernel doesn't have 
   all the information it needs to do this. So why bother at 
   all?
  
  The point is to have a working default mechanism.
 
 Your suggestions aren't a working default mechanism.

In what way?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Arjan van de Ven

 A modern kernel better know what state the system is in: on 
 battery or on AC power.

 That's a fundamentally uninteresting thing for the kernel to 
 know about. [...]
 
 I disagree.

and I'll agree with Matthew and disagree with you ;-)

 
 [...] AC/battery is just not an important power management 
 policy input when compared to various other things.
 
 Such as?
 
 The thing is, when I use Linux on a laptop then AC/battery is 
 *the* main policy input.

I think you're wrong there.
First of all, not the whole world is a laptop.
Phones and servers are very different than laptops in this sense.
In a phone, when you're charging, you want to be EXTRA power efficient in many 
ways
(since charging creates heat, and that heat will take away your thermal budget).
In a datacenter, you're either on AC or DC all the time, and power efficiency 
still matters.

And even on a laptop.. heat production matters even when on AC... laptops are 
more and more like phones
that way.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
 * Matthew Garrett mj...@srcf.ucam.org wrote:
  [...] AC/battery is just not an important power management 
  policy input when compared to various other things.
 
 Such as?

The scheduler's behaviour is going to have a minimal impact on power 
consumption on laptops. Other things are much more important - backlight 
level, ASPM state, that kind of thing. So why special case the 
scheduler? This is going to be hugely more important on multi-socket 
systems, where your policy is usually going to be dictated by the 
specific workload that you're running at the time. The exception is in 
cases where your rack is overcommitted for power and your rack 
management unit is telling you to reduce power consumption since 
otherwise it's going to have to cut the power to one of the machines in 
the rack in the next few seconds.

 The thing is, when I use Linux on a laptop then AC/battery is 
 *the* main policy input.

And it's already well handled from userspace, as it has to be.

  Userspace has been doing a perfectly reasonable job of 
  determining policy here.
 
 Has it properly switched the scheduler's balancing between 
 power-effient and performance-maximizing strategies when for 
 example a laptop's AC got unplugged/replugged?

No, because sched_mt_powersave usually crippled performance more than it 
saved power and nobody makes multi-socket laptops.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Alan Cox
  That's a fundamentally uninteresting thing for the kernel to 
  know about. [...]
 
 I disagree.

The kernel has no idea of the power architecture leading up to the plug
socket. The kernel has no idea of the policy concerns of the user.

  [...] AC/battery is just not an important power management 
  policy input when compared to various other things.
 
 Such as?
 
 The thing is, when I use Linux on a laptop then AC/battery is 
 *the* main policy input.

Along with distance likely to be travelled without a socket being
available, whether you remembered the charger, and a pile of other things
('can I get this built before Linus wakes up').

The kernel isn't capable of computing these other factors. The userspace
can at least make an educated guess,

In the business space its even more complicated because battery/mains may
well only be visible via SNMP queries to the power systems and the bigger
concern may well be heat efficiency. If you are running a cloud your
policy considerations also include things like your current spot
electricity price, outside temperature and your current spot compute price
chargeable.

  Userspace has been doing a perfectly reasonable job of 
  determining policy here.
 
 Has it properly switched the scheduler's balancing between 
 power-effient and performance-maximizing strategies when for 
 example a laptop's AC got unplugged/replugged?

You work for Red Hat, maybe you should ask your distro people if they do.
While you are it at perhaps also some of the ATA power management that
will probably be an order of magnitude more significant could get
included ;)

Seriously. On a typical laptop the things you can do about power are
dominated by the backlight, by disk power (eg idle SATA links), by USB
device power downs where possible, by turning off any unused phys and by
not having the CPU wake up, which means fixing the desktop apps to behave
sensibly.

I'd like to see actual numbers and evidence on a wide range of workloads
the spread/don't spread thing is even measurable given that you've also
got to factor in effects like completing faster and turning everything
off. I'd *really* like to see such evidence on a laptop,which is your
one cited case it might work.

  Your suggestions aren't a working default mechanism.
 
 In what way?

For one if the default behaviour is that when I get on the train and am
on battery my video playback begins to stutter due to some kernel
magic then I shall be unamused and file it as a regression.

Policy is userspace - the desktop can figure out I'm watching movies and
what this means, the kernel can't.

I'd also note there have been repeated attempts to put power management
policy on various OS's by putting the power management policy 

- in the hardware
- in SMM handlers
- in the kernel

and every single one has been a failure because those parts of the system
never have enough information nor do they have enough variety of control
to manage the complexity of input state.

It's a single policy file for a distro to do scheduler configuration
based upon power events. One trivial 'drop it here' shell script. The
difference then being the desktop can be taught to do overrides and
policy properly.

It might be the kernel has important knowledge about what schedule
for efficiency means and even to be able to ask the kernel to dot hat
- but it has no idea what the right policy is at any given moment.

ie even if there is a /sys/mumble/schedule_for_efficiency

the echo 1  and echo 0  belong in a script

Alan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett m...@redhat.com wrote:

 On Tue, Aug 21, 2012 at 05:19:10PM +0200, Ingo Molnar wrote:
  * Matthew Garrett mj...@srcf.ucam.org wrote:
   [...] AC/battery is just not an important power management 
   policy input when compared to various other things.
  
  Such as?
 
 The scheduler's behaviour is going to have a minimal impact on 
 power consumption on laptops. Other things are much more 
 important - backlight level, ASPM state, that kind of thing. 
 So why special case the scheduler? [...]

I'm not special casing the scheduler - but we are talking about 
scheduler policies here, so *if* it makes sense to handle this 
dynamically then obviously the scheduler wants to use system 
state information when/if the kernel can get it.

Your argument is as if you said that the shape of a car's side 
view mirrors is not important to its top speed, because the 
overall shape of the chassis and engine power are much more 
important.

But we are desiging side view mirrors here, so we might as well 
do a good job there.

 [...] This is going to be hugely more important on 
 multi-socket systems, where your policy is usually going to be 
 dictated by the specific workload that you're running at the 
 time. [...]

If only we had some kernel subsystem that is intimiately familar 
with the workloads running on the system and could act 
accordingly and with low latency.

We could name that subsystem it in some intuitive fashion: it 
switches and schedules workloads, so how about calling it the 
'scheduler'?

 [...] The exception is in cases where your rack is 
 overcommitted for power and your rack management unit is 
 telling you to reduce power consumption since otherwise it's 
 going to have to cut the power to one of the machines in the 
 rack in the next few seconds.

( That must be some ACPI middleware driven crap, right? Not 
  really the Linux kernel's problem. )

  The thing is, when I use Linux on a laptop then AC/battery 
  is *the* main policy input.
 
 And it's already well handled from userspace, as it has to be.

Not according to the developers switching away from Linux 
desktop distros in droves, because MacOSX or Win7 has 30%+ 
better battery efficiency.

The scheduler might be a small part of the picture, but it's 
certainly a part of it.

   Userspace has been doing a perfectly reasonable job of 
   determining policy here.
  
  Has it properly switched the scheduler's balancing between 
  power-effient and performance-maximizing strategies when for 
  example a laptop's AC got unplugged/replugged?
 
 No, because sched_mt_powersave usually crippled performance 
 more than it saved power and nobody makes multi-socket 
 laptops.

That's a user-space policy management fail right there: why 
wasn't this fixed? If the default policy is in the kernel we can 
at least fix it in one place for the most common cases. If it's 
spread out amongst multiple projects then progress only happens 
at glacial speed ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
 * Matthew Garrett m...@redhat.com wrote:
  The scheduler's behaviour is going to have a minimal impact on 
  power consumption on laptops. Other things are much more 
  important - backlight level, ASPM state, that kind of thing. 
  So why special case the scheduler? [...]
 
 I'm not special casing the scheduler - but we are talking about 
 scheduler policies here, so *if* it makes sense to handle this 
 dynamically then obviously the scheduler wants to use system 
 state information when/if the kernel can get it.
 
 Your argument is as if you said that the shape of a car's side 
 view mirrors is not important to its top speed, because the 
 overall shape of the chassis and engine power are much more 
 important.
 
 But we are desiging side view mirrors here, so we might as well 
 do a good job there.

If the kernel is going to make power choices automatically then it 
should do it everywhere, not piecemeal.

  [...] This is going to be hugely more important on 
  multi-socket systems, where your policy is usually going to be 
  dictated by the specific workload that you're running at the 
  time. [...]
 
 If only we had some kernel subsystem that is intimiately familar 
 with the workloads running on the system and could act 
 accordingly and with low latency.
 
 We could name that subsystem it in some intuitive fashion: it 
 switches and schedules workloads, so how about calling it the 
 'scheduler'?

The scheduler is unaware of whether I care about a process finishing 
quickly or whether I care about it consuming less power.

  [...] The exception is in cases where your rack is 
  overcommitted for power and your rack management unit is 
  telling you to reduce power consumption since otherwise it's 
  going to have to cut the power to one of the machines in the 
  rack in the next few seconds.
 
 ( That must be some ACPI middleware driven crap, right? Not 
   really the Linux kernel's problem. )

It's as much the Linux kernel's problem as AC/battery decisions are - 
ie, it's not.

   The thing is, when I use Linux on a laptop then AC/battery 
   is *the* main policy input.
  
  And it's already well handled from userspace, as it has to be.
 
 Not according to the developers switching away from Linux 
 desktop distros in droves, because MacOSX or Win7 has 30%+ 
 better battery efficiency.

Ok so what you're actually telling me here is that you don't understand 
anything about power management and where our problems are.

 The scheduler might be a small part of the picture, but it's 
 certainly a part of it.

It's in the drivers, which is where it has been since we went tickless. 

  No, because sched_mt_powersave usually crippled performance 
  more than it saved power and nobody makes multi-socket 
  laptops.
 
 That's a user-space policy management fail right there: why 
 wasn't this fixed? If the default policy is in the kernel we can 
 at least fix it in one place for the most common cases. If it's 
 spread out amongst multiple projects then progress only happens 
 at glacial speed ...

sched_mt_powersave was inherently going to have a huge impact on 
performance, and with modern chips that would result in the platform 
consuming more power. It was a feature that was useful for a small 
number of generations of desktop CPUs - I don't think it would ever skew 
the power/performance ratio in a useful direction on mobile hardware. 
But feel free to blame userspace for hardware design.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Ingo Molnar

* Matthew Garrett mj...@srcf.ucam.org wrote:

 On Tue, Aug 21, 2012 at 05:59:08PM +0200, Ingo Molnar wrote:
  * Matthew Garrett m...@redhat.com wrote:
   The scheduler's behaviour is going to have a minimal impact on 
   power consumption on laptops. Other things are much more 
   important - backlight level, ASPM state, that kind of thing. 
   So why special case the scheduler? [...]
  
  I'm not special casing the scheduler - but we are talking about 
  scheduler policies here, so *if* it makes sense to handle this 
  dynamically then obviously the scheduler wants to use system 
  state information when/if the kernel can get it.
  
  Your argument is as if you said that the shape of a car's side 
  view mirrors is not important to its top speed, because the 
  overall shape of the chassis and engine power are much more 
  important.
  
  But we are desiging side view mirrors here, so we might as well 
  do a good job there.
 
 If the kernel is going to make power choices automatically 
 then it should do it everywhere, not piecemeal.

Why? Good scheduling is useful even in isolation.

 The scheduler is unaware of whether I care about a process 
 finishing quickly or whether I care about it consuming less 
 power.

You are posing them as if the two were mutually exclusive, while 
in reality they are not necessarily exclusive: it's quite 
possible that the highest (non-turbo) CPU frequency happens to 
be the most energy efficient one for a CPU with a particular 
workload ...

You also missed the bit of my mail where I suggested that such 
user preferences and tolerances can be communicated to the 
scheduler via a policy toggle - which the scheduler would take 
into account.

I suggest to use sane defaults, such as being energy efficient 
on battery power (within a sane threshold) and maximizing 
throughput on AC power (within a sane threshold).

That would go a *long* way improving the current mess. If Linux 
power efficiency was so good today then I'd not ask for kernel 
driven defaults - but the reality is that in terms of process 
scheduling we suck today (and have sucked for the last 10 years) 
so pretty much any approach will improve things.

The thing is, when I use Linux on a laptop then 
AC/battery is *the* main policy input.
   
   And it's already well handled from userspace, as it has to 
   be.
  
  Not according to the developers switching away from Linux 
  desktop distros in droves, because MacOSX or Win7 has 30%+ 
  better battery efficiency.
 
 Ok so what you're actually telling me here is that you don't 
 understand anything about power management and where our 
 problems are.

Huh? In practice we suck today in terms of energy efficiency. 
That covers both scheduling and other areas.

Saying this out aloud does not tell anything about my 
understanding of power management...

So please outline a technical point.

  The scheduler might be a small part of the picture, but it's 
  certainly a part of it.
 
 It's in the drivers, which is where it has been since we went 
 tickless.

You mean the code is in drivers? Or the problem is in drivers? 

Both is true currently - this discussion is about the future, to 
make the scheduler aware of power concerns, as the scheduler 
(and the timer subsystem) already calculates various interesting 
metrics that matter to energy efficient scheduling.

   No, because sched_mt_powersave usually crippled performance 
   more than it saved power and nobody makes multi-socket 
   laptops.
  
  That's a user-space policy management fail right there: why 
  wasn't this fixed? If the default policy is in the kernel we can 
  at least fix it in one place for the most common cases. If it's 
  spread out amongst multiple projects then progress only happens 
  at glacial speed ...
 
 sched_mt_powersave was inherently going to have a huge impact 
 on performance, and with modern chips that would result in the 
 platform consuming more power. It was a feature that was 
 useful for a small number of generations of desktop CPUs - I 
 don't think it would ever skew the power/performance ratio in 
 a useful direction on mobile hardware. But feel free to blame 
 userspace for hardware design.

FYI, sched_mt_powersave is *GONE* in recent kernels, because it 
basically never worked. This thread is about designing and 
implementing something that actually works.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Matthew Garrett
On Tue, Aug 21, 2012 at 08:23:46PM +0200, Ingo Molnar wrote:
 * Matthew Garrett mj...@srcf.ucam.org wrote:
  The scheduler is unaware of whether I care about a process 
  finishing quickly or whether I care about it consuming less 
  power.
 
 You are posing them as if the two were mutually exclusive, while 
 in reality they are not necessarily exclusive: it's quite 
 possible that the highest (non-turbo) CPU frequency happens to 
 be the most energy efficient one for a CPU with a particular 
 workload ...

You just put in a proviso that makes them mutually exclusive. If I want 
it done fast, I want it done in the highest turbo CPU frequency. If I 
don't want it done fast, I want it done in the most efficient CPU 
frequency. They're probably not the same thing.

 You also missed the bit of my mail where I suggested that such 
 user preferences and tolerances can be communicated to the 
 scheduler via a policy toggle - which the scheduler would take 
 into account.

Yes. And that toggle should be the thing that defines the policy under 
all circumstances.

  Ok so what you're actually telling me here is that you don't 
  understand anything about power management and where our 
  problems are.
 
 Huh? In practice we suck today in terms of energy efficiency. 
 That covers both scheduling and other areas.
 
 Saying this out aloud does not tell anything about my 
 understanding of power management...
 
 So please outline a technical point.

Our power consumption is worse than under other operating systems is 
almost entirely because only one of our three GPU drivers implements any 
kind of useful power management. The power saving functionality that we 
expose to userspace is already used when it's safe to do so. So blaming 
our userspace policy management for our higher power consumption means 
that you can't possibly understand where the problems actually are, 
which indicates that you probably shouldn't be trying to tell me about 
optimal approaches to power management.

 You mean the code is in drivers? Or the problem is in drivers? 

The problem is in the drivers.

  sched_mt_powersave was inherently going to have a huge impact 
  on performance, and with modern chips that would result in the 
  platform consuming more power. It was a feature that was 
  useful for a small number of generations of desktop CPUs - I 
  don't think it would ever skew the power/performance ratio in 
  a useful direction on mobile hardware. But feel free to blame 
  userspace for hardware design.
 
 FYI, sched_mt_powersave is *GONE* in recent kernels, because it 
 basically never worked. This thread is about designing and 
 implementing something that actually works.

Yes. You asked me whether userspace ever used the knobs that the kernel 
exposed. I said no, because the only knob relevant for laptops would 
never improve energy efficiency on laptops. It is therefore impossible 
to use this as an example of userspace policy management not doing the 
right thing.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-21 Thread Alan Cox
 Why? Good scheduling is useful even in isolation.

For power - I suspect it's damn near irrelevant except on a big big
machine.

Unless you've sorted out your SATA, fixed your phy handling, optimised
your desktop for wakeups and worked down the big wakeup causes one by one
it's turd polishing.

PM means fixing the stack top to bottom, and its a whackamole game, each
one you fix you find the next. You have to sort the entire stack from
desktop apps to kernel.

However benchmarks talk - so lets have some benchmarks ... on a laptop.

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Alex Shi
On 08/20/2012 11:47 PM, Vincent Guittot wrote:

> On 16 August 2012 07:03, Alex Shi  wrote:
>> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>>
>>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>>
 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.
>>>
>>> Are there workloads in which "power" might provide more performance than
>>> "performance"? If so, don't use these terms.
>>>
>>
>>
>> Power scheme should no chance has better performance in design.
> 
> A side effect of packing small tasks on one core is that you always
> use the core with the lowest C-state which will minimize the wake up
> latency so you can sometime get better results than performance mode
> which will try to use a other core in another cluster which will take
> more time to wake up that waiting for the end of the current task.
> 


Sure. some scenario packing tasks into smaller domain will bring
performance benefit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Alex Shi
On 08/20/2012 11:36 PM, Vincent Guittot wrote:

>> > What you want it to keep track of a per-cpu utilization level (inverse
>> > of idle-time) and using PJTs per-task runnable avg see if placing the
>> > new task on will exceed the utilization limit.
>> >
>> > I think some of the Linaro people actually played around with this,
>> > Vincent?
> Sorry for the late reply but I had almost no network access during last weeks.
> 
> So Linaro also works on a power aware scheduler as Peter mentioned.
> 
> Based on previous tests, we have concluded that main drawback of the
> (now removed) old power scheduler was that we had no way to make
> difference between short and long running tasks whereas it's a key
> input (at least for phone) for deciding to pack tasks and for
> selecting the core on an asymmetric system.


It is hard to estimate future in general view point. but from hack
point, maybe you can add something to hint this from task_struct. :)

> One additional key information is the power distribution in the system
> which can have a finer granularity than current sched_domain
> description. Peter's proposal was to use a SHARE_POWERLINE flag
> similarly to flags that already describe if a sched_domain share
> resources or cpu capacity.


Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
and SD_SHARE_PKG_RESOURCES.

> 
> With these 2 new information, we can have a 1st power saving scheduler
> which spread or packed tasks across core and package


Fine, I like to test them on X86, plus SMT and NUMA :)

> 
> Vincent


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Christoph Lameter
On Mon, 20 Aug 2012, Matthew Garrett wrote:

> On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote:
>
> > So please make sure that there are obvious and easy ways to switch this
> > stuff off or provide "low latency" know that keeps the system from
> > assuming that idle time means that full performance is not needed.
>
> That seems like an issue for cpuidle, not the scheduler. Does pm_qos not
> already do what you want?

Dont know. A simple solution is not to compile power management into the
kernel.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Matthew Garrett
On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:

> If the answer is 'yes' then there's clear cases where the kernel 
> (should) automatically know the events where we switch from 
> balancing for performance to balancing for power:

No. We can't identify all of these cases and we can't identify corner 
cases. Putting this kind of policy in the kernel is an awful idea. It 
should never be altering policy itself, because it'll get it wrong and 
people will file bugs complaining that it got it wrong and the biggest 
case where you *need* to be able to handle switching between performance 
and power optimisations (your rack management unit just told you that 
you're going to have to drop power consumption by 20W) is one where the 
kernel doesn't have all the information it needs to do this. So why 
bother at all?

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot
On 17 August 2012 10:43, Paul Turner  wrote:
> On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra  
> wrote:
>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>>> Since there is no power saving consideration in scheduler CFS, I has a
>>> very rough idea for enabling a new power saving schema in CFS.
>>
>> Adding Thomas, he always delights poking holes in power schemes.
>>
>>> It bases on the following assumption:
>>> 1, If there are many task crowd in system, just let few domain cpus
>>> running and let other cpus idle can not save power. Let all cpu take the
>>> load, finish tasks early, and then get into idle. will save more power
>>> and have better user experience.
>>
>> I'm not sure this is a valid assumption. I've had it explained to me by
>> various people that race-to-idle isn't always the best thing. It has to
>> do with the cost of switching power states and the duration of execution
>> and other such things.
>>
>>> 2, schedule domain, schedule group perfect match the hardware, and
>>> the power consumption unit. So, pull tasks out of a domain means
>>> potentially this power consumption unit idle.
>>
>> I'm not sure I understand what you're saying, sorry.
>>
>>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
>> between the two based on AC/BAT, UPS status and simple things like that.
>> But this seems like a later concern, you have to have something to pick
>> between before you can pick :-)
>>
>>> And in scheduling, 2 place will care the policy, load_balance() and in
>>> task fork/exec: select_task_rq_fair().
>>
>> ack
>>
>>> Here is some pseudo code try to explain the proposal behaviour in
>>> load_balance() and select_task_rq_fair();
>>
>> Oh man.. A few words outlining the general idea would've been nice.
>>
>>> load_balance() {
>>>   update_sd_lb_stats(); //get busiest group, idlest group data.
>>>
>>>   if (sd->nr_running > sd's capacity) {
>>>   //power saving policy is not suitable for
>>>   //this scenario, it runs like performance policy
>>>   mv tasks from busiest cpu in busiest group to
>>>   idlest  cpu in idlest group;
>>
>> Once upon a time we talked about adding a factor to the capacity for
>> this. So say you'd allow 2*capacity before overflowing and waking
>> another power group.
>>
>> But I think we should not go on nr_running here, PJTs per-entity load
>> tracking stuff gives us much better measures -- also, repost that series
>> already Paul! :-)
>
> Yes -- I just got back from Africa this week.  It's updated for almost
> all the previous comments but I ran out of time before I left to
> re-post.  I'm just about caught up enough that I should be able to get
> this done over the upcoming weekend.  Monday at the latest.
>
>>
>> Also, I'm not sure this is entirely correct, the thing you want to do
>> for power aware stuff is to minimize the number of active power domains,
>> this means you don't want idlest, you want least busy non-idle.
>>
>>>   } else {// the sd has enough capacity to hold all tasks.
>>>   if (sg->nr_running > sg's capacity) {
>>>   //imbalanced between groups
>>>   if (schedule policy == performance) {
>>>   //when 2 busiest group at same busy
>>>   //degree, need to prefer the one has
>>>   // softest group??
>>>   move tasks from busiest group to
>>>   idletest group;
>>
>> So I'd leave the currently implemented scheme as performance, and I
>> don't think the above describes the current state.
>>
>>>   } else if (schedule policy == power)
>>>   move tasks from busiest group to
>>>   idlest group until busiest is just full
>>>   of capacity.
>>>   //the busiest group can balance
>>>   //internally after next time LB,
>>
>> There's another thing we need to do, and that is collect tasks in a
>> minimal amount of power domains. The old code (that got deleted) did
>> something like that, you can revive some of the that code if needed -- I
>> just killed everything to be able to start with a clean slate.
>>
>>
>>>   } else {
>>>   //all groups has enough capacity for its tasks.
>>>   if (schedule policy == performance)
>>>   //all tasks may has enough cpu
>>>   //resources to run,
>>>   //mv tasks from busiest to idlest group?
>>>   //no, at this time, 

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Matthew Garrett
On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote:

> So please make sure that there are obvious and easy ways to switch this
> stuff off or provide "low latency" know that keeps the system from
> assuming that idle time means that full performance is not needed.

That seems like an issue for cpuidle, not the scheduler. Does pm_qos not 
already do what you want?

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Christoph Lameter
One issue that is often forgotten is that there are users who want lowest
latency and not highest performance. Our systems sit idle for most of the
time but when a specific event occurs (typically a packet is received)
they must react in the fastest way possible.

On every new generation of hardware and software we keep on running into
various mechanisms that automatically power down when idle for a long time
(to save power...). And its pretty hard to figure these things out given
the complexity of modern hardware. F.e. for the Sandybridges we found that
the memory channel powers down after 2 milliseconds idle time and that was
unaffected by any of the bios config options. Similar mechanisms exist in
the kernel but those are easier discover since there is source.

So please make sure that there are obvious and easy ways to switch this
stuff off or provide "low latency" know that keeps the system from
assuming that idle time means that full performance is not needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot
On 16 August 2012 07:03, Alex Shi  wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
>
>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Are there workloads in which "power" might provide more performance than
>> "performance"? If so, don't use these terms.
>>
>
>
> Power scheme should no chance has better performance in design.

A side effect of packing small tasks on one core is that you always
use the core with the lowest C-state which will minimize the wake up
latency so you can sometime get better results than performance mode
which will try to use a other core in another cluster which will take
more time to wake up that waiting for the end of the current task.

Vincent
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot
On 15 August 2012 13:05, Peter Zijlstra  wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.
>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.
>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)
>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack
>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>>   update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>>   if (sd->nr_running > sd's capacity) {
>>   //power saving policy is not suitable for
>>   //this scenario, it runs like performance policy
>>   mv tasks from busiest cpu in busiest group to
>>   idlest  cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.
>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)
>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.
>
>>   } else {// the sd has enough capacity to hold all tasks.
>>   if (sg->nr_running > sg's capacity) {
>>   //imbalanced between groups
>>   if (schedule policy == performance) {
>>   //when 2 busiest group at same busy
>>   //degree, need to prefer the one has
>>   // softest group??
>>   move tasks from busiest group to
>>   idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
>>   } else if (schedule policy == power)
>>   move tasks from busiest group to
>>   idlest group until busiest is just full
>>   of capacity.
>>   //the busiest group can balance
>>   //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.
>
>
>>   } else {
>>   //all groups has enough capacity for its tasks.
>>   if (schedule policy == performance)
>>   //all tasks may has enough cpu
>>   //resources to run,
>>   //mv tasks from busiest to idlest group?
>>   //no, at this time, it's better to keep
>>   //the task on current cpu.
>>   //so, it is maybe better to do balance
>>   //in each of groups
>>   for_each_imbalance_groups()
>>   move tasks from busiest cpu to
>>   idlest cpu in each of groups;
>>  

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Arjan van de Ven
On 8/20/2012 1:06 AM, Ingo Molnar wrote:
> 
> 
> There's also cases where the kernel has insufficient information 
> from the hardware and from the admin about the preferred 
> characteristics/policy of the system - a tweakable fallback knob 
> might be provided for that sad case.
> 
> The point is, that knob is not the policy setting and it's not 
> the main mechanism. It's a fallback.

if we call the knob "powersave", it better save power...
if we call it "group together" or "spread out".. no problem with that.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 10:06 +0200, Ingo Molnar wrote:
> > > I was really more thinking of something useful for the 
> > > laptops out there, when they pull the power cord it makes 
> > > sense to try and keep CPUs asleep until the one that's awake 
> > > is saturated.
> 
> s/CPU/core ? 

I was thinking logical cpus, but whatever really.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Ingo Molnar

* Arjan van de Ven  wrote:

> On 8/15/2012 8:04 AM, Peter Zijlstra wrote:
>
> > This all sounds far too complicated.. we're talking about 
> > simple spreading and packing balancers without deep arch 
> > knowledge and knobs, we couldn't possibly evaluate anything 
> > like that.
> > 
> > I was really more thinking of something useful for the 
> > laptops out there, when they pull the power cord it makes 
> > sense to try and keep CPUs asleep until the one that's awake 
> > is saturated.

s/CPU/core ?

> as long as you don't do that on machines with an Intel CPU.. 
> since that'd be the worst case behavior for tasks that run for 
> more than 100 usec. (e.g. not interrupts, but almost 
> everything else)

The question is, do we need to balance for 'power saving', on 
systems that care more about power use than they care about peak 
performance/throughput, at all?

If the answer is 'no' then things get rather simple.

If the answer is 'yes' then there's clear cases where the kernel 
(should) automatically know the events where we switch from 
balancing for performance to balancing for power:

 - the system boots up on battery

 - the system was on AC but the cord has been pulled and the 
   system is now on battery

 - the administrator configures the system on AC to be
   power-conscious.

( and the opposite direction events wants the scheduler to 
  switch from 'balancing for power' to 'balancing for 
  performance'. )

There's also cases where the kernel has insufficient information 
from the hardware and from the admin about the preferred 
characteristics/policy of the system - a tweakable fallback knob 
might be provided for that sad case.

The point is, that knob is not the policy setting and it's not 
the main mechanism. It's a fallback.

Thanks,

Ingo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Ingo Molnar

* Arjan van de Ven ar...@linux.intel.com wrote:

 On 8/15/2012 8:04 AM, Peter Zijlstra wrote:

  This all sounds far too complicated.. we're talking about 
  simple spreading and packing balancers without deep arch 
  knowledge and knobs, we couldn't possibly evaluate anything 
  like that.
  
  I was really more thinking of something useful for the 
  laptops out there, when they pull the power cord it makes 
  sense to try and keep CPUs asleep until the one that's awake 
  is saturated.

s/CPU/core ?

 as long as you don't do that on machines with an Intel CPU.. 
 since that'd be the worst case behavior for tasks that run for 
 more than 100 usec. (e.g. not interrupts, but almost 
 everything else)

The question is, do we need to balance for 'power saving', on 
systems that care more about power use than they care about peak 
performance/throughput, at all?

If the answer is 'no' then things get rather simple.

If the answer is 'yes' then there's clear cases where the kernel 
(should) automatically know the events where we switch from 
balancing for performance to balancing for power:

 - the system boots up on battery

 - the system was on AC but the cord has been pulled and the 
   system is now on battery

 - the administrator configures the system on AC to be
   power-conscious.

( and the opposite direction events wants the scheduler to 
  switch from 'balancing for power' to 'balancing for 
  performance'. )

There's also cases where the kernel has insufficient information 
from the hardware and from the admin about the preferred 
characteristics/policy of the system - a tweakable fallback knob 
might be provided for that sad case.

The point is, that knob is not the policy setting and it's not 
the main mechanism. It's a fallback.

Thanks,

Ingo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Peter Zijlstra
On Mon, 2012-08-20 at 10:06 +0200, Ingo Molnar wrote:
   I was really more thinking of something useful for the 
   laptops out there, when they pull the power cord it makes 
   sense to try and keep CPUs asleep until the one that's awake 
   is saturated.
 
 s/CPU/core ? 

I was thinking logical cpus, but whatever really.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Arjan van de Ven
On 8/20/2012 1:06 AM, Ingo Molnar wrote:
 
 
 There's also cases where the kernel has insufficient information 
 from the hardware and from the admin about the preferred 
 characteristics/policy of the system - a tweakable fallback knob 
 might be provided for that sad case.
 
 The point is, that knob is not the policy setting and it's not 
 the main mechanism. It's a fallback.

if we call the knob powersave, it better save power...
if we call it group together or spread out.. no problem with that.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot
On 15 August 2012 13:05, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
 Since there is no power saving consideration in scheduler CFS, I has a
 very rough idea for enabling a new power saving schema in CFS.

 Adding Thomas, he always delights poking holes in power schemes.

 It bases on the following assumption:
 1, If there are many task crowd in system, just let few domain cpus
 running and let other cpus idle can not save power. Let all cpu take the
 load, finish tasks early, and then get into idle. will save more power
 and have better user experience.

 I'm not sure this is a valid assumption. I've had it explained to me by
 various people that race-to-idle isn't always the best thing. It has to
 do with the cost of switching power states and the duration of execution
 and other such things.

 2, schedule domain, schedule group perfect match the hardware, and
 the power consumption unit. So, pull tasks out of a domain means
 potentially this power consumption unit idle.

 I'm not sure I understand what you're saying, sorry.

 So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Yay, ideally we'd also provide a 3rd option: auto, which simply switches
 between the two based on AC/BAT, UPS status and simple things like that.
 But this seems like a later concern, you have to have something to pick
 between before you can pick :-)

 And in scheduling, 2 place will care the policy, load_balance() and in
 task fork/exec: select_task_rq_fair().

 ack

 Here is some pseudo code try to explain the proposal behaviour in
 load_balance() and select_task_rq_fair();

 Oh man.. A few words outlining the general idea would've been nice.

 load_balance() {
   update_sd_lb_stats(); //get busiest group, idlest group data.

   if (sd-nr_running  sd's capacity) {
   //power saving policy is not suitable for
   //this scenario, it runs like performance policy
   mv tasks from busiest cpu in busiest group to
   idlest  cpu in idlest group;

 Once upon a time we talked about adding a factor to the capacity for
 this. So say you'd allow 2*capacity before overflowing and waking
 another power group.

 But I think we should not go on nr_running here, PJTs per-entity load
 tracking stuff gives us much better measures -- also, repost that series
 already Paul! :-)

 Also, I'm not sure this is entirely correct, the thing you want to do
 for power aware stuff is to minimize the number of active power domains,
 this means you don't want idlest, you want least busy non-idle.

   } else {// the sd has enough capacity to hold all tasks.
   if (sg-nr_running  sg's capacity) {
   //imbalanced between groups
   if (schedule policy == performance) {
   //when 2 busiest group at same busy
   //degree, need to prefer the one has
   // softest group??
   move tasks from busiest group to
   idletest group;

 So I'd leave the currently implemented scheme as performance, and I
 don't think the above describes the current state.

   } else if (schedule policy == power)
   move tasks from busiest group to
   idlest group until busiest is just full
   of capacity.
   //the busiest group can balance
   //internally after next time LB,

 There's another thing we need to do, and that is collect tasks in a
 minimal amount of power domains. The old code (that got deleted) did
 something like that, you can revive some of the that code if needed -- I
 just killed everything to be able to start with a clean slate.


   } else {
   //all groups has enough capacity for its tasks.
   if (schedule policy == performance)
   //all tasks may has enough cpu
   //resources to run,
   //mv tasks from busiest to idlest group?
   //no, at this time, it's better to keep
   //the task on current cpu.
   //so, it is maybe better to do balance
   //in each of groups
   for_each_imbalance_groups()
   move tasks from busiest cpu to
   idlest cpu in each of groups;
   else if (schedule policy == power) {
   if (no hard pin in idlest group)
 

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot
On 16 August 2012 07:03, Alex Shi alex@intel.com wrote:
 On 08/16/2012 12:19 AM, Matthew Garrett wrote:

 On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:

 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Are there workloads in which power might provide more performance than
 performance? If so, don't use these terms.



 Power scheme should no chance has better performance in design.

A side effect of packing small tasks on one core is that you always
use the core with the lowest C-state which will minimize the wake up
latency so you can sometime get better results than performance mode
which will try to use a other core in another cluster which will take
more time to wake up that waiting for the end of the current task.

Vincent
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Christoph Lameter
One issue that is often forgotten is that there are users who want lowest
latency and not highest performance. Our systems sit idle for most of the
time but when a specific event occurs (typically a packet is received)
they must react in the fastest way possible.

On every new generation of hardware and software we keep on running into
various mechanisms that automatically power down when idle for a long time
(to save power...). And its pretty hard to figure these things out given
the complexity of modern hardware. F.e. for the Sandybridges we found that
the memory channel powers down after 2 milliseconds idle time and that was
unaffected by any of the bios config options. Similar mechanisms exist in
the kernel but those are easier discover since there is source.

So please make sure that there are obvious and easy ways to switch this
stuff off or provide low latency know that keeps the system from
assuming that idle time means that full performance is not needed.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Matthew Garrett
On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote:

 So please make sure that there are obvious and easy ways to switch this
 stuff off or provide low latency know that keeps the system from
 assuming that idle time means that full performance is not needed.

That seems like an issue for cpuidle, not the scheduler. Does pm_qos not 
already do what you want?

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Vincent Guittot
On 17 August 2012 10:43, Paul Turner p...@google.com wrote:
 On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra a.p.zijls...@chello.nl 
 wrote:
 On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
 Since there is no power saving consideration in scheduler CFS, I has a
 very rough idea for enabling a new power saving schema in CFS.

 Adding Thomas, he always delights poking holes in power schemes.

 It bases on the following assumption:
 1, If there are many task crowd in system, just let few domain cpus
 running and let other cpus idle can not save power. Let all cpu take the
 load, finish tasks early, and then get into idle. will save more power
 and have better user experience.

 I'm not sure this is a valid assumption. I've had it explained to me by
 various people that race-to-idle isn't always the best thing. It has to
 do with the cost of switching power states and the duration of execution
 and other such things.

 2, schedule domain, schedule group perfect match the hardware, and
 the power consumption unit. So, pull tasks out of a domain means
 potentially this power consumption unit idle.

 I'm not sure I understand what you're saying, sorry.

 So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Yay, ideally we'd also provide a 3rd option: auto, which simply switches
 between the two based on AC/BAT, UPS status and simple things like that.
 But this seems like a later concern, you have to have something to pick
 between before you can pick :-)

 And in scheduling, 2 place will care the policy, load_balance() and in
 task fork/exec: select_task_rq_fair().

 ack

 Here is some pseudo code try to explain the proposal behaviour in
 load_balance() and select_task_rq_fair();

 Oh man.. A few words outlining the general idea would've been nice.

 load_balance() {
   update_sd_lb_stats(); //get busiest group, idlest group data.

   if (sd-nr_running  sd's capacity) {
   //power saving policy is not suitable for
   //this scenario, it runs like performance policy
   mv tasks from busiest cpu in busiest group to
   idlest  cpu in idlest group;

 Once upon a time we talked about adding a factor to the capacity for
 this. So say you'd allow 2*capacity before overflowing and waking
 another power group.

 But I think we should not go on nr_running here, PJTs per-entity load
 tracking stuff gives us much better measures -- also, repost that series
 already Paul! :-)

 Yes -- I just got back from Africa this week.  It's updated for almost
 all the previous comments but I ran out of time before I left to
 re-post.  I'm just about caught up enough that I should be able to get
 this done over the upcoming weekend.  Monday at the latest.


 Also, I'm not sure this is entirely correct, the thing you want to do
 for power aware stuff is to minimize the number of active power domains,
 this means you don't want idlest, you want least busy non-idle.

   } else {// the sd has enough capacity to hold all tasks.
   if (sg-nr_running  sg's capacity) {
   //imbalanced between groups
   if (schedule policy == performance) {
   //when 2 busiest group at same busy
   //degree, need to prefer the one has
   // softest group??
   move tasks from busiest group to
   idletest group;

 So I'd leave the currently implemented scheme as performance, and I
 don't think the above describes the current state.

   } else if (schedule policy == power)
   move tasks from busiest group to
   idlest group until busiest is just full
   of capacity.
   //the busiest group can balance
   //internally after next time LB,

 There's another thing we need to do, and that is collect tasks in a
 minimal amount of power domains. The old code (that got deleted) did
 something like that, you can revive some of the that code if needed -- I
 just killed everything to be able to start with a clean slate.


   } else {
   //all groups has enough capacity for its tasks.
   if (schedule policy == performance)
   //all tasks may has enough cpu
   //resources to run,
   //mv tasks from busiest to idlest group?
   //no, at this time, it's better to keep
   //the task on current cpu.
   //so, it is maybe better to do balance
   //in each of groups
   

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Matthew Garrett
On Mon, Aug 20, 2012 at 10:06:06AM +0200, Ingo Molnar wrote:

 If the answer is 'yes' then there's clear cases where the kernel 
 (should) automatically know the events where we switch from 
 balancing for performance to balancing for power:

No. We can't identify all of these cases and we can't identify corner 
cases. Putting this kind of policy in the kernel is an awful idea. It 
should never be altering policy itself, because it'll get it wrong and 
people will file bugs complaining that it got it wrong and the biggest 
case where you *need* to be able to handle switching between performance 
and power optimisations (your rack management unit just told you that 
you're going to have to drop power consumption by 20W) is one where the 
kernel doesn't have all the information it needs to do this. So why 
bother at all?

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Christoph Lameter
On Mon, 20 Aug 2012, Matthew Garrett wrote:

 On Mon, Aug 20, 2012 at 03:47:54PM +, Christoph Lameter wrote:

  So please make sure that there are obvious and easy ways to switch this
  stuff off or provide low latency know that keeps the system from
  assuming that idle time means that full performance is not needed.

 That seems like an issue for cpuidle, not the scheduler. Does pm_qos not
 already do what you want?

Dont know. A simple solution is not to compile power management into the
kernel.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Alex Shi
On 08/20/2012 11:36 PM, Vincent Guittot wrote:

  What you want it to keep track of a per-cpu utilization level (inverse
  of idle-time) and using PJTs per-task runnable avg see if placing the
  new task on will exceed the utilization limit.
 
  I think some of the Linaro people actually played around with this,
  Vincent?
 Sorry for the late reply but I had almost no network access during last weeks.
 
 So Linaro also works on a power aware scheduler as Peter mentioned.
 
 Based on previous tests, we have concluded that main drawback of the
 (now removed) old power scheduler was that we had no way to make
 difference between short and long running tasks whereas it's a key
 input (at least for phone) for deciding to pack tasks and for
 selecting the core on an asymmetric system.


It is hard to estimate future in general view point. but from hack
point, maybe you can add something to hint this from task_struct. :)

 One additional key information is the power distribution in the system
 which can have a finer granularity than current sched_domain
 description. Peter's proposal was to use a SHARE_POWERLINE flag
 similarly to flags that already describe if a sched_domain share
 resources or cpu capacity.


Seems I missed this. what's difference with current SD_SHARE_CPUPOWER
and SD_SHARE_PKG_RESOURCES.

 
 With these 2 new information, we can have a 1st power saving scheduler
 which spread or packed tasks across core and package


Fine, I like to test them on X86, plus SMT and NUMA :)

 
 Vincent


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-20 Thread Alex Shi
On 08/20/2012 11:47 PM, Vincent Guittot wrote:

 On 16 August 2012 07:03, Alex Shi alex@intel.com wrote:
 On 08/16/2012 12:19 AM, Matthew Garrett wrote:

 On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:

 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Are there workloads in which power might provide more performance than
 performance? If so, don't use these terms.



 Power scheme should no chance has better performance in design.
 
 A side effect of packing small tasks on one core is that you always
 use the core with the lowest C-state which will minimize the wake up
 latency so you can sometime get better results than performance mode
 which will try to use a other core in another cluster which will take
 more time to wake up that waiting for the end of the current task.
 


Sure. some scenario packing tasks into smaller domain will bring
performance benefit.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-19 Thread Juri Lelli

Hi all,
I can probably add some bits to the discussion, after all I'm preparing 
a talk for Plumbers that is strictly related :-). My points are not CFS 
related (so feel free to ignore me), but they would probably be 
interesting if we talk about power aware scheduling in Linux in general.


On 08/16/2012 04:31 PM, Morten Rasmussen wrote:

Hi all,

On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote:


sub proposal:
1, If it's possible to balance task on idlest cpu not appointed 'balance
cpu'. If so, it may can reduce one more time balancing.
The idlest cpu can prefer the new idle cpu;  and is the least load cpu;
2, se or task load is good for running time setting.
but it should the second basis in load balancing. The first basis of LB
is running tasks' number in group/cpu. Since whatever of the weight of
groups is, if the tasks number is less than cpu number, the group is
still has capacity to take more tasks. (will consider the SMT cpu power
or other big/little cpu capacity on ARM.)


Ah, no we shouldn't balance on nr_running, but on the amount of time
consumed. Imagine two tasks being woken at the same time, both tasks
will only run a fraction of the available time, you don't want this to
exceed your capacity because ran back to back the one cpu will still be
mostly idle.

What you want it to keep track of a per-cpu utilization level (inverse
of idle-time) and using PJTs per-task runnable avg see if placing the
new task on will exceed the utilization limit.

I think some of the Linaro people actually played around with this,
Vincent?



I agree. A better measure of cpu load and task weight than nr_running
and the current task load weight are necessary to do proper task
packing.

I have used PJTs per-task load-tracking for scheduling experiments on
heterogeneous systems and my experience is that it works quite well for
determining the load of a specific task. Something like PJTs work
would be a good starting point for power aware scheduling and better
support for heterogeneous systems.



I didn't tried PJTs work myself (it's on my todo list), but with 
SCHED_DEADLINE you can see the picture from the other side and, instead 
of tracking per-task load, you can enforce a task not to exceed its 
allowed "load".
This is done reserving some fraction of CPU time (runtime or budget) 
every predefined interval of time (period). Than this allocated 
bandwidth is enforced with proper scheduling mechanisms (BTW, I have 
another talk at Plumbers explaining the SCHED_DEADLINE patchset in more 
details).



One of the biggest challenges here for load-balancing is translating
task load from one cpu to another as the task load is influenced by the
total load of its cpu. So a task that appears to be heavy on an
oversubscribed cpu might not be so heavy after all when it is moved to a
cpu with plenty cpu time to spare. This issue is likely to be more
pronounced on heterogeneous systems and system with aggressive frequency
scaling. It might be possible to avoid having to translate load or that
it doesn't really matter, but I haven't completely convinced myself yet.



This is probably a key point where deadline scheduling could be helpful. 
A task load in this case cannot be influenced by other tasks in the 
system and it is one of the known variables. Actually, this is however 
half true. Isolation is achieved only considering CPU time between 
concurrently executing task, other terms like cache interferences etc. 
cannot be controlled. The nice fact is that a misbehaving task, one that 
tries or experiments deviations from its allowed CPU fraction, is 
throttled and cannot influence other tasks behavior.
As I will show during my talk (power aware deadline scheduling), other 
techniques are required when a task execution time it is not stricly 
known beforehand, beeing this due to interferences or intrinsic 
variability on the performed activity. They fall in the domain of 
adaptive/feedback scheduling.



My point is that getting the task load right or at least better is a
fundamental requirement for improving power aware scheduling.



Fully agree :-).

As I said, I just wanted to add something, sorry if I misinterpret the 
purpose of this discussion.


Best Regards,

- Juri Lelli
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-19 Thread Juri Lelli

Hi all,
I can probably add some bits to the discussion, after all I'm preparing 
a talk for Plumbers that is strictly related :-). My points are not CFS 
related (so feel free to ignore me), but they would probably be 
interesting if we talk about power aware scheduling in Linux in general.


On 08/16/2012 04:31 PM, Morten Rasmussen wrote:

Hi all,

On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote:


sub proposal:
1, If it's possible to balance task on idlest cpu not appointed 'balance
cpu'. If so, it may can reduce one more time balancing.
The idlest cpu can prefer the new idle cpu;  and is the least load cpu;
2, se or task load is good for running time setting.
but it should the second basis in load balancing. The first basis of LB
is running tasks' number in group/cpu. Since whatever of the weight of
groups is, if the tasks number is less than cpu number, the group is
still has capacity to take more tasks. (will consider the SMT cpu power
or other big/little cpu capacity on ARM.)


Ah, no we shouldn't balance on nr_running, but on the amount of time
consumed. Imagine two tasks being woken at the same time, both tasks
will only run a fraction of the available time, you don't want this to
exceed your capacity because ran back to back the one cpu will still be
mostly idle.

What you want it to keep track of a per-cpu utilization level (inverse
of idle-time) and using PJTs per-task runnable avg see if placing the
new task on will exceed the utilization limit.

I think some of the Linaro people actually played around with this,
Vincent?



I agree. A better measure of cpu load and task weight than nr_running
and the current task load weight are necessary to do proper task
packing.

I have used PJTs per-task load-tracking for scheduling experiments on
heterogeneous systems and my experience is that it works quite well for
determining the load of a specific task. Something like PJTs work
would be a good starting point for power aware scheduling and better
support for heterogeneous systems.



I didn't tried PJTs work myself (it's on my todo list), but with 
SCHED_DEADLINE you can see the picture from the other side and, instead 
of tracking per-task load, you can enforce a task not to exceed its 
allowed load.
This is done reserving some fraction of CPU time (runtime or budget) 
every predefined interval of time (period). Than this allocated 
bandwidth is enforced with proper scheduling mechanisms (BTW, I have 
another talk at Plumbers explaining the SCHED_DEADLINE patchset in more 
details).



One of the biggest challenges here for load-balancing is translating
task load from one cpu to another as the task load is influenced by the
total load of its cpu. So a task that appears to be heavy on an
oversubscribed cpu might not be so heavy after all when it is moved to a
cpu with plenty cpu time to spare. This issue is likely to be more
pronounced on heterogeneous systems and system with aggressive frequency
scaling. It might be possible to avoid having to translate load or that
it doesn't really matter, but I haven't completely convinced myself yet.



This is probably a key point where deadline scheduling could be helpful. 
A task load in this case cannot be influenced by other tasks in the 
system and it is one of the known variables. Actually, this is however 
half true. Isolation is achieved only considering CPU time between 
concurrently executing task, other terms like cache interferences etc. 
cannot be controlled. The nice fact is that a misbehaving task, one that 
tries or experiments deviations from its allowed CPU fraction, is 
throttled and cannot influence other tasks behavior.
As I will show during my talk (power aware deadline scheduling), other 
techniques are required when a task execution time it is not stricly 
known beforehand, beeing this due to interferences or intrinsic 
variability on the performed activity. They fall in the domain of 
adaptive/feedback scheduling.



My point is that getting the task load right or at least better is a
fundamental requirement for improving power aware scheduling.



Fully agree :-).

As I said, I just wanted to add something, sorry if I misinterpret the 
purpose of this discussion.


Best Regards,

- Juri Lelli
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-18 Thread Arjan van de Ven
On 8/18/2012 7:33 AM, Luming Yu wrote:
> saving mode. But obviously, we need to spread as much as possible
> across all cores in another socket(to race to idle). So from the
> example above, we see a threshold that we need to reference before
> selecting one from two complete different policy: spread or not
> spread... As long as there is hardware limitation, we could always
> need knob like that referenced threshold to adapt on different
> hardware in one kernel

I think the physics are slightly simpler, if you abstract it one level.

every reasonable system out there has things that can be off if all cores are 
in the deep power state,
that have to be on if even one of them is alive. On "big core" Intel, that's 
uncore and memory controller,
on small core (atom/phone) Intel that is the chipset fabric only. On ARM it 
might be something else. On all of
them it's some clocks, PLLs, voltage regulators etc etc.

not all chips are advanced enough to aggressively these things off when they 
could, but most are nowadays.

so in abstract, there's a power offset that gets you from 0 to 1, Lets call 
this P0
there is also a power offset to go from 1 to 2, but that's smaller than 0->1. 
Lets call this Pc

or rather, 0->1 has the same kind of offset as 1->2 plus some extra offset.. so 
P0 = Pbase + Pc

there's also an energy cost for waking a cpu up (and letting it go back to 
sleep afterwards)... call it Ewake

so the abstract question is
you're running a task A on cpu 0
you want to also run a task B, which you estimate to run for time T

it's more energy efficient to wake a 2nd cpu if

Ewake < T * Pbase

(this assumes all cores are the same, you get a more complex formula if that's 
not the case, where T is even core specific)


there is no hardware policy *switch* in such formula, only parameters.
If Pbase = 0 (e.g. your hardware has no extra power savings), then the formula 
very naturally leads to one extreme of the behavior
if Ewake is very high, then it leads to the other extreme.

The only other variable is the user preference between power and performance 
balance.. but that's a pure preference, not hardware
specific anymore.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-18 Thread Luming Yu
On Sat, Aug 18, 2012 at 4:16 AM, Chris Friesen
 wrote:
> On 08/17/2012 01:50 PM, Matthew Garrett wrote:
>>
>> On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
>>>
>>> On 08/17/2012 12:47 PM, Matthew Garrett wrote:
>>
>>
>>> The datasheet for the Xeon E5 (my variant at least) says it doesn't
>>> do C7 so never powers down the LLC.  However, as you said earlier
>>> once you can put the socket into C6 which saves about 30W compared
>>> to C1E.
>>>
>>> So as far as I can see with this CPU at least you would benefit from
>>> shutting down a whole socket when possible.
>>
>>
>> Having any active cores on the system prevents all packages from going
>> into PC6 or deeper. What I'm not clear on is whether less deep package C
>> states are also blocked.
>>
>
> Right, we need the memory controller.
>
> The E5 datasheet is a bit ambiguous, it reads:
>
>
> A processor enters the package C3 low power state when:
>  -At least one core is in the C3 state.
>  -The other cores are in a C3 or lower power state, and the processor has
> been granted permission by the platform.
>
>
> Unfortunately it doesn't specify whether that is the other cores in the
> package, or the other cores on the whole system.
>

Hardware limitations is just part of the problem. We could find them
out from various white papers or data sheets, or test out.To me, the
key problem in terms of power and performance balancing still lies in
CPU and memory allocation method.  For example, on a system we can
benefit from shutting down a whole socket when possible, if a workload
allocates 50% CPU cycles and 50% memory bandwidth and space on a two
socket system(modern), an ideal allocation method ( I assume it's our
goal of the discussion) should leave CPU, cache, memory controller and
memory on one socket ( node) completely idle and in deepest power
saving mode. But obviously, we need to spread as much as possible
across all cores in another socket(to race to idle). So from the
example above, we see a threshold that we need to reference before
selecting one from two complete different policy: spread or not
spread... As long as there is hardware limitation, we could always
need knob like that referenced threshold to adapt on different
hardware in one kernel

/l
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-18 Thread Luming Yu
On Sat, Aug 18, 2012 at 4:16 AM, Chris Friesen
chris.frie...@genband.com wrote:
 On 08/17/2012 01:50 PM, Matthew Garrett wrote:

 On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:

 On 08/17/2012 12:47 PM, Matthew Garrett wrote:


 The datasheet for the Xeon E5 (my variant at least) says it doesn't
 do C7 so never powers down the LLC.  However, as you said earlier
 once you can put the socket into C6 which saves about 30W compared
 to C1E.

 So as far as I can see with this CPU at least you would benefit from
 shutting down a whole socket when possible.


 Having any active cores on the system prevents all packages from going
 into PC6 or deeper. What I'm not clear on is whether less deep package C
 states are also blocked.


 Right, we need the memory controller.

 The E5 datasheet is a bit ambiguous, it reads:


 A processor enters the package C3 low power state when:
  -At least one core is in the C3 state.
  -The other cores are in a C3 or lower power state, and the processor has
 been granted permission by the platform.


 Unfortunately it doesn't specify whether that is the other cores in the
 package, or the other cores on the whole system.


Hardware limitations is just part of the problem. We could find them
out from various white papers or data sheets, or test out.To me, the
key problem in terms of power and performance balancing still lies in
CPU and memory allocation method.  For example, on a system we can
benefit from shutting down a whole socket when possible, if a workload
allocates 50% CPU cycles and 50% memory bandwidth and space on a two
socket system(modern), an ideal allocation method ( I assume it's our
goal of the discussion) should leave CPU, cache, memory controller and
memory on one socket ( node) completely idle and in deepest power
saving mode. But obviously, we need to spread as much as possible
across all cores in another socket(to race to idle). So from the
example above, we see a threshold that we need to reference before
selecting one from two complete different policy: spread or not
spread... As long as there is hardware limitation, we could always
need knob like that referenced threshold to adapt on different
hardware in one kernel

/l
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-18 Thread Arjan van de Ven
On 8/18/2012 7:33 AM, Luming Yu wrote:
 saving mode. But obviously, we need to spread as much as possible
 across all cores in another socket(to race to idle). So from the
 example above, we see a threshold that we need to reference before
 selecting one from two complete different policy: spread or not
 spread... As long as there is hardware limitation, we could always
 need knob like that referenced threshold to adapt on different
 hardware in one kernel

I think the physics are slightly simpler, if you abstract it one level.

every reasonable system out there has things that can be off if all cores are 
in the deep power state,
that have to be on if even one of them is alive. On big core Intel, that's 
uncore and memory controller,
on small core (atom/phone) Intel that is the chipset fabric only. On ARM it 
might be something else. On all of
them it's some clocks, PLLs, voltage regulators etc etc.

not all chips are advanced enough to aggressively these things off when they 
could, but most are nowadays.

so in abstract, there's a power offset that gets you from 0 to 1, Lets call 
this P0
there is also a power offset to go from 1 to 2, but that's smaller than 0-1. 
Lets call this Pc

or rather, 0-1 has the same kind of offset as 1-2 plus some extra offset.. so 
P0 = Pbase + Pc

there's also an energy cost for waking a cpu up (and letting it go back to 
sleep afterwards)... call it Ewake

so the abstract question is
you're running a task A on cpu 0
you want to also run a task B, which you estimate to run for time T

it's more energy efficient to wake a 2nd cpu if

Ewake  T * Pbase

(this assumes all cores are the same, you get a more complex formula if that's 
not the case, where T is even core specific)


there is no hardware policy *switch* in such formula, only parameters.
If Pbase = 0 (e.g. your hardware has no extra power savings), then the formula 
very naturally leads to one extreme of the behavior
if Ewake is very high, then it leads to the other extreme.

The only other variable is the user preference between power and performance 
balance.. but that's a pure preference, not hardware
specific anymore.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Chris Friesen

On 08/17/2012 01:50 PM, Matthew Garrett wrote:

On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:

On 08/17/2012 12:47 PM, Matthew Garrett wrote:



The datasheet for the Xeon E5 (my variant at least) says it doesn't
do C7 so never powers down the LLC.  However, as you said earlier
once you can put the socket into C6 which saves about 30W compared
to C1E.

So as far as I can see with this CPU at least you would benefit from
shutting down a whole socket when possible.


Having any active cores on the system prevents all packages from going
into PC6 or deeper. What I'm not clear on is whether less deep package C
states are also blocked.



Right, we need the memory controller.

The E5 datasheet is a bit ambiguous, it reads:


A processor enters the package C3 low power state when:
 -At least one core is in the C3 state.
 -The other cores are in a C3 or lower power state, and the processor 
has been granted permission by the platform.



Unfortunately it doesn't specify whether that is the other cores in the 
package, or the other cores on the whole system.


Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Matthew Garrett
On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
> On 08/17/2012 12:47 PM, Matthew Garrett wrote:

> The datasheet for the Xeon E5 (my variant at least) says it doesn't
> do C7 so never powers down the LLC.  However, as you said earlier
> once you can put the socket into C6 which saves about 30W compared
> to C1E.
> 
> So as far as I can see with this CPU at least you would benefit from
> shutting down a whole socket when possible.

Having any active cores on the system prevents all packages from going 
into PC6 or deeper. What I'm not clear on is whether less deep package C 
states are also blocked.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Chris Friesen

On 08/17/2012 12:47 PM, Matthew Garrett wrote:

On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:

On 8/17/2012 11:41 AM, Matthew Garrett wrote:

On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:

this is ... a dubiously general statement.

for good power, at least on Intel cpus, you want to spread. Parallelism is 
efficient.

Is this really true? In a two-socket system I'd have thought the benefit
of keeping socket 1 in package C3 outweighed the cost of keeping socket
0 awake for slightly longer.

not on Intel

you can't enter package c3 either until every one is down.
(e.g. memory controller must stay on etc etc)

I thought that was only PC6 - is there any reason why the package cache
can't be entirely powered down?


According to 
"http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf; 
once you're in package C6 then you can go to package C7.


The datasheet for the Xeon E5 (my variant at least) says it doesn't do 
C7 so never powers down the LLC.  However, as you said earlier once you 
can put the socket into C6 which saves about 30W compared to C1E.


So as far as I can see with this CPU at least you would benefit from 
shutting down a whole socket when possible.


Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Matthew Garrett
On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:
> On 8/17/2012 11:41 AM, Matthew Garrett wrote:
> > On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
> >> this is ... a dubiously general statement.
> >>
> >> for good power, at least on Intel cpus, you want to spread. Parallelism is 
> >> efficient.
> > 
> > Is this really true? In a two-socket system I'd have thought the benefit 
> > of keeping socket 1 in package C3 outweighed the cost of keeping socket 
> > 0 awake for slightly longer.
> 
> not on Intel
> 
> you can't enter package c3 either until every one is down.
> (e.g. memory controller must stay on etc etc)

I thought that was only PC6 - is there any reason why the package cache 
can't be entirely powered down?

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Arjan van de Ven
On 8/17/2012 11:41 AM, Matthew Garrett wrote:
> On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
>>> *Power policy*:
>>>
>>> So how is power policy different? As Peter says,'pack more than spread
>>> more'.
>>
>> this is ... a dubiously general statement.
>>
>> for good power, at least on Intel cpus, you want to spread. Parallelism is 
>> efficient.
> 
> Is this really true? In a two-socket system I'd have thought the benefit 
> of keeping socket 1 in package C3 outweighed the cost of keeping socket 
> 0 awake for slightly longer.

not on Intel

you can't enter package c3 either until every one is down.
(e.g. memory controller must stay on etc etc)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Matthew Garrett
On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
> > *Power policy*:
> > 
> > So how is power policy different? As Peter says,'pack more than spread
> > more'.
> 
> this is ... a dubiously general statement.
> 
> for good power, at least on Intel cpus, you want to spread. Parallelism is 
> efficient.

Is this really true? In a two-socket system I'd have thought the benefit 
of keeping socket 1 in package C3 outweighed the cost of keeping socket 
0 awake for slightly longer.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Paul Turner
On Wed, Aug 15, 2012 at 11:02 AM, Arjan van de Ven
 wrote:
> On 8/15/2012 9:34 AM, Matthew Garrett wrote:
>> On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
>>> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
 It bases on the following assumption:
 1, If there are many task crowd in system, just let few domain cpus
 running and let other cpus idle can not save power. Let all cpu take the
 load, finish tasks early, and then get into idle. will save more power
 and have better user experience.
>>>
>>> I'm not sure this is a valid assumption. I've had it explained to me by
>>> various people that race-to-idle isn't always the best thing. It has to
>>> do with the cost of switching power states and the duration of execution
>>> and other such things.
>>
>> This is affected by Intel's implementation - if there's a single active
>
> not just intel.. also AMD
> basically everyone who has the memory controller in the cpu package will end 
> up with
> a restriction very similar to this.
>

I think this is circular to discussion previously held on this topic.
This preference is arch specific; we need to reduce the set of inputs
to a sensible, actionable set, and plumb that so that the architecture
and not the scheduler can supply this preference.

That you believe 100-300us is actually the tipping point vs power
migration cost is probably in itself one of the most useful replies
I've seen on this topic in all of the last few rounds of discussion
its been through.  It suggests we could actually parameterize this in
a manner similar to wake-up migration cost; with a minimum usage
average for which it's worth spilling to an idle sibling.

- Paul

> (this is because the exit-from-self-refresh latency is pretty high.. at least 
> in DDR2/3)
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Paul Turner
On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra  wrote:
> On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
>> Since there is no power saving consideration in scheduler CFS, I has a
>> very rough idea for enabling a new power saving schema in CFS.
>
> Adding Thomas, he always delights poking holes in power schemes.
>
>> It bases on the following assumption:
>> 1, If there are many task crowd in system, just let few domain cpus
>> running and let other cpus idle can not save power. Let all cpu take the
>> load, finish tasks early, and then get into idle. will save more power
>> and have better user experience.
>
> I'm not sure this is a valid assumption. I've had it explained to me by
> various people that race-to-idle isn't always the best thing. It has to
> do with the cost of switching power states and the duration of execution
> and other such things.
>
>> 2, schedule domain, schedule group perfect match the hardware, and
>> the power consumption unit. So, pull tasks out of a domain means
>> potentially this power consumption unit idle.
>
> I'm not sure I understand what you're saying, sorry.
>
>> So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
>> power aware scheduling), this proposal will adopt the
>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>
> Yay, ideally we'd also provide a 3rd option: auto, which simply switches
> between the two based on AC/BAT, UPS status and simple things like that.
> But this seems like a later concern, you have to have something to pick
> between before you can pick :-)
>
>> And in scheduling, 2 place will care the policy, load_balance() and in
>> task fork/exec: select_task_rq_fair().
>
> ack
>
>> Here is some pseudo code try to explain the proposal behaviour in
>> load_balance() and select_task_rq_fair();
>
> Oh man.. A few words outlining the general idea would've been nice.
>
>> load_balance() {
>>   update_sd_lb_stats(); //get busiest group, idlest group data.
>>
>>   if (sd->nr_running > sd's capacity) {
>>   //power saving policy is not suitable for
>>   //this scenario, it runs like performance policy
>>   mv tasks from busiest cpu in busiest group to
>>   idlest  cpu in idlest group;
>
> Once upon a time we talked about adding a factor to the capacity for
> this. So say you'd allow 2*capacity before overflowing and waking
> another power group.
>
> But I think we should not go on nr_running here, PJTs per-entity load
> tracking stuff gives us much better measures -- also, repost that series
> already Paul! :-)

Yes -- I just got back from Africa this week.  It's updated for almost
all the previous comments but I ran out of time before I left to
re-post.  I'm just about caught up enough that I should be able to get
this done over the upcoming weekend.  Monday at the latest.

>
> Also, I'm not sure this is entirely correct, the thing you want to do
> for power aware stuff is to minimize the number of active power domains,
> this means you don't want idlest, you want least busy non-idle.
>
>>   } else {// the sd has enough capacity to hold all tasks.
>>   if (sg->nr_running > sg's capacity) {
>>   //imbalanced between groups
>>   if (schedule policy == performance) {
>>   //when 2 busiest group at same busy
>>   //degree, need to prefer the one has
>>   // softest group??
>>   move tasks from busiest group to
>>   idletest group;
>
> So I'd leave the currently implemented scheme as performance, and I
> don't think the above describes the current state.
>
>>   } else if (schedule policy == power)
>>   move tasks from busiest group to
>>   idlest group until busiest is just full
>>   of capacity.
>>   //the busiest group can balance
>>   //internally after next time LB,
>
> There's another thing we need to do, and that is collect tasks in a
> minimal amount of power domains. The old code (that got deleted) did
> something like that, you can revive some of the that code if needed -- I
> just killed everything to be able to start with a clean slate.
>
>
>>   } else {
>>   //all groups has enough capacity for its tasks.
>>   if (schedule policy == performance)
>>   //all tasks may has enough cpu
>>   //resources to run,
>>   //mv tasks from busiest to idlest group?
>>   //no, at this time, it's better to keep
>>   //the task on current cpu.
>>   //so, it is maybe better to do balance
>>  

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Paul Turner
On Wed, Aug 15, 2012 at 4:05 AM, Peter Zijlstra a.p.zijls...@chello.nl wrote:
 On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
 Since there is no power saving consideration in scheduler CFS, I has a
 very rough idea for enabling a new power saving schema in CFS.

 Adding Thomas, he always delights poking holes in power schemes.

 It bases on the following assumption:
 1, If there are many task crowd in system, just let few domain cpus
 running and let other cpus idle can not save power. Let all cpu take the
 load, finish tasks early, and then get into idle. will save more power
 and have better user experience.

 I'm not sure this is a valid assumption. I've had it explained to me by
 various people that race-to-idle isn't always the best thing. It has to
 do with the cost of switching power states and the duration of execution
 and other such things.

 2, schedule domain, schedule group perfect match the hardware, and
 the power consumption unit. So, pull tasks out of a domain means
 potentially this power consumption unit idle.

 I'm not sure I understand what you're saying, sorry.

 So, according Peter mentioned in commit 8e7fbcbc22c(sched: Remove stale
 power aware scheduling), this proposal will adopt the
 sched_balance_policy concept and use 2 kind of policy: performance, power.

 Yay, ideally we'd also provide a 3rd option: auto, which simply switches
 between the two based on AC/BAT, UPS status and simple things like that.
 But this seems like a later concern, you have to have something to pick
 between before you can pick :-)

 And in scheduling, 2 place will care the policy, load_balance() and in
 task fork/exec: select_task_rq_fair().

 ack

 Here is some pseudo code try to explain the proposal behaviour in
 load_balance() and select_task_rq_fair();

 Oh man.. A few words outlining the general idea would've been nice.

 load_balance() {
   update_sd_lb_stats(); //get busiest group, idlest group data.

   if (sd-nr_running  sd's capacity) {
   //power saving policy is not suitable for
   //this scenario, it runs like performance policy
   mv tasks from busiest cpu in busiest group to
   idlest  cpu in idlest group;

 Once upon a time we talked about adding a factor to the capacity for
 this. So say you'd allow 2*capacity before overflowing and waking
 another power group.

 But I think we should not go on nr_running here, PJTs per-entity load
 tracking stuff gives us much better measures -- also, repost that series
 already Paul! :-)

Yes -- I just got back from Africa this week.  It's updated for almost
all the previous comments but I ran out of time before I left to
re-post.  I'm just about caught up enough that I should be able to get
this done over the upcoming weekend.  Monday at the latest.


 Also, I'm not sure this is entirely correct, the thing you want to do
 for power aware stuff is to minimize the number of active power domains,
 this means you don't want idlest, you want least busy non-idle.

   } else {// the sd has enough capacity to hold all tasks.
   if (sg-nr_running  sg's capacity) {
   //imbalanced between groups
   if (schedule policy == performance) {
   //when 2 busiest group at same busy
   //degree, need to prefer the one has
   // softest group??
   move tasks from busiest group to
   idletest group;

 So I'd leave the currently implemented scheme as performance, and I
 don't think the above describes the current state.

   } else if (schedule policy == power)
   move tasks from busiest group to
   idlest group until busiest is just full
   of capacity.
   //the busiest group can balance
   //internally after next time LB,

 There's another thing we need to do, and that is collect tasks in a
 minimal amount of power domains. The old code (that got deleted) did
 something like that, you can revive some of the that code if needed -- I
 just killed everything to be able to start with a clean slate.


   } else {
   //all groups has enough capacity for its tasks.
   if (schedule policy == performance)
   //all tasks may has enough cpu
   //resources to run,
   //mv tasks from busiest to idlest group?
   //no, at this time, it's better to keep
   //the task on current cpu.
   //so, it is maybe better to do balance
   //in each of groups
   for_each_imbalance_groups()
   

Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Paul Turner
On Wed, Aug 15, 2012 at 11:02 AM, Arjan van de Ven
ar...@linux.intel.com wrote:
 On 8/15/2012 9:34 AM, Matthew Garrett wrote:
 On Wed, Aug 15, 2012 at 01:05:38PM +0200, Peter Zijlstra wrote:
 On Mon, 2012-08-13 at 20:21 +0800, Alex Shi wrote:
 It bases on the following assumption:
 1, If there are many task crowd in system, just let few domain cpus
 running and let other cpus idle can not save power. Let all cpu take the
 load, finish tasks early, and then get into idle. will save more power
 and have better user experience.

 I'm not sure this is a valid assumption. I've had it explained to me by
 various people that race-to-idle isn't always the best thing. It has to
 do with the cost of switching power states and the duration of execution
 and other such things.

 This is affected by Intel's implementation - if there's a single active

 not just intel.. also AMD
 basically everyone who has the memory controller in the cpu package will end 
 up with
 a restriction very similar to this.


I think this is circular to discussion previously held on this topic.
This preference is arch specific; we need to reduce the set of inputs
to a sensible, actionable set, and plumb that so that the architecture
and not the scheduler can supply this preference.

That you believe 100-300us is actually the tipping point vs power
migration cost is probably in itself one of the most useful replies
I've seen on this topic in all of the last few rounds of discussion
its been through.  It suggests we could actually parameterize this in
a manner similar to wake-up migration cost; with a minimum usage
average for which it's worth spilling to an idle sibling.

- Paul

 (this is because the exit-from-self-refresh latency is pretty high.. at least 
 in DDR2/3)


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Matthew Garrett
On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
  *Power policy*:
  
  So how is power policy different? As Peter says,'pack more than spread
  more'.
 
 this is ... a dubiously general statement.
 
 for good power, at least on Intel cpus, you want to spread. Parallelism is 
 efficient.

Is this really true? In a two-socket system I'd have thought the benefit 
of keeping socket 1 in package C3 outweighed the cost of keeping socket 
0 awake for slightly longer.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Arjan van de Ven
On 8/17/2012 11:41 AM, Matthew Garrett wrote:
 On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
 *Power policy*:

 So how is power policy different? As Peter says,'pack more than spread
 more'.

 this is ... a dubiously general statement.

 for good power, at least on Intel cpus, you want to spread. Parallelism is 
 efficient.
 
 Is this really true? In a two-socket system I'd have thought the benefit 
 of keeping socket 1 in package C3 outweighed the cost of keeping socket 
 0 awake for slightly longer.

not on Intel

you can't enter package c3 either until every one is down.
(e.g. memory controller must stay on etc etc)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Matthew Garrett
On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:
 On 8/17/2012 11:41 AM, Matthew Garrett wrote:
  On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:
  this is ... a dubiously general statement.
 
  for good power, at least on Intel cpus, you want to spread. Parallelism is 
  efficient.
  
  Is this really true? In a two-socket system I'd have thought the benefit 
  of keeping socket 1 in package C3 outweighed the cost of keeping socket 
  0 awake for slightly longer.
 
 not on Intel
 
 you can't enter package c3 either until every one is down.
 (e.g. memory controller must stay on etc etc)

I thought that was only PC6 - is there any reason why the package cache 
can't be entirely powered down?

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Chris Friesen

On 08/17/2012 12:47 PM, Matthew Garrett wrote:

On Fri, Aug 17, 2012 at 11:44:03AM -0700, Arjan van de Ven wrote:

On 8/17/2012 11:41 AM, Matthew Garrett wrote:

On Thu, Aug 16, 2012 at 07:01:25AM -0700, Arjan van de Ven wrote:

this is ... a dubiously general statement.

for good power, at least on Intel cpus, you want to spread. Parallelism is 
efficient.

Is this really true? In a two-socket system I'd have thought the benefit
of keeping socket 1 in package C3 outweighed the cost of keeping socket
0 awake for slightly longer.

not on Intel

you can't enter package c3 either until every one is down.
(e.g. memory controller must stay on etc etc)

I thought that was only PC6 - is there any reason why the package cache
can't be entirely powered down?


According to 
http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf; 
once you're in package C6 then you can go to package C7.


The datasheet for the Xeon E5 (my variant at least) says it doesn't do 
C7 so never powers down the LLC.  However, as you said earlier once you 
can put the socket into C6 which saves about 30W compared to C1E.


So as far as I can see with this CPU at least you would benefit from 
shutting down a whole socket when possible.


Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Matthew Garrett
On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:
 On 08/17/2012 12:47 PM, Matthew Garrett wrote:

 The datasheet for the Xeon E5 (my variant at least) says it doesn't
 do C7 so never powers down the LLC.  However, as you said earlier
 once you can put the socket into C6 which saves about 30W compared
 to C1E.
 
 So as far as I can see with this CPU at least you would benefit from
 shutting down a whole socket when possible.

Having any active cores on the system prevents all packages from going 
into PC6 or deeper. What I'm not clear on is whether less deep package C 
states are also blocked.

-- 
Matthew Garrett | mj...@srcf.ucam.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-17 Thread Chris Friesen

On 08/17/2012 01:50 PM, Matthew Garrett wrote:

On Fri, Aug 17, 2012 at 01:45:09PM -0600, Chris Friesen wrote:

On 08/17/2012 12:47 PM, Matthew Garrett wrote:



The datasheet for the Xeon E5 (my variant at least) says it doesn't
do C7 so never powers down the LLC.  However, as you said earlier
once you can put the socket into C6 which saves about 30W compared
to C1E.

So as far as I can see with this CPU at least you would benefit from
shutting down a whole socket when possible.


Having any active cores on the system prevents all packages from going
into PC6 or deeper. What I'm not clear on is whether less deep package C
states are also blocked.



Right, we need the memory controller.

The E5 datasheet is a bit ambiguous, it reads:


A processor enters the package C3 low power state when:
 -At least one core is in the C3 state.
 -The other cores are in a C3 or lower power state, and the processor 
has been granted permission by the platform.



Unfortunately it doesn't specify whether that is the other cores in the 
package, or the other cores on the whole system.


Chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-16 Thread Alex Shi
On 08/16/2012 10:01 PM, Arjan van de Ven wrote:

>> *Power policy*:
>>
>> So how is power policy different? As Peter says,'pack more than spread
>> more'.
> 
> this is ... a dubiously general statement.
> 
> for good power, at least on Intel cpus, you want to spread. Parallelism is 
> efficient.
> 
> the only thing you do not want to do, is wake cpus up for
> tasks that only run extremely briefly (think "100 usec" or less).


It's a very important and valuable info!
Just want to know how you get this? From CS cost or cache/TLB refill cost?

> 
> so maybe the balance interval is slightly different, or more, you don't 
> balance tasks that
> historically ran only for brief periods
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-16 Thread Arjan van de Ven
On 8/16/2012 11:45 AM, Rik van Riel wrote:
> 
> The c-state governor can call the scheduler code before
> putting a CPU to sleep, to indicate (1) the wakeup latency
> of the CPU, and (2) whether TLB and/or cache get invalidated.

I don't think (2) is useful really; that basically always happens ;-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-16 Thread Rik van Riel

On 08/16/2012 10:01 AM, Arjan van de Ven wrote:

*Power policy*:

So how is power policy different? As Peter says,'pack more than spread
more'.


this is ... a dubiously general statement.

for good power, at least on Intel cpus, you want to spread. Parallelism is 
efficient.

the only thing you do not want to do, is wake cpus up for
tasks that only run extremely briefly (think "100 usec" or less).

so maybe the balance interval is slightly different, or more, you don't balance 
tasks that
historically ran only for brief periods


This makes me think that maybe, in addition to tracking
the idle residency time in the c-state governor, we may
also want to track the average run times in the scheduler.

The c-state governor can call the scheduler code before
putting a CPU to sleep, to indicate (1) the wakeup latency
of the CPU, and (2) whether TLB and/or cache get invalidated.

At wakeup time, the scheduler can check whether the CPU
the to-be-woken process ran on is in a deeper sleep state,
and whether the typical run time for the process significantly
exceeds the wakeup latency of the CPU it last ran on.

If the process typically runs for a short interval, and/or
the process's CPU lost its cached state, it may be better
to run the just-woken task on the CPU that is doing the
waking up, instead of on the CPU where it used to run.

Does that make sense?

Am I overlooking any factors?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-16 Thread Morten Rasmussen
Hi all,

On Wed, Aug 15, 2012 at 12:05:38PM +0100, Peter Zijlstra wrote:
> > 
> > sub proposal:
> > 1, If it's possible to balance task on idlest cpu not appointed 'balance
> > cpu'. If so, it may can reduce one more time balancing.
> > The idlest cpu can prefer the new idle cpu;  and is the least load cpu;
> > 2, se or task load is good for running time setting.
> > but it should the second basis in load balancing. The first basis of LB
> > is running tasks' number in group/cpu. Since whatever of the weight of
> > groups is, if the tasks number is less than cpu number, the group is
> > still has capacity to take more tasks. (will consider the SMT cpu power
> > or other big/little cpu capacity on ARM.)
> 
> Ah, no we shouldn't balance on nr_running, but on the amount of time
> consumed. Imagine two tasks being woken at the same time, both tasks
> will only run a fraction of the available time, you don't want this to
> exceed your capacity because ran back to back the one cpu will still be
> mostly idle.
> 
> What you want it to keep track of a per-cpu utilization level (inverse
> of idle-time) and using PJTs per-task runnable avg see if placing the
> new task on will exceed the utilization limit.
> 
> I think some of the Linaro people actually played around with this,
> Vincent?
>

I agree. A better measure of cpu load and task weight than nr_running
and the current task load weight are necessary to do proper task
packing.

I have used PJTs per-task load-tracking for scheduling experiments on
heterogeneous systems and my experience is that it works quite well for
determining the load of a specific task. Something like PJTs work
would be a good starting point for power aware scheduling and better
support for heterogeneous systems.

One of the biggest challenges here for load-balancing is translating
task load from one cpu to another as the task load is influenced by the
total load of its cpu. So a task that appears to be heavy on an
oversubscribed cpu might not be so heavy after all when it is moved to a
cpu with plenty cpu time to spare. This issue is likely to be more
pronounced on heterogeneous systems and system with aggressive frequency
scaling. It might be possible to avoid having to translate load or that
it doesn't really matter, but I haven't completely convinced myself yet.

My point is that getting the task load right or at least better is a
fundamental requirement for improving power aware scheduling.

Best regards,
Morten

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-16 Thread Arjan van de Ven
> *Power policy*:
> 
> So how is power policy different? As Peter says,'pack more than spread
> more'.

this is ... a dubiously general statement.

for good power, at least on Intel cpus, you want to spread. Parallelism is 
efficient.

the only thing you do not want to do, is wake cpus up for
tasks that only run extremely briefly (think "100 usec" or less).

so maybe the balance interval is slightly different, or more, you don't balance 
tasks that
historically ran only for brief periods


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [discussion]sched: a rough proposal to enable power saving in scheduler

2012-08-16 Thread Arjan van de Ven
On 8/15/2012 10:03 PM, Alex Shi wrote:
> On 08/16/2012 12:19 AM, Matthew Garrett wrote:
> 
>> On Mon, Aug 13, 2012 at 08:21:00PM +0800, Alex Shi wrote:
>>
>>> power aware scheduling), this proposal will adopt the
>>> sched_balance_policy concept and use 2 kind of policy: performance, power.
>>
>> Are there workloads in which "power" might provide more performance than 
>> "performance"? If so, don't use these terms.
>>
> 
> 
> Power scheme should no chance has better performance in design.

ehm.

so in reality, the very first thing that helps power, is to run software 
efficiently.

anything else is completely secondary.

if placement policy leads to a placement that's different from the most 
efficient placement,
you're already burning extra power...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >