Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-08-02 Thread Jeremy Eder
On 130729 12:59:47, Jeremy Eder wrote:
> On 130729 23:57:31, Youquan Song wrote:
> > Hi Jeremy,
> > 
> > I try reproduce your result and then fix the issue, but I do not reproduce 
> > it
> >  yet.
> > 
> > I run at netperf-2.6.0 at one machine as server: netserver, other
> > machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> > used in both client and server. I do not reproduce the performance drop
> > issue. I also notice the result is not stable, sometime it is high,
> > sometime is low. In sumarry, it is hard to make a definite result.
> > 
> > Can you try tell me how to reproduce the issue? how do you get the C0
> > data?
> > 
> > What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
> > only CONFIG_NO_HZ=y?
> > 
> > 
> > Thanks
> > -Youquan 
> 
> Hi,
> 
> To answer both your and Daniel's question, those results used only
> CONFIG_NO_HZ=y.
> 
> These network latency benchmarks are fickle creatures, and need careful
> tuning to become reproducible.  Plus there are BIOS implications and tuning
> varies by vendor.
> 
> Anyway for the most part it's probably not stable because in order to get
> any sort
> of reproducibility between runs you need to do at least these steps:
> 
> - ensure as little is running in userspace as possible
> - determine PCI affinity for the NIC
> - on both machines, isolate the socket connected to the NIC from userspace
>   tasks
> - Turn off irqbalance and bind all IRQs for that NIC to a single core on
>   the same socket as the NIC
> - run netperf with -TX,Y where X,Y are core numbers that you wish
>   netperf/netserver to run on, respectively.
> 
> For example, if your NIC is attached to socket 0 and socket 0 cores are
> enumerated 0-7, then:
> 
> - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
>   NIC.
> - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2
> 
> That should get you most of the way there.  The -s 2 connects and waits 2
> seconds, I found this to help with the first few second's worth of data.
> Or
> you could just toss the first 2 seconds worth, it seems to take that long
> to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
> you might not have seen that netperf tests seem to take a few seconds to
> stabilize even
> when properly tuned.
> 
> I got the C0 data by running turbostat in parallel with each benchmark run,
> then grabbing the C-state data for the cores relevant to the test.  In my
> case that was cores 4 and 6, where core 4 was where I put netperf/netserver
> and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
> format that this could interpret:
> 
> https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py
> 
> I'm building a kernel from Rafael's tree and will try to confirm what Len
> already sent.  Thanks everyone for looking into it.


Hi, sorry for the delay.  In addition to the results I initially posted,
the below results confirm my initial data, plus what Len sent:

3.11-rc2 w/reverts
TCP_RR trans/s 54454.13

3.11-rc2 w/reverts + c0 lock
TCP_RR trans/s 55088.11
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-08-02 Thread Jeremy Eder
On 130729 12:59:47, Jeremy Eder wrote:
 On 130729 23:57:31, Youquan Song wrote:
  Hi Jeremy,
  
  I try reproduce your result and then fix the issue, but I do not reproduce 
  it
   yet.
  
  I run at netperf-2.6.0 at one machine as server: netserver, other
  machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
  used in both client and server. I do not reproduce the performance drop
  issue. I also notice the result is not stable, sometime it is high,
  sometime is low. In sumarry, it is hard to make a definite result.
  
  Can you try tell me how to reproduce the issue? how do you get the C0
  data?
  
  What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
  only CONFIG_NO_HZ=y?
  
  
  Thanks
  -Youquan 
 
 Hi,
 
 To answer both your and Daniel's question, those results used only
 CONFIG_NO_HZ=y.
 
 These network latency benchmarks are fickle creatures, and need careful
 tuning to become reproducible.  Plus there are BIOS implications and tuning
 varies by vendor.
 
 Anyway for the most part it's probably not stable because in order to get
 any sort
 of reproducibility between runs you need to do at least these steps:
 
 - ensure as little is running in userspace as possible
 - determine PCI affinity for the NIC
 - on both machines, isolate the socket connected to the NIC from userspace
   tasks
 - Turn off irqbalance and bind all IRQs for that NIC to a single core on
   the same socket as the NIC
 - run netperf with -TX,Y where X,Y are core numbers that you wish
   netperf/netserver to run on, respectively.
 
 For example, if your NIC is attached to socket 0 and socket 0 cores are
 enumerated 0-7, then:
 
 - set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
   NIC.
 - nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2
 
 That should get you most of the way there.  The -s 2 connects and waits 2
 seconds, I found this to help with the first few second's worth of data.
 Or
 you could just toss the first 2 seconds worth, it seems to take that long
 to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
 you might not have seen that netperf tests seem to take a few seconds to
 stabilize even
 when properly tuned.
 
 I got the C0 data by running turbostat in parallel with each benchmark run,
 then grabbing the C-state data for the cores relevant to the test.  In my
 case that was cores 4 and 6, where core 4 was where I put netperf/netserver
 and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
 format that this could interpret:
 
 https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py
 
 I'm building a kernel from Rafael's tree and will try to confirm what Len
 already sent.  Thanks everyone for looking into it.


Hi, sorry for the delay.  In addition to the results I initially posted,
the below results confirm my initial data, plus what Len sent:

3.11-rc2 w/reverts
TCP_RR trans/s 54454.13

3.11-rc2 w/reverts + c0 lock
TCP_RR trans/s 55088.11
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Jeremy Eder
On 130729 23:57:31, Youquan Song wrote:
> Hi Jeremy,
> 
> I try reproduce your result and then fix the issue, but I do not reproduce it
>  yet.
> 
> I run at netperf-2.6.0 at one machine as server: netserver, other
> machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
> used in both client and server. I do not reproduce the performance drop
> issue. I also notice the result is not stable, sometime it is high,
> sometime is low. In sumarry, it is hard to make a definite result.
> 
> Can you try tell me how to reproduce the issue? how do you get the C0
> data?
> 
> What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
> only CONFIG_NO_HZ=y?
> 
> 
> Thanks
> -Youquan 

Hi,

To answer both your and Daniel's question, those results used only
CONFIG_NO_HZ=y.

These network latency benchmarks are fickle creatures, and need careful
tuning to become reproducible.  Plus there are BIOS implications and tuning
varies by vendor.

Anyway for the most part it's probably not stable because in order to get
any sort
of reproducibility between runs you need to do at least these steps:

- ensure as little is running in userspace as possible
- determine PCI affinity for the NIC
- on both machines, isolate the socket connected to the NIC from userspace
  tasks
- Turn off irqbalance and bind all IRQs for that NIC to a single core on
  the same socket as the NIC
- run netperf with -TX,Y where X,Y are core numbers that you wish
  netperf/netserver to run on, respectively.

For example, if your NIC is attached to socket 0 and socket 0 cores are
enumerated 0-7, then:

- set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
  NIC.
- nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2

That should get you most of the way there.  The -s 2 connects and waits 2
seconds, I found this to help with the first few second's worth of data.
Or
you could just toss the first 2 seconds worth, it seems to take that long
to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
you might not have seen that netperf tests seem to take a few seconds to
stabilize even
when properly tuned.

I got the C0 data by running turbostat in parallel with each benchmark run,
then grabbing the C-state data for the cores relevant to the test.  In my
case that was cores 4 and 6, where core 4 was where I put netperf/netserver
and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
format that this could interpret:

https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py

I'm building a kernel from Rafael's tree and will try to confirm what Len
already sent.  Thanks everyone for looking into it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Youquan Song
Hi Jeremy,

I try reproduce your result and then fix the issue, but I do not reproduce it
 yet.

I run at netperf-2.6.0 at one machine as server: netserver, other
machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
used in both client and server. I do not reproduce the performance drop
issue. I also notice the result is not stable, sometime it is high,
sometime is low. In sumarry, it is hard to make a definite result.

Can you try tell me how to reproduce the issue? how do you get the C0
data?

What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
only CONFIG_NO_HZ=y?


Thanks
-Youquan 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven

On 7/29/2013 9:01 AM, Lorenzo Pieralisi wrote:



Coupled C states on this level are a PAIN in many ways, and tend to totally 
suck for power
due to this and the general "too much is active" reasons.


I think the trend is moving towards core gating, which resembles a lot to what
x86 world does today. Still, the interaction between menu governor and
cluster states has to be characterized and that's we are doing at the
moment.


I suspect that we (Linux) need a different governor than menu for the coupled 
case.
Fundamentally you're going to need different algorithms that are much closer 
tied
to the behavior of the specific hardware; the current menu governor uses an 
abstract
and simple hardware model (fully independent).
For specific hardware that deviates from the model, something different is 
clearly needed.

It's an open question for me if just the predictor is the part that
you split out, or if the whole governor needs to be split out.
(but frankly, if you take the predictor out of the menu governor, not a whole 
lot is left)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Lorenzo Pieralisi
On Mon, Jul 29, 2013 at 03:29:20PM +0100, Arjan van de Ven wrote:
> On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:
> >>
> >>
> >> btw this is largely a misunderstanding;
> >> tasks are not the issue; tasks use timers and those are perfectly 
> >> predictable.
> >> It's interrupts that are not and the heuristics are for that.
> >>
> >> Now, if your hardware does the really-bad-for-power wake-all on any 
> >> interrupt,
> >> then the menu governor logic is not good for you; rather than looking at 
> >> the next
> >> timer on the current cpu you need to look at the earliest timer on the set 
> >> of bundled
> >> cpus as the upper bound of the next wake event.
> >
> > Yes, that's true and we have to look into this properly, but certainly
> > a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs 
> > either,
> > or I am missing something ?
> 
> a CPU core isn't in a package C state, the system is.
> (in a core C state the whole core is already powered down completely; a 
> package C state
> just also turns off the memory controller/etc)
> 
> package C states are global on x86 (not just per package); there's nothing one
> can do there in terms of grouping/etc.

So you are saying that package states are system states on x86 right ?
Now things are a bit clearer, I was a bit baffled at first when you
mentioned that package C-states allow PM to turn off DRAM controller,
the concept of package C-state is a bit misleading and does not resemble
much to what cluster states are now for ARM, that's why I asked in the
first place, thank you.

> > Even if the wake-up interrupts just power up one of the CPUs in a package
> > and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
> > be turned on. What I am asking is: this bundled next event is a concept
> > that should apply to x86 CPUs too, or it is entirely managed in FW/HW
> > and the kernel just should not care ?
> 
> on Intel x86 cpus, there's not really bundled concept. or rather, there is 
> only 1 bundle
> (which amounts to the same thing).
> Yes in a multi-package setup there are some cache power effects... but there's
> not a lot one can do there.

On ARM there is, some basic optimizations like avoid cleaning caches if
an IRQ is pending or a next event is due shortly, for instance. The
difference is that on ARM cache management in done in SW and under
kernel or FW control (which in a way is closer to what x86 does, even
though I think on x86 most of power control is offloaded to HW).

Given what you said above, I understand that even on a multi-package system,
package C-states are global, not per package. That's a pivotal point.

> The other cores don't wake up, so they still make their own correct decisions.
> 
> > I still do not understand how this "bundled" next event is managed on
> > x86 with the menu governor, or better why it is not managed at all, given
> > the importance of package C-states.
> 
> package C states on x86 are basically OS invisible. The OS manages core level 
> C states,
> the hardware manages the rest.
> The bundle part hurts you on a "one wakes all" system,
> not because of package level power effects, but because others wake up 
> prematurely
> (compared to what they expected) which causes them to think future wakups 
> will also
> be earlier. All because they get the "what is the next known event" wrong,
> and start correcting for known events instead of only for 'unpredictable' 
> interrupts.

Well, reality is a bit more complex and probably less drastic, cores that are
woken up spuriosly can be put back to sleep without going through the governor
again, but your point is taken.

> Things will go very wonky if you do that for sure.
> (I've seen various simulation data on that, and the menu governor indeed acts 
> quite poorly
> for that)

That's something we have been benchmarking, yes.

> >> And maybe even more special casing is needed... but I doubt it.
> >
> > I lost you here, can you elaborate pls ?
> 
> well.. just looking at the earliest timer might not be enough; that timer 
> might be on a different
> core that's still active, and may change after the current cpu has gone into 
> an idle state.

Yes, I have worked on this, and certainly next events must be tied to
the state a core is in, the bare next event does not help.

> Fun.

Big :D

> Coupled C states on this level are a PAIN in many ways, and tend to totally 
> suck for power
> due to this and the general "too much is active" reasons.

I think the trend is moving towards core gating, which resembles a lot to what
x86 world does today. Still, the interaction between menu governor and
cluster states has to be characterized and that's we are doing at the
moment.

Thank you very much,
Lorenzo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven

On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:



btw this is largely a misunderstanding;
tasks are not the issue; tasks use timers and those are perfectly predictable.
It's interrupts that are not and the heuristics are for that.

Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
then the menu governor logic is not good for you; rather than looking at the 
next
timer on the current cpu you need to look at the earliest timer on the set of 
bundled
cpus as the upper bound of the next wake event.


Yes, that's true and we have to look into this properly, but certainly
a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
or I am missing something ?


a CPU core isn't in a package C state, the system is.
(in a core C state the whole core is already powered down completely; a package 
C state
just also turns off the memory controller/etc)

package C states are global on x86 (not just per package); there's nothing one
can do there in terms of grouping/etc.



Even if the wake-up interrupts just power up one of the CPUs in a package
and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
be turned on. What I am asking is: this bundled next event is a concept
that should apply to x86 CPUs too, or it is entirely managed in FW/HW
and the kernel just should not care ?


on Intel x86 cpus, there's not really bundled concept. or rather, there is only 
1 bundle
(which amounts to the same thing).
Yes in a multi-package setup there are some cache power effects... but there's
not a lot one can do there.
The other cores don't wake up, so they still make their own correct decisions.


I still do not understand how this "bundled" next event is managed on
x86 with the menu governor, or better why it is not managed at all, given
the importance of package C-states.


package C states on x86 are basically OS invisible. The OS manages core level C 
states,
the hardware manages the rest.
The bundle part hurts you on a "one wakes all" system,
not because of package level power effects, but because others wake up 
prematurely
(compared to what they expected) which causes them to think future wakups will 
also
be earlier. All because they get the "what is the next known event" wrong,
and start correcting for known events instead of only for 'unpredictable' 
interrupts.
Things will go very wonky if you do that for sure.
(I've seen various simulation data on that, and the menu governor indeed acts 
quite poorly
for that)


And maybe even more special casing is needed... but I doubt it.


I lost you here, can you elaborate pls ?


well.. just looking at the earliest timer might not be enough; that timer might 
be on a different
core that's still active, and may change after the current cpu has gone into an 
idle state.
Fun.
Coupled C states on this level are a PAIN in many ways, and tend to totally 
suck for power
due to this and the general "too much is active" reasons.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Lorenzo Pieralisi
On Mon, Jul 29, 2013 at 02:12:58PM +0100, Arjan van de Ven wrote:
> >
> > The menu governor tries to deduce the next wakeup but based on events
> > per cpu. That means if a task with a specific behavior is migrated
> > across cpus, the statistics will be wrong.
> 
> 
> btw this is largely a misunderstanding;
> tasks are not the issue; tasks use timers and those are perfectly predictable.
> It's interrupts that are not and the heuristics are for that.
> 
> Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
> then the menu governor logic is not good for you; rather than looking at the 
> next
> timer on the current cpu you need to look at the earliest timer on the set of 
> bundled
> cpus as the upper bound of the next wake event.

Yes, that's true and we have to look into this properly, but certainly
a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
or I am missing something ?

Even if the wake-up interrupts just power up one of the CPUs in a package
and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
be turned on. What I am asking is: this bundled next event is a concept
that should apply to x86 CPUs too, or it is entirely managed in FW/HW
and the kernel just should not care ?

I still do not understand how this "bundled" next event is managed on
x86 with the menu governor, or better why it is not managed at all, given
the importance of package C-states.

> And maybe even more special casing is needed... but I doubt it.

I lost you here, can you elaborate pls ?

Thanks a lot,
Lorenzo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven


The menu governor tries to deduce the next wakeup but based on events
per cpu. That means if a task with a specific behavior is migrated
across cpus, the statistics will be wrong.



btw this is largely a misunderstanding;
tasks are not the issue; tasks use timers and those are perfectly predictable.
It's interrupts that are not and the heuristics are for that.

Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
then the menu governor logic is not good for you; rather than looking at the 
next
timer on the current cpu you need to look at the earliest timer on the set of 
bundled
cpus as the upper bound of the next wake event.
And maybe even more special casing is needed... but I doubt it.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven


Beside that, on ARM the cpus could be coupled and the timer to detect
the prediction will wake up the entire cluster, making the power saving
less efficient and impacting the statistics of the other cpu.


an ARM specific governor would be quite appropriate in that case.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven


Beside that, on ARM the cpus could be coupled and the timer to detect
the prediction will wake up the entire cluster, making the power saving
less efficient and impacting the statistics of the other cpu.


an ARM specific governor would be quite appropriate in that case.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven


The menu governor tries to deduce the next wakeup but based on events
per cpu. That means if a task with a specific behavior is migrated
across cpus, the statistics will be wrong.



btw this is largely a misunderstanding;
tasks are not the issue; tasks use timers and those are perfectly predictable.
It's interrupts that are not and the heuristics are for that.

Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
then the menu governor logic is not good for you; rather than looking at the 
next
timer on the current cpu you need to look at the earliest timer on the set of 
bundled
cpus as the upper bound of the next wake event.
And maybe even more special casing is needed... but I doubt it.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Lorenzo Pieralisi
On Mon, Jul 29, 2013 at 02:12:58PM +0100, Arjan van de Ven wrote:
 
  The menu governor tries to deduce the next wakeup but based on events
  per cpu. That means if a task with a specific behavior is migrated
  across cpus, the statistics will be wrong.
 
 
 btw this is largely a misunderstanding;
 tasks are not the issue; tasks use timers and those are perfectly predictable.
 It's interrupts that are not and the heuristics are for that.
 
 Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
 then the menu governor logic is not good for you; rather than looking at the 
 next
 timer on the current cpu you need to look at the earliest timer on the set of 
 bundled
 cpus as the upper bound of the next wake event.

Yes, that's true and we have to look into this properly, but certainly
a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
or I am missing something ?

Even if the wake-up interrupts just power up one of the CPUs in a package
and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
be turned on. What I am asking is: this bundled next event is a concept
that should apply to x86 CPUs too, or it is entirely managed in FW/HW
and the kernel just should not care ?

I still do not understand how this bundled next event is managed on
x86 with the menu governor, or better why it is not managed at all, given
the importance of package C-states.

 And maybe even more special casing is needed... but I doubt it.

I lost you here, can you elaborate pls ?

Thanks a lot,
Lorenzo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven

On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:



btw this is largely a misunderstanding;
tasks are not the issue; tasks use timers and those are perfectly predictable.
It's interrupts that are not and the heuristics are for that.

Now, if your hardware does the really-bad-for-power wake-all on any interrupt,
then the menu governor logic is not good for you; rather than looking at the 
next
timer on the current cpu you need to look at the earliest timer on the set of 
bundled
cpus as the upper bound of the next wake event.


Yes, that's true and we have to look into this properly, but certainly
a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs either,
or I am missing something ?


a CPU core isn't in a package C state, the system is.
(in a core C state the whole core is already powered down completely; a package 
C state
just also turns off the memory controller/etc)

package C states are global on x86 (not just per package); there's nothing one
can do there in terms of grouping/etc.



Even if the wake-up interrupts just power up one of the CPUs in a package
and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
be turned on. What I am asking is: this bundled next event is a concept
that should apply to x86 CPUs too, or it is entirely managed in FW/HW
and the kernel just should not care ?


on Intel x86 cpus, there's not really bundled concept. or rather, there is only 
1 bundle
(which amounts to the same thing).
Yes in a multi-package setup there are some cache power effects... but there's
not a lot one can do there.
The other cores don't wake up, so they still make their own correct decisions.


I still do not understand how this bundled next event is managed on
x86 with the menu governor, or better why it is not managed at all, given
the importance of package C-states.


package C states on x86 are basically OS invisible. The OS manages core level C 
states,
the hardware manages the rest.
The bundle part hurts you on a one wakes all system,
not because of package level power effects, but because others wake up 
prematurely
(compared to what they expected) which causes them to think future wakups will 
also
be earlier. All because they get the what is the next known event wrong,
and start correcting for known events instead of only for 'unpredictable' 
interrupts.
Things will go very wonky if you do that for sure.
(I've seen various simulation data on that, and the menu governor indeed acts 
quite poorly
for that)


And maybe even more special casing is needed... but I doubt it.


I lost you here, can you elaborate pls ?


well.. just looking at the earliest timer might not be enough; that timer might 
be on a different
core that's still active, and may change after the current cpu has gone into an 
idle state.
Fun.
Coupled C states on this level are a PAIN in many ways, and tend to totally 
suck for power
due to this and the general too much is active reasons.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Lorenzo Pieralisi
On Mon, Jul 29, 2013 at 03:29:20PM +0100, Arjan van de Ven wrote:
 On 7/29/2013 7:14 AM, Lorenzo Pieralisi wrote:
 
 
  btw this is largely a misunderstanding;
  tasks are not the issue; tasks use timers and those are perfectly 
  predictable.
  It's interrupts that are not and the heuristics are for that.
 
  Now, if your hardware does the really-bad-for-power wake-all on any 
  interrupt,
  then the menu governor logic is not good for you; rather than looking at 
  the next
  timer on the current cpu you need to look at the earliest timer on the set 
  of bundled
  cpus as the upper bound of the next wake event.
 
  Yes, that's true and we have to look into this properly, but certainly
  a wake-up for a CPU in a package C-state is not beneficial to x86 CPUs 
  either,
  or I am missing something ?
 
 a CPU core isn't in a package C state, the system is.
 (in a core C state the whole core is already powered down completely; a 
 package C state
 just also turns off the memory controller/etc)
 
 package C states are global on x86 (not just per package); there's nothing one
 can do there in terms of grouping/etc.

So you are saying that package states are system states on x86 right ?
Now things are a bit clearer, I was a bit baffled at first when you
mentioned that package C-states allow PM to turn off DRAM controller,
the concept of package C-state is a bit misleading and does not resemble
much to what cluster states are now for ARM, that's why I asked in the
first place, thank you.

  Even if the wake-up interrupts just power up one of the CPUs in a package
  and leave other(s) alone, all HW state shared (ie caches) by those CPUs must
  be turned on. What I am asking is: this bundled next event is a concept
  that should apply to x86 CPUs too, or it is entirely managed in FW/HW
  and the kernel just should not care ?
 
 on Intel x86 cpus, there's not really bundled concept. or rather, there is 
 only 1 bundle
 (which amounts to the same thing).
 Yes in a multi-package setup there are some cache power effects... but there's
 not a lot one can do there.

On ARM there is, some basic optimizations like avoid cleaning caches if
an IRQ is pending or a next event is due shortly, for instance. The
difference is that on ARM cache management in done in SW and under
kernel or FW control (which in a way is closer to what x86 does, even
though I think on x86 most of power control is offloaded to HW).

Given what you said above, I understand that even on a multi-package system,
package C-states are global, not per package. That's a pivotal point.

 The other cores don't wake up, so they still make their own correct decisions.
 
  I still do not understand how this bundled next event is managed on
  x86 with the menu governor, or better why it is not managed at all, given
  the importance of package C-states.
 
 package C states on x86 are basically OS invisible. The OS manages core level 
 C states,
 the hardware manages the rest.
 The bundle part hurts you on a one wakes all system,
 not because of package level power effects, but because others wake up 
 prematurely
 (compared to what they expected) which causes them to think future wakups 
 will also
 be earlier. All because they get the what is the next known event wrong,
 and start correcting for known events instead of only for 'unpredictable' 
 interrupts.

Well, reality is a bit more complex and probably less drastic, cores that are
woken up spuriosly can be put back to sleep without going through the governor
again, but your point is taken.

 Things will go very wonky if you do that for sure.
 (I've seen various simulation data on that, and the menu governor indeed acts 
 quite poorly
 for that)

That's something we have been benchmarking, yes.

  And maybe even more special casing is needed... but I doubt it.
 
  I lost you here, can you elaborate pls ?
 
 well.. just looking at the earliest timer might not be enough; that timer 
 might be on a different
 core that's still active, and may change after the current cpu has gone into 
 an idle state.

Yes, I have worked on this, and certainly next events must be tied to
the state a core is in, the bare next event does not help.

 Fun.

Big :D

 Coupled C states on this level are a PAIN in many ways, and tend to totally 
 suck for power
 due to this and the general too much is active reasons.

I think the trend is moving towards core gating, which resembles a lot to what
x86 world does today. Still, the interaction between menu governor and
cluster states has to be characterized and that's we are doing at the
moment.

Thank you very much,
Lorenzo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Arjan van de Ven

On 7/29/2013 9:01 AM, Lorenzo Pieralisi wrote:



Coupled C states on this level are a PAIN in many ways, and tend to totally 
suck for power
due to this and the general too much is active reasons.


I think the trend is moving towards core gating, which resembles a lot to what
x86 world does today. Still, the interaction between menu governor and
cluster states has to be characterized and that's we are doing at the
moment.


I suspect that we (Linux) need a different governor than menu for the coupled 
case.
Fundamentally you're going to need different algorithms that are much closer 
tied
to the behavior of the specific hardware; the current menu governor uses an 
abstract
and simple hardware model (fully independent).
For specific hardware that deviates from the model, something different is 
clearly needed.

It's an open question for me if just the predictor is the part that
you split out, or if the whole governor needs to be split out.
(but frankly, if you take the predictor out of the menu governor, not a whole 
lot is left)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Youquan Song
Hi Jeremy,

I try reproduce your result and then fix the issue, but I do not reproduce it
 yet.

I run at netperf-2.6.0 at one machine as server: netserver, other
machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
used in both client and server. I do not reproduce the performance drop
issue. I also notice the result is not stable, sometime it is high,
sometime is low. In sumarry, it is hard to make a definite result.

Can you try tell me how to reproduce the issue? how do you get the C0
data?

What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
only CONFIG_NO_HZ=y?


Thanks
-Youquan 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-29 Thread Jeremy Eder
On 130729 23:57:31, Youquan Song wrote:
 Hi Jeremy,
 
 I try reproduce your result and then fix the issue, but I do not reproduce it
  yet.
 
 I run at netperf-2.6.0 at one machine as server: netserver, other
 machine: netperf -t TCP_RR -H $SERVER_IP -l 60. The target machine is
 used in both client and server. I do not reproduce the performance drop
 issue. I also notice the result is not stable, sometime it is high,
 sometime is low. In sumarry, it is hard to make a definite result.
 
 Can you try tell me how to reproduce the issue? how do you get the C0
 data?
 
 What's your config for kernel?  Do you enable CONFIG_NO_HZ_FULL=y or
 only CONFIG_NO_HZ=y?
 
 
 Thanks
 -Youquan 

Hi,

To answer both your and Daniel's question, those results used only
CONFIG_NO_HZ=y.

These network latency benchmarks are fickle creatures, and need careful
tuning to become reproducible.  Plus there are BIOS implications and tuning
varies by vendor.

Anyway for the most part it's probably not stable because in order to get
any sort
of reproducibility between runs you need to do at least these steps:

- ensure as little is running in userspace as possible
- determine PCI affinity for the NIC
- on both machines, isolate the socket connected to the NIC from userspace
  tasks
- Turn off irqbalance and bind all IRQs for that NIC to a single core on
  the same socket as the NIC
- run netperf with -TX,Y where X,Y are core numbers that you wish
  netperf/netserver to run on, respectively.

For example, if your NIC is attached to socket 0 and socket 0 cores are
enumerated 0-7, then:

- set /proc/irq/NNN/smp_affinity_list to, say, 6 for all vectors on that
  NIC.
- nice -20 netperf -t TCP_RR - $SERVER_IP -l 60 -T4,4 -s 2

That should get you most of the way there.  The -s 2 connects and waits 2
seconds, I found this to help with the first few second's worth of data.
Or
you could just toss the first 2 seconds worth, it seems to take that long
to stabilize.  What I mean is, if you're not using -D1,1 option to netperf,
you might not have seen that netperf tests seem to take a few seconds to
stabilize even
when properly tuned.

I got the C0 data by running turbostat in parallel with each benchmark run,
then grabbing the C-state data for the cores relevant to the test.  In my
case that was cores 4 and 6, where core 4 was where I put netperf/netserver
and core 6 was where I put the NIC IRQs.  Then I parsed that output into a
format that this could interpret:

https://github.com/bitly/data_hacks/blob/master/data_hacks/histogram.py

I'm building a kernel from Rafael's tree and will try to confirm what Len
already sent.  Thanks everyone for looking into it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Rafael J. Wysocki
On Saturday, July 27, 2013 02:22:53 AM Len Brown wrote:
> >> OK, I'll queue up the reverts as fixes for 3.11-rc4.
> >
> > So, the reverts are on the fixes-next branch of the linux-pm.git tree that 
> > you
> > can access at
> >
> > http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
> >
> > However, they are not simple reverts as we've had some non-trivial changes 
> > on
> > top of those commits already, so I'd appreciate it a lot if somebody could
> > double check if I didn't break anything in them.
> 
> I've verified that the reverts improve netperf TCP_RR performance.
> 
> Here I've got two Xeon's, slightly different, so I run in both directions.
> Also I run two ways -- Out-of-the-box, plus with cpufreq
> set to max frequency.  The reverts improve all 4 cases:
> 
> JKT → IVTIVT → JKT
> 
>  3.11.0-rc2baseline w/o revertOut of Box
> 
> 2042019963
> 2065819915
> 2029820320
> 
>  3.11.0-rc2baseline w/o revertmax freq
> 
> 5965851427
> 5966351503
> 5941651343
> 
> 3.11.0-rc2-2-g74bce39Linux-pm “fixes-next” branch
> 
> 3.11.0-rc2-2-g74bce39Out of box
> 2322722056
> 2330622125
> 233874022640k result saw some 2.2 ghz
> vs 1.2 ghz in other runs
> 21991
> 
> 3.11.0-rc2-2-g74bce39Max-freq
> 6724057645
> 6488056764
> 6592457435
> 
> Tested-by: Len Brown 

Many thanks Len!

Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Daniel Lezcano
On 07/26/2013 08:29 PM, Rik van Riel wrote:
> On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
>> On 7/26/2013 11:13 AM, Rik van Riel wrote:
>>
>>>
>>> Could you try running the tests with just the repeat mode
>>> stuff from commit 69a37bea excluded, but leaving the common
>>> infrastructure and commit e11538?
>>>
>>
>> personally I think we should go the other way around.
>> revert the set entirely first, and now, and get our performance back
>> to what it should be
>>
>> and then see what we can add back without causing the regressions.
>> this may take longer, or be done in steps, and that's ok.
>>
>> the end point may well be the same... but we can then evaluate in the
>> right
>> direction.
> 
> Works for me. I have no objection to reverting both patches,
> if the people planning to fix the code prefer that :)

I am favor of reverting these patches also because they don't solve the
root problem of reducing the prediction failure.

The menu governor tries to deduce the next wakeup but based on events
per cpu. That means if a task with a specific behavior is migrated
across cpus, the statistics will be wrong. IMHO, the menu governor has
too much heuristic in there and may be we should consider the scheduler
to pass more informations to the governor and remove some of these
heuristics.

Beside that, on ARM the cpus could be coupled and the timer to detect
the prediction will wake up the entire cluster, making the power saving
less efficient and impacting the statistics of the other cpu.


-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Daniel Lezcano
On 07/26/2013 07:33 PM, Jeremy Eder wrote:
> Hello,
> 
> We believe we've identified a particular commit to the cpuidle code that
> seems to be impacting performance of variety of workloads.  The simplest way 
> to
> reproduce is using netperf TCP_RR test, so we're using that, on a pair of
> Sandy Bridge based servers.  We also have data from a large database setup
> where performance is also measurably/positively impacted, though that test
> data isn't easily share-able.
> 
> Included below are test results from 3 test kernels:

Is the system tickless or with a periodic tick ?



> kernel   reverts
> ---
> 1) vanilla   upstream (no reverts)
> 
> 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
> 
> 3) test  reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
>  e11538d1f03914eb92af5a1a378375c05ae8520c
> 
> In summary, netperf TCP_RR numbers improve by approximately 4% after
> reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
> 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
> seems to get above 40%.  Taking that patch out gets C0 near 100% quite
> often, and performance increases.
> 
> The below data are histograms representing the %c0 residency @ 1-second
> sample rates (using turbostat), while under netperf test.
> 
> - If you look at the first 4 histograms, you can see %c0 residency almost
>   entirely in the 30,40% bin.
> - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
>   shows %c0 in the 80,90,100% bins.
> 
> Below each kernel name are netperf TCP_RR trans/s numbers for the
> particular kernel that can be disclosed publicly, comparing the 3 test
> kernels.  We ran a 4th test with the vanilla kernel where we've also set
> /dev/cpu_dma_latency=0 to show overall impact boosting single-threaded
> TCP_RR performance over 11% above baseline.
> 
> 3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):  
> TCP_RR trans/s 54323.78
> 
> ---
> 3.10-rc2 vanilla RX (no reverts)
> TCP_RR trans/s 48192.47
> 
> Receiver %c0 
> 0. -10. [ 1]: *
>10. -20. [ 0]: 
>20. -30. [ 0]: 
>30. -40. [59]: 
> ***
>40. -50. [ 1]: *
>50. -60. [ 0]: 
>60. -70. [ 0]: 
>70. -80. [ 0]: 
>80. -90. [ 0]: 
>90. -   100. [ 0]: 
> 
> Sender %c0
> 0. -10. [ 1]: *
>10. -20. [ 0]: 
>20. -30. [ 0]: 
>30. -40. [11]: ***
>40. -50. [49]:
> *
>50. -60. [ 0]: 
>60. -70. [ 0]: 
>70. -80. [ 0]: 
>80. -90. [ 0]: 
>90. -   100. [ 0]: 
> 
> ---
> 3.10-rc2 perfteam2 RX (reverts commit
> e11538d1f03914eb92af5a1a378375c05ae8520c)
> TCP_RR trans/s 49698.69
> 
> Receiver %c0 
> 0. -10. [ 1]: *
>10. -20. [ 1]: *
>20. -30. [ 0]: 
>30. -40. [59]:
> ***
>40. -50. [ 0]: 
>50. -60. [ 0]: 
>60. -70. [ 0]: 
>70. -80. [ 0]: 
>80. -90. [ 0]: 
>90. -   100. [ 0]: 
> 
> Sender %c0 
> 0. -10. [ 1]: *
>10. -20. [ 0]: 
>20. -30. [ 0]: 
>30. -40. [ 2]: **
>40. -50. [58]:
> **
>50. -60. [ 0]: 
>60. -70. [ 0]: 
>70. -80. [ 0]: 
>80. -90. [ 0]: 
>90. -   100. [ 0]: 
> 
> ---
> 3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and
> e11538d1f03914eb92af5a1a378375c05ae8520c)
> TCP_RR trans/s 47766.95
> 
> Receiver %c0
> 0. -10. [ 1]: *
>10. -20. [ 1]: *
>20. -30. [ 0]: 
>30. -40. [27]: ***
>40. -50. [ 2]: **
>50. -60. [ 0]: 
>60. -70. [ 2]: **
>70. -80. [ 0]: 
>80. -90. [ 0]: 
>90. -   100. [28]: 
> 
> Sender:
> 0. -10. [ 1]: *
>10. -20. [ 0]: 
>20. -30. [ 0]: 
>30. -40. [11]: ***
>40. -50. [ 0]: 
>50. 

Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Len Brown
>> OK, I'll queue up the reverts as fixes for 3.11-rc4.
>
> So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
> can access at
>
> http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
>
> However, they are not simple reverts as we've had some non-trivial changes on
> top of those commits already, so I'd appreciate it a lot if somebody could
> double check if I didn't break anything in them.

I've verified that the reverts improve netperf TCP_RR performance.

Here I've got two Xeon's, slightly different, so I run in both directions.
Also I run two ways -- Out-of-the-box, plus with cpufreq
set to max frequency.  The reverts improve all 4 cases:

JKT → IVTIVT → JKT

 3.11.0-rc2baseline w/o revertOut of Box

2042019963
2065819915
2029820320

 3.11.0-rc2baseline w/o revertmax freq

5965851427
5966351503
5941651343

3.11.0-rc2-2-g74bce39Linux-pm “fixes-next” branch

3.11.0-rc2-2-g74bce39Out of box
2322722056
2330622125
233874022640k result saw some 2.2 ghz
vs 1.2 ghz in other runs
21991

3.11.0-rc2-2-g74bce39Max-freq
6724057645
6488056764
6592457435

Tested-by: Len Brown 

thanks,
Len Brown, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Len Brown
 OK, I'll queue up the reverts as fixes for 3.11-rc4.

 So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
 can access at

 http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next

 However, they are not simple reverts as we've had some non-trivial changes on
 top of those commits already, so I'd appreciate it a lot if somebody could
 double check if I didn't break anything in them.

I've verified that the reverts improve netperf TCP_RR performance.

Here I've got two Xeon's, slightly different, so I run in both directions.
Also I run two ways -- Out-of-the-box, plus with cpufreq
set to max frequency.  The reverts improve all 4 cases:

JKT → IVTIVT → JKT

 3.11.0-rc2baseline w/o revertOut of Box

2042019963
2065819915
2029820320

 3.11.0-rc2baseline w/o revertmax freq

5965851427
5966351503
5941651343

3.11.0-rc2-2-g74bce39Linux-pm “fixes-next” branch

3.11.0-rc2-2-g74bce39Out of box
2322722056
2330622125
233874022640k result saw some 2.2 ghz
vs 1.2 ghz in other runs
21991

3.11.0-rc2-2-g74bce39Max-freq
6724057645
6488056764
6592457435

Tested-by: Len Brown len.br...@intel.com

thanks,
Len Brown, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Daniel Lezcano
On 07/26/2013 07:33 PM, Jeremy Eder wrote:
 Hello,
 
 We believe we've identified a particular commit to the cpuidle code that
 seems to be impacting performance of variety of workloads.  The simplest way 
 to
 reproduce is using netperf TCP_RR test, so we're using that, on a pair of
 Sandy Bridge based servers.  We also have data from a large database setup
 where performance is also measurably/positively impacted, though that test
 data isn't easily share-able.
 
 Included below are test results from 3 test kernels:

Is the system tickless or with a periodic tick ?



 kernel   reverts
 ---
 1) vanilla   upstream (no reverts)
 
 2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c
 
 3) test  reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
  e11538d1f03914eb92af5a1a378375c05ae8520c
 
 In summary, netperf TCP_RR numbers improve by approximately 4% after
 reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
 seems to get above 40%.  Taking that patch out gets C0 near 100% quite
 often, and performance increases.
 
 The below data are histograms representing the %c0 residency @ 1-second
 sample rates (using turbostat), while under netperf test.
 
 - If you look at the first 4 histograms, you can see %c0 residency almost
   entirely in the 30,40% bin.
 - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
   shows %c0 in the 80,90,100% bins.
 
 Below each kernel name are netperf TCP_RR trans/s numbers for the
 particular kernel that can be disclosed publicly, comparing the 3 test
 kernels.  We ran a 4th test with the vanilla kernel where we've also set
 /dev/cpu_dma_latency=0 to show overall impact boosting single-threaded
 TCP_RR performance over 11% above baseline.
 
 3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):  
 TCP_RR trans/s 54323.78
 
 ---
 3.10-rc2 vanilla RX (no reverts)
 TCP_RR trans/s 48192.47
 
 Receiver %c0 
 0. -10. [ 1]: *
10. -20. [ 0]: 
20. -30. [ 0]: 
30. -40. [59]: 
 ***
40. -50. [ 1]: *
50. -60. [ 0]: 
60. -70. [ 0]: 
70. -80. [ 0]: 
80. -90. [ 0]: 
90. -   100. [ 0]: 
 
 Sender %c0
 0. -10. [ 1]: *
10. -20. [ 0]: 
20. -30. [ 0]: 
30. -40. [11]: ***
40. -50. [49]:
 *
50. -60. [ 0]: 
60. -70. [ 0]: 
70. -80. [ 0]: 
80. -90. [ 0]: 
90. -   100. [ 0]: 
 
 ---
 3.10-rc2 perfteam2 RX (reverts commit
 e11538d1f03914eb92af5a1a378375c05ae8520c)
 TCP_RR trans/s 49698.69
 
 Receiver %c0 
 0. -10. [ 1]: *
10. -20. [ 1]: *
20. -30. [ 0]: 
30. -40. [59]:
 ***
40. -50. [ 0]: 
50. -60. [ 0]: 
60. -70. [ 0]: 
70. -80. [ 0]: 
80. -90. [ 0]: 
90. -   100. [ 0]: 
 
 Sender %c0 
 0. -10. [ 1]: *
10. -20. [ 0]: 
20. -30. [ 0]: 
30. -40. [ 2]: **
40. -50. [58]:
 **
50. -60. [ 0]: 
60. -70. [ 0]: 
70. -80. [ 0]: 
80. -90. [ 0]: 
90. -   100. [ 0]: 
 
 ---
 3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4 and
 e11538d1f03914eb92af5a1a378375c05ae8520c)
 TCP_RR trans/s 47766.95
 
 Receiver %c0
 0. -10. [ 1]: *
10. -20. [ 1]: *
20. -30. [ 0]: 
30. -40. [27]: ***
40. -50. [ 2]: **
50. -60. [ 0]: 
60. -70. [ 2]: **
70. -80. [ 0]: 
80. -90. [ 0]: 
90. -   100. [28]: 
 
 Sender:
 0. -10. [ 1]: *
10. -20. [ 0]: 
20. -30. [ 0]: 
30. -40. [11]: ***
40. -50. [ 0]: 
50. -60. [ 1]: *
60. -70. [ 0]: 
70. -80. [ 3]: ***
80. -90. [  

Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Daniel Lezcano
On 07/26/2013 08:29 PM, Rik van Riel wrote:
 On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
 On 7/26/2013 11:13 AM, Rik van Riel wrote:


 Could you try running the tests with just the repeat mode
 stuff from commit 69a37bea excluded, but leaving the common
 infrastructure and commit e11538?


 personally I think we should go the other way around.
 revert the set entirely first, and now, and get our performance back
 to what it should be

 and then see what we can add back without causing the regressions.
 this may take longer, or be done in steps, and that's ok.

 the end point may well be the same... but we can then evaluate in the
 right
 direction.
 
 Works for me. I have no objection to reverting both patches,
 if the people planning to fix the code prefer that :)

I am favor of reverting these patches also because they don't solve the
root problem of reducing the prediction failure.

The menu governor tries to deduce the next wakeup but based on events
per cpu. That means if a task with a specific behavior is migrated
across cpus, the statistics will be wrong. IMHO, the menu governor has
too much heuristic in there and may be we should consider the scheduler
to pass more informations to the governor and remove some of these
heuristics.

Beside that, on ARM the cpus could be coupled and the timer to detect
the prediction will wake up the entire cluster, making the power saving
less efficient and impacting the statistics of the other cpu.


-- 
 http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  http://www.facebook.com/pages/Linaro Facebook |
http://twitter.com/#!/linaroorg Twitter |
http://www.linaro.org/linaro-blog/ Blog

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-27 Thread Rafael J. Wysocki
On Saturday, July 27, 2013 02:22:53 AM Len Brown wrote:
  OK, I'll queue up the reverts as fixes for 3.11-rc4.
 
  So, the reverts are on the fixes-next branch of the linux-pm.git tree that 
  you
  can access at
 
  http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next
 
  However, they are not simple reverts as we've had some non-trivial changes 
  on
  top of those commits already, so I'd appreciate it a lot if somebody could
  double check if I didn't break anything in them.
 
 I've verified that the reverts improve netperf TCP_RR performance.
 
 Here I've got two Xeon's, slightly different, so I run in both directions.
 Also I run two ways -- Out-of-the-box, plus with cpufreq
 set to max frequency.  The reverts improve all 4 cases:
 
 JKT → IVTIVT → JKT
 
  3.11.0-rc2baseline w/o revertOut of Box
 
 2042019963
 2065819915
 2029820320
 
  3.11.0-rc2baseline w/o revertmax freq
 
 5965851427
 5966351503
 5941651343
 
 3.11.0-rc2-2-g74bce39Linux-pm “fixes-next” branch
 
 3.11.0-rc2-2-g74bce39Out of box
 2322722056
 2330622125
 233874022640k result saw some 2.2 ghz
 vs 1.2 ghz in other runs
 21991
 
 3.11.0-rc2-2-g74bce39Max-freq
 6724057645
 6488056764
 6592457435
 
 Tested-by: Len Brown len.br...@intel.com

Many thanks Len!

Rafael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rafael J. Wysocki
On Friday, July 26, 2013 11:48:36 PM Rafael J. Wysocki wrote:
> On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
> > On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> > > On 7/26/2013 11:13 AM, Rik van Riel wrote:
> > >
> > >>
> > >> Could you try running the tests with just the repeat mode
> > >> stuff from commit 69a37bea excluded, but leaving the common
> > >> infrastructure and commit e11538?
> > >>
> > >
> > > personally I think we should go the other way around.
> > > revert the set entirely first, and now, and get our performance back
> > > to what it should be
> > >
> > > and then see what we can add back without causing the regressions.
> > > this may take longer, or be done in steps, and that's ok.
> > >
> > > the end point may well be the same... but we can then evaluate in the 
> > > right
> > > direction.
> > 
> > Works for me. I have no objection to reverting both patches,
> > if the people planning to fix the code prefer that :)
> 
> OK, I'll queue up the reverts as fixes for 3.11-rc4.

So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
can access at

http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next

However, they are not simple reverts as we've had some non-trivial changes on
top of those commits already, so I'd appreciate it a lot if somebody could
double check if I didn't break anything in them.

They are based on top of my master branch for now, but I'll rebase them on
3.11-rc3 when it's out.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rafael J. Wysocki
On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
> On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
> > On 7/26/2013 11:13 AM, Rik van Riel wrote:
> >
> >>
> >> Could you try running the tests with just the repeat mode
> >> stuff from commit 69a37bea excluded, but leaving the common
> >> infrastructure and commit e11538?
> >>
> >
> > personally I think we should go the other way around.
> > revert the set entirely first, and now, and get our performance back
> > to what it should be
> >
> > and then see what we can add back without causing the regressions.
> > this may take longer, or be done in steps, and that's ok.
> >
> > the end point may well be the same... but we can then evaluate in the right
> > direction.
> 
> Works for me. I have no objection to reverting both patches,
> if the people planning to fix the code prefer that :)

OK, I'll queue up the reverts as fixes for 3.11-rc4.

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rik van Riel

On 07/26/2013 02:27 PM, Arjan van de Ven wrote:

On 7/26/2013 11:13 AM, Rik van Riel wrote:



Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?



personally I think we should go the other way around.
revert the set entirely first, and now, and get our performance back
to what it should be

and then see what we can add back without causing the regressions.
this may take longer, or be done in steps, and that's ok.

the end point may well be the same... but we can then evaluate in the right
direction.


Works for me. I have no objection to reverting both patches,
if the people planning to fix the code prefer that :)

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Arjan van de Ven

On 7/26/2013 11:13 AM, Rik van Riel wrote:



Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?



personally I think we should go the other way around.
revert the set entirely first, and now, and get our performance back
to what it should be

and then see what we can add back without causing the regressions.
this may take longer, or be done in steps, and that's ok.

the end point may well be the same... but we can then evaluate in the right
direction.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rik van Riel

On 07/26/2013 01:33 PM, Jeremy Eder wrote:

Hello,

We believe we've identified a particular commit to the cpuidle code that
seems to be impacting performance of variety of workloads.  The simplest way to
reproduce is using netperf TCP_RR test, so we're using that, on a pair of
Sandy Bridge based servers.  We also have data from a large database setup
where performance is also measurably/positively impacted, though that test
data isn't easily share-able.

Included below are test results from 3 test kernels:

kernel   reverts
---
1) vanilla   upstream (no reverts)

2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c

3) test  reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
  e11538d1f03914eb92af5a1a378375c05ae8520c

In summary, netperf TCP_RR numbers improve by approximately 4% after
reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
seems to get above 40%.  Taking that patch out gets C0 near 100% quite
often, and performance increases.


Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rik van Riel

On 07/26/2013 01:33 PM, Jeremy Eder wrote:

Hello,

We believe we've identified a particular commit to the cpuidle code that
seems to be impacting performance of variety of workloads.  The simplest way to
reproduce is using netperf TCP_RR test, so we're using that, on a pair of
Sandy Bridge based servers.  We also have data from a large database setup
where performance is also measurably/positively impacted, though that test
data isn't easily share-able.

Included below are test results from 3 test kernels:

kernel   reverts
---
1) vanilla   upstream (no reverts)

2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c

3) test  reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
  e11538d1f03914eb92af5a1a378375c05ae8520c

In summary, netperf TCP_RR numbers improve by approximately 4% after
reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4.  When
69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency never
seems to get above 40%.  Taking that patch out gets C0 near 100% quite
often, and performance increases.


Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Arjan van de Ven

On 7/26/2013 11:13 AM, Rik van Riel wrote:



Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?



personally I think we should go the other way around.
revert the set entirely first, and now, and get our performance back
to what it should be

and then see what we can add back without causing the regressions.
this may take longer, or be done in steps, and that's ok.

the end point may well be the same... but we can then evaluate in the right
direction.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rik van Riel

On 07/26/2013 02:27 PM, Arjan van de Ven wrote:

On 7/26/2013 11:13 AM, Rik van Riel wrote:



Could you try running the tests with just the repeat mode
stuff from commit 69a37bea excluded, but leaving the common
infrastructure and commit e11538?



personally I think we should go the other way around.
revert the set entirely first, and now, and get our performance back
to what it should be

and then see what we can add back without causing the regressions.
this may take longer, or be done in steps, and that's ok.

the end point may well be the same... but we can then evaluate in the right
direction.


Works for me. I have no objection to reverting both patches,
if the people planning to fix the code prefer that :)

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rafael J. Wysocki
On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
 On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
  On 7/26/2013 11:13 AM, Rik van Riel wrote:
 
 
  Could you try running the tests with just the repeat mode
  stuff from commit 69a37bea excluded, but leaving the common
  infrastructure and commit e11538?
 
 
  personally I think we should go the other way around.
  revert the set entirely first, and now, and get our performance back
  to what it should be
 
  and then see what we can add back without causing the regressions.
  this may take longer, or be done in steps, and that's ok.
 
  the end point may well be the same... but we can then evaluate in the right
  direction.
 
 Works for me. I have no objection to reverting both patches,
 if the people planning to fix the code prefer that :)

OK, I'll queue up the reverts as fixes for 3.11-rc4.

Thanks,
Rafael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: revert request for cpuidle patches e11538d1 and 69a37bea

2013-07-26 Thread Rafael J. Wysocki
On Friday, July 26, 2013 11:48:36 PM Rafael J. Wysocki wrote:
 On Friday, July 26, 2013 02:29:40 PM Rik van Riel wrote:
  On 07/26/2013 02:27 PM, Arjan van de Ven wrote:
   On 7/26/2013 11:13 AM, Rik van Riel wrote:
  
  
   Could you try running the tests with just the repeat mode
   stuff from commit 69a37bea excluded, but leaving the common
   infrastructure and commit e11538?
  
  
   personally I think we should go the other way around.
   revert the set entirely first, and now, and get our performance back
   to what it should be
  
   and then see what we can add back without causing the regressions.
   this may take longer, or be done in steps, and that's ok.
  
   the end point may well be the same... but we can then evaluate in the 
   right
   direction.
  
  Works for me. I have no objection to reverting both patches,
  if the people planning to fix the code prefer that :)
 
 OK, I'll queue up the reverts as fixes for 3.11-rc4.

So, the reverts are on the fixes-next branch of the linux-pm.git tree that you
can access at

http://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git/log/?h=fixes-next

However, they are not simple reverts as we've had some non-trivial changes on
top of those commits already, so I'd appreciate it a lot if somebody could
double check if I didn't break anything in them.

They are based on top of my master branch for now, but I'll rebase them on
3.11-rc3 when it's out.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/