Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity

On 06/13/2010 06:03 PM, Avi Kivity wrote:

Currently fpu management is only lazy in one direction.  When we switch into
a task, we may avoid loading the fpu state in the hope that the task will
never use it.  If we guess right we save an fpu load/save cycle; if not,
a Device not Available exception will remind us to load the fpu.

However, in the other direction, fpu management is eager.  When we switch out
of an fpu-using task, we always save its fpu state.

This is wasteful if the task(s) that run until we switch back in all don't use
the fpu, since we could have kept the task's fpu on the cpu all this time
and saved an fpu save/load cycle.  This can be quite common with threaded
interrupts, but will also happen with normal kernel threads and even normal
user tasks.

This patch series converts task fpu management to be fully lazy.  When
switching out of a task, we keep its fpu state on the cpu, only flushing it
if some other task needs the fpu.
   


Ingo, Peter, any feedback on this?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread H. Peter Anvin
On 06/16/2010 12:24 AM, Avi Kivity wrote:
 
 Ingo, Peter, any feedback on this?
 

Conceptually, this makes sense to me.  However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU.  That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity

On 06/16/2010 10:32 AM, H. Peter Anvin wrote:

On 06/16/2010 12:24 AM, Avi Kivity wrote:
   

Ingo, Peter, any feedback on this?
 

Conceptually, this makes sense to me.  However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU.  That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.
   


Right, this optimization isn't free.

I think the tradeoff is favourable since task migrations are much less 
frequent than context switches within the same cpu, can the scheduler 
experts comment?


We can also mitigate some of the IPIs if we know that we're migrating on 
the cpu we're migrating from (i.e. we're pushing tasks to another cpu, 
not pulling them from their cpu).  Is that a common case, and if so, 
where can I hook a call to unlazy_fpu() (or its new equivalent)?


Note that kvm on intel has exactly the same issue (the VMPTR and VMCS 
are on-chip registers that are expensive to load and save, so we keep 
them loaded even while not scheduled, and IPI if we notice we've 
migrated; note that architecturally the cpu can cache multiple VMCSs 
simultaneously (though I doubt they cache multiple VMCSs 
microarchitecturally at this point)).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Ingo Molnar

(Cc:-ed various performance/optimization folks)

* Avi Kivity a...@redhat.com wrote:

 On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
 On 06/16/2010 12:24 AM, Avi Kivity wrote:
 Ingo, Peter, any feedback on this?
  Conceptually, this makes sense to me.  However, I have a concern what
  happens when a task is scheduled on another CPU, while its FPU state is
  still in registers in the original CPU.  That would seem to require
  expensive IPIs to spill the state in order for the rescheduling to
  proceed, and this could really damage performance.
 
 Right, this optimization isn't free.
 
 I think the tradeoff is favourable since task migrations are much
 less frequent than context switches within the same cpu, can the
 scheduler experts comment?

This cannot be stated categorically without precise measurements of 
known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
various lmbench components, X benchmarks, tiobench - you name it. Combined 
with the fact that most micro-benchmarks wont be using the FPU, while in the 
long run most processes will be using the FPU due to SIMM instructions. So 
even a positive result might be skewed in practice. Has to be measured 
carefully IMO - and i havent seen a _single_ performance measurement in the 
submission mail. This is really essential.

So this does not look like a patch-set we could apply without gathering a 
_ton_ of hard data about advantages and disadvantages.

 We can also mitigate some of the IPIs if we know that we're migrating on the 
 cpu we're migrating from (i.e. we're pushing tasks to another cpu, not 
 pulling them from their cpu).  Is that a common case, and if so, where can I 
 hook a call to unlazy_fpu() (or its new equivalent)?

When the system goes from idle to less idle then most of the 'fast' migrations 
happen on a 'push' model - on a busy CPU we wake up a new task and push it out 
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably 
little cost.

But on busy servers where most wakeups are IRQ based the chance of being on 
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of 
CPUs.

If there's some sucky corner case in theory we could approach it statistically 
and measure the ratio of fast vs. slow migration vs. local context switches - 
but that looks a bit complex.

Dunno.

Ingo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Nick Piggin
On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote:
 
 (Cc:-ed various performance/optimization folks)
 
 * Avi Kivity a...@redhat.com wrote:
 
  On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
  On 06/16/2010 12:24 AM, Avi Kivity wrote:
  Ingo, Peter, any feedback on this?
   Conceptually, this makes sense to me.  However, I have a concern what
   happens when a task is scheduled on another CPU, while its FPU state is
   still in registers in the original CPU.  That would seem to require
   expensive IPIs to spill the state in order for the rescheduling to
   proceed, and this could really damage performance.
  
  Right, this optimization isn't free.
  
  I think the tradeoff is favourable since task migrations are much
  less frequent than context switches within the same cpu, can the
  scheduler experts comment?
 
 This cannot be stated categorically without precise measurements of 
 known-good, known-bad, average FPU usage and average CPU usage scenarios. All 
 these workloads have different characteristics.
 
 I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, 
 various lmbench components, X benchmarks, tiobench - you name it. Combined 
 with the fact that most micro-benchmarks wont be using the FPU, while in the 
 long run most processes will be using the FPU due to SIMM instructions. So 
 even a positive result might be skewed in practice. Has to be measured 
 carefully IMO - and i havent seen a _single_ performance measurement in the 
 submission mail. This is really essential.

It can be nice to code an absolute worst-case microbenchmark too.

Task migration can actually be very important to the point of being
almost a fastpath in some workloads where threads are oversubscribed to
CPUs and blocking on some contented resource (IO or mutex or whatever).
I suspect the main issues in that case is the actual context switching
and contention, but it would be nice to see just how much slower it
could get.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Samuel Thibault
Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit :
 in the long run most processes will be using the FPU due to SIMM  
 instructions.

I believe glibc already uses SIMM instructions for e.g. memcpy and
friends, i.e. basically all applications...

Samuel
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity

On 06/16/2010 11:39 AM, Ingo Molnar wrote:

(Cc:-ed various performance/optimization folks)

* Avi Kivitya...@redhat.com  wrote:

   

On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
 

On 06/16/2010 12:24 AM, Avi Kivity wrote:
   

Ingo, Peter, any feedback on this?
 

Conceptually, this makes sense to me.  However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU.  That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.
   

Right, this optimization isn't free.

I think the tradeoff is favourable since task migrations are much
less frequent than context switches within the same cpu, can the
scheduler experts comment?
 

This cannot be stated categorically without precise measurements of
known-good, known-bad, average FPU usage and average CPU usage scenarios. All
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
various lmbench components, X benchmarks, tiobench - you name it. Combined
with the fact that most micro-benchmarks wont be using the FPU, while in the
long run most processes will be using the FPU due to SIMM instructions. So
even a positive result might be skewed in practice. Has to be measured
carefully IMO - and i havent seen a _single_ performance measurement in the
submission mail. This is really essential.
   


I have really no idea what to measure.  Which would you most like to see?


So this does not look like a patch-set we could apply without gathering a
_ton_ of hard data about advantages and disadvantages.
   


I agree (not to mention that I'm not really close to having an applyable 
patchset).


Note some of the advantages will not be in throughput but in latency 
(making kernel_fpu_begin() preemptible, and reducing context switch time 
for event threads).



We can also mitigate some of the IPIs if we know that we're migrating on the
cpu we're migrating from (i.e. we're pushing tasks to another cpu, not
pulling them from their cpu).  Is that a common case, and if so, where can I
hook a call to unlazy_fpu() (or its new equivalent)?
 

When the system goes from idle to less idle then most of the 'fast' migrations
happen on a 'push' model - on a busy CPU we wake up a new task and push it out
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably
little cost.
   


Can you point me to the code which does this?


But on busy servers where most wakeups are IRQ based the chance of being on
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
CPUs.
   


But don't we usually avoid pulls due to NUMA and cache considerations?


If there's some sucky corner case in theory we could approach it statistically
and measure the ratio of fast vs. slow migration vs. local context switches -
but that looks a bit complex.

   


I certainly wouldn't want to start with it.


Dunno.
   


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity

On 06/16/2010 12:10 PM, Nick Piggin wrote:



This cannot be stated categorically without precise measurements of
known-good, known-bad, average FPU usage and average CPU usage scenarios. All
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
various lmbench components, X benchmarks, tiobench - you name it. Combined
with the fact that most micro-benchmarks wont be using the FPU, while in the
long run most processes will be using the FPU due to SIMM instructions. So
even a positive result might be skewed in practice. Has to be measured
carefully IMO - and i havent seen a _single_ performance measurement in the
submission mail. This is really essential.
 

It can be nice to code an absolute worst-case microbenchmark too.
   


Sure.


Task migration can actually be very important to the point of being
almost a fastpath in some workloads where threads are oversubscribed to
CPUs and blocking on some contented resource (IO or mutex or whatever).
I suspect the main issues in that case is the actual context switching
and contention, but it would be nice to see just how much slower it
could get.
   


If it's just cpu oversubscription then the IPIs will be limited by the 
rebalance rate and the time slice, so as you say it has to involve 
contention and frequent wakeups as well as heavy cpu usage.  That won't 
be easy to code.  Can you suggest an existing benchmark to run?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-16 Thread Avi Kivity

On 06/16/2010 12:01 PM, Samuel Thibault wrote:

Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit :
   

in the long run most processes will be using the FPU due to SIMM
instructions.
 

I believe glibc already uses SIMM instructions for e.g. memcpy and
friends, i.e. basically all applications...
   


I think they ought to be using 'rep movs' on newer processors, but yes 
you're right.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Really lazy fpu

2010-06-13 Thread Valdis . Kletnieks
On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said:
 Currently fpu management is only lazy in one direction.  When we switch into
 a task, we may avoid loading the fpu state in the hope that the task will
 never use it.  If we guess right we save an fpu load/save cycle; if not,
 a Device not Available exception will remind us to load the fpu.
 
 However, in the other direction, fpu management is eager.  When we switch out
 of an fpu-using task, we always save its fpu state.

Does anybody have numbers on how many clocks it takes a modern CPU design
to do a FPU state save or restore?  I know it must have been painful in the
days before cache memory, having to make added trips out to RAM for 128-bit
registers.  But what's the impact today? (Yes, I see there's the potential
for a painful IPI call - anything else?)

Do we have any numbers on how many saves/restores this will save us when
running the hypothetical standard Gnome desktop environment?  How common
is the we went all the way around to the original single FPU-using task case?


pgpslLeJ3IhFt.pgp
Description: PGP signature