Re: [PATCH 0/4] Really lazy fpu
On 06/13/2010 06:03 PM, Avi Kivity wrote: Currently fpu management is only lazy in one direction. When we switch into a task, we may avoid loading the fpu state in the hope that the task will never use it. If we guess right we save an fpu load/save cycle; if not, a Device not Available exception will remind us to load the fpu. However, in the other direction, fpu management is eager. When we switch out of an fpu-using task, we always save its fpu state. This is wasteful if the task(s) that run until we switch back in all don't use the fpu, since we could have kept the task's fpu on the cpu all this time and saved an fpu save/load cycle. This can be quite common with threaded interrupts, but will also happen with normal kernel threads and even normal user tasks. This patch series converts task fpu management to be fully lazy. When switching out of a task, we keep its fpu state on the cpu, only flushing it if some other task needs the fpu. Ingo, Peter, any feedback on this? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? We can also mitigate some of the IPIs if we know that we're migrating on the cpu we're migrating from (i.e. we're pushing tasks to another cpu, not pulling them from their cpu). Is that a common case, and if so, where can I hook a call to unlazy_fpu() (or its new equivalent)? Note that kvm on intel has exactly the same issue (the VMPTR and VMCS are on-chip registers that are expensive to load and save, so we keep them loaded even while not scheduled, and IPI if we notice we've migrated; note that architecturally the cpu can cache multiple VMCSs simultaneously (though I doubt they cache multiple VMCSs microarchitecturally at this point)). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
(Cc:-ed various performance/optimization folks) * Avi Kivity a...@redhat.com wrote: On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. So this does not look like a patch-set we could apply without gathering a _ton_ of hard data about advantages and disadvantages. We can also mitigate some of the IPIs if we know that we're migrating on the cpu we're migrating from (i.e. we're pushing tasks to another cpu, not pulling them from their cpu). Is that a common case, and if so, where can I hook a call to unlazy_fpu() (or its new equivalent)? When the system goes from idle to less idle then most of the 'fast' migrations happen on a 'push' model - on a busy CPU we wake up a new task and push it out to a known-idle CPU. At that point we can indeed unlazy the FPU with probably little cost. But on busy servers where most wakeups are IRQ based the chance of being on the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of CPUs. If there's some sucky corner case in theory we could approach it statistically and measure the ratio of fast vs. slow migration vs. local context switches - but that looks a bit complex. Dunno. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On Wed, Jun 16, 2010 at 10:39:41AM +0200, Ingo Molnar wrote: (Cc:-ed various performance/optimization folks) * Avi Kivity a...@redhat.com wrote: On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. It can be nice to code an absolute worst-case microbenchmark too. Task migration can actually be very important to the point of being almost a fastpath in some workloads where threads are oversubscribed to CPUs and blocking on some contented resource (IO or mutex or whatever). I suspect the main issues in that case is the actual context switching and contention, but it would be nice to see just how much slower it could get. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit : in the long run most processes will be using the FPU due to SIMM instructions. I believe glibc already uses SIMM instructions for e.g. memcpy and friends, i.e. basically all applications... Samuel -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 11:39 AM, Ingo Molnar wrote: (Cc:-ed various performance/optimization folks) * Avi Kivitya...@redhat.com wrote: On 06/16/2010 10:32 AM, H. Peter Anvin wrote: On 06/16/2010 12:24 AM, Avi Kivity wrote: Ingo, Peter, any feedback on this? Conceptually, this makes sense to me. However, I have a concern what happens when a task is scheduled on another CPU, while its FPU state is still in registers in the original CPU. That would seem to require expensive IPIs to spill the state in order for the rescheduling to proceed, and this could really damage performance. Right, this optimization isn't free. I think the tradeoff is favourable since task migrations are much less frequent than context switches within the same cpu, can the scheduler experts comment? This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. I have really no idea what to measure. Which would you most like to see? So this does not look like a patch-set we could apply without gathering a _ton_ of hard data about advantages and disadvantages. I agree (not to mention that I'm not really close to having an applyable patchset). Note some of the advantages will not be in throughput but in latency (making kernel_fpu_begin() preemptible, and reducing context switch time for event threads). We can also mitigate some of the IPIs if we know that we're migrating on the cpu we're migrating from (i.e. we're pushing tasks to another cpu, not pulling them from their cpu). Is that a common case, and if so, where can I hook a call to unlazy_fpu() (or its new equivalent)? When the system goes from idle to less idle then most of the 'fast' migrations happen on a 'push' model - on a busy CPU we wake up a new task and push it out to a known-idle CPU. At that point we can indeed unlazy the FPU with probably little cost. Can you point me to the code which does this? But on busy servers where most wakeups are IRQ based the chance of being on the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of CPUs. But don't we usually avoid pulls due to NUMA and cache considerations? If there's some sucky corner case in theory we could approach it statistically and measure the ratio of fast vs. slow migration vs. local context switches - but that looks a bit complex. I certainly wouldn't want to start with it. Dunno. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 12:10 PM, Nick Piggin wrote: This cannot be stated categorically without precise measurements of known-good, known-bad, average FPU usage and average CPU usage scenarios. All these workloads have different characteristics. I can imagine bad effects across all sorts of workloads: tcpbench, AIM7, various lmbench components, X benchmarks, tiobench - you name it. Combined with the fact that most micro-benchmarks wont be using the FPU, while in the long run most processes will be using the FPU due to SIMM instructions. So even a positive result might be skewed in practice. Has to be measured carefully IMO - and i havent seen a _single_ performance measurement in the submission mail. This is really essential. It can be nice to code an absolute worst-case microbenchmark too. Sure. Task migration can actually be very important to the point of being almost a fastpath in some workloads where threads are oversubscribed to CPUs and blocking on some contented resource (IO or mutex or whatever). I suspect the main issues in that case is the actual context switching and contention, but it would be nice to see just how much slower it could get. If it's just cpu oversubscription then the IPIs will be limited by the rebalance rate and the time slice, so as you say it has to involve contention and frequent wakeups as well as heavy cpu usage. That won't be easy to code. Can you suggest an existing benchmark to run? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On 06/16/2010 12:01 PM, Samuel Thibault wrote: Ingo Molnar, le Wed 16 Jun 2010 10:39:41 +0200, a écrit : in the long run most processes will be using the FPU due to SIMM instructions. I believe glibc already uses SIMM instructions for e.g. memcpy and friends, i.e. basically all applications... I think they ought to be using 'rep movs' on newer processors, but yes you're right. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Really lazy fpu
On Sun, 13 Jun 2010 18:03:43 +0300, Avi Kivity said: Currently fpu management is only lazy in one direction. When we switch into a task, we may avoid loading the fpu state in the hope that the task will never use it. If we guess right we save an fpu load/save cycle; if not, a Device not Available exception will remind us to load the fpu. However, in the other direction, fpu management is eager. When we switch out of an fpu-using task, we always save its fpu state. Does anybody have numbers on how many clocks it takes a modern CPU design to do a FPU state save or restore? I know it must have been painful in the days before cache memory, having to make added trips out to RAM for 128-bit registers. But what's the impact today? (Yes, I see there's the potential for a painful IPI call - anything else?) Do we have any numbers on how many saves/restores this will save us when running the hypothetical standard Gnome desktop environment? How common is the we went all the way around to the original single FPU-using task case? pgpslLeJ3IhFt.pgp Description: PGP signature