Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
On 08/29/2017 01:46 PM, Yang Zhang wrote: Some latency-intensive workload will see obviously performance drop when running inside VM. The main reason is that the overhead is amplified when running inside VM. The most cost i have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch: 2493.14 ns/ctxsw -- 200.3 %CPU 2. w/ patch: halt_poll_threshold=1 -- 1485.96ns/ctxsw -- 201.0 %CPU halt_poll_threshold=2 -- 1391.26 ns/ctxsw -- 200.7 %CPU halt_poll_threshold=3 -- 1488.55 ns/ctxsw -- 200.1 %CPU halt_poll_threshold=50 -- 1159.14 ns/ctxsw -- 201.5 %CPU 3. kvm dynamic poll halt_poll_ns=1 -- 2296.11 ns/ctxsw -- 201.2 %CPU halt_poll_ns=2 -- 2599.7 ns/ctxsw -- 201.7 %CPU halt_poll_ns=3 -- 2588.68 ns/ctxsw -- 211.6 %CPU halt_poll_ns=50 -- 2423.20 ns/ctxsw -- 229.2 %CPU 4. idle=poll 2050.1 ns/ctxsw -- 1003 %CPU 5. idle=mwait 2188.06 ns/ctxsw -- 206.3 %CPU Could you please try to create another metric for guest initiated, host aborted mwait? For a quick benchmark, reserve 4 registers for a magic value, set them to the magic value before you enter MWAIT in the guest. Then allow native MWAIT execution on the host. If you see the guest wants to enter with the 4 registers containing the magic contents and no events are pending, directly go into the vcpu block function on the host. That way any time a guest gets naturally aborted while in mwait, it will only reenter mwait when an event actually occured. While the guest is normally running (and nobody else wants to run on the host), we just stay in guest context, but with a sleeping CPU. Overall, that might give us even better performance, as it allows for turbo boost and HT to work properly. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86/idle: use dynamic halt poll
On 17.07.17 11:26, Yang Zhang wrote: On 2017/7/14 17:37, Alexander Graf wrote: On 13.07.17 13:49, Yang Zhang wrote: On 2017/7/4 22:13, Radim Krčmář wrote: 2017-07-03 17:28+0800, Yang Zhang: The background is that we(Alibaba Cloud) do get more and more complaints from our customers in both KVM and Xen compare to bare-mental.After investigations, the root cause is known to us: big cost in message passing workload(David show it in KVM forum 2015) A typical message workload like below: vcpu 0 vcpu 1 1. send ipi 2. doing hlt 3. go into idle 4. receive ipi and wake up from hlt 5. write APIC time twice6. write APIC time twice to to stop sched timer reprogram sched timer One write is enough to disable/re-enable the APIC timer -- why does Linux use two? One is to remove the timer and another one is to reprogram the timer. Normally, only one write to remove the timer.But in some cases, it will reprogram it. 7. doing hlt8. handle task and send ipi to vcpu 0 9. same to 4. 10. same to 3 One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The cost of such vmexits will degrades performance severely. Yeah, sounds like too much ... I understood that there are IPI from 1 to 2 4 * APIC timer IPI from 2 to 1 which adds to 6 MSR writes -- what are the other 4? In the worst case, each timer will touch APIC timer twice.So it will add additional 4 msr writse. But this is not always true. Linux kernel already provide idle=poll to mitigate the trend. But it only eliminates the IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A compromise would be to turn off NOHZ kernel, but it is not the default config for new distributions. Same for halt-poll in KVM, it only solve the cost from schedule in/out in host and can not help such workload much. The purpose of this patch we want to improve current idle=poll mechanism to Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow down the sibling hyperthread. MWAIT solves the IPI problem, but doesn't get rid of the timer one. Yes, i can try it. But MWAIT will not yield CPU, it only helps the sibling hyperthread as you mentioned. If you implement proper MWAIT emulation that conditionally gets en- or disabled depending on the same halt poll dynamics that we already have for in-host HLT handling, it will also yield the CPU. It is hard to do . If we not intercept MWAIT instruction, there is no chance to wake up the CPU unless an interrupt arrived or a store to the address armed by MONITOR which is the same with idle=polling. Yes, but you can reconfigure the VMCS/VMCB to trap on MWAIT or not trap on it. That's something that idle=polling does not give you at all - a guest vcpu will always use 100% CPU. The only really tricky part is how to limit the effect of MONITOR on nested page table maintenance. But if we just set the MONITOR cache size to 4k, well behaved guests should ideally always give us the one same page for wakeup - which we can then leave marked as trapping. As for the timer - are you sure the problem is really the overhead of the timer configuration, not the latency that it takes to actually fire the guest timer? No, the main cost is introduced by vmexit, includes IPIs, Timer program, HLT. David detailed it in KVM forum, you can search "Message Passing Workloads in KVM" in google and the first link give the whole analysis of the problem. During time critical message passing you want to keep both vCPUs inside the guest, yes. That again is something that guest exposed MWAIT would buy you. The problem is that overcommitting CPU is very expensive with anything that does not set the guests idle at all. And not everyone can afford to throw more CPUs at problems :). Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86/idle: use dynamic halt poll
On 13.07.17 13:49, Yang Zhang wrote: On 2017/7/4 22:13, Radim Krčmář wrote: 2017-07-03 17:28+0800, Yang Zhang: The background is that we(Alibaba Cloud) do get more and more complaints from our customers in both KVM and Xen compare to bare-mental.After investigations, the root cause is known to us: big cost in message passing workload(David show it in KVM forum 2015) A typical message workload like below: vcpu 0 vcpu 1 1. send ipi 2. doing hlt 3. go into idle 4. receive ipi and wake up from hlt 5. write APIC time twice6. write APIC time twice to to stop sched timer reprogram sched timer One write is enough to disable/re-enable the APIC timer -- why does Linux use two? One is to remove the timer and another one is to reprogram the timer. Normally, only one write to remove the timer.But in some cases, it will reprogram it. 7. doing hlt8. handle task and send ipi to vcpu 0 9. same to 4. 10. same to 3 One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The cost of such vmexits will degrades performance severely. Yeah, sounds like too much ... I understood that there are IPI from 1 to 2 4 * APIC timer IPI from 2 to 1 which adds to 6 MSR writes -- what are the other 4? In the worst case, each timer will touch APIC timer twice.So it will add additional 4 msr writse. But this is not always true. Linux kernel already provide idle=poll to mitigate the trend. But it only eliminates the IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A compromise would be to turn off NOHZ kernel, but it is not the default config for new distributions. Same for halt-poll in KVM, it only solve the cost from schedule in/out in host and can not help such workload much. The purpose of this patch we want to improve current idle=poll mechanism to Please aim to allow MWAIT instead of idle=poll -- MWAIT doesn't slow down the sibling hyperthread. MWAIT solves the IPI problem, but doesn't get rid of the timer one. Yes, i can try it. But MWAIT will not yield CPU, it only helps the sibling hyperthread as you mentioned. If you implement proper MWAIT emulation that conditionally gets en- or disabled depending on the same halt poll dynamics that we already have for in-host HLT handling, it will also yield the CPU. As for the timer - are you sure the problem is really the overhead of the timer configuration, not the latency that it takes to actually fire the guest timer? One major problem I see is that we configure the host hrtimer to fire at the point in time when the guest wants to see a timer event. But in a virtual environment, the point in time when we have to start switching to the VM really should be a bit *before* the guest wants to be woken up, as it takes quite some time to switch back into the VM context. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] kvm: better MWAIT emulation for guests
On 21.04.17 12:02, Paolo Bonzini wrote: On 12/04/2017 18:29, Michael S. Tsirkin wrote: I don't really agree we do not need the PV flag. mwait on kvm is different from mwait on bare metal in that you are heavily penalized by scheduler for polling unless you configure the host just so. HLT lets you give up the host CPU if you know you won't need it for a long time. So while many people can get by with monitor cpuid (those that isolate host CPUs) and it's a valuable option to have, I think a PV flag is also a valuable option and can be set for more configurations. Guest has an idle driver calling mwait on short waits and halt on longer ones. I'm in fact testing an idle driver using such a PV flag and will post when ready (after vacation ~3 weeks from now probably). For now I think I'm removing the PV flag, making this just an optimization of commit 87c00572ba05aa8c ("kvm: x86: emulate monitor and mwait instructions as nop"). We can add it for 4.13 together with the idle driver. I think that's a perfectly reasonable approach, yes. We can always add the PV flag with the driver. Thanks a lot! Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] kvm: better MWAIT emulation for guests
> Am 11.04.2017 um 19:10 schrieb Jim Mattson: > > This might be more useful if it could be dynamically toggled on and > off, depending on system load. What would trapping mwait (currently) buy you? As it stands today, before this patch, mwait is simply implemented as a nop, so enabling the trap just means you're wasting as much cpu time, but never send the pCPU idle. With this patch, the CPU at least has the chance to go idle. Keep in mind that this patch does *not* advertise the mwait cpuid feature bit to the guest. What you're referring to I guess is actual mwait emulation. That is indeed more useful, but a bigger patch than this and needs some more thought on how to properly cache the monitor'ed pages. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests
On 04/04/2017 03:13 PM, Radim Krčmář wrote: 2017-04-04 14:51+0200, Alexander Graf: On 04/04/2017 02:39 PM, Radim Krčmář wrote: 2017-04-03 12:04+0200, Alexander Graf: So coming back to the original patch, is there anything that should keep us from exposing MWAIT straight into the guest at all times? Just minor issues: * OS X on Core 2 fails for unknown reason if we disable the instruction trapping, which is an argument against doing it by default So for that we should try and see if changing the exposed CPUID MWAIT leaf helps. Currently we return 0/0 which is pretty bogus and might be the reason OSX fails. We have tried to pass host's CPUID MWAIT leaf and it still failed: https://www.spinics.net/lists/kvm/msg146686.html I wouldn't mind breaking that particular combination of OS X and hardware, but I'm worried to do it because we don't understand why it broke, so there could be more ... * idling guests would consume host CPU, which is a significant change in behavior and shouldn't be done without userspace's involvement That's the same as today, as idling guests with MWAIT would also today end up in a NOP emulated loop. Please bear in mind that I do not advocate to expose the MWAIT CPUID flag. This is only for the instruction trap. Ah, makes sense. I think the best compromise is to add a capability for the MWAIT VM-exit controls and let userspace expose MWAIT if it wishes to. Will send a patch. Please see my patch to force enable CPUID bits ;). Nice. MWAIT could also use setting of arbitrary values for its leaf, but a generic interface for that would probably look clunky on the command line ... I think we should have an interface similar to smbios for that eventually. Something where you can explicitly set arbitrary CPUID leaf information using leaf specific syntax. There are more leafs where it would make sense - cache topology for example. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests
On 04/04/2017 02:39 PM, Radim Krčmář wrote: 2017-04-03 12:04+0200, Alexander Graf: On 03/29/2017 02:11 PM, Radim Krčmář wrote: 2017-03-28 13:35-0700, Jim Mattson: On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrc...@redhat.com> wrote: 2017-03-27 15:34+0200, Alexander Graf: On 15/03/2017 22:22, Michael S. Tsirkin wrote: Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem: unless explicitly provided with kernel command line argument "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability, without checking CPUID. We currently emulate that as a NOP but on VMX we can do better: let guest stop the CPU until timer, IPI or memory change. CPU will be busy but that isn't any worse than a NOP emulation. Note that mwait within guests is not the same as on real hardware because halt causes an exit while mwait doesn't. For this reason it might not be a good idea to use the regular MWAIT flag in CPUID to signal this capability. Add a flag in the hypervisor leaf instead. So imagine we had proper MWAIT emulation capabilities based on page faults. In that case, we could do something as fancy as Treat MWAIT as pass-through by default Have a per-vcpu monitor timer 10 times a second in the background that checks which instruction we're in If we're in mwait for the last - say - 1 second, switch to emulated MWAIT, if $IP was in non-mwait within that time, reset counter. Or we could reuse external interrupts for sampling. Exits trigerred by them would check for current instruction (probably would be best to limit just to timer tick) and a sufficient ratio (> 0?) of other exits would imply that MWAIT is not used. Or instead maybe just reuse the adapter hlt logic? Emulated MWAIT is very similar to emulated HLT, so reusing the logic makes sense. We would just add new wakeup methods. Either way, with that we should be able to get super low latency IPIs running while still maintaining some sanity on systems which don't have dedicated CPUs for workloads. And we wouldn't need guest modifications, which is a great plus. So older guests (and Windows?) could benefit from mwait as well. There is no need guest modifications -- it could be exposed as standard MWAIT feature to the guest, with responsibilities for guest/host-impact on the user. I think that the page-fault based MWAIT would require paravirt if it should be enabled by default, because of performance concerns: Enabling write protection on a page needs a VM exit on all other VCPUs when beginning monitoring (to reload page permissions and prevent missed writes). We'd want to keep trapping writes to the page all the time because toggling is slow, but this could regress performance for an OS that has other data accessed by other VCPUs in that page. No current interface can tell the guest that it should reserve the whole page instead of what CPUID[5] says and that writes to the monitored page are not "cheap", but can trigger a VM exit ... CPUID.05H:EBX is supposed to address the false sharing issue. IIRC, VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX when running Mac OS X guests. Per Intel's SDM volume 3, section 8.10.5, "To avoid false wake-ups; use the largest monitor line size to pad the data structure used to monitor writes. Software must make sure that beyond the data structure, no unrelated data variable exists in the triggering area for MWAIT. A pad may be needed to avoid this situation." Unfortunately, most operating systems do not follow this advice. Right, EBX provides what we need to expose that the whole page is monitored, thanks! So coming back to the original patch, is there anything that should keep us from exposing MWAIT straight into the guest at all times? Just minor issues: * OS X on Core 2 fails for unknown reason if we disable the instruction trapping, which is an argument against doing it by default So for that we should try and see if changing the exposed CPUID MWAIT leaf helps. Currently we return 0/0 which is pretty bogus and might be the reason OSX fails. * idling guests would consume host CPU, which is a significant change in behavior and shouldn't be done without userspace's involvement That's the same as today, as idling guests with MWAIT would also today end up in a NOP emulated loop. Please bear in mind that I do not advocate to expose the MWAIT CPUID flag. This is only for the instruction trap. I think the best compromise is to add a capability for the MWAIT VM-exit controls and let userspace expose MWAIT if it wishes to. Will send a patch. Please see my patch to force enable CPUID bits ;). Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests
On 03/29/2017 02:11 PM, Radim Krčmář wrote: 2017-03-28 13:35-0700, Jim Mattson: On Tue, Mar 28, 2017 at 7:28 AM, Radim Krčmář <rkrc...@redhat.com> wrote: 2017-03-27 15:34+0200, Alexander Graf: On 15/03/2017 22:22, Michael S. Tsirkin wrote: Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem: unless explicitly provided with kernel command line argument "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability, without checking CPUID. We currently emulate that as a NOP but on VMX we can do better: let guest stop the CPU until timer, IPI or memory change. CPU will be busy but that isn't any worse than a NOP emulation. Note that mwait within guests is not the same as on real hardware because halt causes an exit while mwait doesn't. For this reason it might not be a good idea to use the regular MWAIT flag in CPUID to signal this capability. Add a flag in the hypervisor leaf instead. So imagine we had proper MWAIT emulation capabilities based on page faults. In that case, we could do something as fancy as Treat MWAIT as pass-through by default Have a per-vcpu monitor timer 10 times a second in the background that checks which instruction we're in If we're in mwait for the last - say - 1 second, switch to emulated MWAIT, if $IP was in non-mwait within that time, reset counter. Or we could reuse external interrupts for sampling. Exits trigerred by them would check for current instruction (probably would be best to limit just to timer tick) and a sufficient ratio (> 0?) of other exits would imply that MWAIT is not used. Or instead maybe just reuse the adapter hlt logic? Emulated MWAIT is very similar to emulated HLT, so reusing the logic makes sense. We would just add new wakeup methods. Either way, with that we should be able to get super low latency IPIs running while still maintaining some sanity on systems which don't have dedicated CPUs for workloads. And we wouldn't need guest modifications, which is a great plus. So older guests (and Windows?) could benefit from mwait as well. There is no need guest modifications -- it could be exposed as standard MWAIT feature to the guest, with responsibilities for guest/host-impact on the user. I think that the page-fault based MWAIT would require paravirt if it should be enabled by default, because of performance concerns: Enabling write protection on a page needs a VM exit on all other VCPUs when beginning monitoring (to reload page permissions and prevent missed writes). We'd want to keep trapping writes to the page all the time because toggling is slow, but this could regress performance for an OS that has other data accessed by other VCPUs in that page. No current interface can tell the guest that it should reserve the whole page instead of what CPUID[5] says and that writes to the monitored page are not "cheap", but can trigger a VM exit ... CPUID.05H:EBX is supposed to address the false sharing issue. IIRC, VMware Fusion reports 64 in CPUID.05H:EAX and 4096 in CPUID.05H:EBX when running Mac OS X guests. Per Intel's SDM volume 3, section 8.10.5, "To avoid false wake-ups; use the largest monitor line size to pad the data structure used to monitor writes. Software must make sure that beyond the data structure, no unrelated data variable exists in the triggering area for MWAIT. A pad may be needed to avoid this situation." Unfortunately, most operating systems do not follow this advice. Right, EBX provides what we need to expose that the whole page is monitored, thanks! So coming back to the original patch, is there anything that should keep us from exposing MWAIT straight into the guest at all times? Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 untested] kvm: better MWAIT emulation for guests
On 15/03/2017 22:22, Michael S. Tsirkin wrote: Guests running Mac OS 5, 6, and 7 (Leopard through Lion) have a problem: unless explicitly provided with kernel command line argument "idlehalt=0" they'd implicitly assume MONITOR and MWAIT availability, without checking CPUID. We currently emulate that as a NOP but on VMX we can do better: let guest stop the CPU until timer, IPI or memory change. CPU will be busy but that isn't any worse than a NOP emulation. Note that mwait within guests is not the same as on real hardware because halt causes an exit while mwait doesn't. For this reason it might not be a good idea to use the regular MWAIT flag in CPUID to signal this capability. Add a flag in the hypervisor leaf instead. So imagine we had proper MWAIT emulation capabilities based on page faults. In that case, we could do something as fancy as Treat MWAIT as pass-through by default Have a per-vcpu monitor timer 10 times a second in the background that checks which instruction we're in If we're in mwait for the last - say - 1 second, switch to emulated MWAIT, if $IP was in non-mwait within that time, reset counter. Or instead maybe just reuse the adapter hlt logic? Either way, with that we should be able to get super low latency IPIs running while still maintaining some sanity on systems which don't have dedicated CPUs for workloads. And we wouldn't need guest modifications, which is a great plus. So older guests (and Windows?) could benefit from mwait as well. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC2 nowrap: PATCH v7 00/18] ILP32 for ARM64
> On 17 Aug 2016, at 13:46, Yury Norovwrote: > > This series enables aarch64 with ilp32 mode, and as supporting work, > introduces ARCH_32BIT_OFF_T configuration option that is enabled for > existing 32-bit architectures but disabled for new arches (so 64-bit > off_t is is used by new userspace). > > This version is based on kernel v4.8-rc2. > It works with glibc-2.23, and tested with LTP. > > This is RFC because there is still no solid understanding what type of > registers > top-halves delousing we prefer. In this patchset, w0-w7 are cleared for each > syscall in assembler entry. The alternative approach is in introducing compat > wrappers which is little faster for natively routed syscalls (~2.6% for > syscall > with no payload) but much more complicated. So you’re saying there are 2 options: 1) easy to get right, slightly slower, same ABI to user space as 2 2) harder to get right, minor performance benefit That’s an obvious pick, no? Mark it non-RFC and stay with the clearing in assembler entry. If anyone cares about those last few percent, they can still push the harder path upstream later if they want to, but at least we’ll have the ABI stable, so that you can start using and developing for ilp32 on aarch64. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html