Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
On 2017/9/14 17:19, Wanpeng Li wrote: 2017-09-14 16:36 GMT+08:00 Quan Xu <quan@gmail.com>: on 2017/9/13 19:56, Yang Zhang wrote: On 2017/8/29 22:56, Michael S. Tsirkin wrote: On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote: Some latency-intensive workload will see obviously performance drop when running inside VM. But are we trading a lot of CPU for a bit of lower latency? The main reason is that the overhead is amplified when running inside VM. The most cost i have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Isn't it the job of an idle driver to find the best way to halt the CPU? It looks like just by adding a cstate we can make it halt at higher latencies only. And at lower latencies, if it's doing a good job we can hopefully use mwait to stop the CPU. In fact I have been experimenting with exactly that. Some initial results are encouraging but I could use help with testing and especially tuning. If you can help pls let me know! Quan, Can you help to test it and give result? Thanks. Hi, MST I have tested the patch "intel_idle: add pv cstates when running on kvm" on a recent host that allows guests to execute mwait without an exit. also I have tested our patch "[RFC PATCH v2 0/7] x86/idle: add halt poll support", upstream linux, and idle=poll. the following is the result (which seems better than ever berfore, as I ran test case on a more powerful machine): for __netperf__, the first column is trans. rate per sec, the second column is CPU utilzation. 1. upstream linux This "upstream linux" means that disables the kvm adaptive halt-polling after confirm with Xu Quan. upstream linux -- the source code is just from upstream linux, without our patch or MST's patch.. yes, we disable kvm halt-polling(halt_poll_ns=0) for _all_of_ following cases. Quan Regards, Wanpeng Li 28371.7 bits/s -- 76.6 %CPU 2. idle=poll 34372 bit/s -- 999.3 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 28362.7 bits/s -- 74.7 %CPU (halt_poll_threshold=1) 32949.5 bits/s -- 82.5 %CPU (halt_poll_threshold=2) 39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3) 40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4) 40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 33041.8 bits/s -- 999.4 %CPU for __ctxsw__, the first column is the time per process context switches, the second column is CPU utilzation.. 1. upstream linux 3624.19 ns/ctxsw -- 191.9 %CPU 2. idle=poll 3419.66 ns/ctxsw -- 999.2 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1) 1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2) 1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3) 1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4) 1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 3427.59 ns/ctxsw -- 999.4 %CPU -Quan
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
on 2017/9/1 13:57, Quan Xu wrote: on 2017/8/29 20:45, Peter Zijlstra wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Broken SoB chain. Peter, I can't follow 'Broken SoB chain'.. could you explain more about it? Peter, Ping.. Quan -Quan diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 6c23e30..b374744 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } And not a word on why we need a new arch hook. What's wrong with arch_cpu_idle_enter() for instance?
Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
on 2017/9/13 19:56, Yang Zhang wrote: On 2017/8/29 22:56, Michael S. Tsirkin wrote: On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote: Some latency-intensive workload will see obviously performance drop when running inside VM. But are we trading a lot of CPU for a bit of lower latency? The main reason is that the overhead is amplified when running inside VM. The most cost i have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Isn't it the job of an idle driver to find the best way to halt the CPU? It looks like just by adding a cstate we can make it halt at higher latencies only. And at lower latencies, if it's doing a good job we can hopefully use mwait to stop the CPU. In fact I have been experimenting with exactly that. Some initial results are encouraging but I could use help with testing and especially tuning. If you can help pls let me know! Quan, Can you help to test it and give result? Thanks. Hi, MST I have tested the patch "intel_idle: add pv cstates when running on kvm" on a recent host that allows guests to execute mwait without an exit. also I have tested our patch "[RFC PATCH v2 0/7] x86/idle: add halt poll support", upstream linux, and idle=poll. the following is the result (which seems better than ever berfore, as I ran test case on a more powerful machine): for __netperf__, the first column is trans. rate per sec, the second column is CPU utilzation. 1. upstream linux 28371.7 bits/s -- 76.6 %CPU 2. idle=poll 34372 bit/s -- 999.3 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 28362.7 bits/s -- 74.7 %CPU (halt_poll_threshold=1) 32949.5 bits/s -- 82.5 %CPU (halt_poll_threshold=2) 39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3) 40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4) 40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 33041.8 bits/s -- 999.4 %CPU for __ctxsw__, the first column is the time per process context switches, the second column is CPU utilzation.. 1. upstream linux 3624.19 ns/ctxsw -- 191.9 %CPU 2. idle=poll 3419.66 ns/ctxsw -- 999.2 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1) 1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2) 1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3) 1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4) 1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 3427.59 ns/ctxsw -- 999.4 %CPU -Quan
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
On 2017/9/1 14:49, Quan Xu wrote: on 2017/8/29 22:39, Borislav Petkov wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: Peter Zijlstra <pet...@infradead.org> Cc: Borislav Petkov <b...@alien8.de> Cc: Kyle Huey <m...@kylehuey.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Len Brown <len.br...@intel.com> Cc: linux-kernel@vger.kernel.org --- arch/x86/kernel/process.c | 7 +++ kernel/sched/idle.c | 2 ++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 3ca1980..def4113 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -332,6 +332,13 @@ void arch_cpu_idle(void) x86_idle(); } +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT) +void arch_cpu_idle_poll(void) +{ + paravirt_idle_poll(); +} +#endif So this will get called on any system which has CONFIG_PARAVIRT enabled *even* if they're not running any guests. Huh? Borislav, think twice, it is better to run ander an IF condition, to make sure they are running any guests. Quan Borislav , yes, this will get called on any system which has CONFIG_PARAVIRT enabled. but if they are not running any guests, the callback is paravirt_nop() , IIUC which is as similar as the other paravirt_*, such as paravirt_pgd_free().. - Quan
Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle
On 2017/8/29 20:46, Peter Zijlstra wrote: On Tue, Aug 29, 2017 at 11:46:41AM +, Yang Zhang wrote: In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here we just reuse this logic to update the poll time. It may be a little late to update the poll in ttwu_do_wakeup, but the test result shows no obvious performance gap compare with updating poll in irq handler. one problem is that idle_stamp only used when using CFS scheduler. But it is ok since it is the default policy for scheduler and only consider it should enough. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Same broken SoB chain, and not a useful word on why you need to adjust this crap to begin with. What you want that poll duration to be related to is the cost of a VMEXIT/VMENTER cycle, not however long we happened to be idle. So no. Peter, I think you are right.. IIUC, the time we happened to be idle may contain a chain of VMEXIT/VMENTER cycles, which would be mainly (except the last VMEXIT/VMENTER cycles) for just idle loops. right? as you mentioned, poll duration to be related to is the cost of __a__ VMEXIT/VMENTER cycle. howerver it is very difficult to measure a VMEXIT/VMENTER cycle accurately from kvm guest, we could find out an approximate one -- dropping the idle loops from the time we happened to be idle.. make sense? Quan
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
on 2017/8/29 22:39, Borislav Petkov wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: Peter Zijlstra <pet...@infradead.org> Cc: Borislav Petkov <b...@alien8.de> Cc: Kyle Huey <m...@kylehuey.com> Cc: Andy Lutomirski <l...@kernel.org> Cc: Len Brown <len.br...@intel.com> Cc: linux-kernel@vger.kernel.org --- arch/x86/kernel/process.c | 7 +++ kernel/sched/idle.c | 2 ++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 3ca1980..def4113 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -332,6 +332,13 @@ void arch_cpu_idle(void) x86_idle(); } +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT) +void arch_cpu_idle_poll(void) +{ + paravirt_idle_poll(); +} +#endif So this will get called on any system which has CONFIG_PARAVIRT enabled *even* if they're not running any guests. Huh? Borislav , yes, this will get called on any system which has CONFIG_PARAVIRT enabled. but if they are not running any guests, the callback is paravirt_nop() , IIUC which is as similar as the other paravirt_*, such as paravirt_pgd_free().. - Quan
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
on 2017/8/29 20:45, Peter Zijlstra wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Broken SoB chain. Peter, I can't follow 'Broken SoB chain'.. could you more about it? -Quan diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 6c23e30..b374744 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } And not a word on why we need a new arch hook. What's wrong with arch_cpu_idle_enter() for instance?
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-15 22:43, Rik van Riel wrote: Can you explain why you believe that? for example, a vcpu thread is running in kvm mode under cretical condition to stop. QEMU send an IPI to cause a VM-exit to happen immediately, and this IPI doesn't make vcpu return to QEMU. IIUC this vcpu thread will still continue to run in kvm mode when is waked up at targer machine. with your patch, I don't see a chance to load guest FPU or XSTATE, until return to QEMU and run kvm mode again. then the FPU or XSTATE status is inconsistent for a small window, what's even worse is that the vcpu is running. Did I misunderstand? Quan Alibaba Cloud Getting the guest FPU or XSTATE is done under the vcpu->mutex. This patch switches out guest and userspace FPU/XSTATE under the vcpu->mutex, and switches it back before releasing the vcpu->mutex. By the time a KVM_GET_FPU has obtained the vcpu->mutex, the guest FPU state will be in vcpu->arch.guest_fpu.state, where you expect it to be. What am I missing?
Re: [PATCH RFC 0/7] kvm pvtimer
On 2017/12/14 19:56, Paolo Bonzini wrote: On 13/12/2017 17:28, Konrad Rzeszutek Wilk wrote: 1) VM idle path and network req/resp services: Does this go away if you don't hit the idle path? Meaning if you loop without hitting HLT/MWAIT? I am assuming the issue you are facing is the latency - that is first time the guest comes from HLT and responds to the packet the latency is much higher than without? And the arming of the timer? 2) process context switches. Is that related to the 1)? That is the 'schedule' call and the process going to sleep waiting for an interrupt or timer? This all sounds like issues with low-CPU usage workloads where you need low latency responses? Even high-CPU usage, as long as there is a small idle time. The cost of setting the TSC deadline timer twice is about 3000 cycles. However, I think Amazon's approach of not intercepting HLT/MWAIT/PAUSE can recover most of the performance and it's way less intrusive. Paolo, could you share the Amazon's patch or the LML link? thanks. Quan Thanks, Paolo
Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten
On 2017/12/14 21:15, Gonglei (Arei) wrote: -Original Message- From: Liran Alon [mailto:liran.a...@oracle.com] Sent: Thursday, December 14, 2017 9:02 PM To: Gonglei (Arei); pbonz...@redhat.com; rkrc...@redhat.com Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org; Huangweidong (C) Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten On 14/12/17 14:23, Gonglei wrote: We hit a bug in our test while run PCMark 10 in a windows 7 VM, The VM got stuck and the wallclock was hang after several minutes running PCMark 10 in it. It is quite easily to reproduce the bug with the upstream KVM and Qemu. We found that KVM can not inject any RTC irq to VM after it was hang, it fails to Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr. static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq, int irq_level, bool line_status) { ... if (!irq_level) { ioapic->irr &= ~mask; ret = 1; goto out; } ... if ((edge && old_irr == ioapic->irr) || (!edge && entry.fields.remote_irr)) { ret = 0; goto out; } According to RTC spec, after RTC injects a High level irq, OS will read CMOS's register C to to clear the irq flag, and pull down the irq electric pin. For Qemu, we will emulate the reading operation in cmos_ioport_read(), but Guest OS will fire a write operation before to tell which register will be read after this write, where we use s->cmos_index to record the following register to read. But in our test, we found that there is a possible situation that Vcpu fails to read RTC_REG_C to clear irq, This could happens while two VCpus are writing/reading registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C, so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C, but before it tries to read register C, another vcpu1 is going to read RTC_YEAR, it changes s->cmos_index to RTC_YEAR by a writing action. The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we will miss calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never inject RTC irq, and Windows VM will hang. If I understood correctly, this looks to me like a race-condition bug in the Windows guest kernel. In real-hardware this race-condition will also cause the RTC_YEAR to be read instead of RTC_REG_C. Guest kernel should make sure that 2 CPUs does not attempt to read a CMOS register in parallel as they can override each other's cmos_index. See for example how Linux kernel makes sure to avoid such kind of issues in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock. Yes, I knew that. Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue. Can you elaborate a bit more why it makes sense to put such workaround in KVM code instead of declaring this as guest kernel bug? I agree it's a Windows bug. The big problem is there is not problem on Xen platform. have you analyzed why it is not a problem on Xen? Quan Thanks, -Gonglei Regards, -Liran Suggested-by: Paolo BonziniSigned-off-by: Gonglei --- Thanks to Paolo provides a good solution. :) arch/x86/kvm/ioapic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c index 4e822ad..5022d63 100644 --- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu) { if (test_and_clear_bit(vcpu->vcpu_id, ioapic->rtc_status.dest_map.map)) { + ioapic->irr &= ~(1 << RTC_GSI); --ioapic->rtc_status.pending_eoi; rtc_status_pending_eoi_check_valid(ioapic); }
Re: [PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest
On 2017/12/08 23:06, Konrad Rzeszutek Wilk wrote: On Fri, Dec 08, 2017 at 04:39:46PM +0800, Quan Xu wrote: From: Ben Luo <bn0...@gmail.com> In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and then KVM handles this timer for guest. The tsc-deadline timestamp is mostly recorded in share page with less VM-exit. We Introduce a periodically working kthread to scan share page and synchronize timer setting for guest on a dedicated CPU. That sounds like a race. Meaning the guest may put too small window and this 'working thread to scan' may not get to it fast enough? yes, you are right. So .. . Meaning we miss the deadline to inject the timer in the guest. Or is this part of this PV MSR semantics - that it will only work for certain amount of values and anything less than say 1ms should not use the PV MSR? .. for these timers, We have to program these tsc-deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will signal the working thread through IPI to program timer, instead of registering on current CPU (patch 0004). more detail in patch 0007. Quan Alibaba Cloud
Re: [PATCH] KVM: x86: avoid unnecessary XSETBV on guest entry
On 2017/12/13 20:51, Paolo Bonzini wrote: xsetbv can be expensive when running on nested virtualization, try to avoid it. Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> --- arch/x86/kvm/x86.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0f82e2cbf64c..daa1918031df 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -702,7 +702,8 @@ static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu) if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) && !vcpu->guest_xcr0_loaded) { /* kvm_set_xcr() also depends on this */ - xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0); + if (vcpu->arch.xcr0 != host_xcr0) + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0); vcpu->guest_xcr0_loaded = 1; } } Reviewed-by: Quan Xu <quan@gmail.com>
Re: [PATCH RFC 0/7] kvm pvtimer
On 2017/12/14 00:28, Konrad Rzeszutek Wilk wrote: On Wed, Dec 13, 2017 at 11:25:13PM +0800, Quan Xu wrote: On Fri, Dec 8, 2017 at 11:10 PM, Konrad Rzeszutek Wilk < konrad.w...@oracle.com> wrote: On Fri, Dec 08, 2017 at 04:39:43PM +0800, Quan Xu wrote: From: Ben Luo <bn0...@gmail.com> This patchset introduces a new paravirtualized mechanism to reduce VM-exit caused by guest timer accessing. And how bad is this blib in arming the timer? And how often do you get this timer to be armed? OR better yet - what are the workloads in which you found this VMExit to be painful? Thanks. Or better yet - what are the workloads in which you found this VMExit to be painful? one painful point is from VM idle path.. for some network req/resp services, or benchmark of process context switches.. So: 1) VM idle path and network req/resp services: Does this go away if you don't hit the idle path? Meaning if you loop without hitting HLT/MWAIT? we still hit HLT.. we can use it with https://lkml.org/lkml/2017/8/29/279 .. I am assuming the issue you are facing is the latency - that is first time the guest comes from HLT and responds to the packet the latency is much higher than without? yes, And the arming of the timer? 2) process context switches. Is that related to the 1)? That is the 'schedule' call and the process going to sleep waiting for an interrupt or timer? This all sounds like issues with low-CPU usage workloads where you need low latency responses? yes, it is also helpful to some timer-intensive services. Quan
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 17:53, Thomas Gleixner wrote: On Thu, 16 Nov 2017, Quan Xu wrote: On 2017-11-16 06:03, Thomas Gleixner wrote: --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, target_state = >states[index]; } +#ifdef CONFIG_PARAVIRT + paravirt_idle_poll(); + + if (need_resched()) + return -EBUSY; +#endif That's just plain wrong. We don't want to see any of this PARAVIRT crap in anything outside the architecture/hypervisor interfacing code which really needs it. The problem can and must be solved at the generic level in the first place to gather the data which can be used to make such decisions. How that information is used might be either completely generic or requires system specific variants. But as long as we don't have any information at all we cannot discuss that. Please sit down and write up which data needs to be considered to make decisions about probabilistic polling. Then we need to compare and contrast that with the data which is necessary to make power/idle state decisions. I would be very surprised if this data would not overlap by at least 90%. Peter, tglx Thanks for your comments.. rethink of this patch set, 1. which data needs to considerd to make decisions about probabilistic polling I really need to write up which data needs to considerd to make decisions about probabilistic polling. At last several months, I always focused on the data _from idle to reschedule_, then to bypass the idle loops. unfortunately, this makes me touch scheduler/idle/nohz code inevitably. with tglx's suggestion, the data which is necessary to make power/idle state decisions, is the last idle state's residency time. IIUC this data is duration from idle to wakeup, which maybe by reschedule irq or other irq. I also test that the reschedule irq overlap by more than 90% (trace the need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for one minute. as the overlap, I think I can input the last idle state's residency time to make decisions about probabilistic polling, as @dev->last_residency does. it is much easier to get data. 2. do a HV specific idle driver (function) so far, power management is not exposed to guest.. idle is simple for KVM guest, calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. thanks Xen guys, who has implemented the paravirt framework. I can implement it as easy as following: --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void) update_intr_gate(X86_TRAP_PF, async_page_fault); } +static __cpuidle void kvm_safe_halt(void) +{ + /* 1. POLL, if need_resched() --> return */ + + asm volatile("sti; hlt": : :"memory"); /* 2. halt */ + + /* 3. get the last idle state's residency time */ + + /* 4. update poll duration based on last idle state's residency time */ +} + void __init kvm_guest_init(void) { int i; @@ -490,6 +496,8 @@ void __init kvm_guest_init(void) if (kvmclock_vsyscall) kvm_setup_vsyscall_timeinfo(); + pv_irq_ops.safe_halt = kvm_safe_halt; + #ifdef CONFIG_SMP then, I am no need to introduce a new pvops, and never modify schedule/idle/nohz code again. also I can narrow all of the code down in arch/x86/kernel/kvm.c. If this is in the right direction, I will send a new patch set next week.. thanks, Quan Alibaba Cloud
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-17 19:36, Thomas Gleixner wrote: On Fri, 17 Nov 2017, Quan Xu wrote: On 2017-11-16 17:53, Thomas Gleixner wrote: That's just plain wrong. We don't want to see any of this PARAVIRT crap in anything outside the architecture/hypervisor interfacing code which really needs it. The problem can and must be solved at the generic level in the first place to gather the data which can be used to make such decisions. How that information is used might be either completely generic or requires system specific variants. But as long as we don't have any information at all we cannot discuss that. Please sit down and write up which data needs to be considered to make decisions about probabilistic polling. Then we need to compare and contrast that with the data which is necessary to make power/idle state decisions. I would be very surprised if this data would not overlap by at least 90%. 1. which data needs to considerd to make decisions about probabilistic polling I really need to write up which data needs to considerd to make decisions about probabilistic polling. At last several months, I always focused on the data _from idle to reschedule_, then to bypass the idle loops. unfortunately, this makes me touch scheduler/idle/nohz code inevitably. with tglx's suggestion, the data which is necessary to make power/idle state decisions, is the last idle state's residency time. IIUC this data is duration from idle to wakeup, which maybe by reschedule irq or other irq. That's part of the picture, but not complete. tglx, could you share more? I am very curious about it.. I also test that the reschedule irq overlap by more than 90% (trace the need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for one minute. as the overlap, I think I can input the last idle state's residency time to make decisions about probabilistic polling, as @dev->last_residency does. it is much easier to get data. That's only true for your particular use case. 2. do a HV specific idle driver (function) so far, power management is not exposed to guest.. idle is simple for KVM guest, calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. thanks Xen guys, who has implemented the paravirt framework. I can implement it as easy as following: --- a/arch/x86/kernel/kvm.c Your email client is using a very strange formatting. my bad, I insert space to highlight these code. This is definitely better than what you proposed so far and implementing it as a prove of concept seems to be worthwhile. But I doubt that this is the final solution. It's not generic and not necessarily suitable for all use case scenarios. yes, I am exhausted :):) could you tell me the gap to be generic and necessarily suitable for all use case scenarios? as lack of irq/idle predictors? I really want to upstream it for all of public cloud users/providers.. as kvm host has a similar one, is it possible to upstream with following conditions? : 1). add a QEMU configuration, whether enable or not, by default disable. 2). add some "TODO" comments near the code. 3). ... anyway, thanks for your help.. Quan Alibaba Cloud
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-16 18:21, Paolo Bonzini wrote: On 16/11/2017 06:06, Quan Xu wrote: when vcpu thread is scheduled out, the pkru value in current->thread.fpu.state may be the host pkru value, instead of guest pkru value (of course, this _assumes_ that the pkru is in current->thread.fpu.state as well). in this way, the pkru may be a coner case. Rik may correct me, but I think this is not possible. Preemption is disabled all the time while PKRU = guest_pkru (which is only during vmx_vcpu_run). refer to the following code, vcpu_enter_guest() { ... preempt_disable() ... kvm_x86_ops->run(vcpu); (actually it is vmx_vcpu_run()) ... ... preempt_enable(); } when preempt_disable before to run kvm_x86_ops->run.. in kvm_x86_ops->run, the pkru is restored with host_pkru (IF guest_pkru != host_pkru). all this happened under preempt_disable(). it's not true that preemtion is disable all the time while PKRU = guest_pkru.. However it seems there is still some gap.. as Rik said, "at context switch time, the context switch code will save the guest FPU state to current->thread.fpu when the VCPU thread is scheduled out." after preempt_enable() in vcpu_enter_guest(), the vcpu thread is scheduled out, in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF guest_pkru != host_pkru).. instead of guest_pkru.. then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu? as mentioned, all this _assumes_ that the pkru is in current->thread.fpu.state as well. thanks, Quan Alibaba Cloud Context switching will only happen in vcpu_enter_guest() after preempt_enable() for a preemptible kernel, or in vcpu_run via cond_resched() for a non-preemptible kernel. Thanks, Paolo VM migration again, in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. the pkru status would be inconsistent..
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-16 20:18, Paolo Bonzini wrote: On 16/11/2017 13:12, Quan Xu wrote: However it seems there is still some gap.. as Rik said, "at context switch time, the context switch code will save the guest FPU state to current->thread.fpu when the VCPU thread is scheduled out." By "guest FPU state" Rik means "guest FPU with host PKRU". :( actually it is host_pkru, just with different names.. Guest PKRU is always stored in vcpu->arch.pkru rather than in the guest FPU state, so guest PKRU will never be in current->thread.fpu.state either. KVM_GET_XSAVE will the guest FPU state with vcpu->arch.pkru and migration will work properly. agreed, I fix it.. that's why I concern.. there are so much methods to write PKRU with host_pkru/guest_pkru.. after migration, KVM_GET|SET_XSAVE restore the right pkru.. but we introduce another method: -- When the VCPU thread is scheduled back in, the context switch code will restore current->thread.fpu to the FPU registers. there is still a window to restore current->thread.fpu to the FPU registers before enter guest mode and preempt_disable(). on target machine, after migration, the pkru value is source_host_pkru in current->thread.fpu. in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. source_host_pkru_value may be restored to PKRU.. make pkru status inconsistent.. thanks Quan Alibaba Cloud Thanks, Paolo after preempt_enable() in vcpu_enter_guest(), the vcpu thread is scheduled out, in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF guest_pkru != host_pkru).. instead of guest_pkru.. then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu? as mentioned, all this _assumes_ that the pkru is in current->thread.fpu.state as well. thanks, Quan Alibaba Cloud Context switching will only happen in vcpu_enter_guest() after preempt_enable() for a preemptible kernel, or in vcpu_run via cond_resched() for a non-preemptible kernel. Thanks, Paolo VM migration again, in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. the pkru status would be inconsistent..
Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll
On 2017/11/13 23:08, Ingo Molnar wrote: * Quan Xu <quan.x...@gmail.com> wrote: From: Quan Xu <quan@gmail.com> To reduce the cost of poll, we introduce three sysctl to control the poll time when running as a virtual machine with paravirt. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> --- Documentation/sysctl/kernel.txt | 35 +++ arch/x86/kernel/paravirt.c |4 include/linux/kernel.h |6 ++ kernel/sysctl.c | 34 ++ 4 files changed, 79 insertions(+), 0 deletions(-) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 694968c..30c25fb 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one. == +paravirt_poll_grow: (X86 only) + +Multiplied value to increase the poll time. This is expected to take +effect only when running as a virtual machine with CONFIG_PARAVIRT +enabled. This can't bring any benifit on bare mental even with +CONFIG_PARAVIRT enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_shrink: (X86 only) + +Divided value to reduce the poll time. This is expected to take effect +only when running as a virtual machine with CONFIG_PARAVIRT enabled. +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT +enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_threshold_ns: (X86 only) + +Controls the maximum poll time before entering real idle path. This is +expected to take effect only when running as a virtual machine with +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental +even with CONFIG_PARAVIRT enabled. + +By default, this value is 0 means not to poll. Possible values to set +are in range {0..50}. Change the value to non-zero if running +latency-bound workloads in a virtual machine. I absolutely hate it how this hybrid idle loop polling mechanism is not self-tuning! Ingo, actually it is self-tuning.. Please make it all work fine by default, and automatically so, instead of adding three random parameters... .. I will make it all fine by default. howerver cloud environment is of diversity, could I only leave paravirt_poll_threshold_ns parameter (the maximum poll time), which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then user can turn it off, or find an appropriate threshold for some odd scenario.. thanks for your comments!! Quan Alibaba Cloud And if it cannot be done automatically then we should rather not do it at all. Maybe the next submitter of a similar feature can think of a better approach. Thanks, Ingo
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 15:12, Wanpeng Li wrote: 2017-11-14 15:02 GMT+08:00 Quan Xu <quan@gmail.com>: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU Actually we can reduce the CPU utilization by sleeping a period of time as what has already been done in the poll logic of IO subsystem, then we can improve the algorithm in kvm instead of introduing another duplicate one in the kvm guest. We really appreciate upstream's kvm dynamic poll mechanism, which is really helpful for a lot of scenario.. However, as description said, in virtualization, idle path includes several heavy operations includes timer access (LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. for upstream's kvm dynamic poll mechanism, even you could provide a better algorism, how could you bypass timer access (LAPIC timer or TSC deadline timer), or a hardware context switch between virtual machine and hypervisor. I know these is a tradeoff. Furthermore, here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll: 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: 2280.6 ns/ctxsw -- 199.5 %CPU so, these tow solution are quite similar, but not duplicate.. that's also why to add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Quan Alibaba Cloud Regards, Wanpeng Li 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. Quan Alibaba Cloud Juergen
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. Quan Alibaba Cloud Juergen
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 18:27, Juergen Gross wrote: On 14/11/17 10:38, Quan Xu wrote: On 2017/11/14 15:30, Juergen Gross wrote: On 14/11/17 08:02, Quan Xu wrote: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. I don't question the general idea. I just think pvops isn't the best way to implement it. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes sorry, a typo.. /other archs/other hypervisors/ it refers hypervisor like Xen, HyperV and VMware).. code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. You are aware that pvops is x86 only? yes, I'm aware.. I really don't see the big difference in maintainability compared to the static key / function pointer variant: void (*guest_idle_poll_func)(void); struct static_key guest_idle_poll_key __read_mostly; static inline void guest_idle_poll(void) { if (static_key_false(_idle_poll_key)) guest_idle_poll_func(); } thank you for your sample code :) I agree there is no big difference.. I think we are discussion for two things: 1) x86 VM on different hypervisors 2) different archs VM on kvm hypervisor What I want to do is x86 VM on different hypervisors, such as kvm / xen / hyperv .. Why limit the solution to x86 if the more general solution isn't harder? As you didn't give any reason why the pvops approach is better other than you don't care for non-x86 platforms you won't get an "Ack" from me for this patch. It just looks a little odder to me. I understand you care about no-x86 arch. Are you aware 'pv_time_ops' for arm64/arm/x86 archs, defined in - arch/arm64/include/asm/paravirt.h - arch/x86/include/asm/paravirt_types.h - arch/arm/include/asm/paravirt.h I am unfamilar with arm code. IIUC, if you'd implement pv_idle_ops for arm/arm64 arch, you'd define a same structure in - arch/arm64/include/asm/paravirt.h or - arch/arm/include/asm/paravirt.h .. instead of static key / fuction. then implement a real function in - arch/arm/kernel/paravirt.c. Also I wonder HOW/WHERE to define a static key/function, then to benifit x86/no-x86 archs? Quan Alibaba Cloud And KVM would just need to set guest_idle_poll_func and enable the static key. Works on non-x86 architectures, too. .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own functions for 'pv_mmu_ops'. I think it is the same to pv_idle_ops. with above explaination, do you still think I need to define the static key/function pointer variant? btw, any interest to port it to Xen HVM guest? :) Maybe. But this should work for Xen on ARM, too. Juergen
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 15:30, Juergen Gross wrote: On 14/11/17 08:02, Quan Xu wrote: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. I don't question the general idea. I just think pvops isn't the best way to implement it. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes sorry, a typo.. /other archs/other hypervisors/ it refers hypervisor like Xen, HyperV and VMware).. code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. You are aware that pvops is x86 only? yes, I'm aware.. I really don't see the big difference in maintainability compared to the static key / function pointer variant: void (*guest_idle_poll_func)(void); struct static_key guest_idle_poll_key __read_mostly; static inline void guest_idle_poll(void) { if (static_key_false(_idle_poll_key)) guest_idle_poll_func(); } thank you for your sample code :) I agree there is no big difference.. I think we are discussion for two things: 1) x86 VM on different hypervisors 2) different archs VM on kvm hypervisor What I want to do is x86 VM on different hypervisors, such as kvm / xen / hyperv .. And KVM would just need to set guest_idle_poll_func and enable the static key. Works on non-x86 architectures, too. .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own functions for 'pv_mmu_ops'. I think it is the same to pv_idle_ops. with above explaination, do you still think I need to define the static key/function pointer variant? btw, any interest to port it to Xen HVM guest? :) Quan Alibaba Cloud
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 16:22, Wanpeng Li wrote: 2017-11-14 16:15 GMT+08:00 Quan Xu <quan@gmail.com>: On 2017/11/14 15:12, Wanpeng Li wrote: 2017-11-14 15:02 GMT+08:00 Quan Xu <quan@gmail.com>: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu <quan@gmail.com> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Cc: Juergen Gross <jgr...@suse.com> Cc: Alok Kataria <akata...@vmware.com> Cc: Rusty Russell <ru...@rustcorp.com.au> Cc: Thomas Gleixner <t...@linutronix.de> Cc: Ingo Molnar <mi...@redhat.com> Cc: "H. Peter Anvin" <h...@zytor.com> Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU Actually we can reduce the CPU utilization by sleeping a period of time as what has already been done in the poll logic of IO subsystem, then we can improve the algorithm in kvm instead of introduing another duplicate one in the kvm guest. We really appreciate upstream's kvm dynamic poll mechanism, which is really helpful for a lot of scenario.. However, as description said, in virtualization, idle path includes several heavy operations includes timer access (LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. for upstream's kvm dynamic poll mechanism, even you could provide a better algorism, how could you bypass timer access (LAPIC timer or TSC deadline timer), or a hardware context switch between virtual machine and hypervisor. I know these is a tradeoff. Furthermore, here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll: 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: 2280.6 ns/ctxsw -- 199.5 %CPU so, these tow solution are quite similar, but not duplicate.. that's also why to add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. There is a similar logic in the idle governor/driver, so how this patchset influence the decision in the idle governor/driver when running on bare-metal(power managment is not exposed to the guest so we will not enter into idle driver in the guest)? This is expected to take effect only when running as a virtual machine with proper CONFIG_* enabled. This can not work on bare mental even with proper CONFIG_* enabled. Quan Alibaba Cloud
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 17:45, Daniel Lezcano wrote: On 16/11/2017 10:12, Quan Xu wrote: On 2017-11-16 06:03, Thomas Gleixner wrote: On Wed, 15 Nov 2017, Peter Zijlstra wrote: On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote: From: Yang Zhang <yang.zhang...@gmail.com> Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. No, we want less of those magic hooks, not more. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Why not do a HV specific idle driver? If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. See commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit looks similar. But lets not proliferate that. I'd rather see that go away. agreed. Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu") in kvm guest. I won't proliferate that.. But the irq_timings stuff is heading into the same direction, with a more complex prediction logic which should tell you pretty good how long that idle period is going to be and in case of an interrupt heavy workload this would skip the extra work of stopping and restarting the tick and provide a very good input into a polling decision. interesting. I have tested with IRQ_TIMINGS related code, which seems not working so far. I don't know how you tested it, can you elaborate what you meant by "seems not working so far" ? Daniel, I tried to enable IRQ_TIMINGS* manually. used irq_timings_next_event() to return estimation of the earliest interrupt. However I got a constant. There are still some work to do to be more efficient. The prediction based on the irq timings is all right if the interrupts have a simple periodicity. But as soon as there is a pattern, the current code can't handle it properly and does bad predictions. I'm working on a self-learning pattern detection which is too heavy for the kernel, and with it we should be able to detect properly the patterns and re-ajust the period if it changes. I'm in the process of making it suitable for kernel code (both math and perf). One improvement which can be done right now and which can help you is the interrupts rate on the CPU. It is possible to compute it and that will give an accurate information for the polling decision. As tglx said, talk to each other / work together to make it usable for all use cases. could you share how to enable it to get the interrupts rate on the CPU? I can try it in cloud scenario. of course, I'd like to work with you to improve it. Quan Alibaba Cloud
Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote: On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote: From: Yang Zhang <yang.zhang...@gmail.com> Some latency-intensive workload have seen obviously performance drop when running inside VM. The main reason is that the overhead is amplified when running inside VM. The most cost I have seen is inside idle path. Meaning an VMEXIT b/c it is an 'halt' operation ? And then going back in guest (VMRESUME) takes time. And hence your latency gets all whacked b/c of this? Konrad, I can't follow 'b/c' here.. sorry. So if I understand - you want to use your _full_ timeslice (of the guest) without ever (or as much as possible) to go in the hypervisor? as much as possible. Which means in effect you don't care about power-saving or CPUfreq savings, you just want to eat the full CPU for snack? actually, we care about power-saving. The poll duration is self-tuning, otherwise it is almost as the same as 'halt=poll'. Also we always sent out with CPU usage of benchmark netperf/ctxsw. We got much more performance with limited promotion of CPU usage. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Schedule of what? The guest or the host? rescheduled of guest scheduler.. it is the guest. Quan Alibaba Cloud
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-16 12:21, Rik van Riel wrote: On Thu, 2017-11-16 at 10:50 +0800, Quan Xu wrote: On 2017-11-15 22:43, Rik van Riel wrote: Can you explain why you believe that? for example, a vcpu thread is running in kvm mode under cretical condition to stop. QEMU send an IPI to cause a VM-exit to happen immediately, and this IPI doesn't make vcpu return to QEMU. IIUC this vcpu thread will still continue to run in kvm mode when is waked up at targer machine. with your patch, I don't see a chance to load guest FPU or XSTATE, until return to QEMU and run kvm mode again. then the FPU or XSTATE status is inconsistent for a small window, what's even worse is that the vcpu is running. Did I misunderstand? At context switch time, the context switch code will save the guest FPU state to current->thread.fpu when the VCPU thread is scheduled out. When the VCPU thread is scheduled back in, the context switch code will restore current->thread.fpu to the FPU registers. good catch! Also as your comment, PKRU is switched out separately at VMENTER and VMEXIT time, but with a lots of IF conditions.. the pkru may be restored with host pkru after VMEXIT. when vcpu thread is scheduled out, the pkru value in current->thread.fpu.state may be the host pkru value, instead of guest pkru value (of course, this _assumes_ that the pkru is in current->thread.fpu.state as well). in this way, the pkru may be a coner case. VM migration again, in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. the pkru status would be inconsistent.. Quan Alibaba Cloud The VCPU thread will never run with anything else than the guest FPU state, while inside the KVM_RUN code.
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-17 01:50, Paolo Bonzini wrote: On 16/11/2017 15:28, Quan Xu wrote: vcpu->srcu_idx = srcu_read_lock(>srcu); + kvm_load_guest_fpu(vcpu); + for (;;) { if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); << as Rik dropped kvm_load_guest_fpu(vcpu) in vcpu_enter_guest() .. in case still in kvm mode, how to make sure to pkru is always the right one before enter guest mode, not be preempted before preempt_disable() after migration? :( As you know: 1) kvm_load_guest_fpu doesn't load the guest PKRU, it keeps the userspace PKRU. 2) the guest PKRU is only ever set in a preemption-disabled area Thus, context switch always sees the userspace PKRU. The guest PKRU is only set the next time vmx_vcpu_run executes. Paolo Paolo, thanks for your explanation!!:-) Rik, could you cc me in v2? Quan
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 06:03, Thomas Gleixner wrote: On Wed, 15 Nov 2017, Peter Zijlstra wrote: On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote: From: Yang Zhang <yang.zhang...@gmail.com> Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. No, we want less of those magic hooks, not more. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Why not do a HV specific idle driver? If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. See commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit looks similar. But lets not proliferate that. I'd rather see that go away. agreed. Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu") in kvm guest. I won't proliferate that.. But the irq_timings stuff is heading into the same direction, with a more complex prediction logic which should tell you pretty good how long that idle period is going to be and in case of an interrupt heavy workload this would skip the extra work of stopping and restarting the tick and provide a very good input into a polling decision. interesting. I have tested with IRQ_TIMINGS related code, which seems not working so far. Also I'd like to help as much as I can. This can be handled either in a HV specific idle driver or even in the generic core code. If the interrupt does not arrive then you can assume within the predicted time then you can assume that the flood stopped and invoke halt or whatever. That avoids all of that 'tunable and tweakable' x86 specific hackery and utilizes common functionality which is mostly there already. here is some sample code. Poll for a while before enter halt in cpuidle_enter_state() If I get a reschedule event, then don't try to enter halt. (I hope this is the right direction as Peter mentioned in another email) --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, target_state = >states[index]; } +#ifdef CONFIG_PARAVIRT + paravirt_idle_poll(); + + if (need_resched()) + return -EBUSY; +#endif + /* Take note of the planned idle state. */ sched_idle_set_state(target_state); thanks, Quan Alibaba Cloud
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 16:45, Peter Zijlstra wrote: On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote: If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. Sure. And there are people who want to do the same for native. Adding more ugly and special cases just isn't the way to go about doing that. I'm fairly sure I've told the various groups that want to tinker with this to work together on this. I've also in fairly significant detail sketched how to rework the idle code and idle predictors. At this point I'm too tired to dig any of that up, so I'll just keep saying no to patches that don't even attempt to go in the right direction. Peter, take care. I really have considered this factor, and try my best not to interfere with scheduler/idle code. if irq_timings code is ready, I can use it directly. I think irq_timings is not an easy task, I'd like to help as much as I can. Also don't try to touch tick_nohz* code again. as tglx suggested, this can be handled either in a HV specific idle driver or even in the generic core code. I hope this is in the right direction. Quan Alibaba Cloud
[PATCH] KVM: VMX: drop I/O permission bitmaps
From: Quan Xu <quan@gmail.com> Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson <jmatt...@google.com> Signed-off-by: Radim Krčmář <rkrc...@redhat.com> Signed-off-by: Quan Xu <quan@gmail.com> --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), FIELD64(MSR_BITMAP, msr_bitmap), FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr), FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr), @@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +954,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6741,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); -- 1.7.1
[PATCH RFC 0/7] kvm pvtimer
From: Ben LuoThis patchset introduces a new paravirtualized mechanism to reduce VM-exit caused by guest timer accessing. In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and then KVM handles this timer for guest. Also kvm always registers timer on the CPU which vCPU was running on. Even though vCPU thread is rescheduled to another CPU, the timer will be migrated to the target CPU as well. When timer expired, timer interrupt could make guest-mode vCPU VM-exit on this CPU. When pvtimer is enabled: - The tsc-deadline timestamp is mostly recorded in share page with less VM-exit. We Introduce a periodically working kthread to scan share page and synchronize timer setting for guest on a dedicated CPU. - Since the working kthread scans periodically, some of the timer events may be lost or delayed. We have to program these tsc-deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will signal the working thread through IPI to program timer, instread of registering on current CPU. - Timer interrupt will be delivered by posted interrupt mechanism to vCPUs with less VM-exit. Ben Luo (7): kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR kvm: x86: add a function to exchange value KVM: timer: synchronize tsc-deadline timestamp for guest KVM: timer: program timer to a dedicated CPU KVM: timer: ignore timer migration if pvtimer is enabled Doc/KVM: introduce a new cpuid bit for kvm pvtimer kvm: guest: reprogram guest timer Documentation/virtual/kvm/cpuid.txt |4 + arch/x86/include/asm/kvm_host.h |5 + arch/x86/include/asm/kvm_para.h |9 ++ arch/x86/include/uapi/asm/kvm_para.h |7 ++ arch/x86/kernel/apic/apic.c |9 +- arch/x86/kernel/kvm.c| 38 arch/x86/kvm/cpuid.c |1 + arch/x86/kvm/lapic.c | 170 +- arch/x86/kvm/lapic.h | 11 ++ arch/x86/kvm/x86.c | 15 +++- include/linux/kvm_host.h |3 + virt/kvm/kvm_main.c | 42 + 12 files changed, 308 insertions(+), 6 deletions(-)
[PATCH RFC 2/7] kvm: x86: add a function to exchange value
From: Ben Luo <bn0...@gmail.com> Introduce kvm_xchg_guest_cached to exchange value with guest page atomically. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Ben Luo <bn0...@gmail.com> --- include/linux/kvm_host.h |3 +++ virt/kvm/kvm_main.c | 42 ++ 2 files changed, 45 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6882538..32949ed 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -688,6 +688,9 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, void *data, int offset, unsigned long len); int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len); +unsigned long kvm_xchg_guest_cached(struct kvm *kvm, + struct gfn_to_hva_cache *ghc, unsigned long offset, + unsigned long new, int size); int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len); struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9deb5a2..3149e17 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2010,6 +2010,48 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_read_guest_cached); +unsigned long kvm_xchg_guest_cached(struct kvm *kvm, + struct gfn_to_hva_cache *ghc, unsigned long offset, + unsigned long new, int size) +{ + unsigned long r; + void *kva; + struct page *page; + kvm_pfn_t pfn; + + WARN_ON(offset > ghc->len); + + pfn = gfn_to_pfn_atomic(kvm, ghc->gpa >> PAGE_SHIFT); + page = kvm_pfn_to_page(pfn); + + if (is_error_page(page)) + return -EFAULT; + + kva = kmap_atomic(page) + offset_in_page(ghc->gpa) + offset; + switch (size) { + case 1: + r = xchg((char *)kva, new); + break; + case 2: + r = xchg((short *)kva, new); + break; + case 4: + r = xchg((int *)kva, new); + break; + case 8: + r = xchg((long *)kva, new); + break; + default: + kunmap_atomic(kva); + return -EFAULT; + } + + kunmap_atomic(kva); + mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT); + + return r; +} + int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len) { const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0))); -- 1.7.1
[PATCH RFC 1/7] kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR
From: Ben Luo <bn0...@gmail.com> Guest enables pv timer functionality using this MSR Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Ben Luo <bn0...@gmail.com> --- arch/x86/include/asm/kvm_host.h |5 + arch/x86/include/uapi/asm/kvm_para.h |6 ++ arch/x86/kvm/lapic.c | 22 ++ arch/x86/kvm/lapic.h |6 ++ arch/x86/kvm/x86.c |8 5 files changed, 47 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c73e493..641b4aa 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -684,6 +684,11 @@ struct kvm_vcpu_arch { bool pv_unhalted; } pv; + struct { + u64 msr_val; + struct gfn_to_hva_cache data; + } pv_timer; + int pending_ioapic_eoi; int pending_external_vector; diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 554aa8f..3dd6116 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -41,6 +41,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_PV_TIMER_EN0x4b564d05 struct kvm_steal_time { __u64 steal; @@ -64,6 +65,11 @@ struct kvm_clock_pairing { #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1))) #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1) +struct pvtimer_vcpu_event_info { + __u64 expire_tsc; + __u64 next_sync_tsc; +} __attribute__((__packed__)); + #define KVM_MAX_MMU_OP_BATCH 32 #define KVM_ASYNC_PF_ENABLED (1 << 0) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 36c90d6..55c9ba3 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1991,6 +1991,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) kvm_lapic_set_base(vcpu, vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP); vcpu->arch.pv_eoi.msr_val = 0; + vcpu->arch.pv_timer.msr_val = 0; apic_update_ppr(apic); if (vcpu->arch.apicv_active) { kvm_x86_ops->apicv_post_state_restore(vcpu); @@ -2478,6 +2479,27 @@ int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data) addr, sizeof(u8)); } +int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data) +{ + u64 addr = data & ~KVM_MSR_ENABLED; + int ret; + + if (!lapic_in_kernel(vcpu)) + return 1; + + if (!IS_ALIGNED(addr, 4)) + return 1; + + vcpu->arch.pv_timer.msr_val = data; + if (!pv_timer_enabled(vcpu)) + return 0; + + ret = kvm_gfn_to_hva_cache_init(vcpu->kvm, >arch.pv_timer.data, + addr, sizeof(struct pvtimer_vcpu_event_info)); + + return ret; +} + void kvm_apic_accept_events(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 4b9935a..539a738 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -113,6 +113,7 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu) } int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data); +int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data); void kvm_lapic_init(void); void kvm_lapic_exit(void); @@ -207,6 +208,11 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu) return lapic_in_kernel(vcpu) && test_bit(KVM_APIC_INIT, >arch.apic->pending_events); } +static inline bool pv_timer_enabled(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.pv_timer.msr_val & KVM_MSR_ENABLED; +} + bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); void wait_lapic_expire(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03869eb..5668774 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1025,6 +1025,7 @@ bool kvm_rdpmc(struct kvm_vcpu *vcpu) HV_X64_MSR_STIMER0_CONFIG, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, + MSR_KVM_PV_TIMER_EN, MSR_IA32_TSC_ADJUST, MSR_IA32_TSCDEADLINE, @@ -2279,6 +2280,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (kvm_lapic_enable_pv_eoi(vcpu, data)) return 1; break; + case MSR_KVM_PV_TIMER_EN: + if (kvm_lapic_enable_pv_timer(vcpu, data)) + return 1; + break; case MSR
[PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest
From: Ben Luo <bn0...@gmail.com> In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and then KVM handles this timer for guest. The tsc-deadline timestamp is mostly recorded in share page with less VM-exit. We Introduce a periodically working kthread to scan share page and synchronize timer setting for guest on a dedicated CPU. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Ben Luo <bn0...@gmail.com> --- arch/x86/kvm/lapic.c | 138 ++ arch/x86/kvm/lapic.h |5 ++ 2 files changed, 143 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 55c9ba3..20a23bb 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -36,6 +36,10 @@ #include #include #include +#include +#include +#include +#include #include "kvm_cache_regs.h" #include "irq.h" #include "trace.h" @@ -70,6 +74,12 @@ #define APIC_BROADCAST 0xFF #define X2APIC_BROADCAST 0xul +static struct hrtimer pv_sync_timer; +static long pv_timer_period_ns = PVTIMER_PERIOD_NS; +static struct task_struct *pv_timer_polling_worker; + +module_param(pv_timer_period_ns, long, 0644); + static inline int apic_test_vector(int vec, void *bitmap) { return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); @@ -2542,8 +2552,130 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu) } } +static enum hrtimer_restart pv_sync_timer_callback(struct hrtimer *timer) +{ + hrtimer_forward_now(timer, ns_to_ktime(pv_timer_period_ns)); + wake_up_process(pv_timer_polling_worker); + + return HRTIMER_RESTART; +} + +void kvm_apic_sync_pv_timer(void *data) +{ + struct kvm_vcpu *vcpu = data; + struct kvm_lapic *apic = vcpu->arch.apic; + unsigned long flags, this_tsc_khz = vcpu->arch.virtual_tsc_khz; + u64 guest_tsc, expire_tsc; + long rem_tsc; + + if (!lapic_in_kernel(vcpu) || !pv_timer_enabled(vcpu)) + return; + + local_irq_save(flags); + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); + rem_tsc = ktime_to_ns(hrtimer_get_remaining(_sync_timer)) + * this_tsc_khz; + if (rem_tsc <= 0) + rem_tsc += pv_timer_period_ns * this_tsc_khz; + do_div(rem_tsc, 100L); + + /* +* make sure guest_tsc and rem_tsc are assigned before to update +* next_sync_tsc. +*/ + smp_wmb(); + kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data, + offsetof(struct pvtimer_vcpu_event_info, next_sync_tsc), + guest_tsc + rem_tsc, 8); + + /* make sure next_sync_tsc is visible */ + smp_wmb(); + + expire_tsc = kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data, + offsetof(struct pvtimer_vcpu_event_info, expire_tsc), + 0UL, 8); + + /* make sure expire_tsc is visible */ + smp_wmb(); + + if (expire_tsc) { + if (expire_tsc > guest_tsc) + /* +* As we bind this thread to a dedicated CPU through +* IPI, the timer is registered on that dedicated +* CPU here. +*/ + kvm_set_lapic_tscdeadline_msr(apic->vcpu, expire_tsc); + else + /* deliver immediately if expired */ + kvm_apic_local_deliver(apic, APIC_LVTT); + } + local_irq_restore(flags); +} + +static int pv_timer_polling(void *arg) +{ + struct kvm *kvm; + struct kvm_vcpu *vcpu; + int i; + mm_segment_t oldfs = get_fs(); + + while (1) { + set_current_state(TASK_INTERRUPTIBLE); + + if (kthread_should_stop()) { + __set_current_state(TASK_RUNNING); + break; + } + + spin_lock(_lock); + __set_current_state(TASK_RUNNING); + list_for_each_entry(kvm, _list, vm_list) { + set_fs(USER_DS); + use_mm(kvm->mm); + kvm_for_each_vcpu(i, vcpu, kvm) { + kvm_apic_sync_pv_timer(vcpu); + } + unuse_mm(kvm->mm); + set_fs(oldfs); + } + + spin_unlock(_lock); + + schedule(); + } + + return 0; +} + +static void kvm_pv_timer_init(void) +{ + ktime_t ktime = ktime_set(0, pv_timer_period_ns); + + hrtimer_init(_sync_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED); + pv_sync_timer.function = _sync_timer_callback; + + /* kthread for pv_timer
[PATCH RFC 7/7] kvm: guest: reprogram guest timer
From: Ben Luo <bn0...@gmail.com> In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. When pvtimer is enabled, we introduce a new mechanism to reprogram KVM guest timer. A periodically working kthread scans share page and synchronize timer setting for guest on a dedicated CPU. The next time event of the periodically working kthread is a threshold to decide whether to program tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR, or to share page. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Ben Luo <bn0...@gmail.com> --- arch/x86/include/asm/kvm_para.h |9 + arch/x86/kernel/apic/apic.c |9 ++--- arch/x86/kernel/kvm.c | 38 ++ 3 files changed, 53 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index c373e44..109e706 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -5,6 +5,7 @@ #include #include #include +#include extern void kvmclock_init(void); extern int kvm_register_clock(char *txt); @@ -92,6 +93,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, void kvm_async_pf_task_wait(u32 token, int interrupt_kernel); void kvm_async_pf_task_wake(u32 token); u32 kvm_read_and_reset_pf_reason(void); +int kvm_pv_timer_next_event(unsigned long tsc, + struct clock_event_device *evt); extern void kvm_disable_steal_time(void); #ifdef CONFIG_PARAVIRT_SPINLOCKS @@ -126,6 +129,12 @@ static inline void kvm_disable_steal_time(void) { return; } + +static inline int kvm_pv_timer_next_event(unsigned long tsc, + struct clock_event_device *evt) +{ + return 0; +} #endif #endif /* _ASM_X86_KVM_PARA_H */ diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index ff89177..286c1b3 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -471,10 +471,13 @@ static int lapic_next_event(unsigned long delta, static int lapic_next_deadline(unsigned long delta, struct clock_event_device *evt) { - u64 tsc; + u64 tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR); - tsc = rdtsc(); - wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR)); + /* TODO: undisciplined function call */ + if (kvm_pv_timer_next_event(tsc, evt)) + return 0; + + wrmsrl(MSR_IA32_TSC_DEADLINE, tsc); return 0; } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..ec7aff1 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -328,6 +328,35 @@ static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val) apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK); } +static DEFINE_PER_CPU(int, pvtimer_enabled); +static DEFINE_PER_CPU(struct pvtimer_vcpu_event_info, + pvtimer_shared_buf) = {0}; +#define PVTIMER_PADDING25000 +int kvm_pv_timer_next_event(unsigned long tsc, + struct clock_event_device *evt) +{ + struct pvtimer_vcpu_event_info *src; + u64 now; + + if (!this_cpu_read(pvtimer_enabled)) + return false; + + src = this_cpu_ptr(_shared_buf); + xchg((u64 *)>expire_tsc, tsc); + + barrier(); + + if (tsc < src->next_sync_tsc) + return false; + + rdtscll(now); + if (tsc < now || tsc - now < PVTIMER_PADDING) + return false; + + return true; +} +EXPORT_SYMBOL_GPL(kvm_pv_timer_next_event); + static void kvm_guest_cpu_init(void) { if (!kvm_para_available()) @@ -362,6 +391,15 @@ static void kvm_guest_cpu_init(void) if (has_steal_clock) kvm_register_steal_time(); + + if (kvm_para_has_feature(KVM_FEATURE_PV_TIMER)) { + unsigned long data; + + data = slow_virt_to_phys(this_cpu_ptr(_shared_buf)) + | KVM_MSR_ENABLED; + wrmsrl(MSR_KVM_PV_TIMER_EN, data); + this_cpu_write(pvtimer_enabled, 1); + } } static void kvm_pv_disable_apf(void) -- 1.7.1
[PATCH RFC 6/7] Doc/KVM: introduce a new cpuid bit for kvm pvtimer
From: Ben Luo <bn0...@gmail.com> KVM_FEATURE_PV_TIMER enables guest to check whether pvtimer can be enabled in guest. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Ben Luo <bn0...@gmail.com> --- Documentation/virtual/kvm/cpuid.txt |4 arch/x86/include/uapi/asm/kvm_para.h |1 + arch/x86/kvm/cpuid.c |1 + 3 files changed, 6 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt index 3c65feb..b26b31c 100644 --- a/Documentation/virtual/kvm/cpuid.txt +++ b/Documentation/virtual/kvm/cpuid.txt @@ -54,6 +54,10 @@ KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit || || before enabling paravirtualized || || spinlock support. -- +KVM_FEATURE_PV_TIMER || 8 || guest checks this feature bit + || || before enabling paravirtualized + || || timer support. +-- KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no guest-side || || per-cpu warps are expected in || || kvmclock. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 3dd6116..46734a8 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -25,6 +25,7 @@ #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 #define KVM_FEATURE_PV_UNHALT 7 +#define KVM_FEATURE_PV_TIMER 8 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0099e10..e02fd23 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -593,6 +593,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, (1 << KVM_FEATURE_CLOCKSOURCE2) | (1 << KVM_FEATURE_ASYNC_PF) | (1 << KVM_FEATURE_PV_EOI) | +(1 << KVM_FEATURE_PV_TIMER) | (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | (1 << KVM_FEATURE_PV_UNHALT); -- 1.7.1
[PATCH RFC 5/7] KVM: timer: ignore timer migration if pvtimer is enabled
From: Ben Luo <bn0...@gmail.com> When pvtimer is enabled, KVM programs timer to a dedicated CPU through IPI. Whether the vCPU is on the dedicated CPU or any other CPU, the timer interrupt will be delivered properly. No need to migrate timer. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Ben Luo <bn0...@gmail.com> --- arch/x86/kvm/lapic.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5835a27..265efe6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2282,7 +2282,7 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) { struct hrtimer *timer; - if (!lapic_in_kernel(vcpu)) + if (!lapic_in_kernel(vcpu) || pv_timer_enabled(vcpu)) return; timer = >arch.apic->lapic_timer.timer; -- 1.7.1
[PATCH RFC 4/7] KVM: timer: program timer to a dedicated CPU
From: Ben Luo <bn0...@gmail.com> KVM always registers timer on the CPU which vCPU was running on. Even though vCPU thread is rescheduled to another CPU, the timer will be migrated to the target CPU as well. When timer expired, timer interrupt could make guest-mode vCPU VM-exit on this CPU. Since the working kthread scans periodically, some of the timer events may be lost or delayed. We have to program these tsc- deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will signal the working thread through IPI to program timer, instread of registering on current CPU. Signed-off-by: Yang Zhang <yang.zhang...@gmail.com> Signed-off-by: Quan Xu <quan@gmail.com> Signed-off-by: Ben Luo <bn0...@gmail.com> --- arch/x86/kvm/lapic.c |8 +++- arch/x86/kvm/x86.c |7 ++- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 20a23bb..5835a27 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2072,7 +2072,13 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data) struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer); struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, lapic_timer); - apic_timer_expired(apic); + + if (pv_timer_enabled(apic->vcpu)) { + kvm_apic_local_deliver(apic, APIC_LVTT); + if (apic_lvtt_tscdeadline(apic)) + apic->lapic_timer.tscdeadline = 0; + } else + apic_timer_expired(apic); if (lapic_is_periodic(apic)) { advance_periodic_target_expiration(apic); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5668774..3cbb223 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -26,6 +26,7 @@ #include "tss.h" #include "kvm_cache_regs.h" #include "x86.h" +#include "lapic.h" #include "cpuid.h" #include "pmu.h" #include "hyperv.h" @@ -2196,7 +2197,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff: return kvm_x2apic_msr_write(vcpu, msr, data); case MSR_IA32_TSCDEADLINE: - kvm_set_lapic_tscdeadline_msr(vcpu, data); + if (pv_timer_enabled(vcpu)) + smp_call_function_single(PVTIMER_SYNC_CPU, + kvm_apic_sync_pv_timer, vcpu, 0); + else + kvm_set_lapic_tscdeadline_msr(vcpu, data); break; case MSR_IA32_TSC_ADJUST: if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) { -- 1.7.1
Re: [PATCH v3] KVM: X86: Fix load RFLAGS w/o the fixed bit
On 2017/12/07 16:30, Wanpeng Li wrote: From: Wanpeng Li <wanpeng...@hotmail.com> *** Guest State *** CR0: actual=0x0030, shadow=0x6010, gh_mask=fff7 CR4: actual=0x2050, shadow=0x, gh_mask=e871 CR3 = 0xfffbc000 RSP = 0x RIP = 0x RFLAGS=0x DR7 = 0x0400 ^^ The failed vmentry is triggered by the following testcase when ept=Y: #include #include #include #include #include #include #include long r[5]; int main() { r[2] = open("/dev/kvm", O_RDONLY); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7); struct kvm_regs regs = { .rflags = 0, }; ioctl(r[4], KVM_SET_REGS, ); ioctl(r[4], KVM_RUN, 0); } X86 RFLAGS bit 1 is fixed set, userspace can simply clearing bit 1 of RFLAGS with KVM_SET_REGS ioctl which results in vmentry fails. This patch fixes it by oring X86_EFLAGS_FIXED during ioctl. Suggested-by: Jim Mattson <jmatt...@google.com> Reviewed-by: David Hildenbrand <da...@redhat.com> Cc: Paolo Bonzini <pbonz...@redhat.com> Cc: Radim Krčmář <rkrc...@redhat.com> Cc: Jim Mattson <jmatt...@google.com> Signed-off-by: Wanpeng Li <wanpeng...@hotmail.com> Reviewed-by: Quan Xu <quan@gmail.com>
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/09 00:18, David Hildenbrand wrote: On 08.12.2017 11:22, Quan Xu wrote: From: Quan Xu <quan@gmail.com> Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson <jmatt...@google.com> Signed-off-by: Radim Krčmář <rkrc...@redhat.com> Signed-off-by: Quan Xu <quan@gmail.com> --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), FIELD64(MSR_BITMAP, msr_bitmap), FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr), FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr), @@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +954,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6741,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); Looks good to me. David, thanks for your review. We could now also drop diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a5d2856eb28b..732e93332f4c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * Rather, exit every time. */ exec_control &= ~CPU_BASED_USE_IO_BITMAPS; - exec_control |= CPU_BASED_UNCOND_IO_EXITING; vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control); As this will be implicitly set by vmx_exec_control() good catch. yes, L0 merge vmcs01 with vmcsl2 to create vmcs02 and run L2. however as this code is related to nested virtualization, I better introduce it in another new patch with "KVM: NVMX" prefix subject after this patch is applied. furthermore, I think I could drop these tow lines in another new patch: - exec_control &= ~CPU_BASED_USE_IO_BITMAPS; - exec_control |= CPU_BASED_UNCOND_IO_EXITING; as the CPU_BASED_USE_IO_BITMAPS is clear after this patch. Quan
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/09 01:31, Jim Mattson wrote: On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu <quan@gmail.com> wrote: From: Quan Xu <quan@gmail.com> Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson <jmatt...@google.com> Signed-off-by: Radim Krčmář <rkrc...@redhat.com> Signed-off-by: Quan Xu <quan@gmail.com> --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), These two lines should stay. Jim, could you explain why these two lines should stay? IIUC, the main concern is from nested virtualization, which still uses io_bitmap_a/io_bitmap_b.. if so, we really need to further clean up these code, as CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set for both L0/L2. after new patches which I mentioned in this thread. right? Alibaba Cloud Quan
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/12 02:08, Jim Mattson wrote: Removing these two lines from the initialization of field_to_offset_table[] means that vmcs_field_to_offset() will return -ENOENT for IO_BITMAP_A or IO_BITMAP_B. Hence, handle_vmread and handle_vmwrite will incorrectly report these fields as unsupported VMCS components if an L1 hypervisor tries to access them. I will fix in v2. Quan Alibaba Cloud On Sun, Dec 10, 2017 at 9:37 PM, Quan Xu <quan@gmail.com> wrote: On 2017/12/09 01:31, Jim Mattson wrote: On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu <quan@gmail.com> wrote: From: Quan Xu <quan@gmail.com> Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson <jmatt...@google.com> Signed-off-by: Radim Krčmář <rkrc...@redhat.com> Signed-off-by: Quan Xu <quan@gmail.com> --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), These two lines should stay. Jim, could you explain why these two lines should stay? IIUC, the main concern is from nested virtualization, which still uses io_bitmap_a/io_bitmap_b.. if so, we really need to further clean up these code, as CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set for both L0/L2. after new patches which I mentioned in this thread. right? Alibaba Cloud Quan
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/09 00:18, David Hildenbrand wrote: On 08.12.2017 11:22, Quan Xu wrote: From: Quan Xu <quan@gmail.com> Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson <jmatt...@google.com> Signed-off-by: Radim Krčmář <rkrc...@redhat.com> Signed-off-by: Quan Xu <quan@gmail.com> --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), FIELD64(MSR_BITMAP, msr_bitmap), FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr), FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr), @@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +954,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6741,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); Looks good to me. We could now also drop diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a5d2856eb28b..732e93332f4c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * Rather, exit every time. */ exec_control &= ~CPU_BASED_USE_IO_BITMAPS; - exec_control |= CPU_BASED_UNCOND_IO_EXITING; vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control); As this will be implicitly set by vmx_exec_control() David, I find some nVMX code is still using vmcs12 (.e.g. nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) dropping it would make these logic error handling.. we better leave it as is.. Quan Alibaba Cloud
[PATCH v2] KVM: VMX: drop I/O permission bitmaps
From: Quan Xu <quan@gmail.com> Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson <jmatt...@google.com> Signed-off-by: Radim Krčmář <rkrc...@redhat.com> Signed-off-by: Quan Xu <quan@gmail.com> --- arch/x86/kvm/vmx.c | 15 +-- 1 files changed, 1 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..41aaf5e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -943,8 +943,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +956,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3628,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5441,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6743,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); -- 1.7.1
Re: [PATCH] vfio: mdev: make a couple of functions and structure vfio_mdev_driver static
On 2017/12/22 07:12, Xiongwei Song wrote: The functions vfio_mdev_probe, vfio_mdev_remove and the structure vfio_mdev_driver are only used in this file, so make them static. Clean up sparse warnings: drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype for 'vfio_mdev_probe' [-Wmissing-prototypes] drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype for 'vfio_mdev_remove' [-Wmissing-prototypes] Signed-off-by: Xiongwei Song <sxwj...@gmail.com> --- drivers/vfio/mdev/vfio_mdev.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c index fa848a701b8b..d230620fe02d 100644 --- a/drivers/vfio/mdev/vfio_mdev.c +++ b/drivers/vfio/mdev/vfio_mdev.c @@ -111,19 +111,19 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = { .mmap = vfio_mdev_mmap, }; -int vfio_mdev_probe(struct device *dev) +static int vfio_mdev_probe(struct device *dev) { struct mdev_device *mdev = to_mdev_device(dev); return vfio_add_group_dev(dev, _mdev_dev_ops, mdev); } -void vfio_mdev_remove(struct device *dev) +static void vfio_mdev_remove(struct device *dev) { vfio_del_group_dev(dev); } -struct mdev_driver vfio_mdev_driver = { +static struct mdev_driver vfio_mdev_driver = { .name = "vfio_mdev", .probe = vfio_mdev_probe, .remove = vfio_mdev_remove, Reviewed-by: Quan Xu <quan@gmail.com>
Re: [PATCH] KVM: nVMX: remove unnecessary vmwrite from L2->L1 vmexit
On 2018/01/02 17:47, Liran Alon wrote: On 02/01/18 00:58, Paolo Bonzini wrote: The POSTED_INTR_NV field is constant (though it differs between the vmcs01 and vmcs02), there is no need to reload it on vmexit to L1. Signed-off-by: Paolo Bonzini <pbonz...@redhat.com> --- arch/x86/kvm/vmx.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e6223fe8faa1..1e184830a295 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11610,9 +11610,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, */ vmx_flush_tlb(vcpu, true); } - /* Restore posted intr vector. */ - if (nested_cpu_has_posted_intr(vmcs12)) - vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR); vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs); vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp); Reviewed-by: Liran Alon <liran.a...@oracle.com> I would also add to commit message: Fixes: 06a5524f091b ("KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode") Reviewed-by: Quan Xu <quan@gmail.com>
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
on 2017/8/29 20:45, Peter Zijlstra wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Broken SoB chain. Peter, I can't follow 'Broken SoB chain'.. could you more about it? -Quan diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 6c23e30..b374744 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } And not a word on why we need a new arch hook. What's wrong with arch_cpu_idle_enter() for instance?
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
on 2017/8/29 22:39, Borislav Petkov wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Kyle Huey Cc: Andy Lutomirski Cc: Len Brown Cc: linux-kernel@vger.kernel.org --- arch/x86/kernel/process.c | 7 +++ kernel/sched/idle.c | 2 ++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 3ca1980..def4113 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -332,6 +332,13 @@ void arch_cpu_idle(void) x86_idle(); } +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT) +void arch_cpu_idle_poll(void) +{ + paravirt_idle_poll(); +} +#endif So this will get called on any system which has CONFIG_PARAVIRT enabled *even* if they're not running any guests. Huh? Borislav , yes, this will get called on any system which has CONFIG_PARAVIRT enabled. but if they are not running any guests, the callback is paravirt_nop() , IIUC which is as similar as the other paravirt_*, such as paravirt_pgd_free().. - Quan
Re: [RFC PATCH v2 7/7] sched/idle: update poll time when wakeup from idle
On 2017/8/29 20:46, Peter Zijlstra wrote: On Tue, Aug 29, 2017 at 11:46:41AM +, Yang Zhang wrote: In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here we just reuse this logic to update the poll time. It may be a little late to update the poll in ttwu_do_wakeup, but the test result shows no obvious performance gap compare with updating poll in irq handler. one problem is that idle_stamp only used when using CFS scheduler. But it is ok since it is the default policy for scheduler and only consider it should enough. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Same broken SoB chain, and not a useful word on why you need to adjust this crap to begin with. What you want that poll duration to be related to is the cost of a VMEXIT/VMENTER cycle, not however long we happened to be idle. So no. Peter, I think you are right.. IIUC, the time we happened to be idle may contain a chain of VMEXIT/VMENTER cycles, which would be mainly (except the last VMEXIT/VMENTER cycles) for just idle loops. right? as you mentioned, poll duration to be related to is the cost of __a__ VMEXIT/VMENTER cycle. howerver it is very difficult to measure a VMEXIT/VMENTER cycle accurately from kvm guest, we could find out an approximate one -- dropping the idle loops from the time we happened to be idle.. make sense? Quan
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
On 2017/9/1 14:49, Quan Xu wrote: on 2017/8/29 22:39, Borislav Petkov wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: Peter Zijlstra Cc: Borislav Petkov Cc: Kyle Huey Cc: Andy Lutomirski Cc: Len Brown Cc: linux-kernel@vger.kernel.org --- arch/x86/kernel/process.c | 7 +++ kernel/sched/idle.c | 2 ++ 2 files changed, 9 insertions(+) diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 3ca1980..def4113 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -332,6 +332,13 @@ void arch_cpu_idle(void) x86_idle(); } +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT) +void arch_cpu_idle_poll(void) +{ + paravirt_idle_poll(); +} +#endif So this will get called on any system which has CONFIG_PARAVIRT enabled *even* if they're not running any guests. Huh? Borislav, think twice, it is better to run ander an IF condition, to make sure they are running any guests. Quan Borislav , yes, this will get called on any system which has CONFIG_PARAVIRT enabled. but if they are not running any guests, the callback is paravirt_nop() , IIUC which is as similar as the other paravirt_*, such as paravirt_pgd_free().. - Quan
Re: [PATCH RFC 0/7] kvm pvtimer
On 2017/12/14 19:56, Paolo Bonzini wrote: On 13/12/2017 17:28, Konrad Rzeszutek Wilk wrote: 1) VM idle path and network req/resp services: Does this go away if you don't hit the idle path? Meaning if you loop without hitting HLT/MWAIT? I am assuming the issue you are facing is the latency - that is first time the guest comes from HLT and responds to the packet the latency is much higher than without? And the arming of the timer? 2) process context switches. Is that related to the 1)? That is the 'schedule' call and the process going to sleep waiting for an interrupt or timer? This all sounds like issues with low-CPU usage workloads where you need low latency responses? Even high-CPU usage, as long as there is a small idle time. The cost of setting the TSC deadline timer twice is about 3000 cycles. However, I think Amazon's approach of not intercepting HLT/MWAIT/PAUSE can recover most of the performance and it's way less intrusive. Paolo, could you share the Amazon's patch or the LML link? thanks. Quan Thanks, Paolo
Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten
On 2017/12/14 21:15, Gonglei (Arei) wrote: -Original Message- From: Liran Alon [mailto:liran.a...@oracle.com] Sent: Thursday, December 14, 2017 9:02 PM To: Gonglei (Arei); pbonz...@redhat.com; rkrc...@redhat.com Cc: k...@vger.kernel.org; linux-kernel@vger.kernel.org; Huangweidong (C) Subject: Re: [PATCH] KVM: x86: ioapic: Clear IRR for rtc bit when rtc EOI gotten On 14/12/17 14:23, Gonglei wrote: We hit a bug in our test while run PCMark 10 in a windows 7 VM, The VM got stuck and the wallclock was hang after several minutes running PCMark 10 in it. It is quite easily to reproduce the bug with the upstream KVM and Qemu. We found that KVM can not inject any RTC irq to VM after it was hang, it fails to Deliver irq in ioapic_set_irq() because RTC irq is still pending in ioapic->irr. static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq, int irq_level, bool line_status) { ... if (!irq_level) { ioapic->irr &= ~mask; ret = 1; goto out; } ... if ((edge && old_irr == ioapic->irr) || (!edge && entry.fields.remote_irr)) { ret = 0; goto out; } According to RTC spec, after RTC injects a High level irq, OS will read CMOS's register C to to clear the irq flag, and pull down the irq electric pin. For Qemu, we will emulate the reading operation in cmos_ioport_read(), but Guest OS will fire a write operation before to tell which register will be read after this write, where we use s->cmos_index to record the following register to read. But in our test, we found that there is a possible situation that Vcpu fails to read RTC_REG_C to clear irq, This could happens while two VCpus are writing/reading registers at the same time, for example, vcpu 0 is trying to read RTC_REG_C, so it write RTC_REG_C first, where the s->cmos_index will be RTC_REG_C, but before it tries to read register C, another vcpu1 is going to read RTC_YEAR, it changes s->cmos_index to RTC_YEAR by a writing action. The next operation of vcpu0 will be lead to read RTC_YEAR, In this case, we will miss calling qemu_irq_lower(s->irq) to clear the irq. After this, kvm will never inject RTC irq, and Windows VM will hang. If I understood correctly, this looks to me like a race-condition bug in the Windows guest kernel. In real-hardware this race-condition will also cause the RTC_YEAR to be read instead of RTC_REG_C. Guest kernel should make sure that 2 CPUs does not attempt to read a CMOS register in parallel as they can override each other's cmos_index. See for example how Linux kernel makes sure to avoid such kind of issues in rtc_cmos_read() (arch/x86/kernel/rtc.c) by grabbing a cmos_lock. Yes, I knew that. Let's clear IRR of rtc when corresponding EOI is gotten to avoid the issue. Can you elaborate a bit more why it makes sense to put such workaround in KVM code instead of declaring this as guest kernel bug? I agree it's a Windows bug. The big problem is there is not problem on Xen platform. have you analyzed why it is not a problem on Xen? Quan Thanks, -Gonglei Regards, -Liran Suggested-by: Paolo Bonzini Signed-off-by: Gonglei --- Thanks to Paolo provides a good solution. :) arch/x86/kvm/ioapic.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c index 4e822ad..5022d63 100644 --- a/arch/x86/kvm/ioapic.c +++ b/arch/x86/kvm/ioapic.c @@ -160,6 +160,7 @@ static void rtc_irq_eoi(struct kvm_ioapic *ioapic, struct kvm_vcpu *vcpu) { if (test_and_clear_bit(vcpu->vcpu_id, ioapic->rtc_status.dest_map.map)) { + ioapic->irr &= ~(1 << RTC_GSI); --ioapic->rtc_status.pending_eoi; rtc_status_pending_eoi_check_valid(ioapic); }
Re: [PATCH] vfio: mdev: make a couple of functions and structure vfio_mdev_driver static
On 2017/12/22 07:12, Xiongwei Song wrote: The functions vfio_mdev_probe, vfio_mdev_remove and the structure vfio_mdev_driver are only used in this file, so make them static. Clean up sparse warnings: drivers/vfio/mdev/vfio_mdev.c:114:5: warning: no previous prototype for 'vfio_mdev_probe' [-Wmissing-prototypes] drivers/vfio/mdev/vfio_mdev.c:121:6: warning: no previous prototype for 'vfio_mdev_remove' [-Wmissing-prototypes] Signed-off-by: Xiongwei Song --- drivers/vfio/mdev/vfio_mdev.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c index fa848a701b8b..d230620fe02d 100644 --- a/drivers/vfio/mdev/vfio_mdev.c +++ b/drivers/vfio/mdev/vfio_mdev.c @@ -111,19 +111,19 @@ static const struct vfio_device_ops vfio_mdev_dev_ops = { .mmap = vfio_mdev_mmap, }; -int vfio_mdev_probe(struct device *dev) +static int vfio_mdev_probe(struct device *dev) { struct mdev_device *mdev = to_mdev_device(dev); return vfio_add_group_dev(dev, _mdev_dev_ops, mdev); } -void vfio_mdev_remove(struct device *dev) +static void vfio_mdev_remove(struct device *dev) { vfio_del_group_dev(dev); } -struct mdev_driver vfio_mdev_driver = { +static struct mdev_driver vfio_mdev_driver = { .name = "vfio_mdev", .probe = vfio_mdev_probe, .remove = vfio_mdev_remove, Reviewed-by: Quan Xu
Re: [PATCH RFC v3 4/6] Documentation: Add three sysctls for smart idle poll
On 2017/11/13 23:08, Ingo Molnar wrote: * Quan Xu wrote: From: Quan Xu To reduce the cost of poll, we introduce three sysctl to control the poll time when running as a virtual machine with paravirt. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu --- Documentation/sysctl/kernel.txt | 35 +++ arch/x86/kernel/paravirt.c |4 include/linux/kernel.h |6 ++ kernel/sysctl.c | 34 ++ 4 files changed, 79 insertions(+), 0 deletions(-) diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 694968c..30c25fb 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -714,6 +714,41 @@ kernel tries to allocate a number starting from this one. == +paravirt_poll_grow: (X86 only) + +Multiplied value to increase the poll time. This is expected to take +effect only when running as a virtual machine with CONFIG_PARAVIRT +enabled. This can't bring any benifit on bare mental even with +CONFIG_PARAVIRT enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_shrink: (X86 only) + +Divided value to reduce the poll time. This is expected to take effect +only when running as a virtual machine with CONFIG_PARAVIRT enabled. +This can't bring any benifit on bare mental even with CONFIG_PARAVIRT +enabled. + +By default this value is 2. Possible values to set are in range {2..16}. + +== + +paravirt_poll_threshold_ns: (X86 only) + +Controls the maximum poll time before entering real idle path. This is +expected to take effect only when running as a virtual machine with +CONFIG_PARAVIRT enabled. This can't bring any benifit on bare mental +even with CONFIG_PARAVIRT enabled. + +By default, this value is 0 means not to poll. Possible values to set +are in range {0..50}. Change the value to non-zero if running +latency-bound workloads in a virtual machine. I absolutely hate it how this hybrid idle loop polling mechanism is not self-tuning! Ingo, actually it is self-tuning.. Please make it all work fine by default, and automatically so, instead of adding three random parameters... .. I will make it all fine by default. howerver cloud environment is of diversity, could I only leave paravirt_poll_threshold_ns parameter (the maximum poll time), which is as similar as "adaptive halt-polling" Wanpeng mentioned.. then user can turn it off, or find an appropriate threshold for some odd scenario.. thanks for your comments!! Quan Alibaba Cloud And if it cannot be done automatically then we should rather not do it at all. Maybe the next submitter of a similar feature can think of a better approach. Thanks, Ingo
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Juergen Gross Cc: Alok Kataria Cc: Rusty Russell Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. Quan Alibaba Cloud Juergen
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 15:12, Wanpeng Li wrote: 2017-11-14 15:02 GMT+08:00 Quan Xu : On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Juergen Gross Cc: Alok Kataria Cc: Rusty Russell Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU Actually we can reduce the CPU utilization by sleeping a period of time as what has already been done in the poll logic of IO subsystem, then we can improve the algorithm in kvm instead of introduing another duplicate one in the kvm guest. We really appreciate upstream's kvm dynamic poll mechanism, which is really helpful for a lot of scenario.. However, as description said, in virtualization, idle path includes several heavy operations includes timer access (LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. for upstream's kvm dynamic poll mechanism, even you could provide a better algorism, how could you bypass timer access (LAPIC timer or TSC deadline timer), or a hardware context switch between virtual machine and hypervisor. I know these is a tradeoff. Furthermore, here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll: 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: 2280.6 ns/ctxsw -- 199.5 %CPU so, these tow solution are quite similar, but not duplicate.. that's also why to add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Quan Alibaba Cloud Regards, Wanpeng Li 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. Quan Alibaba Cloud Juergen
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 15:30, Juergen Gross wrote: On 14/11/17 08:02, Quan Xu wrote: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Juergen Gross Cc: Alok Kataria Cc: Rusty Russell Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. I don't question the general idea. I just think pvops isn't the best way to implement it. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes sorry, a typo.. /other archs/other hypervisors/ it refers hypervisor like Xen, HyperV and VMware).. code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. You are aware that pvops is x86 only? yes, I'm aware.. I really don't see the big difference in maintainability compared to the static key / function pointer variant: void (*guest_idle_poll_func)(void); struct static_key guest_idle_poll_key __read_mostly; static inline void guest_idle_poll(void) { if (static_key_false(_idle_poll_key)) guest_idle_poll_func(); } thank you for your sample code :) I agree there is no big difference.. I think we are discussion for two things: 1) x86 VM on different hypervisors 2) different archs VM on kvm hypervisor What I want to do is x86 VM on different hypervisors, such as kvm / xen / hyperv .. And KVM would just need to set guest_idle_poll_func and enable the static key. Works on non-x86 architectures, too. .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own functions for 'pv_mmu_ops'. I think it is the same to pv_idle_ops. with above explaination, do you still think I need to define the static key/function pointer variant? btw, any interest to port it to Xen HVM guest? :) Quan Alibaba Cloud
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 16:22, Wanpeng Li wrote: 2017-11-14 16:15 GMT+08:00 Quan Xu : On 2017/11/14 15:12, Wanpeng Li wrote: 2017-11-14 15:02 GMT+08:00 Quan Xu : On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Juergen Gross Cc: Alok Kataria Cc: Rusty Russell Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU Actually we can reduce the CPU utilization by sleeping a period of time as what has already been done in the poll logic of IO subsystem, then we can improve the algorithm in kvm instead of introduing another duplicate one in the kvm guest. We really appreciate upstream's kvm dynamic poll mechanism, which is really helpful for a lot of scenario.. However, as description said, in virtualization, idle path includes several heavy operations includes timer access (LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. for upstream's kvm dynamic poll mechanism, even you could provide a better algorism, how could you bypass timer access (LAPIC timer or TSC deadline timer), or a hardware context switch between virtual machine and hypervisor. I know these is a tradeoff. Furthermore, here is the data we get when running benchmark contextswitch to measure the latency(lower is better): 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 3402.9 ns/ctxsw -- 199.8 %CPU 2. w/ patch and disable kvm dynamic poll: 1163.5 ns/ctxsw -- 205.5 %CPU 3. w/ kvm dynamic poll: 2280.6 ns/ctxsw -- 199.5 %CPU so, these tow solution are quite similar, but not duplicate.. that's also why to add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. There is a similar logic in the idle governor/driver, so how this patchset influence the decision in the idle governor/driver when running on bare-metal(power managment is not exposed to the guest so we will not enter into idle driver in the guest)? This is expected to take effect only when running as a virtual machine with proper CONFIG_* enabled. This can not work on bare mental even with proper CONFIG_* enabled. Quan Alibaba Cloud
Re: [PATCH RFC v3 1/6] x86/paravirt: Add pv_idle_ops to paravirt ops
On 2017/11/14 18:27, Juergen Gross wrote: On 14/11/17 10:38, Quan Xu wrote: On 2017/11/14 15:30, Juergen Gross wrote: On 14/11/17 08:02, Quan Xu wrote: On 2017/11/13 18:53, Juergen Gross wrote: On 13/11/17 11:06, Quan Xu wrote: From: Quan Xu So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in idle path which will poll for a while before we enter the real idle state. In virtualization, idle path includes several heavy operations includes timer access(LAPIC timer or TSC deadline timer) which will hurt performance especially for latency intensive workload like message passing task. The cost is mainly from the vmexit which is a hardware context switch between virtual machine and hypervisor. Our solution is to poll for a while and do not enter real idle path if we can get the schedule event during polling. Poll may cause the CPU waste so we adopt a smart polling mechanism to reduce the useless poll. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Cc: Juergen Gross Cc: Alok Kataria Cc: Rusty Russell Cc: Thomas Gleixner Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: x...@kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: linux-kernel@vger.kernel.org Cc: xen-de...@lists.xenproject.org Hmm, is the idle entry path really so critical to performance that a new pvops function is necessary? Juergen, Here is the data we get when running benchmark netperf: 1. w/o patch and disable kvm dynamic poll (halt_poll_ns=0): 29031.6 bit/s -- 76.1 %CPU 2. w/ patch and disable kvm dynamic poll (halt_poll_ns=0): 35787.7 bit/s -- 129.4 %CPU 3. w/ kvm dynamic poll: 35735.6 bit/s -- 200.0 %CPU 4. w/patch and w/ kvm dynamic poll: 42225.3 bit/s -- 198.7 %CPU 5. idle=poll 37081.7 bit/s -- 998.1 %CPU w/ this patch, we will improve performance by 23%.. even we could improve performance by 45.4%, if we use w/patch and w/ kvm dynamic poll. also the cost of CPU is much lower than 'idle=poll' case.. I don't question the general idea. I just think pvops isn't the best way to implement it. Wouldn't a function pointer, maybe guarded by a static key, be enough? A further advantage would be that this would work on other architectures, too. I assume this feature will be ported to other archs.. a new pvops makes sorry, a typo.. /other archs/other hypervisors/ it refers hypervisor like Xen, HyperV and VMware).. code clean and easy to maintain. also I tried to add it into existed pvops, but it doesn't match. You are aware that pvops is x86 only? yes, I'm aware.. I really don't see the big difference in maintainability compared to the static key / function pointer variant: void (*guest_idle_poll_func)(void); struct static_key guest_idle_poll_key __read_mostly; static inline void guest_idle_poll(void) { if (static_key_false(_idle_poll_key)) guest_idle_poll_func(); } thank you for your sample code :) I agree there is no big difference.. I think we are discussion for two things: 1) x86 VM on different hypervisors 2) different archs VM on kvm hypervisor What I want to do is x86 VM on different hypervisors, such as kvm / xen / hyperv .. Why limit the solution to x86 if the more general solution isn't harder? As you didn't give any reason why the pvops approach is better other than you don't care for non-x86 platforms you won't get an "Ack" from me for this patch. It just looks a little odder to me. I understand you care about no-x86 arch. Are you aware 'pv_time_ops' for arm64/arm/x86 archs, defined in - arch/arm64/include/asm/paravirt.h - arch/x86/include/asm/paravirt_types.h - arch/arm/include/asm/paravirt.h I am unfamilar with arm code. IIUC, if you'd implement pv_idle_ops for arm/arm64 arch, you'd define a same structure in - arch/arm64/include/asm/paravirt.h or - arch/arm/include/asm/paravirt.h .. instead of static key / fuction. then implement a real function in - arch/arm/kernel/paravirt.c. Also I wonder HOW/WHERE to define a static key/function, then to benifit x86/no-x86 archs? Quan Alibaba Cloud And KVM would just need to set guest_idle_poll_func and enable the static key. Works on non-x86 architectures, too. .. referred to 'pv_mmu_ops', HyperV and Xen can implement their own functions for 'pv_mmu_ops'. I think it is the same to pv_idle_ops. with above explaination, do you still think I need to define the static key/function pointer variant? btw, any interest to port it to Xen HVM guest? :) Maybe. But this should work for Xen on ARM, too. Juergen
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 17:45, Daniel Lezcano wrote: On 16/11/2017 10:12, Quan Xu wrote: On 2017-11-16 06:03, Thomas Gleixner wrote: On Wed, 15 Nov 2017, Peter Zijlstra wrote: On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote: From: Yang Zhang Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. No, we want less of those magic hooks, not more. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Why not do a HV specific idle driver? If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. See commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit looks similar. But lets not proliferate that. I'd rather see that go away. agreed. Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu") in kvm guest. I won't proliferate that.. But the irq_timings stuff is heading into the same direction, with a more complex prediction logic which should tell you pretty good how long that idle period is going to be and in case of an interrupt heavy workload this would skip the extra work of stopping and restarting the tick and provide a very good input into a polling decision. interesting. I have tested with IRQ_TIMINGS related code, which seems not working so far. I don't know how you tested it, can you elaborate what you meant by "seems not working so far" ? Daniel, I tried to enable IRQ_TIMINGS* manually. used irq_timings_next_event() to return estimation of the earliest interrupt. However I got a constant. There are still some work to do to be more efficient. The prediction based on the irq timings is all right if the interrupts have a simple periodicity. But as soon as there is a pattern, the current code can't handle it properly and does bad predictions. I'm working on a self-learning pattern detection which is too heavy for the kernel, and with it we should be able to detect properly the patterns and re-ajust the period if it changes. I'm in the process of making it suitable for kernel code (both math and perf). One improvement which can be done right now and which can help you is the interrupts rate on the CPU. It is possible to compute it and that will give an accurate information for the polling decision. As tglx said, talk to each other / work together to make it usable for all use cases. could you share how to enable it to get the interrupts rate on the CPU? I can try it in cloud scenario. of course, I'd like to work with you to improve it. Quan Alibaba Cloud
Re: [Xen-devel] [PATCH RFC v3 0/6] x86/idle: add halt poll support
On 2017-11-16 05:31, Konrad Rzeszutek Wilk wrote: On Mon, Nov 13, 2017 at 06:05:59PM +0800, Quan Xu wrote: From: Yang Zhang Some latency-intensive workload have seen obviously performance drop when running inside VM. The main reason is that the overhead is amplified when running inside VM. The most cost I have seen is inside idle path. Meaning an VMEXIT b/c it is an 'halt' operation ? And then going back in guest (VMRESUME) takes time. And hence your latency gets all whacked b/c of this? Konrad, I can't follow 'b/c' here.. sorry. So if I understand - you want to use your _full_ timeslice (of the guest) without ever (or as much as possible) to go in the hypervisor? as much as possible. Which means in effect you don't care about power-saving or CPUfreq savings, you just want to eat the full CPU for snack? actually, we care about power-saving. The poll duration is self-tuning, otherwise it is almost as the same as 'halt=poll'. Also we always sent out with CPU usage of benchmark netperf/ctxsw. We got much more performance with limited promotion of CPU usage. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Schedule of what? The guest or the host? rescheduled of guest scheduler.. it is the guest. Quan Alibaba Cloud
Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
on 2017/9/13 19:56, Yang Zhang wrote: On 2017/8/29 22:56, Michael S. Tsirkin wrote: On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote: Some latency-intensive workload will see obviously performance drop when running inside VM. But are we trading a lot of CPU for a bit of lower latency? The main reason is that the overhead is amplified when running inside VM. The most cost i have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Isn't it the job of an idle driver to find the best way to halt the CPU? It looks like just by adding a cstate we can make it halt at higher latencies only. And at lower latencies, if it's doing a good job we can hopefully use mwait to stop the CPU. In fact I have been experimenting with exactly that. Some initial results are encouraging but I could use help with testing and especially tuning. If you can help pls let me know! Quan, Can you help to test it and give result? Thanks. Hi, MST I have tested the patch "intel_idle: add pv cstates when running on kvm" on a recent host that allows guests to execute mwait without an exit. also I have tested our patch "[RFC PATCH v2 0/7] x86/idle: add halt poll support", upstream linux, and idle=poll. the following is the result (which seems better than ever berfore, as I ran test case on a more powerful machine): for __netperf__, the first column is trans. rate per sec, the second column is CPU utilzation. 1. upstream linux 28371.7 bits/s -- 76.6 %CPU 2. idle=poll 34372 bit/s -- 999.3 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 28362.7 bits/s -- 74.7 %CPU (halt_poll_threshold=1) 32949.5 bits/s -- 82.5 %CPU (halt_poll_threshold=2) 39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3) 40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4) 40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 33041.8 bits/s -- 999.4 %CPU for __ctxsw__, the first column is the time per process context switches, the second column is CPU utilzation.. 1. upstream linux 3624.19 ns/ctxsw -- 191.9 %CPU 2. idle=poll 3419.66 ns/ctxsw -- 999.2 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1) 1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2) 1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3) 1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4) 1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 3427.59 ns/ctxsw -- 999.4 %CPU -Quan
Re: [RFC PATCH v2 3/7] sched/idle: Add poll before enter real idle path
on 2017/9/1 13:57, Quan Xu wrote: on 2017/8/29 20:45, Peter Zijlstra wrote: On Tue, Aug 29, 2017 at 11:46:37AM +, Yang Zhang wrote: Add poll in do_idle. For UP VM, if there are running task, it will not goes into idle path, so we only enable poll in SMP VM. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Broken SoB chain. Peter, I can't follow 'Broken SoB chain'.. could you explain more about it? Peter, Ping.. Quan -Quan diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 6c23e30..b374744 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void) } /* Weak implementations for optional arch specific functions */ +void __weak arch_cpu_idle_poll(void) { } void __weak arch_cpu_idle_prepare(void) { } void __weak arch_cpu_idle_enter(void) { } And not a word on why we need a new arch hook. What's wrong with arch_cpu_idle_enter() for instance?
Re: [RFC PATCH v2 0/7] x86/idle: add halt poll support
On 2017/9/14 17:19, Wanpeng Li wrote: 2017-09-14 16:36 GMT+08:00 Quan Xu : on 2017/9/13 19:56, Yang Zhang wrote: On 2017/8/29 22:56, Michael S. Tsirkin wrote: On Tue, Aug 29, 2017 at 11:46:34AM +, Yang Zhang wrote: Some latency-intensive workload will see obviously performance drop when running inside VM. But are we trading a lot of CPU for a bit of lower latency? The main reason is that the overhead is amplified when running inside VM. The most cost i have seen is inside idle path. This patch introduces a new mechanism to poll for a while before entering idle state. If schedule is needed during poll, then we don't need to goes through the heavy overhead path. Isn't it the job of an idle driver to find the best way to halt the CPU? It looks like just by adding a cstate we can make it halt at higher latencies only. And at lower latencies, if it's doing a good job we can hopefully use mwait to stop the CPU. In fact I have been experimenting with exactly that. Some initial results are encouraging but I could use help with testing and especially tuning. If you can help pls let me know! Quan, Can you help to test it and give result? Thanks. Hi, MST I have tested the patch "intel_idle: add pv cstates when running on kvm" on a recent host that allows guests to execute mwait without an exit. also I have tested our patch "[RFC PATCH v2 0/7] x86/idle: add halt poll support", upstream linux, and idle=poll. the following is the result (which seems better than ever berfore, as I ran test case on a more powerful machine): for __netperf__, the first column is trans. rate per sec, the second column is CPU utilzation. 1. upstream linux This "upstream linux" means that disables the kvm adaptive halt-polling after confirm with Xu Quan. upstream linux -- the source code is just from upstream linux, without our patch or MST's patch.. yes, we disable kvm halt-polling(halt_poll_ns=0) for _all_of_ following cases. Quan Regards, Wanpeng Li 28371.7 bits/s -- 76.6 %CPU 2. idle=poll 34372 bit/s -- 999.3 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 28362.7 bits/s -- 74.7 %CPU (halt_poll_threshold=1) 32949.5 bits/s -- 82.5 %CPU (halt_poll_threshold=2) 39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=3) 40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=4) 40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 33041.8 bits/s -- 999.4 %CPU for __ctxsw__, the first column is the time per process context switches, the second column is CPU utilzation.. 1. upstream linux 3624.19 ns/ctxsw -- 191.9 %CPU 2. idle=poll 3419.66 ns/ctxsw -- 999.2 %CPU 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different values of parameter 'halt_poll_threshold': 1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=1) 1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=2) 1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=3) 1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=4) 1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=5) 4. "intel_idle: add pv cstates when running on kvm" 3427.59 ns/ctxsw -- 999.4 %CPU -Quan
Re: [PATCH] KVM: nVMX: remove unnecessary vmwrite from L2->L1 vmexit
On 2018/01/02 17:47, Liran Alon wrote: On 02/01/18 00:58, Paolo Bonzini wrote: The POSTED_INTR_NV field is constant (though it differs between the vmcs01 and vmcs02), there is no need to reload it on vmexit to L1. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/vmx.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e6223fe8faa1..1e184830a295 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -11610,9 +11610,6 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu, */ vmx_flush_tlb(vcpu, true); } - /* Restore posted intr vector. */ - if (nested_cpu_has_posted_intr(vmcs12)) - vmcs_write16(POSTED_INTR_NV, POSTED_INTR_VECTOR); vmcs_write32(GUEST_SYSENTER_CS, vmcs12->host_ia32_sysenter_cs); vmcs_writel(GUEST_SYSENTER_ESP, vmcs12->host_ia32_sysenter_esp); Reviewed-by: Liran Alon I would also add to commit message: Fixes: 06a5524f091b ("KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode") Reviewed-by: Quan Xu
Re: [PATCH v3] KVM: X86: Fix load RFLAGS w/o the fixed bit
On 2017/12/07 16:30, Wanpeng Li wrote: From: Wanpeng Li *** Guest State *** CR0: actual=0x0030, shadow=0x6010, gh_mask=fff7 CR4: actual=0x2050, shadow=0x, gh_mask=e871 CR3 = 0xfffbc000 RSP = 0x RIP = 0x RFLAGS=0x DR7 = 0x0400 ^^ The failed vmentry is triggered by the following testcase when ept=Y: #include #include #include #include #include #include #include long r[5]; int main() { r[2] = open("/dev/kvm", O_RDONLY); r[3] = ioctl(r[2], KVM_CREATE_VM, 0); r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7); struct kvm_regs regs = { .rflags = 0, }; ioctl(r[4], KVM_SET_REGS, ); ioctl(r[4], KVM_RUN, 0); } X86 RFLAGS bit 1 is fixed set, userspace can simply clearing bit 1 of RFLAGS with KVM_SET_REGS ioctl which results in vmentry fails. This patch fixes it by oring X86_EFLAGS_FIXED during ioctl. Suggested-by: Jim Mattson Reviewed-by: David Hildenbrand Cc: Paolo Bonzini Cc: Radim Krčmář Cc: Jim Mattson Signed-off-by: Wanpeng Li Reviewed-by: Quan Xu
[PATCH RFC 1/7] kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR
From: Ben Luo Guest enables pv timer functionality using this MSR Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Signed-off-by: Ben Luo --- arch/x86/include/asm/kvm_host.h |5 + arch/x86/include/uapi/asm/kvm_para.h |6 ++ arch/x86/kvm/lapic.c | 22 ++ arch/x86/kvm/lapic.h |6 ++ arch/x86/kvm/x86.c |8 5 files changed, 47 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c73e493..641b4aa 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -684,6 +684,11 @@ struct kvm_vcpu_arch { bool pv_unhalted; } pv; + struct { + u64 msr_val; + struct gfn_to_hva_cache data; + } pv_timer; + int pending_ioapic_eoi; int pending_external_vector; diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 554aa8f..3dd6116 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -41,6 +41,7 @@ #define MSR_KVM_ASYNC_PF_EN 0x4b564d02 #define MSR_KVM_STEAL_TIME 0x4b564d03 #define MSR_KVM_PV_EOI_EN 0x4b564d04 +#define MSR_KVM_PV_TIMER_EN0x4b564d05 struct kvm_steal_time { __u64 steal; @@ -64,6 +65,11 @@ struct kvm_clock_pairing { #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1))) #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1) +struct pvtimer_vcpu_event_info { + __u64 expire_tsc; + __u64 next_sync_tsc; +} __attribute__((__packed__)); + #define KVM_MAX_MMU_OP_BATCH 32 #define KVM_ASYNC_PF_ENABLED (1 << 0) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 36c90d6..55c9ba3 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1991,6 +1991,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) kvm_lapic_set_base(vcpu, vcpu->arch.apic_base | MSR_IA32_APICBASE_BSP); vcpu->arch.pv_eoi.msr_val = 0; + vcpu->arch.pv_timer.msr_val = 0; apic_update_ppr(apic); if (vcpu->arch.apicv_active) { kvm_x86_ops->apicv_post_state_restore(vcpu); @@ -2478,6 +2479,27 @@ int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data) addr, sizeof(u8)); } +int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data) +{ + u64 addr = data & ~KVM_MSR_ENABLED; + int ret; + + if (!lapic_in_kernel(vcpu)) + return 1; + + if (!IS_ALIGNED(addr, 4)) + return 1; + + vcpu->arch.pv_timer.msr_val = data; + if (!pv_timer_enabled(vcpu)) + return 0; + + ret = kvm_gfn_to_hva_cache_init(vcpu->kvm, >arch.pv_timer.data, + addr, sizeof(struct pvtimer_vcpu_event_info)); + + return ret; +} + void kvm_apic_accept_events(struct kvm_vcpu *vcpu) { struct kvm_lapic *apic = vcpu->arch.apic; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 4b9935a..539a738 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -113,6 +113,7 @@ static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu) } int kvm_lapic_enable_pv_eoi(struct kvm_vcpu *vcpu, u64 data); +int kvm_lapic_enable_pv_timer(struct kvm_vcpu *vcpu, u64 data); void kvm_lapic_init(void); void kvm_lapic_exit(void); @@ -207,6 +208,11 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu) return lapic_in_kernel(vcpu) && test_bit(KVM_APIC_INIT, >arch.apic->pending_events); } +static inline bool pv_timer_enabled(struct kvm_vcpu *vcpu) +{ + return vcpu->arch.pv_timer.msr_val & KVM_MSR_ENABLED; +} + bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector); void wait_lapic_expire(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 03869eb..5668774 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1025,6 +1025,7 @@ bool kvm_rdpmc(struct kvm_vcpu *vcpu) HV_X64_MSR_STIMER0_CONFIG, HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, MSR_KVM_PV_EOI_EN, + MSR_KVM_PV_TIMER_EN, MSR_IA32_TSC_ADJUST, MSR_IA32_TSCDEADLINE, @@ -2279,6 +2280,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (kvm_lapic_enable_pv_eoi(vcpu, data)) return 1; break; + case MSR_KVM_PV_TIMER_EN: + if (kvm_lapic_enable_pv_timer(vcpu, data)) + return 1; + break; case MSR_IA32_MCG_CTL: case MSR_IA32_MCG_STATUS: @@ -2510,6 +2515,9 @@ int kvm_get_msr_common(struc
[PATCH RFC 0/7] kvm pvtimer
From: Ben Luo This patchset introduces a new paravirtualized mechanism to reduce VM-exit caused by guest timer accessing. In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and then KVM handles this timer for guest. Also kvm always registers timer on the CPU which vCPU was running on. Even though vCPU thread is rescheduled to another CPU, the timer will be migrated to the target CPU as well. When timer expired, timer interrupt could make guest-mode vCPU VM-exit on this CPU. When pvtimer is enabled: - The tsc-deadline timestamp is mostly recorded in share page with less VM-exit. We Introduce a periodically working kthread to scan share page and synchronize timer setting for guest on a dedicated CPU. - Since the working kthread scans periodically, some of the timer events may be lost or delayed. We have to program these tsc-deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will signal the working thread through IPI to program timer, instread of registering on current CPU. - Timer interrupt will be delivered by posted interrupt mechanism to vCPUs with less VM-exit. Ben Luo (7): kvm: x86: emulate MSR_KVM_PV_TIMER_EN MSR kvm: x86: add a function to exchange value KVM: timer: synchronize tsc-deadline timestamp for guest KVM: timer: program timer to a dedicated CPU KVM: timer: ignore timer migration if pvtimer is enabled Doc/KVM: introduce a new cpuid bit for kvm pvtimer kvm: guest: reprogram guest timer Documentation/virtual/kvm/cpuid.txt |4 + arch/x86/include/asm/kvm_host.h |5 + arch/x86/include/asm/kvm_para.h |9 ++ arch/x86/include/uapi/asm/kvm_para.h |7 ++ arch/x86/kernel/apic/apic.c |9 +- arch/x86/kernel/kvm.c| 38 arch/x86/kvm/cpuid.c |1 + arch/x86/kvm/lapic.c | 170 +- arch/x86/kvm/lapic.h | 11 ++ arch/x86/kvm/x86.c | 15 +++- include/linux/kvm_host.h |3 + virt/kvm/kvm_main.c | 42 + 12 files changed, 308 insertions(+), 6 deletions(-)
[PATCH RFC 2/7] kvm: x86: add a function to exchange value
From: Ben Luo Introduce kvm_xchg_guest_cached to exchange value with guest page atomically. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Signed-off-by: Ben Luo --- include/linux/kvm_host.h |3 +++ virt/kvm/kvm_main.c | 42 ++ 2 files changed, 45 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6882538..32949ed 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -688,6 +688,9 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, void *data, int offset, unsigned long len); int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc, gpa_t gpa, unsigned long len); +unsigned long kvm_xchg_guest_cached(struct kvm *kvm, + struct gfn_to_hva_cache *ghc, unsigned long offset, + unsigned long new, int size); int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len); int kvm_clear_guest(struct kvm *kvm, gpa_t gpa, unsigned long len); struct kvm_memory_slot *gfn_to_memslot(struct kvm *kvm, gfn_t gfn); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 9deb5a2..3149e17 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2010,6 +2010,48 @@ int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, } EXPORT_SYMBOL_GPL(kvm_read_guest_cached); +unsigned long kvm_xchg_guest_cached(struct kvm *kvm, + struct gfn_to_hva_cache *ghc, unsigned long offset, + unsigned long new, int size) +{ + unsigned long r; + void *kva; + struct page *page; + kvm_pfn_t pfn; + + WARN_ON(offset > ghc->len); + + pfn = gfn_to_pfn_atomic(kvm, ghc->gpa >> PAGE_SHIFT); + page = kvm_pfn_to_page(pfn); + + if (is_error_page(page)) + return -EFAULT; + + kva = kmap_atomic(page) + offset_in_page(ghc->gpa) + offset; + switch (size) { + case 1: + r = xchg((char *)kva, new); + break; + case 2: + r = xchg((short *)kva, new); + break; + case 4: + r = xchg((int *)kva, new); + break; + case 8: + r = xchg((long *)kva, new); + break; + default: + kunmap_atomic(kva); + return -EFAULT; + } + + kunmap_atomic(kva); + mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT); + + return r; +} + int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len) { const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0))); -- 1.7.1
[PATCH RFC 5/7] KVM: timer: ignore timer migration if pvtimer is enabled
From: Ben Luo When pvtimer is enabled, KVM programs timer to a dedicated CPU through IPI. Whether the vCPU is on the dedicated CPU or any other CPU, the timer interrupt will be delivered properly. No need to migrate timer. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Signed-off-by: Ben Luo --- arch/x86/kvm/lapic.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5835a27..265efe6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2282,7 +2282,7 @@ void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) { struct hrtimer *timer; - if (!lapic_in_kernel(vcpu)) + if (!lapic_in_kernel(vcpu) || pv_timer_enabled(vcpu)) return; timer = >arch.apic->lapic_timer.timer; -- 1.7.1
[PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest
From: Ben Luo In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and then KVM handles this timer for guest. The tsc-deadline timestamp is mostly recorded in share page with less VM-exit. We Introduce a periodically working kthread to scan share page and synchronize timer setting for guest on a dedicated CPU. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Signed-off-by: Ben Luo --- arch/x86/kvm/lapic.c | 138 ++ arch/x86/kvm/lapic.h |5 ++ 2 files changed, 143 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 55c9ba3..20a23bb 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -36,6 +36,10 @@ #include #include #include +#include +#include +#include +#include #include "kvm_cache_regs.h" #include "irq.h" #include "trace.h" @@ -70,6 +74,12 @@ #define APIC_BROADCAST 0xFF #define X2APIC_BROADCAST 0xul +static struct hrtimer pv_sync_timer; +static long pv_timer_period_ns = PVTIMER_PERIOD_NS; +static struct task_struct *pv_timer_polling_worker; + +module_param(pv_timer_period_ns, long, 0644); + static inline int apic_test_vector(int vec, void *bitmap) { return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); @@ -2542,8 +2552,130 @@ void kvm_apic_accept_events(struct kvm_vcpu *vcpu) } } +static enum hrtimer_restart pv_sync_timer_callback(struct hrtimer *timer) +{ + hrtimer_forward_now(timer, ns_to_ktime(pv_timer_period_ns)); + wake_up_process(pv_timer_polling_worker); + + return HRTIMER_RESTART; +} + +void kvm_apic_sync_pv_timer(void *data) +{ + struct kvm_vcpu *vcpu = data; + struct kvm_lapic *apic = vcpu->arch.apic; + unsigned long flags, this_tsc_khz = vcpu->arch.virtual_tsc_khz; + u64 guest_tsc, expire_tsc; + long rem_tsc; + + if (!lapic_in_kernel(vcpu) || !pv_timer_enabled(vcpu)) + return; + + local_irq_save(flags); + guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc()); + rem_tsc = ktime_to_ns(hrtimer_get_remaining(_sync_timer)) + * this_tsc_khz; + if (rem_tsc <= 0) + rem_tsc += pv_timer_period_ns * this_tsc_khz; + do_div(rem_tsc, 100L); + + /* +* make sure guest_tsc and rem_tsc are assigned before to update +* next_sync_tsc. +*/ + smp_wmb(); + kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data, + offsetof(struct pvtimer_vcpu_event_info, next_sync_tsc), + guest_tsc + rem_tsc, 8); + + /* make sure next_sync_tsc is visible */ + smp_wmb(); + + expire_tsc = kvm_xchg_guest_cached(vcpu->kvm, >arch.pv_timer.data, + offsetof(struct pvtimer_vcpu_event_info, expire_tsc), + 0UL, 8); + + /* make sure expire_tsc is visible */ + smp_wmb(); + + if (expire_tsc) { + if (expire_tsc > guest_tsc) + /* +* As we bind this thread to a dedicated CPU through +* IPI, the timer is registered on that dedicated +* CPU here. +*/ + kvm_set_lapic_tscdeadline_msr(apic->vcpu, expire_tsc); + else + /* deliver immediately if expired */ + kvm_apic_local_deliver(apic, APIC_LVTT); + } + local_irq_restore(flags); +} + +static int pv_timer_polling(void *arg) +{ + struct kvm *kvm; + struct kvm_vcpu *vcpu; + int i; + mm_segment_t oldfs = get_fs(); + + while (1) { + set_current_state(TASK_INTERRUPTIBLE); + + if (kthread_should_stop()) { + __set_current_state(TASK_RUNNING); + break; + } + + spin_lock(_lock); + __set_current_state(TASK_RUNNING); + list_for_each_entry(kvm, _list, vm_list) { + set_fs(USER_DS); + use_mm(kvm->mm); + kvm_for_each_vcpu(i, vcpu, kvm) { + kvm_apic_sync_pv_timer(vcpu); + } + unuse_mm(kvm->mm); + set_fs(oldfs); + } + + spin_unlock(_lock); + + schedule(); + } + + return 0; +} + +static void kvm_pv_timer_init(void) +{ + ktime_t ktime = ktime_set(0, pv_timer_period_ns); + + hrtimer_init(_sync_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED); + pv_sync_timer.function = _sync_timer_callback; + + /* kthread for pv_timer sync buffer */ + pv_timer_polling_worker = kthread_create(pv_timer_polling, NULL, +
[PATCH RFC 7/7] kvm: guest: reprogram guest timer
From: Ben Luo In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. When pvtimer is enabled, we introduce a new mechanism to reprogram KVM guest timer. A periodically working kthread scans share page and synchronize timer setting for guest on a dedicated CPU. The next time event of the periodically working kthread is a threshold to decide whether to program tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR, or to share page. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Signed-off-by: Ben Luo --- arch/x86/include/asm/kvm_para.h |9 + arch/x86/kernel/apic/apic.c |9 ++--- arch/x86/kernel/kvm.c | 38 ++ 3 files changed, 53 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h index c373e44..109e706 100644 --- a/arch/x86/include/asm/kvm_para.h +++ b/arch/x86/include/asm/kvm_para.h @@ -5,6 +5,7 @@ #include #include #include +#include extern void kvmclock_init(void); extern int kvm_register_clock(char *txt); @@ -92,6 +93,8 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, void kvm_async_pf_task_wait(u32 token, int interrupt_kernel); void kvm_async_pf_task_wake(u32 token); u32 kvm_read_and_reset_pf_reason(void); +int kvm_pv_timer_next_event(unsigned long tsc, + struct clock_event_device *evt); extern void kvm_disable_steal_time(void); #ifdef CONFIG_PARAVIRT_SPINLOCKS @@ -126,6 +129,12 @@ static inline void kvm_disable_steal_time(void) { return; } + +static inline int kvm_pv_timer_next_event(unsigned long tsc, + struct clock_event_device *evt) +{ + return 0; +} #endif #endif /* _ASM_X86_KVM_PARA_H */ diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index ff89177..286c1b3 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -471,10 +471,13 @@ static int lapic_next_event(unsigned long delta, static int lapic_next_deadline(unsigned long delta, struct clock_event_device *evt) { - u64 tsc; + u64 tsc = rdtsc() + (((u64) delta) * TSC_DIVISOR); - tsc = rdtsc(); - wrmsrl(MSR_IA32_TSC_DEADLINE, tsc + (((u64) delta) * TSC_DIVISOR)); + /* TODO: undisciplined function call */ + if (kvm_pv_timer_next_event(tsc, evt)) + return 0; + + wrmsrl(MSR_IA32_TSC_DEADLINE, tsc); return 0; } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..ec7aff1 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -328,6 +328,35 @@ static notrace void kvm_guest_apic_eoi_write(u32 reg, u32 val) apic->native_eoi_write(APIC_EOI, APIC_EOI_ACK); } +static DEFINE_PER_CPU(int, pvtimer_enabled); +static DEFINE_PER_CPU(struct pvtimer_vcpu_event_info, + pvtimer_shared_buf) = {0}; +#define PVTIMER_PADDING25000 +int kvm_pv_timer_next_event(unsigned long tsc, + struct clock_event_device *evt) +{ + struct pvtimer_vcpu_event_info *src; + u64 now; + + if (!this_cpu_read(pvtimer_enabled)) + return false; + + src = this_cpu_ptr(_shared_buf); + xchg((u64 *)>expire_tsc, tsc); + + barrier(); + + if (tsc < src->next_sync_tsc) + return false; + + rdtscll(now); + if (tsc < now || tsc - now < PVTIMER_PADDING) + return false; + + return true; +} +EXPORT_SYMBOL_GPL(kvm_pv_timer_next_event); + static void kvm_guest_cpu_init(void) { if (!kvm_para_available()) @@ -362,6 +391,15 @@ static void kvm_guest_cpu_init(void) if (has_steal_clock) kvm_register_steal_time(); + + if (kvm_para_has_feature(KVM_FEATURE_PV_TIMER)) { + unsigned long data; + + data = slow_virt_to_phys(this_cpu_ptr(_shared_buf)) + | KVM_MSR_ENABLED; + wrmsrl(MSR_KVM_PV_TIMER_EN, data); + this_cpu_write(pvtimer_enabled, 1); + } } static void kvm_pv_disable_apf(void) -- 1.7.1
[PATCH RFC 6/7] Doc/KVM: introduce a new cpuid bit for kvm pvtimer
From: Ben Luo KVM_FEATURE_PV_TIMER enables guest to check whether pvtimer can be enabled in guest. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Signed-off-by: Ben Luo --- Documentation/virtual/kvm/cpuid.txt |4 arch/x86/include/uapi/asm/kvm_para.h |1 + arch/x86/kvm/cpuid.c |1 + 3 files changed, 6 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/cpuid.txt b/Documentation/virtual/kvm/cpuid.txt index 3c65feb..b26b31c 100644 --- a/Documentation/virtual/kvm/cpuid.txt +++ b/Documentation/virtual/kvm/cpuid.txt @@ -54,6 +54,10 @@ KVM_FEATURE_PV_UNHALT || 7 || guest checks this feature bit || || before enabling paravirtualized || || spinlock support. -- +KVM_FEATURE_PV_TIMER || 8 || guest checks this feature bit + || || before enabling paravirtualized + || || timer support. +-- KVM_FEATURE_CLOCKSOURCE_STABLE_BIT ||24 || host will warn if no guest-side || || per-cpu warps are expected in || || kvmclock. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 3dd6116..46734a8 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -25,6 +25,7 @@ #define KVM_FEATURE_STEAL_TIME 5 #define KVM_FEATURE_PV_EOI 6 #define KVM_FEATURE_PV_UNHALT 7 +#define KVM_FEATURE_PV_TIMER 8 /* The last 8 bits are used to indicate how to interpret the flags field * in pvclock structure. If no bits are set, all flags are ignored. diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 0099e10..e02fd23 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -593,6 +593,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, (1 << KVM_FEATURE_CLOCKSOURCE2) | (1 << KVM_FEATURE_ASYNC_PF) | (1 << KVM_FEATURE_PV_EOI) | +(1 << KVM_FEATURE_PV_TIMER) | (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) | (1 << KVM_FEATURE_PV_UNHALT); -- 1.7.1
[PATCH RFC 4/7] KVM: timer: program timer to a dedicated CPU
From: Ben Luo KVM always registers timer on the CPU which vCPU was running on. Even though vCPU thread is rescheduled to another CPU, the timer will be migrated to the target CPU as well. When timer expired, timer interrupt could make guest-mode vCPU VM-exit on this CPU. Since the working kthread scans periodically, some of the timer events may be lost or delayed. We have to program these tsc- deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will signal the working thread through IPI to program timer, instread of registering on current CPU. Signed-off-by: Yang Zhang Signed-off-by: Quan Xu Signed-off-by: Ben Luo --- arch/x86/kvm/lapic.c |8 +++- arch/x86/kvm/x86.c |7 ++- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 20a23bb..5835a27 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2072,7 +2072,13 @@ static enum hrtimer_restart apic_timer_fn(struct hrtimer *data) struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer); struct kvm_lapic *apic = container_of(ktimer, struct kvm_lapic, lapic_timer); - apic_timer_expired(apic); + + if (pv_timer_enabled(apic->vcpu)) { + kvm_apic_local_deliver(apic, APIC_LVTT); + if (apic_lvtt_tscdeadline(apic)) + apic->lapic_timer.tscdeadline = 0; + } else + apic_timer_expired(apic); if (lapic_is_periodic(apic)) { advance_periodic_target_expiration(apic); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5668774..3cbb223 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -26,6 +26,7 @@ #include "tss.h" #include "kvm_cache_regs.h" #include "x86.h" +#include "lapic.h" #include "cpuid.h" #include "pmu.h" #include "hyperv.h" @@ -2196,7 +2197,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff: return kvm_x2apic_msr_write(vcpu, msr, data); case MSR_IA32_TSCDEADLINE: - kvm_set_lapic_tscdeadline_msr(vcpu, data); + if (pv_timer_enabled(vcpu)) + smp_call_function_single(PVTIMER_SYNC_CPU, + kvm_apic_sync_pv_timer, vcpu, 0); + else + kvm_set_lapic_tscdeadline_msr(vcpu, data); break; case MSR_IA32_TSC_ADJUST: if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) { -- 1.7.1
[PATCH] KVM: VMX: drop I/O permission bitmaps
From: Quan Xu Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson Signed-off-by: Radim Krčmář Signed-off-by: Quan Xu --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), FIELD64(MSR_BITMAP, msr_bitmap), FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr), FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr), @@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +954,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6741,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); -- 1.7.1
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/09 00:18, David Hildenbrand wrote: On 08.12.2017 11:22, Quan Xu wrote: From: Quan Xu Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson Signed-off-by: Radim Krčmář Signed-off-by: Quan Xu --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), FIELD64(MSR_BITMAP, msr_bitmap), FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr), FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr), @@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +954,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6741,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); Looks good to me. David, thanks for your review. We could now also drop diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a5d2856eb28b..732e93332f4c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * Rather, exit every time. */ exec_control &= ~CPU_BASED_USE_IO_BITMAPS; - exec_control |= CPU_BASED_UNCOND_IO_EXITING; vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control); As this will be implicitly set by vmx_exec_control() good catch. yes, L0 merge vmcs01 with vmcsl2 to create vmcs02 and run L2. however as this code is related to nested virtualization, I better introduce it in another new patch with "KVM: NVMX" prefix subject after this patch is applied. furthermore, I think I could drop these tow lines in another new patch: - exec_control &= ~CPU_BASED_USE_IO_BITMAPS; - exec_control |= CPU_BASED_UNCOND_IO_EXITING; as the CPU_BASED_USE_IO_BITMAPS is clear after this patch. Quan
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/09 01:31, Jim Mattson wrote: On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu wrote: From: Quan Xu Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson Signed-off-by: Radim Krčmář Signed-off-by: Quan Xu --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), These two lines should stay. Jim, could you explain why these two lines should stay? IIUC, the main concern is from nested virtualization, which still uses io_bitmap_a/io_bitmap_b.. if so, we really need to further clean up these code, as CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set for both L0/L2. after new patches which I mentioned in this thread. right? Alibaba Cloud Quan
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/09 00:18, David Hildenbrand wrote: On 08.12.2017 11:22, Quan Xu wrote: From: Quan Xu Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson Signed-off-by: Radim Krčmář Signed-off-by: Quan Xu --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), FIELD64(MSR_BITMAP, msr_bitmap), FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr), FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr), @@ -943,8 +941,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +954,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3626,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5439,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6741,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); Looks good to me. We could now also drop diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a5d2856eb28b..732e93332f4c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -10639,7 +10639,6 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * Rather, exit every time. */ exec_control &= ~CPU_BASED_USE_IO_BITMAPS; - exec_control |= CPU_BASED_UNCOND_IO_EXITING; vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, exec_control); As this will be implicitly set by vmx_exec_control() David, I find some nVMX code is still using vmcs12 (.e.g. nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) dropping it would make these logic error handling.. we better leave it as is.. Quan Alibaba Cloud
Re: [PATCH] KVM: VMX: drop I/O permission bitmaps
On 2017/12/12 02:08, Jim Mattson wrote: Removing these two lines from the initialization of field_to_offset_table[] means that vmcs_field_to_offset() will return -ENOENT for IO_BITMAP_A or IO_BITMAP_B. Hence, handle_vmread and handle_vmwrite will incorrectly report these fields as unsupported VMCS components if an L1 hypervisor tries to access them. I will fix in v2. Quan Alibaba Cloud On Sun, Dec 10, 2017 at 9:37 PM, Quan Xu wrote: On 2017/12/09 01:31, Jim Mattson wrote: On Fri, Dec 8, 2017 at 2:22 AM, Quan Xu wrote: From: Quan Xu Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson Signed-off-by: Radim Krčmář Signed-off-by: Quan Xu --- arch/x86/kvm/vmx.c | 17 + 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..3e4f760 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -771,8 +771,6 @@ enum segment_cache_field { FIELD(HOST_FS_SELECTOR, host_fs_selector), FIELD(HOST_GS_SELECTOR, host_gs_selector), FIELD(HOST_TR_SELECTOR, host_tr_selector), - FIELD64(IO_BITMAP_A, io_bitmap_a), - FIELD64(IO_BITMAP_B, io_bitmap_b), These two lines should stay. Jim, could you explain why these two lines should stay? IIUC, the main concern is from nested virtualization, which still uses io_bitmap_a/io_bitmap_b.. if so, we really need to further clean up these code, as CPU_BASED_USE_IO_BITMAPS is clear, and CPU_BASED_UNCOND_IO_EXITING is set for both L0/L2. after new patches which I mentioned in this thread. right? Alibaba Cloud Quan
[PATCH v2] KVM: VMX: drop I/O permission bitmaps
From: Quan Xu Since KVM removes the only I/O port 0x80 bypass on Intel hosts, clear CPU_BASED_USE_IO_BITMAPS and set CPU_BASED_UNCOND_IO_EXITING bit. Then these I/O permission bitmaps are not used at all, so drop I/O permission bitmaps. Signed-off-by: Jim Mattson Signed-off-by: Radim Krčmář Signed-off-by: Quan Xu --- arch/x86/kvm/vmx.c | 15 +-- 1 files changed, 1 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2fd9a8c..41aaf5e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -943,8 +943,6 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12, static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock); enum { - VMX_IO_BITMAP_A, - VMX_IO_BITMAP_B, VMX_MSR_BITMAP_LEGACY, VMX_MSR_BITMAP_LONGMODE, VMX_MSR_BITMAP_LEGACY_X2APIC_APICV, @@ -958,8 +956,6 @@ enum { static unsigned long *vmx_bitmap[VMX_BITMAP_NR]; -#define vmx_io_bitmap_a (vmx_bitmap[VMX_IO_BITMAP_A]) -#define vmx_io_bitmap_b (vmx_bitmap[VMX_IO_BITMAP_B]) #define vmx_msr_bitmap_legacy (vmx_bitmap[VMX_MSR_BITMAP_LEGACY]) #define vmx_msr_bitmap_longmode (vmx_bitmap[VMX_MSR_BITMAP_LONGMODE]) #define vmx_msr_bitmap_legacy_x2apic_apicv (vmx_bitmap[VMX_MSR_BITMAP_LEGACY_X2APIC_APICV]) @@ -3632,7 +3628,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) #endif CPU_BASED_CR3_LOAD_EXITING | CPU_BASED_CR3_STORE_EXITING | - CPU_BASED_USE_IO_BITMAPS | + CPU_BASED_UNCOND_IO_EXITING | CPU_BASED_MOV_DR_EXITING | CPU_BASED_USE_TSC_OFFSETING | CPU_BASED_INVLPG_EXITING | @@ -5445,10 +5441,6 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx) #endif int i; - /* I/O */ - vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a)); - vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b)); - if (enable_shadow_vmcs) { vmcs_write64(VMREAD_BITMAP, __pa(vmx_vmread_bitmap)); vmcs_write64(VMWRITE_BITMAP, __pa(vmx_vmwrite_bitmap)); @@ -6751,14 +6743,9 @@ static __init int hardware_setup(void) goto out; } - vmx_io_bitmap_b = (unsigned long *)__get_free_page(GFP_KERNEL); memset(vmx_vmread_bitmap, 0xff, PAGE_SIZE); memset(vmx_vmwrite_bitmap, 0xff, PAGE_SIZE); - memset(vmx_io_bitmap_a, 0xff, PAGE_SIZE); - - memset(vmx_io_bitmap_b, 0xff, PAGE_SIZE); - memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode, 0xff, PAGE_SIZE); -- 1.7.1
Re: [PATCH RFC 3/7] KVM: timer: synchronize tsc-deadline timestamp for guest
On 2017/12/08 23:06, Konrad Rzeszutek Wilk wrote: On Fri, Dec 08, 2017 at 04:39:46PM +0800, Quan Xu wrote: From: Ben Luo In general, KVM guest programs tsc-deadline timestamp to MSR_IA32_TSC_DEADLINE MSR. This will cause a VM-exit, and then KVM handles this timer for guest. The tsc-deadline timestamp is mostly recorded in share page with less VM-exit. We Introduce a periodically working kthread to scan share page and synchronize timer setting for guest on a dedicated CPU. That sounds like a race. Meaning the guest may put too small window and this 'working thread to scan' may not get to it fast enough? yes, you are right. So .. . Meaning we miss the deadline to inject the timer in the guest. Or is this part of this PV MSR semantics - that it will only work for certain amount of values and anything less than say 1ms should not use the PV MSR? .. for these timers, We have to program these tsc-deadline timestamps to MSR_IA32_TSC_DEADLINE as normal, which will cause VM-exit and KVM will signal the working thread through IPI to program timer, instead of registering on current CPU (patch 0004). more detail in patch 0007. Quan Alibaba Cloud
Re: [PATCH RFC 0/7] kvm pvtimer
On 2017/12/14 00:28, Konrad Rzeszutek Wilk wrote: On Wed, Dec 13, 2017 at 11:25:13PM +0800, Quan Xu wrote: On Fri, Dec 8, 2017 at 11:10 PM, Konrad Rzeszutek Wilk < konrad.w...@oracle.com> wrote: On Fri, Dec 08, 2017 at 04:39:43PM +0800, Quan Xu wrote: From: Ben Luo This patchset introduces a new paravirtualized mechanism to reduce VM-exit caused by guest timer accessing. And how bad is this blib in arming the timer? And how often do you get this timer to be armed? OR better yet - what are the workloads in which you found this VMExit to be painful? Thanks. Or better yet - what are the workloads in which you found this VMExit to be painful? one painful point is from VM idle path.. for some network req/resp services, or benchmark of process context switches.. So: 1) VM idle path and network req/resp services: Does this go away if you don't hit the idle path? Meaning if you loop without hitting HLT/MWAIT? we still hit HLT.. we can use it with https://lkml.org/lkml/2017/8/29/279 .. I am assuming the issue you are facing is the latency - that is first time the guest comes from HLT and responds to the packet the latency is much higher than without? yes, And the arming of the timer? 2) process context switches. Is that related to the 1)? That is the 'schedule' call and the process going to sleep waiting for an interrupt or timer? This all sounds like issues with low-CPU usage workloads where you need low latency responses? yes, it is also helpful to some timer-intensive services. Quan
Re: [PATCH] KVM: x86: avoid unnecessary XSETBV on guest entry
On 2017/12/13 20:51, Paolo Bonzini wrote: xsetbv can be expensive when running on nested virtualization, try to avoid it. Signed-off-by: Paolo Bonzini --- arch/x86/kvm/x86.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0f82e2cbf64c..daa1918031df 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -702,7 +702,8 @@ static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu) if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) && !vcpu->guest_xcr0_loaded) { /* kvm_set_xcr() also depends on this */ - xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0); + if (vcpu->arch.xcr0 != host_xcr0) + xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0); vcpu->guest_xcr0_loaded = 1; } } Reviewed-by: Quan Xu
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-15 22:43, Rik van Riel wrote: Can you explain why you believe that? for example, a vcpu thread is running in kvm mode under cretical condition to stop. QEMU send an IPI to cause a VM-exit to happen immediately, and this IPI doesn't make vcpu return to QEMU. IIUC this vcpu thread will still continue to run in kvm mode when is waked up at targer machine. with your patch, I don't see a chance to load guest FPU or XSTATE, until return to QEMU and run kvm mode again. then the FPU or XSTATE status is inconsistent for a small window, what's even worse is that the vcpu is running. Did I misunderstand? Quan Alibaba Cloud Getting the guest FPU or XSTATE is done under the vcpu->mutex. This patch switches out guest and userspace FPU/XSTATE under the vcpu->mutex, and switches it back before releasing the vcpu->mutex. By the time a KVM_GET_FPU has obtained the vcpu->mutex, the guest FPU state will be in vcpu->arch.guest_fpu.state, where you expect it to be. What am I missing?
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-16 12:21, Rik van Riel wrote: On Thu, 2017-11-16 at 10:50 +0800, Quan Xu wrote: On 2017-11-15 22:43, Rik van Riel wrote: Can you explain why you believe that? for example, a vcpu thread is running in kvm mode under cretical condition to stop. QEMU send an IPI to cause a VM-exit to happen immediately, and this IPI doesn't make vcpu return to QEMU. IIUC this vcpu thread will still continue to run in kvm mode when is waked up at targer machine. with your patch, I don't see a chance to load guest FPU or XSTATE, until return to QEMU and run kvm mode again. then the FPU or XSTATE status is inconsistent for a small window, what's even worse is that the vcpu is running. Did I misunderstand? At context switch time, the context switch code will save the guest FPU state to current->thread.fpu when the VCPU thread is scheduled out. When the VCPU thread is scheduled back in, the context switch code will restore current->thread.fpu to the FPU registers. good catch! Also as your comment, PKRU is switched out separately at VMENTER and VMEXIT time, but with a lots of IF conditions.. the pkru may be restored with host pkru after VMEXIT. when vcpu thread is scheduled out, the pkru value in current->thread.fpu.state may be the host pkru value, instead of guest pkru value (of course, this _assumes_ that the pkru is in current->thread.fpu.state as well). in this way, the pkru may be a coner case. VM migration again, in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. the pkru status would be inconsistent.. Quan Alibaba Cloud The VCPU thread will never run with anything else than the guest FPU state, while inside the KVM_RUN code.
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 06:03, Thomas Gleixner wrote: On Wed, 15 Nov 2017, Peter Zijlstra wrote: On Mon, Nov 13, 2017 at 06:06:02PM +0800, Quan Xu wrote: From: Yang Zhang Implement a generic idle poll which resembles the functionality found in arch/. Provide weak arch_cpu_idle_poll function which can be overridden by the architecture code if needed. No, we want less of those magic hooks, not more. Interrupts arrive which may not cause a reschedule in idle loops. In KVM guest, this costs several VM-exit/VM-entry cycles, VM-entry for interrupts and VM-exit immediately. Also this becomes more expensive than bare metal. Add a generic idle poll before enter real idle path. When a reschedule event is pending, we can bypass the real idle path. Why not do a HV specific idle driver? If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. See commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu"). The reason for this commit looks similar. But lets not proliferate that. I'd rather see that go away. agreed. Even we can get more benifit than commit 3c5d92a0cfb5 ("nohz: Introduce arch_needs_cpu") in kvm guest. I won't proliferate that.. But the irq_timings stuff is heading into the same direction, with a more complex prediction logic which should tell you pretty good how long that idle period is going to be and in case of an interrupt heavy workload this would skip the extra work of stopping and restarting the tick and provide a very good input into a polling decision. interesting. I have tested with IRQ_TIMINGS related code, which seems not working so far. Also I'd like to help as much as I can. This can be handled either in a HV specific idle driver or even in the generic core code. If the interrupt does not arrive then you can assume within the predicted time then you can assume that the flood stopped and invoke halt or whatever. That avoids all of that 'tunable and tweakable' x86 specific hackery and utilizes common functionality which is mostly there already. here is some sample code. Poll for a while before enter halt in cpuidle_enter_state() If I get a reschedule event, then don't try to enter halt. (I hope this is the right direction as Peter mentioned in another email) --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, target_state = >states[index]; } +#ifdef CONFIG_PARAVIRT + paravirt_idle_poll(); + + if (need_resched()) + return -EBUSY; +#endif + /* Take note of the planned idle state. */ sched_idle_set_state(target_state); thanks, Quan Alibaba Cloud
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 16:45, Peter Zijlstra wrote: On Wed, Nov 15, 2017 at 11:03:08PM +0100, Thomas Gleixner wrote: If I understand the problem correctly then he wants to avoid the heavy lifting in tick_nohz_idle_enter() in the first place, but there is already an interesting quirk there which makes it exit early. Sure. And there are people who want to do the same for native. Adding more ugly and special cases just isn't the way to go about doing that. I'm fairly sure I've told the various groups that want to tinker with this to work together on this. I've also in fairly significant detail sketched how to rework the idle code and idle predictors. At this point I'm too tired to dig any of that up, so I'll just keep saying no to patches that don't even attempt to go in the right direction. Peter, take care. I really have considered this factor, and try my best not to interfere with scheduler/idle code. if irq_timings code is ready, I can use it directly. I think irq_timings is not an easy task, I'd like to help as much as I can. Also don't try to touch tick_nohz* code again. as tglx suggested, this can be handled either in a HV specific idle driver or even in the generic core code. I hope this is in the right direction. Quan Alibaba Cloud
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-16 18:21, Paolo Bonzini wrote: On 16/11/2017 06:06, Quan Xu wrote: when vcpu thread is scheduled out, the pkru value in current->thread.fpu.state may be the host pkru value, instead of guest pkru value (of course, this _assumes_ that the pkru is in current->thread.fpu.state as well). in this way, the pkru may be a coner case. Rik may correct me, but I think this is not possible. Preemption is disabled all the time while PKRU = guest_pkru (which is only during vmx_vcpu_run). refer to the following code, vcpu_enter_guest() { ... preempt_disable() ... kvm_x86_ops->run(vcpu); (actually it is vmx_vcpu_run()) ... ... preempt_enable(); } when preempt_disable before to run kvm_x86_ops->run.. in kvm_x86_ops->run, the pkru is restored with host_pkru (IF guest_pkru != host_pkru). all this happened under preempt_disable(). it's not true that preemtion is disable all the time while PKRU = guest_pkru.. However it seems there is still some gap.. as Rik said, "at context switch time, the context switch code will save the guest FPU state to current->thread.fpu when the VCPU thread is scheduled out." after preempt_enable() in vcpu_enter_guest(), the vcpu thread is scheduled out, in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF guest_pkru != host_pkru).. instead of guest_pkru.. then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu? as mentioned, all this _assumes_ that the pkru is in current->thread.fpu.state as well. thanks, Quan Alibaba Cloud Context switching will only happen in vcpu_enter_guest() after preempt_enable() for a preemptible kernel, or in vcpu_run via cond_resched() for a non-preemptible kernel. Thanks, Paolo VM migration again, in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. the pkru status would be inconsistent..
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-16 20:18, Paolo Bonzini wrote: On 16/11/2017 13:12, Quan Xu wrote: However it seems there is still some gap.. as Rik said, "at context switch time, the context switch code will save the guest FPU state to current->thread.fpu when the VCPU thread is scheduled out." By "guest FPU state" Rik means "guest FPU with host PKRU". :( actually it is host_pkru, just with different names.. Guest PKRU is always stored in vcpu->arch.pkru rather than in the guest FPU state, so guest PKRU will never be in current->thread.fpu.state either. KVM_GET_XSAVE will the guest FPU state with vcpu->arch.pkru and migration will work properly. agreed, I fix it.. that's why I concern.. there are so much methods to write PKRU with host_pkru/guest_pkru.. after migration, KVM_GET|SET_XSAVE restore the right pkru.. but we introduce another method: -- When the VCPU thread is scheduled back in, the context switch code will restore current->thread.fpu to the FPU registers. there is still a window to restore current->thread.fpu to the FPU registers before enter guest mode and preempt_disable(). on target machine, after migration, the pkru value is source_host_pkru in current->thread.fpu. in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. source_host_pkru_value may be restored to PKRU.. make pkru status inconsistent.. thanks Quan Alibaba Cloud Thanks, Paolo after preempt_enable() in vcpu_enter_guest(), the vcpu thread is scheduled out, in kvm_x86_ops->run, the PKRU has been restored with host_pkru (IF guest_pkru != host_pkru).. instead of guest_pkru.. then the PKRU is host_pkru, how to save guest_pkru current->thread.fpu? as mentioned, all this _assumes_ that the pkru is in current->thread.fpu.state as well. thanks, Quan Alibaba Cloud Context switching will only happen in vcpu_enter_guest() after preempt_enable() for a preemptible kernel, or in vcpu_run via cond_resched() for a non-preemptible kernel. Thanks, Paolo VM migration again, in case, source_host_pkru_value != guest_pkru_value, target_host_pkru_value == guest_pkru_value.. the pkru status would be inconsistent..
Re: [PATCH 1/2] x86,kvm: move qemu/guest FPU switching out to vcpu_run
On 2017-11-17 01:50, Paolo Bonzini wrote: On 16/11/2017 15:28, Quan Xu wrote: vcpu->srcu_idx = srcu_read_lock(>srcu); + kvm_load_guest_fpu(vcpu); + for (;;) { if (kvm_vcpu_running(vcpu)) { r = vcpu_enter_guest(vcpu); << as Rik dropped kvm_load_guest_fpu(vcpu) in vcpu_enter_guest() .. in case still in kvm mode, how to make sure to pkru is always the right one before enter guest mode, not be preempted before preempt_disable() after migration? :( As you know: 1) kvm_load_guest_fpu doesn't load the guest PKRU, it keeps the userspace PKRU. 2) the guest PKRU is only ever set in a preemption-disabled area Thus, context switch always sees the userspace PKRU. The guest PKRU is only set the next time vmx_vcpu_run executes. Paolo Paolo, thanks for your explanation!!:-) Rik, could you cc me in v2? Quan
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-16 17:53, Thomas Gleixner wrote: On Thu, 16 Nov 2017, Quan Xu wrote: On 2017-11-16 06:03, Thomas Gleixner wrote: --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -210,6 +210,13 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, target_state = >states[index]; } +#ifdef CONFIG_PARAVIRT + paravirt_idle_poll(); + + if (need_resched()) + return -EBUSY; +#endif That's just plain wrong. We don't want to see any of this PARAVIRT crap in anything outside the architecture/hypervisor interfacing code which really needs it. The problem can and must be solved at the generic level in the first place to gather the data which can be used to make such decisions. How that information is used might be either completely generic or requires system specific variants. But as long as we don't have any information at all we cannot discuss that. Please sit down and write up which data needs to be considered to make decisions about probabilistic polling. Then we need to compare and contrast that with the data which is necessary to make power/idle state decisions. I would be very surprised if this data would not overlap by at least 90%. Peter, tglx Thanks for your comments.. rethink of this patch set, 1. which data needs to considerd to make decisions about probabilistic polling I really need to write up which data needs to considerd to make decisions about probabilistic polling. At last several months, I always focused on the data _from idle to reschedule_, then to bypass the idle loops. unfortunately, this makes me touch scheduler/idle/nohz code inevitably. with tglx's suggestion, the data which is necessary to make power/idle state decisions, is the last idle state's residency time. IIUC this data is duration from idle to wakeup, which maybe by reschedule irq or other irq. I also test that the reschedule irq overlap by more than 90% (trace the need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for one minute. as the overlap, I think I can input the last idle state's residency time to make decisions about probabilistic polling, as @dev->last_residency does. it is much easier to get data. 2. do a HV specific idle driver (function) so far, power management is not exposed to guest.. idle is simple for KVM guest, calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. thanks Xen guys, who has implemented the paravirt framework. I can implement it as easy as following: --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void) update_intr_gate(X86_TRAP_PF, async_page_fault); } +static __cpuidle void kvm_safe_halt(void) +{ + /* 1. POLL, if need_resched() --> return */ + + asm volatile("sti; hlt": : :"memory"); /* 2. halt */ + + /* 3. get the last idle state's residency time */ + + /* 4. update poll duration based on last idle state's residency time */ +} + void __init kvm_guest_init(void) { int i; @@ -490,6 +496,8 @@ void __init kvm_guest_init(void) if (kvmclock_vsyscall) kvm_setup_vsyscall_timeinfo(); + pv_irq_ops.safe_halt = kvm_safe_halt; + #ifdef CONFIG_SMP then, I am no need to introduce a new pvops, and never modify schedule/idle/nohz code again. also I can narrow all of the code down in arch/x86/kernel/kvm.c. If this is in the right direction, I will send a new patch set next week.. thanks, Quan Alibaba Cloud
Re: [PATCH RFC v3 3/6] sched/idle: Add a generic poll before enter real idle path
On 2017-11-17 19:36, Thomas Gleixner wrote: On Fri, 17 Nov 2017, Quan Xu wrote: On 2017-11-16 17:53, Thomas Gleixner wrote: That's just plain wrong. We don't want to see any of this PARAVIRT crap in anything outside the architecture/hypervisor interfacing code which really needs it. The problem can and must be solved at the generic level in the first place to gather the data which can be used to make such decisions. How that information is used might be either completely generic or requires system specific variants. But as long as we don't have any information at all we cannot discuss that. Please sit down and write up which data needs to be considered to make decisions about probabilistic polling. Then we need to compare and contrast that with the data which is necessary to make power/idle state decisions. I would be very surprised if this data would not overlap by at least 90%. 1. which data needs to considerd to make decisions about probabilistic polling I really need to write up which data needs to considerd to make decisions about probabilistic polling. At last several months, I always focused on the data _from idle to reschedule_, then to bypass the idle loops. unfortunately, this makes me touch scheduler/idle/nohz code inevitably. with tglx's suggestion, the data which is necessary to make power/idle state decisions, is the last idle state's residency time. IIUC this data is duration from idle to wakeup, which maybe by reschedule irq or other irq. That's part of the picture, but not complete. tglx, could you share more? I am very curious about it.. I also test that the reschedule irq overlap by more than 90% (trace the need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for one minute. as the overlap, I think I can input the last idle state's residency time to make decisions about probabilistic polling, as @dev->last_residency does. it is much easier to get data. That's only true for your particular use case. 2. do a HV specific idle driver (function) so far, power management is not exposed to guest.. idle is simple for KVM guest, calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. thanks Xen guys, who has implemented the paravirt framework. I can implement it as easy as following: --- a/arch/x86/kernel/kvm.c Your email client is using a very strange formatting. my bad, I insert space to highlight these code. This is definitely better than what you proposed so far and implementing it as a prove of concept seems to be worthwhile. But I doubt that this is the final solution. It's not generic and not necessarily suitable for all use case scenarios. yes, I am exhausted :):) could you tell me the gap to be generic and necessarily suitable for all use case scenarios? as lack of irq/idle predictors? I really want to upstream it for all of public cloud users/providers.. as kvm host has a similar one, is it possible to upstream with following conditions? : 1). add a QEMU configuration, whether enable or not, by default disable. 2). add some "TODO" comments near the code. 3). ... anyway, thanks for your help.. Quan Alibaba Cloud