[Qemu-devel] [PATCH] i386: turn off l3-cache property by default
Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus" introduced and set by default exposing l3 to the guest. The motivation behind it was that in the Linux scheduler, when waking up a task on a sibling CPU, the task was put onto the target CPU's runqueue directly, without sending a reschedule IPI. Reduction in the IPI count led to performance gain. However, this isn't the whole story. Once the task is on the target CPU's runqueue, it may have to preempt the current task on that CPU, be it the idle task putting the CPU to sleep or just another running task. For that a reschedule IPI will have to be issued, too. Only when that other CPU is running a normal task for too little time, the fairness constraints will prevent the preemption and thus the IPI. This boils down to the improvement being only achievable in workloads with many actively switching tasks. We had no access to the (proprietary?) SAP HANA benchmark the commit referred to, but the pattern is also reproduced with "perf bench sched messaging -g 1" on 1 socket, 8 cores vCPU topology, we see indeed: l3-cache#res IPI /s #time / 1 loops off 560K1.8 sec on 40K 0.9 sec Now there's a downside: with L3 cache the Linux scheduler is more eager to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU interactions and therefore exessive halts and IPIs. E.g. "perf bench sched pipe -i 10" gives l3-cache#res IPI /s #HLT /s #time /10 loops off 200 (no K) 230 0.2 sec on 400K330K0.5 sec In a more realistic test, we observe 15% degradation in VM density (measured as the number of VMs, each running Drupal CMS serving 2 http requests per second to its main page, with 95%-percentile response latency under 100 ms) with l3-cache=on. We think that mostly-idle scenario is more common in cloud and personal usage, and should be optimized for by default; users of highly loaded VMs should be able to tune them up themselves. So switch l3-cache off by default, and add a compat clause for the range of machine types where it was on. Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com> Reviewed-by: Roman Kagan <rka...@virtuozzo.com> --- include/hw/i386/pc.h | 7 ++- target/i386/cpu.c| 2 +- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h index 087d184..1d2dcae 100644 --- a/include/hw/i386/pc.h +++ b/include/hw/i386/pc.h @@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *); .driver = TYPE_X86_CPU,\ .property = "x-hv-max-vps",\ .value= "0x40",\ -}, +},\ +{\ +.driver = TYPE_X86_CPU,\ +.property = "l3-cache",\ +.value= "on",\ +},\ #define PC_COMPAT_2_9 \ HW_COMPAT_2_9 \ diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 1edcf29..95a51bd 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = { DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id), DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true), DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false), -DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true), +DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false), DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration, false), DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true), -- 2.7.4
[Qemu-devel] [PATCH v3] kvmclock: update system_time_msr address forcibly
Do an update of system_time_msr address every time before reading the value of tsc_timestamp from guest's kvmclock page. There is no other code paths which ensure that qemu has an up-to-date value of system_time_msr. So, force this update on guest's tsc_timestamp reading. This bug causes effect on those nested setups which turn off TPR access interception for L2 guests and that access being intercepted by L0 doesn't show up in L1. Linux bootstrap initiate kvmclock before APIC initializing causing TPR access. That's why on L1 guests, having TPR interception turned on for L2, the effect of the bug is not revealed. This patch fixes this problem by making sure it knows the correct system_time_msr address every time it is needed. Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com> --- hw/i386/kvm/clock.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c index 13eca37..363d1b5 100644 --- a/hw/i386/kvm/clock.c +++ b/hw/i386/kvm/clock.c @@ -19,6 +19,7 @@ #include "qemu/host-utils.h" #include "sysemu/sysemu.h" #include "sysemu/kvm.h" +#include "sysemu/hw_accel.h" #include "kvm_i386.h" #include "hw/sysbus.h" #include "hw/kvm/clock.h" @@ -69,6 +70,8 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s) uint64_t nsec_hi; uint64_t nsec; +cpu_synchronize_state(cpu); + if (!(env->system_time_msr & 1ULL)) { /* KVM clock not active */ return 0; -- 2.7.4
[Qemu-devel] [PATCH v2] kvmclock: update system_time_msr address forcibly
Do an update of system_time_msr address every time before reading the value of tsc_timestamp from guest's kvmclock page. There is no other code paths which ensure that qemu has an up-to-date value of system_time_msr. So, force this update on guest's tsc_timestamp reading. This bug causes effect on those nested setups which turn off TPR access interception for L2 guests and that access being intercepted by L0 doesn't show up in L1. Linux bootstrap initiate kvmclock before APIC initializing causing TPR access. That's why on L1 guests, having TPR interception turned on for L2, the effect of the bug is not revealed. This patch fixes this problem by making sure it knows the correct system_time_msr address every time it is needed. Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com> --- hw/i386/kvm/clock.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c index 0f75dd3..875d85f 100644 --- a/hw/i386/kvm/clock.c +++ b/hw/i386/kvm/clock.c @@ -61,6 +61,8 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s) uint64_t nsec_hi; uint64_t nsec; +cpu_synchronize_state(cpu); + if (!(env->system_time_msr & 1ULL)) { /* KVM clock not active */ return 0; -- 2.7.4
[Qemu-devel] [PATCH] kvmclock: update system_time_msr address forcibly
Do an update of system_time_msr address every time before reading the value of tsc_timestamp from guest's kvmclock page. It should be done in a forcible manner because there is a situation when system_time_msr has been set by kvm but qemu doesn't aware of it. This leads to updates of kvmclock_offset without respect of guest's kvmclock values. The situation appears when L2 linux guest runs over L1 linux guest and the action inducing system_time_msr update is tpr access reporting. Some L1 linux guests turn off processing TPR access and when L0 gets an L2 exit induced by TPR MSR access it doesn't enter L1 and processed it by itself. Thus, L1 kvm doesn't know about that TPR access happening and doesn't exit to qemu which in turn doesn't set system_time_msr address. This patch fixes this by making sure it knows the correct address every time it is needed. Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com> --- hw/i386/kvm/clock.c | 32 +++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c index e713162..035196a 100644 --- a/hw/i386/kvm/clock.c +++ b/hw/i386/kvm/clock.c @@ -48,11 +48,38 @@ struct pvclock_vcpu_time_info { uint8_tpad[2]; } __attribute__((__packed__)); /* 32 bytes */ +static void update_all_system_time_msr(void) +{ +CPUState *cpu; +CPUX86State *env; +struct { +struct kvm_msrs info; +struct kvm_msr_entry entries[1]; +} msr_data; +int ret; + +msr_data.info.nmsrs = 1; +msr_data.entries[0].index = MSR_KVM_SYSTEM_TIME; + +CPU_FOREACH(cpu) { +ret = kvm_vcpu_ioctl(cpu, KVM_GET_MSRS, _data); + +if (ret < 0) { +fprintf(stderr, "KVM_GET_MSRS failed: %s\n", strerror(ret)); +abort(); +} + +assert(ret == 1); +env = cpu->env_ptr; +env->system_time_msr = msr_data.entries[0].data; +} +} + static uint64_t kvmclock_current_nsec(KVMClockState *s) { CPUState *cpu = first_cpu; CPUX86State *env = cpu->env_ptr; -hwaddr kvmclock_struct_pa = env->system_time_msr & ~1ULL; +hwaddr kvmclock_struct_pa; uint64_t migration_tsc = env->tsc; struct pvclock_vcpu_time_info time; uint64_t delta; @@ -60,6 +87,9 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s) uint64_t nsec_hi; uint64_t nsec; +update_all_system_time_msr(); +kvmclock_struct_pa = env->system_time_msr & ~1ULL; + if (!(env->system_time_msr & 1ULL)) { /* KVM clock not active */ return 0; -- 2.7.4
Re: [Qemu-devel] [PATCH] kvmclock: update system_time_msr address forcibly
On 24.05.2017 17:09, Denis V. Lunev wrote: On 05/24/2017 05:07 PM, Denis Plotnikov wrote: Do an update of system_time_msr address every time before reading the value of tsc_timestamp from guest's kvmclock page. It should be done in a forcible manner because there is a situation when system_time_msr has been set by kvm but qemu doesn't aware of it. This leads to updates of kvmclock_offset without respect of guest's kvmclock values. The situation appears when L2 linux guest runs over L1 linux guest and the action inducing system_time_msr update is tpr access reporting. Some L1 linux guests turn off processing TPR access and when L0 gets an L2 exit induced by TPR MSR access it doesn't enter L1 and processed it by itself. Thus, L1 kvm doesn't know about that TPR access happening and doesn't exit to qemu which in turn doesn't set system_time_msr address. This patch fixes this by making sure it knows the correct address every time it is needed. Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com> --- hw/i386/kvm/clock.c | 32 +++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c index e713162..035196a 100644 --- a/hw/i386/kvm/clock.c +++ b/hw/i386/kvm/clock.c @@ -48,11 +48,38 @@ struct pvclock_vcpu_time_info { uint8_tpad[2]; } __attribute__((__packed__)); /* 32 bytes */ +static void update_all_system_time_msr(void) +{ +CPUState *cpu; +CPUX86State *env; +struct { +struct kvm_msrs info; +struct kvm_msr_entry entries[1]; +} msr_data; +int ret; + +msr_data.info.nmsrs = 1; +msr_data.entries[0].index = MSR_KVM_SYSTEM_TIME; + +CPU_FOREACH(cpu) { +ret = kvm_vcpu_ioctl(cpu, KVM_GET_MSRS, _data); + +if (ret < 0) { +fprintf(stderr, "KVM_GET_MSRS failed: %s\n", strerror(ret)); +abort(); +} + +assert(ret == 1); +env = cpu->env_ptr; +env->system_time_msr = msr_data.entries[0].data; +} +} + static uint64_t kvmclock_current_nsec(KVMClockState *s) { CPUState *cpu = first_cpu; CPUX86State *env = cpu->env_ptr; -hwaddr kvmclock_struct_pa = env->system_time_msr & ~1ULL; +hwaddr kvmclock_struct_pa; uint64_t migration_tsc = env->tsc; struct pvclock_vcpu_time_info time; uint64_t delta; @@ -60,6 +87,9 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s) uint64_t nsec_hi; uint64_t nsec; +update_all_system_time_msr(); +kvmclock_struct_pa = env->system_time_msr & ~1ULL; + should we do this once/per guest boot? practically - yes. I can barely imagine that the pv_clock page address may be changed after being set once. But we don't know the exact moment when the guest is going to write it. And not to be dependent of any other event I decided to check it every time before using since it won't make any performance issues because this invocation happens on vm state changes only. Den if (!(env->system_time_msr & 1ULL)) { /* KVM clock not active */ return 0; -- Best, Denis