[Qemu-devel] [PATCH] i386: turn off l3-cache property by default

2017-11-24 Thread Denis Plotnikov
Commit 14c985cffa "target-i386: present virtual L3 cache info for vcpus"
introduced and set by default exposing l3 to the guest.

The motivation behind it was that in the Linux scheduler, when waking up
a task on a sibling CPU, the task was put onto the target CPU's runqueue
directly, without sending a reschedule IPI.  Reduction in the IPI count
led to performance gain.

However, this isn't the whole story.  Once the task is on the target
CPU's runqueue, it may have to preempt the current task on that CPU, be
it the idle task putting the CPU to sleep or just another running task.
For that a reschedule IPI will have to be issued, too.  Only when that
other CPU is running a normal task for too little time, the fairness
constraints will prevent the preemption and thus the IPI.

This boils down to the improvement being only achievable in workloads
with many actively switching tasks.  We had no access to the
(proprietary?) SAP HANA benchmark the commit referred to, but the
pattern is also reproduced with "perf bench sched messaging -g 1"
on 1 socket, 8 cores vCPU topology, we see indeed:

l3-cache#res IPI /s #time / 1 loops
off 560K1.8 sec
on  40K 0.9 sec

Now there's a downside: with L3 cache the Linux scheduler is more eager
to wake up tasks on sibling CPUs, resulting in unnecessary cross-vCPU
interactions and therefore exessive halts and IPIs.  E.g. "perf bench
sched pipe -i 10" gives

l3-cache#res IPI /s #HLT /s #time /10 loops
off 200 (no K)  230 0.2 sec
on  400K330K0.5 sec

In a more realistic test, we observe 15% degradation in VM density
(measured as the number of VMs, each running Drupal CMS serving 2 http
requests per second to its main page, with 95%-percentile response
latency under 100 ms) with l3-cache=on.

We think that mostly-idle scenario is more common in cloud and personal
usage, and should be optimized for by default; users of highly loaded
VMs should be able to tune them up themselves.

So switch l3-cache off by default, and add a compat clause for the range
of machine types where it was on.

Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com>
Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
---
 include/hw/i386/pc.h | 7 ++-
 target/i386/cpu.c| 2 +-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index 087d184..1d2dcae 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -375,7 +375,12 @@ bool e820_get_entry(int, uint32_t, uint64_t *, uint64_t *);
 .driver   = TYPE_X86_CPU,\
 .property = "x-hv-max-vps",\
 .value= "0x40",\
-},
+},\
+{\
+.driver   = TYPE_X86_CPU,\
+.property = "l3-cache",\
+.value= "on",\
+},\
 
 #define PC_COMPAT_2_9 \
 HW_COMPAT_2_9 \
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 1edcf29..95a51bd 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -4154,7 +4154,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
 DEFINE_PROP_BOOL("cpuid-0xb", X86CPU, enable_cpuid_0xb, true),
 DEFINE_PROP_BOOL("lmce", X86CPU, enable_lmce, false),
-DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, true),
+DEFINE_PROP_BOOL("l3-cache", X86CPU, enable_l3_cache, false),
 DEFINE_PROP_BOOL("kvm-no-smi-migration", X86CPU, kvm_no_smi_migration,
  false),
 DEFINE_PROP_BOOL("vmware-cpuid-freq", X86CPU, vmware_cpuid_freq, true),
-- 
2.7.4




[Qemu-devel] [PATCH v3] kvmclock: update system_time_msr address forcibly

2017-05-29 Thread Denis Plotnikov
Do an update of system_time_msr address every time before reading
the value of tsc_timestamp from guest's kvmclock page.

There is no other code paths which ensure that qemu has an up-to-date
value of system_time_msr. So, force this update on guest's tsc_timestamp
reading.

This bug causes effect on those nested setups which turn off TPR access
interception for L2 guests and that access being intercepted by L0 doesn't
show up in L1.
Linux bootstrap initiate kvmclock before APIC initializing causing TPR access.
That's why on L1 guests, having TPR interception turned on for L2, the effect
of the bug is not revealed.

This patch fixes this problem by making sure it knows the correct
system_time_msr address every time it is needed.

Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com>
---
 hw/i386/kvm/clock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index 13eca37..363d1b5 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -19,6 +19,7 @@
 #include "qemu/host-utils.h"
 #include "sysemu/sysemu.h"
 #include "sysemu/kvm.h"
+#include "sysemu/hw_accel.h"
 #include "kvm_i386.h"
 #include "hw/sysbus.h"
 #include "hw/kvm/clock.h"
@@ -69,6 +70,8 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s)
 uint64_t nsec_hi;
 uint64_t nsec;
 
+cpu_synchronize_state(cpu);
+
 if (!(env->system_time_msr & 1ULL)) {
 /* KVM clock not active */
 return 0;
-- 
2.7.4




[Qemu-devel] [PATCH v2] kvmclock: update system_time_msr address forcibly

2017-05-26 Thread Denis Plotnikov
Do an update of system_time_msr address every time before reading
the value of tsc_timestamp from guest's kvmclock page.

There is no other code paths which ensure that qemu has an up-to-date
value of system_time_msr. So, force this update on guest's tsc_timestamp
reading.

This bug causes effect on those nested setups which turn off TPR access
interception for L2 guests and that access being intercepted by L0 doesn't
show up in L1.
Linux bootstrap initiate kvmclock before APIC initializing causing TPR access.
That's why on L1 guests, having TPR interception turned on for L2, the effect
of the bug is not revealed.

This patch fixes this problem by making sure it knows the correct
system_time_msr address every time it is needed.

Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com>
---
 hw/i386/kvm/clock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index 0f75dd3..875d85f 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -61,6 +61,8 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s)
 uint64_t nsec_hi;
 uint64_t nsec;
 
+cpu_synchronize_state(cpu);
+
 if (!(env->system_time_msr & 1ULL)) {
 /* KVM clock not active */
 return 0;
-- 
2.7.4




[Qemu-devel] [PATCH] kvmclock: update system_time_msr address forcibly

2017-05-24 Thread Denis Plotnikov
Do an update of system_time_msr address every time before reading
the value of tsc_timestamp from guest's kvmclock page.

It should be done in a forcible manner because there is a situation
when system_time_msr has been set by kvm but qemu doesn't aware of it.
This leads to updates of kvmclock_offset without respect of guest's
kvmclock values.

The situation appears when L2 linux guest runs over L1 linux guest and
the action inducing system_time_msr update is tpr access reporting.
Some L1 linux guests turn off processing TPR access and when L0
gets an L2 exit induced by TPR MSR access it doesn't enter L1 and
processed it by itself.
Thus, L1 kvm doesn't know about that TPR access happening and doesn't
exit to qemu which in turn doesn't set system_time_msr address.

This patch fixes this by making sure it knows the correct address every
time it is needed.

Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com>
---
 hw/i386/kvm/clock.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index e713162..035196a 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -48,11 +48,38 @@ struct pvclock_vcpu_time_info {
 uint8_tpad[2];
 } __attribute__((__packed__)); /* 32 bytes */
 
+static void update_all_system_time_msr(void)
+{
+CPUState *cpu;
+CPUX86State *env;
+struct {
+struct kvm_msrs info;
+struct kvm_msr_entry entries[1];
+} msr_data;
+int ret;
+
+msr_data.info.nmsrs = 1;
+msr_data.entries[0].index = MSR_KVM_SYSTEM_TIME;
+
+CPU_FOREACH(cpu) {
+ret = kvm_vcpu_ioctl(cpu, KVM_GET_MSRS, _data);
+
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_MSRS failed: %s\n", strerror(ret));
+abort();
+}
+
+assert(ret == 1);
+env = cpu->env_ptr;
+env->system_time_msr = msr_data.entries[0].data;
+}
+}
+
 static uint64_t kvmclock_current_nsec(KVMClockState *s)
 {
 CPUState *cpu = first_cpu;
 CPUX86State *env = cpu->env_ptr;
-hwaddr kvmclock_struct_pa = env->system_time_msr & ~1ULL;
+hwaddr kvmclock_struct_pa;
 uint64_t migration_tsc = env->tsc;
 struct pvclock_vcpu_time_info time;
 uint64_t delta;
@@ -60,6 +87,9 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s)
 uint64_t nsec_hi;
 uint64_t nsec;
 
+update_all_system_time_msr();
+kvmclock_struct_pa = env->system_time_msr & ~1ULL;
+
 if (!(env->system_time_msr & 1ULL)) {
 /* KVM clock not active */
 return 0;
-- 
2.7.4




Re: [Qemu-devel] [PATCH] kvmclock: update system_time_msr address forcibly

2017-05-24 Thread Denis Plotnikov



On 24.05.2017 17:09, Denis V. Lunev wrote:

On 05/24/2017 05:07 PM, Denis Plotnikov wrote:

Do an update of system_time_msr address every time before reading
the value of tsc_timestamp from guest's kvmclock page.

It should be done in a forcible manner because there is a situation
when system_time_msr has been set by kvm but qemu doesn't aware of it.
This leads to updates of kvmclock_offset without respect of guest's
kvmclock values.

The situation appears when L2 linux guest runs over L1 linux guest and
the action inducing system_time_msr update is tpr access reporting.
Some L1 linux guests turn off processing TPR access and when L0
gets an L2 exit induced by TPR MSR access it doesn't enter L1 and
processed it by itself.
Thus, L1 kvm doesn't know about that TPR access happening and doesn't
exit to qemu which in turn doesn't set system_time_msr address.

This patch fixes this by making sure it knows the correct address every
time it is needed.

Signed-off-by: Denis Plotnikov <dplotni...@virtuozzo.com>
---
 hw/i386/kvm/clock.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/hw/i386/kvm/clock.c b/hw/i386/kvm/clock.c
index e713162..035196a 100644
--- a/hw/i386/kvm/clock.c
+++ b/hw/i386/kvm/clock.c
@@ -48,11 +48,38 @@ struct pvclock_vcpu_time_info {
 uint8_tpad[2];
 } __attribute__((__packed__)); /* 32 bytes */

+static void update_all_system_time_msr(void)
+{
+CPUState *cpu;
+CPUX86State *env;
+struct {
+struct kvm_msrs info;
+struct kvm_msr_entry entries[1];
+} msr_data;
+int ret;
+
+msr_data.info.nmsrs = 1;
+msr_data.entries[0].index = MSR_KVM_SYSTEM_TIME;
+
+CPU_FOREACH(cpu) {
+ret = kvm_vcpu_ioctl(cpu, KVM_GET_MSRS, _data);
+
+if (ret < 0) {
+fprintf(stderr, "KVM_GET_MSRS failed: %s\n", strerror(ret));
+abort();
+}
+
+assert(ret == 1);
+env = cpu->env_ptr;
+env->system_time_msr = msr_data.entries[0].data;
+}
+}
+
 static uint64_t kvmclock_current_nsec(KVMClockState *s)
 {
 CPUState *cpu = first_cpu;
 CPUX86State *env = cpu->env_ptr;
-hwaddr kvmclock_struct_pa = env->system_time_msr & ~1ULL;
+hwaddr kvmclock_struct_pa;
 uint64_t migration_tsc = env->tsc;
 struct pvclock_vcpu_time_info time;
 uint64_t delta;
@@ -60,6 +87,9 @@ static uint64_t kvmclock_current_nsec(KVMClockState *s)
 uint64_t nsec_hi;
 uint64_t nsec;

+update_all_system_time_msr();
+kvmclock_struct_pa = env->system_time_msr & ~1ULL;
+

should we do this once/per guest boot?
practically - yes. I can barely imagine that the pv_clock page address 
may be changed after being set once.

But we don't know the exact moment when the guest is going to write it.
And not to be dependent of any other event I decided to check it every 
time before using since it won't make any performance issues because 
this invocation happens on vm state changes only.


Den

 if (!(env->system_time_msr & 1ULL)) {
 /* KVM clock not active */
 return 0;




--
Best,
Denis



<    1   2   3   4   5   6