from:"Radim Krčmář"

[PATCH 1/2] kvm: free resources after canceling async_pf

2013-09-04 Thread Radim Krčmář

When we cancel 'async_pf_execute()', we should behave as if the work was
never scheduled in 'kvm_setup_async_pf()'.
Fixes a bug when we can't unload module because the vm wasn't destroyed.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 virt/kvm/async_pf.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index b44cea0..f30aa1c 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -102,8 +102,11 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu 
*vcpu)
   typeof(*work), queue);
cancel_work_sync(work-work);
list_del(work-queue);
-   if (!work-done) /* work was canceled */
+   if (!work-done) { /* work was canceled */
+   mmdrop(work-mm);
+   kvm_put_kvm(vcpu-kvm); /* == work-vcpu-kvm */
kmem_cache_free(async_pf_cache, work);
+   }
}
 
spin_lock(vcpu-async_pf.lock);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/2] kvm: fix a bug and remove a redundancy in async_pf

2013-09-04 Thread Radim Krčmář

I did not reproduce the bug fixed in [1/2], but there are not that many
reasons why we could not unload a module, so the spot is quite obvious.


Radim Krčmář (2):
  kvm: free resources after canceling async_pf
  kvm: remove .done from struct kvm_async_pf

 include/linux/kvm_host.h | 1 -
 virt/kvm/async_pf.c  | 8 
 2 files changed, 4 insertions(+), 5 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] kvm: remove .done from struct kvm_async_pf

2013-09-04 Thread Radim Krčmář

'.done' is used to mark the completion of 'async_pf_execute()', but
'cancel_work_sync()' returns true when the work was canceled, so we
use it instead.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 include/linux/kvm_host.h | 1 -
 virt/kvm/async_pf.c  | 5 +
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ca645a0..c7a5e08 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -190,7 +190,6 @@ struct kvm_async_pf {
unsigned long addr;
struct kvm_arch_async_pf arch;
struct page *page;
-   bool done;
 };
 
 void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index f30aa1c..89acf41 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -76,7 +76,6 @@ static void async_pf_execute(struct work_struct *work)
spin_lock(vcpu-async_pf.lock);
list_add_tail(apf-link, vcpu-async_pf.done);
apf-page = page;
-   apf-done = true;
spin_unlock(vcpu-async_pf.lock);
 
/*
@@ -100,9 +99,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu 
*vcpu)
struct kvm_async_pf *work =
list_entry(vcpu-async_pf.queue.next,
   typeof(*work), queue);
-   cancel_work_sync(work-work);
list_del(work-queue);
-   if (!work-done) { /* work was canceled */
+   if (cancel_work_sync(work-work)) {
mmdrop(work-mm);
kvm_put_kvm(vcpu-kvm); /* == work-vcpu-kvm */
kmem_cache_free(async_pf_cache, work);
@@ -167,7 +165,6 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, 
gfn_t gfn,
return 0;
 
work-page = NULL;
-   work-done = false;
work-vcpu = vcpu;
work-gva = gva;
work-addr = gfn_to_hva(vcpu-kvm, gfn);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/7] static_key: use static_key_slow_inc_deferred()

2013-10-17 Thread Radim Krčmář

Simple replacement where possible.
Saves us problematic access to the structure and allows optimalizations
and bug fixes to take place.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/lapic.c | 7 ---
 kernel/events/core.c | 6 +++---
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5439117..eff85f6 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -121,7 +121,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, 
u32 val)
if (val  APIC_SPIV_APIC_ENABLED)
static_key_slow_dec_deferred(apic_sw_disabled);
else
-   static_key_slow_inc(apic_sw_disabled.key);
+   static_key_slow_inc_deferred(apic_sw_disabled);
}
apic_set_reg(apic, APIC_SPIV, val);
 }
@@ -1351,7 +1351,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
if (value  MSR_IA32_APICBASE_ENABLE)
static_key_slow_dec_deferred(apic_hw_disabled);
else
-   static_key_slow_inc(apic_hw_disabled.key);
+   static_key_slow_inc_deferred(apic_hw_disabled);
recalculate_apic_map(vcpu-kvm);
}
 
@@ -1546,7 +1546,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
kvm_lapic_set_base(vcpu,
APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE);
 
-   static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */
+   /* sw disabled at reset */
+   static_key_slow_inc_deferred(apic_sw_disabled);
kvm_lapic_reset(vcpu);
kvm_iodevice_init(apic-dev, apic_mmio_ops);
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d49a9d2..ade89a1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6565,7 +6565,7 @@ static void account_event(struct perf_event *event)
return;
 
if (event-attach_state  PERF_ATTACH_TASK)
-   static_key_slow_inc(perf_sched_events.key);
+   static_key_slow_inc_deferred(perf_sched_events);
if (event-attr.mmap || event-attr.mmap_data)
atomic_inc(nr_mmap_events);
if (event-attr.comm)
@@ -6577,9 +6577,9 @@ static void account_event(struct perf_event *event)
tick_nohz_full_kick_all();
}
if (has_branch_stack(event))
-   static_key_slow_inc(perf_sched_events.key);
+   static_key_slow_inc_deferred(perf_sched_events);
if (is_cgroup_event(event))
-   static_key_slow_inc(perf_sched_events.key);
+   static_key_slow_inc_deferred(perf_sched_events);
 
account_event_cpu(event, event-cpu);
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7] static_key: use static_key_slow_inc_deferred()

2013-10-18 Thread Radim Krčmář

2013-10-17 12:39+0200, Paolo Bonzini:
 Il 17/10/2013 12:10, Radim Krčmář ha scritto:
  Simple replacement where possible.
  Saves us problematic access to the structure and allows optimalizations
  and bug fixes to take place.
 
 I think you should introduce this first as a simple wrapper around
 static_key_slow_inc, and then add the bells and whistles you have in
 patches 2 and 3.

Ok, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 5/5] static_key: improve deferred inc behavior

2013-12-06 Thread Radim Krčmář

We can cancel deferred static_key_slow_dec() instead of increasing
.enabled.counter.

Timer now won't fire before 'timeout' since the last increase, so this patch
further stabilizes the case of frequent switching.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 kernel/jump_label.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index bd7ad31..9b57261 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -79,7 +79,8 @@ void static_key_slow_inc_deferred(struct static_key_deferred 
*key)
STATIC_KEY_CHECK_USE();
if (atomic_dec_if_positive(key-enabled_debt) = 0)
return;
-   static_key_slow_inc(key-key);
+   if (!cancel_delayed_work(key-work))
+   static_key_slow_inc(key-key);
 }
 EXPORT_SYMBOL_GPL(static_key_slow_inc_deferred);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/5] static_key: deferred key fixes and improvements

2013-12-06 Thread Radim Krčmář

This series is made out of three logical parts
 - patches 1 and 2 fix panic caused by accessing freed module memory
 - patches 3 and 4 fix miscounting by static_key_slow_dec_deferred()
 - patch 5 introduces a minor optimization

More jump_label/static_key patches are prepared, but I returned to them
yesterday and implemented a variable jump length on amd64, which
requires some refactoring, porting to remaining architectures, and
retesting, so I'm posting this independent part before it gets overtaken
by higher priority work again
= This series was tested with additional patches
^ I wrote this on Tuesday and then moved to higher priority work, but
returned with enough courage to post a different first part.
The first part was tested on amd64, s390x and ppc64, the rest also on
armv7.

Applies to next-20131206 and v3.13-rc3.

Radim Krčmář (5):
  static_key: add a section for deferred keys
  static_key: cancel rate limit timer on rmmod
  static_key: add static_key_slow_inc_deferred()
  static_key: keep deferred enabled counter debt
  static_key: improve deferred inc behavior

 arch/x86/kvm/lapic.c | 11 +
 arch/x86/kvm/lapic.h |  4 +--
 include/asm-generic/vmlinux.lds.h|  1 +
 include/linux/jump_label_ratelimit.h | 10 
 include/linux/module.h   |  3 +++
 include/linux/perf_event.h   |  2 +-
 kernel/events/core.c |  8 +++---
 kernel/jump_label.c  | 47 +++-
 kernel/module.c  |  4 +++
 9 files changed, 72 insertions(+), 18 deletions(-)

-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/5] static_key: add a section for deferred keys

2013-12-06 Thread Radim Krčmář

We need to know about all deferred keys if we want to correctly
 - initialize timers on kernel init/module load
 - destroy pending timers when unloading a module

We depend on section attribute, so direct definitions of struct
static_key_deferred should be avoided, which is suboptimal.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
More general solution would use compile-time magic to generate an array
of pointers to deferred structures, but I am not sure if it is
acceptable and possible.

Worse approach added an unload_callback_list to the struct module.
Callbacks of type void (*)(void *) were registered on
static_key_rate_limit(), to cancel the deferred key work from
module_unload_free().
(I have patches for this)

Jump labels are already notified for module changes, so we could keep track of
deferred keys in modules there, using a static global tree.
(in the worst case)

 arch/x86/kvm/lapic.c | 4 ++--
 arch/x86/kvm/lapic.h | 4 ++--
 include/asm-generic/vmlinux.lds.h| 1 +
 include/linux/jump_label_ratelimit.h | 4 
 include/linux/module.h   | 3 +++
 include/linux/perf_event.h   | 2 +-
 kernel/events/core.c | 2 +-
 kernel/module.c  | 4 
 8 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5439117..5f01547 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -112,8 +112,8 @@ static inline int __apic_test_and_clear_vector(int vec, 
void *bitmap)
return __test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 }
 
-struct static_key_deferred apic_hw_disabled __read_mostly;
-struct static_key_deferred apic_sw_disabled __read_mostly;
+static_key_deferred(apic_hw_disabled);
+static_key_deferred(apic_sw_disabled);
 
 static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
 {
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index c730ac9..4ae9a7a 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -108,7 +108,7 @@ static inline bool kvm_vcpu_has_lapic(struct kvm_vcpu *vcpu)
return true;
 }
 
-extern struct static_key_deferred apic_hw_disabled;
+extern static_key_deferred(apic_hw_disabled);
 
 static inline int kvm_apic_hw_enabled(struct kvm_lapic *apic)
 {
@@ -117,7 +117,7 @@ static inline int kvm_apic_hw_enabled(struct kvm_lapic 
*apic)
return MSR_IA32_APICBASE_ENABLE;
 }
 
-extern struct static_key_deferred apic_sw_disabled;
+extern static_key_deferred(apic_sw_disabled);
 
 static inline int kvm_apic_sw_enabled(struct kvm_lapic *apic)
 {
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index bc2121f..572e583 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -214,6 +214,7 @@
 #define READ_MOSTLY_DATA(align)
\
. = ALIGN(align);   \
*(.data..read_mostly)   \
+   *(__deferred_keys)  \
. = ALIGN(align);
 
 #define CACHELINE_ALIGNED_DATA(align)  \
diff --git a/include/linux/jump_label_ratelimit.h 
b/include/linux/jump_label_ratelimit.h
index 089f70f..1216db0 100644
--- a/include/linux/jump_label_ratelimit.h
+++ b/include/linux/jump_label_ratelimit.h
@@ -3,6 +3,10 @@
 
 #include linux/jump_label.h
 #include linux/workqueue.h
+#include linux/compiler.h
+
+#define static_key_deferred(name) \
+   struct static_key_deferred name __section(__deferred_keys)
 
 #if defined(CC_HAVE_ASM_GOTO)  defined(CONFIG_JUMP_LABEL)
 struct static_key_deferred {
diff --git a/include/linux/module.h b/include/linux/module.h
index 46e548f..4cc7269 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -347,6 +347,9 @@ struct module
 #ifdef HAVE_JUMP_LABEL
struct jump_entry *jump_entries;
unsigned int num_jump_entries;
+
+   struct static_key_deferred *deferred_keys;
+   unsigned int num_deferred_keys;
 #endif
 #ifdef CONFIG_TRACING
unsigned int num_trace_bprintk_fmt;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8f4a70f..8baabca 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -672,7 +672,7 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, 
u64 addr)
}
 }
 
-extern struct static_key_deferred perf_sched_events;
+extern static_key_deferred(perf_sched_events);
 
 static inline void perf_event_task_sched_in(struct task_struct *prev,
struct task_struct *task)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 403b781..ee64d26 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -138,7 +138,7 @@ enum event_type_t {
  * perf_sched_events : 0 events exist
  * perf_cgroup_events: 0 per-cpu cgroup events exist on this cpu

[PATCH v2 2/5] static_key: cancel rate limit timer on rmmod

2013-12-06 Thread Radim Krčmář

Fix a bug when we free module's memory while a timer is pending by
canceling all deferred timers from the unloaded module.

static_key_rate_limit() still can't be called more than once.

Reproducer: (host crasher)
  modprobe kvm_intel
  (sleep 1; echo quit) \
| qemu-kvm -kernel /dev/null -monitor stdio 
  sleep 0.5
  until modprobe -rv kvm_intel 2/dev/null; do :; done

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
I decided not to post a patch that uses __deferred_key in kernel/module init,
so these three functions might seem like an overkill.

 kernel/jump_label.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 9019f15..02d610a 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -125,6 +125,27 @@ void jump_label_rate_limit(struct static_key_deferred *key,
 }
 EXPORT_SYMBOL_GPL(jump_label_rate_limit);
 
+/* could (should?) be abstracted more */
+static void with_deferred_keys(struct static_key_deferred *entry,
+   struct static_key_deferred *stop,
+   void (*op)(struct static_key_deferred *))
+{
+   struct static_key_deferred *iter;
+
+   for (iter = entry; iter  stop; iter++)
+   op(iter);
+}
+
+#define with_module_deferred_keys(mod, op) \
+   with_deferred_keys(mod-deferred_keys, \
+  mod-deferred_keys + mod-num_deferred_keys, \
+  op)
+
+static void deferred_key_cancel_work(struct static_key_deferred *dkey)
+{
+   cancel_delayed_work_sync(dkey-work);
+}
+
 static int addr_conflict(struct jump_entry *entry, void *start, void *end)
 {
if (entry-code = (unsigned long)end 
@@ -385,6 +406,7 @@ jump_label_module_notify(struct notifier_block *self, 
unsigned long val,
jump_label_unlock();
break;
case MODULE_STATE_GOING:
+   with_module_deferred_keys(mod, deferred_key_cancel_work);
jump_label_lock();
jump_label_del_module(mod);
jump_label_unlock();
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/5] static_key: add static_key_slow_inc_deferred()

2013-12-06 Thread Radim Krčmář

Complement the static_key_slow_dec_deferred().
This avoids asymmetrical API, and prepares us for future optimizations
and bug fixes.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/lapic.c | 7 ---
 include/linux/jump_label_ratelimit.h | 5 +
 kernel/events/core.c | 6 +++---
 kernel/jump_label.c  | 7 +++
 4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5f01547..86973ac 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -121,7 +121,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, 
u32 val)
if (val  APIC_SPIV_APIC_ENABLED)
static_key_slow_dec_deferred(apic_sw_disabled);
else
-   static_key_slow_inc(apic_sw_disabled.key);
+   static_key_slow_inc_deferred(apic_sw_disabled);
}
apic_set_reg(apic, APIC_SPIV, val);
 }
@@ -1351,7 +1351,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
if (value  MSR_IA32_APICBASE_ENABLE)
static_key_slow_dec_deferred(apic_hw_disabled);
else
-   static_key_slow_inc(apic_hw_disabled.key);
+   static_key_slow_inc_deferred(apic_hw_disabled);
recalculate_apic_map(vcpu-kvm);
}
 
@@ -1546,7 +1546,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu)
kvm_lapic_set_base(vcpu,
APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE);
 
-   static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */
+   /* sw disabled at reset */
+   static_key_slow_inc_deferred(apic_sw_disabled);
kvm_lapic_reset(vcpu);
kvm_iodevice_init(apic-dev, apic_mmio_ops);
 
diff --git a/include/linux/jump_label_ratelimit.h 
b/include/linux/jump_label_ratelimit.h
index 112ba5f..a18aded 100644
--- a/include/linux/jump_label_ratelimit.h
+++ b/include/linux/jump_label_ratelimit.h
@@ -18,6 +18,7 @@ struct static_key_deferred {
 #endif
 
 #ifdef HAVE_JUMP_LABEL
+extern void static_key_slow_inc_deferred(struct static_key_deferred *key);
 extern void static_key_slow_dec_deferred(struct static_key_deferred *key);
 extern void
 jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl);
@@ -26,6 +27,10 @@ jump_label_rate_limit(struct static_key_deferred *key, 
unsigned long rl);
 struct static_key_deferred {
struct static_key  key;
 };
+static inline void static_key_slow_inc_deferred(struct static_key_deferred 
*key)
+{
+   static_key_slow_inc(key-key);
+}
 static inline void static_key_slow_dec_deferred(struct static_key_deferred 
*key)
 {
STATIC_KEY_CHECK_USE();
diff --git a/kernel/events/core.c b/kernel/events/core.c
index ee64d26..b3fb4c2 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6597,7 +6597,7 @@ static void account_event(struct perf_event *event)
return;
 
if (event-attach_state  PERF_ATTACH_TASK)
-   static_key_slow_inc(perf_sched_events.key);
+   static_key_slow_inc_deferred(perf_sched_events);
if (event-attr.mmap || event-attr.mmap_data)
atomic_inc(nr_mmap_events);
if (event-attr.comm)
@@ -6609,9 +6609,9 @@ static void account_event(struct perf_event *event)
tick_nohz_full_kick_all();
}
if (has_branch_stack(event))
-   static_key_slow_inc(perf_sched_events.key);
+   static_key_slow_inc_deferred(perf_sched_events);
if (is_cgroup_event(event))
-   static_key_slow_inc(perf_sched_events.key);
+   static_key_slow_inc_deferred(perf_sched_events);
 
account_event_cpu(event, event-cpu);
 }
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 02d610a..41592ba 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -74,6 +74,13 @@ void static_key_slow_inc(struct static_key *key)
 }
 EXPORT_SYMBOL_GPL(static_key_slow_inc);
 
+void static_key_slow_inc_deferred(struct static_key_deferred *key)
+{
+   STATIC_KEY_CHECK_USE();
+   static_key_slow_inc(key-key);
+}
+EXPORT_SYMBOL_GPL(static_key_slow_inc_deferred);
+
 static void __static_key_slow_dec(struct static_key *key,
unsigned long rate_limit, struct delayed_work *work)
 {
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/5] static_key: keep deferred enabled counter debt

2013-12-06 Thread Radim Krčmář

When '.enabled.counter == 1', static_key_slow_dec_deferred() gets
silently dropped if the decrease is already pending.

We print a warning if this happens and because .enabled.counter cannot
go below 1 before the decrease has finished, the number of ignored
static_key_slow_dec_deferred() is kept and we skip an equal amount of
static_key_slow_inc_deferred().

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 include/linux/jump_label_ratelimit.h |  1 +
 kernel/jump_label.c  | 17 +++--
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/jump_label_ratelimit.h 
b/include/linux/jump_label_ratelimit.h
index a18aded..2d5fa1a 100644
--- a/include/linux/jump_label_ratelimit.h
+++ b/include/linux/jump_label_ratelimit.h
@@ -14,6 +14,7 @@ struct static_key_deferred {
struct static_key key;
unsigned long timeout;
struct delayed_work work;
+   atomic_t enabled_debt;
 };
 #endif
 
diff --git a/kernel/jump_label.c b/kernel/jump_label.c
index 41592ba..bd7ad31 100644
--- a/kernel/jump_label.c
+++ b/kernel/jump_label.c
@@ -77,12 +77,14 @@ EXPORT_SYMBOL_GPL(static_key_slow_inc);
 void static_key_slow_inc_deferred(struct static_key_deferred *key)
 {
STATIC_KEY_CHECK_USE();
+   if (atomic_dec_if_positive(key-enabled_debt) = 0)
+   return;
static_key_slow_inc(key-key);
 }
 EXPORT_SYMBOL_GPL(static_key_slow_inc_deferred);
 
 static void __static_key_slow_dec(struct static_key *key,
-   unsigned long rate_limit, struct delayed_work *work)
+   struct static_key_deferred *dkey)
 {
if (!atomic_dec_and_mutex_lock(key-enabled, jump_label_mutex)) {
WARN(atomic_read(key-enabled)  0,
@@ -90,9 +92,12 @@ static void __static_key_slow_dec(struct static_key *key,
return;
}
 
-   if (rate_limit) {
+   if (dkey  dkey-timeout) {
atomic_inc(key-enabled);
-   schedule_delayed_work(work, rate_limit);
+   if (!schedule_delayed_work(dkey-work, dkey-timeout)) {
+   atomic_inc(dkey-enabled_debt);
+   WARN(1, jump label: negative deferred count!\n);
+   }
} else {
if (!jump_label_get_branch_default(key))
jump_label_update(key, JUMP_LABEL_DISABLE);
@@ -106,20 +111,20 @@ static void jump_label_update_timeout(struct work_struct 
*work)
 {
struct static_key_deferred *key =
container_of(work, struct static_key_deferred, work.work);
-   __static_key_slow_dec(key-key, 0, NULL);
+   __static_key_slow_dec(key-key, NULL);
 }
 
 void static_key_slow_dec(struct static_key *key)
 {
STATIC_KEY_CHECK_USE();
-   __static_key_slow_dec(key, 0, NULL);
+   __static_key_slow_dec(key, NULL);
 }
 EXPORT_SYMBOL_GPL(static_key_slow_dec);
 
 void static_key_slow_dec_deferred(struct static_key_deferred *key)
 {
STATIC_KEY_CHECK_USE();
-   __static_key_slow_dec(key-key, key-timeout, key-work);
+   __static_key_slow_dec(key-key, key);
 }
 EXPORT_SYMBOL_GPL(static_key_slow_dec_deferred);
 
-- 
1.8.4.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)

2013-12-13 Thread Radim Krčmář

2013-12-12 21:36+0100, Paolo Bonzini:
 From: Gleb Natapov g...@redhat.com
 
 A guest can cause a BUG_ON() leading to a host kernel crash.
 When the guest writes to the ICR to request an IPI, while in x2apic
 mode the following things happen, the destination is read from
 ICR2, which is a register that the guest can control.
 
 kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the
 cluster id.  A BUG_ON is triggered, which is a protection against
 accessing map-logical_map with an out-of-bounds access and manages
 to avoid that anything really unsafe occurs.
 
 The logic in the code is correct from real HW point of view. The problem
 is that KVM supports only one cluster with ID 0 in clustered mode, but
 the code that has the bug does not take this into account.

The more I read about x2apic, the more confused I am ...

 - How was the cluster x2apic enabled?

   Linux won't enable cluster x2apic without interrupt remapping and I
   had no idea we were allowed to do it.

 - A hardware test-suite found this?

   This bug can only be hit when the destination cpu is  256, so the
   request itself is buggy -- we don't support that many in kvm and it
   would crash when initializing the vcpus if we did.
   = It looks like we should just ignore the ipi, because we have no
  vcpus in that cluster.

 - Where does the 'only one supported cluster' come from?

   I only see we use 'struct kvm_lapic *logical_map[16][16];', which
   supports 16 clusters of 16 apics = first 256 vcpus, so if we map
   everything to logical_map[0][0:15], we would not work correctly in
   the cluster x2apic, with  16 vcpus.

Thanks.

 Reported-by: Lars Bull larsb...@google.com
 Cc: sta...@vger.kernel.org
 Signed-off-by: Gleb Natapov g...@redhat.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/lapic.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index b8bec45c1610..801dc3fd66e1 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -143,6 +143,8 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
   return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
  }
  
 +#define KMV_X2APIC_CID_BITS 0
 +
  static void recalculate_apic_map(struct kvm *kvm)
  {
   struct kvm_apic_map *new, *old = NULL;
 @@ -180,7 +182,8 @@ static void recalculate_apic_map(struct kvm *kvm)
   if (apic_x2apic_mode(apic)) {
   new-ldr_bits = 32;
   new-cid_shift = 16;
 - new-cid_mask = new-lid_mask = 0x;
 + new-cid_mask = (1  KMV_X2APIC_CID_BITS) - 1;
 + new-lid_mask = 0x;
   } else if (kvm_apic_sw_enabled(apic) 
   !new-cid_mask /* flat mode */ 
   kvm_apic_get_reg(apic, APIC_DFR) == 
 APIC_DFR_CLUSTER) {
 -- 
 1.8.3.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)

2013-12-13 Thread Radim Krčmář

2013-12-13 18:25+0100, Paolo Bonzini:
 Il 13/12/2013 17:07, Radim Krčmář ha scritto:
 This bug can only be hit when the destination cpu is  256, so the
 request itself is buggy -- we don't support that many in kvm and it
 would crash when initializing the vcpus if we did.
 = It looks like we should just ignore the ipi, because we have no
vcpus in that cluster.
 
 That's what should happen in physical mode.  Something like this patch:
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 5439117d5c4c..1f8e9e1abd3b 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -613,9 +615,10 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, 
 struct kvm_lapic *src,
  
   if (irq-dest_mode == 0) { /* physical mode */
   if (irq-delivery_mode == APIC_DM_LOWEST ||
 - irq-dest_id == 0xff)
 + irq-dest_id == 0xff ||

The 0xff is xapic broadcast, we did not care about the x2apic one and
no-one noticed that it does not work :)

0xff isn't special in x2apic mode.

 + (apic_x2apic_mode(src)  irq-dest_id  0xff))
   goto out;
 - dst = map-phys_map[irq-dest_id  0xff];
 + dst = map-phys_map[irq-dest_id];
   } else {
   u32 mda = irq-dest_id  (32 - map-ldr_bits);
  
 
 On top of this, the x2apic spec documents a broadcast destination ID that
 could be implemented as follows.  But I have no idea if this is correct and
 how it differs from the usual broadcast shorthand:
 
 @@ -815,9 +818,11 @@ static void apic_send_ipi(struct kvm_lapic *apic)
   irq.level = icr_low  APIC_INT_ASSERT;
   irq.trig_mode = icr_low  APIC_INT_LEVELTRIG;
   irq.shorthand = icr_low  APIC_SHORT_MASK;
 - if (apic_x2apic_mode(apic))
 + if (apic_x2apic_mode(apic)) {
   irq.dest_id = icr_high;
 - else
 + if (icr_high == 0x)
 + irq.shorthand = APIC_DEST_ALLINC;

The shorthand takes priority, so we shouldn't overwrite it.
This is better solved after we 'goto out' from the _fast().
(could be nicer still ...)

 + } else
   irq.dest_id = GET_APIC_DEST_FIELD(icr_high);
   trace_kvm_apic_ipi(icr_low, irq.dest_id);
 
 

And another possibility occured to me now: Google was the first one to
encounter a broadcast; we don't handle it at all in the fast path and
x2apic has 0x as the target in both modes ...
(I think that xapic has 0xff in both of them too, but I'll check)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5439117..e4618c4 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -512,19 +512,24 @@ static void apic_set_tpr(struct kvm_lapic *apic, u32 tpr)
apic_update_ppr(apic);
 }
 
-int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest)
+int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 dest)
 {
-   return dest == 0xff || kvm_apic_id(apic) == dest;
+   return dest == (apic_x2apic_mode(apic) ? 0x : 0xff)
+  || kvm_apic_id(apic) == dest;
 }
 
-int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda)
+int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda)
 {
int result = 0;
u32 logical_id;
 
+   if (mda == (apic_x2apic_mode(apic) ? 0x : 0xff))
+   return 1;
+
if (apic_x2apic_mode(apic)) {
logical_id = kvm_apic_get_reg(apic, APIC_LDR);
-   return logical_id  mda;
+   return mda  16 == logical_id  16
+   logical_id  mda  0x;
}
 
logical_id = GET_APIC_LOGICAL_ID(kvm_apic_get_reg(apic, APIC_LDR));
@@ -605,6 +610,10 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct 
kvm_lapic *src,
if (irq-shorthand)
return false;
 
+   /* broadcast */
+   if (irq-dest_id == (apic_x2apic_mode(src) ? 0x : 0xff))
+   return false;
+
rcu_read_lock();
map = rcu_dereference(kvm-arch.apic_map);
 
@@ -612,10 +621,13 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, 
struct kvm_lapic *src,
goto out;
 
if (irq-dest_mode == 0) { /* physical mode */
-   if (irq-delivery_mode == APIC_DM_LOWEST ||
-   irq-dest_id == 0xff)
+   if (irq-dest_id  0xff) {
+   ret = true;
+   goto out;
+   }
+   if (irq-delivery_mode == APIC_DM_LOWEST)
goto out;
-   dst = map-phys_map[irq-dest_id  0xff];
+   dst = map-phys_map[irq-dest_id];
} else {
u32 mda = irq-dest_id  (32 - map-ldr_bits);
 

The logical x2apic slowpath was completely broken, maybe still is,
I have no way to test it ...

The broadcast checking should be abstracted, we call it a lot and giving
it a name would be much better than commenting.

Also

Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)

2013-12-16 Thread Radim Krčmář

2013-12-14 11:46+0200, Gleb Natapov:
 On Fri, Dec 13, 2013 at 05:07:54PM +0100, Radim Krčmář wrote:
  2013-12-12 21:36+0100, Paolo Bonzini:
   From: Gleb Natapov g...@redhat.com
   
   A guest can cause a BUG_ON() leading to a host kernel crash.
   When the guest writes to the ICR to request an IPI, while in x2apic
   mode the following things happen, the destination is read from
   ICR2, which is a register that the guest can control.
   
   kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the
   cluster id.  A BUG_ON is triggered, which is a protection against
   accessing map-logical_map with an out-of-bounds access and manages
   to avoid that anything really unsafe occurs.
   
   The logic in the code is correct from real HW point of view. The problem
   is that KVM supports only one cluster with ID 0 in clustered mode, but
   the code that has the bug does not take this into account.
  
  The more I read about x2apic, the more confused I am ...
  
   - How was the cluster x2apic enabled?
  
 Linux won't enable cluster x2apic without interrupt remapping and I
 had no idea we were allowed to do it.
  
 Malicious code can do it.
 
   - A hardware test-suite found this?
  
 This bug can only be hit when the destination cpu is  256, so the
 request itself is buggy -- we don't support that many in kvm and it
 would crash when initializing the vcpus if we did.
 = It looks like we should just ignore the ipi, because we have no
vcpus in that cluster.
  
 That's the nature of malicious code: it does what your code does not expects
 it to do :)

I was wondering if there wasn't malicious linux on the other side too :)

   - Where does the 'only one supported cluster' come from?
  
 only one supported cluster comes from 8 bit cpuid limitation of KVM's x2apic
 implementation. With 8 bit cpuid you can only address cluster 0 in logical 
 mode.

One x2apic cluster has 16 cpus and we generate the x2apic LDR correctly,
so 8 bit cpuid can address first 16 clusters as well.

  u32 ldr = ((id  4)  16) | (1  (id  0xf));

 I only see we use 'struct kvm_lapic *logical_map[16][16];', which
 supports 16 clusters of 16 apics = first 256 vcpus, so if we map
 everything to logical_map[0][0:15], we would not work correctly in
 the cluster x2apic, with  16 vcpus.
  
 Such config cannot work today because of 8 bit cpuid limitation. When the 
 limitation
 will be removed KMV_X2APIC_CID_BITS will be set to actual number of bits we 
 want to support.

Even with KMV_X2APIC_CID_BITS = 4, which would allow us to support 8 bit
cpuid, we would still deliver interrupts destined for cpuid  256 to
potentially plugged cpus.

 It will likely be smaller then 16 bit though since full 18 bit support will 
 require huge tables.

Yeah, we'll have to do dynamic allocation if we are ever going to need
the full potential of x2apic.
(2^20-16 cpus in cluster and 2^32-1 in flat mode)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)

2013-12-16 Thread Radim Krčmář

2013-12-16 14:16+0200, Gleb Natapov:
 On Mon, Dec 16, 2013 at 01:01:10PM +0100, Radim Krčmář wrote:
 - Where does the 'only one supported cluster' come from?

   only one supported cluster comes from 8 bit cpuid limitation of KVM's 
   x2apic
   implementation. With 8 bit cpuid you can only address cluster 0 in 
   logical mode.
  
  One x2apic cluster has 16 cpus and we generate the x2apic LDR correctly,
  so 8 bit cpuid can address first 16 clusters as well.
  
u32 ldr = ((id  4)  16) | (1  (id  0xf));
  
 Interrupt from a device cannot generate such ldr, only IPI can. Only
 4 cpus in cluster zero are addressable in clustering mode by a
 device. Without irq remapping x2apic is a PV interface between host
 and guest where guest needs to know KVM implementation's limitation to
 use it.

Thanks, I'll read more about devices ... still no idea how could they
address cluster  15.

 I do not see a point in fixing problems in x2apic logical mode
 emulation right now since it will not make it usable, as long as
 there is not security problems there.

Agreed; I wanted to know why this patch was correct, if we cared.

   I only see we use 'struct kvm_lapic *logical_map[16][16];', which
   supports 16 clusters of 16 apics = first 256 vcpus, so if we map
   everything to logical_map[0][0:15], we would not work correctly in
   the cluster x2apic, with  16 vcpus.

   Such config cannot work today because of 8 bit cpuid limitation. When the 
   limitation
   will be removed KMV_X2APIC_CID_BITS will be set to actual number of bits 
   we want to support.
  
  Even with KMV_X2APIC_CID_BITS = 4, which would allow us to support 8 bit
  cpuid, we would still deliver interrupts destined for cpuid  256 to
  potentially plugged cpus.
 Again, KMV_X2APIC_CID_BITS = 4 will not allow us to support 8 bit cpuids
 unfortunately, not sure what you mean by the second part of the sentence.

Sorry, I meant that with this change, we map all clusters to cluster 0,
which has two flaws:
 - in kvm_lapic_set_base(), the order of vcpu creation determines those
   assigned to cluster 0, and the rest is unaddressable (overwritten)
 - we can send IPI to an unplugged high cpuid and it arives in cluster 0
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)

2013-12-16 Thread Radim Krčmář

2013-12-16 13:55+0100, Radim Krčmář:
 2013-12-16 14:16+0200, Gleb Natapov:
  On Mon, Dec 16, 2013 at 01:01:10PM +0100, Radim Krčmář wrote:
  - Where does the 'only one supported cluster' come from?
 
only one supported cluster comes from 8 bit cpuid limitation of KVM's 
x2apic
implementation. With 8 bit cpuid you can only address cluster 0 in 
logical mode.
   
   One x2apic cluster has 16 cpus and we generate the x2apic LDR correctly,
   so 8 bit cpuid can address first 16 clusters as well.
   
 u32 ldr = ((id  4)  16) | (1  (id  0xf));
   
  Interrupt from a device cannot generate such ldr, only IPI can. Only
  4 cpus in cluster zero are addressable in clustering mode by a
  device. Without irq remapping x2apic is a PV interface between host
  and guest where guest needs to know KVM implementation's limitation to
  use it.
 
 Thanks, I'll read more about devices ... still no idea how could they
 address cluster  15.
 
  I do not see a point in fixing problems in x2apic logical mode
  emulation right now since it will not make it usable, as long as
  there is not security problems there.
 
 Agreed; I wanted to know why this patch was correct, if we cared.
 
I only see we use 'struct kvm_lapic *logical_map[16][16];', which
supports 16 clusters of 16 apics = first 256 vcpus, so if we map
everything to logical_map[0][0:15], we would not work correctly in
the cluster x2apic, with  16 vcpus.
 
Such config cannot work today because of 8 bit cpuid limitation. When 
the limitation
will be removed KMV_X2APIC_CID_BITS will be set to actual number of 
bits we want to support.
   
   Even with KMV_X2APIC_CID_BITS = 4, which would allow us to support 8 bit
   cpuid, we would still deliver interrupts destined for cpuid  256 to
   potentially plugged cpus.
  Again, KMV_X2APIC_CID_BITS = 4 will not allow us to support 8 bit cpuids
  unfortunately, not sure what you mean by the second part of the sentence.
 
 Sorry, I meant that with this change, we map all clusters to cluster 0,
 which has two flaws:
  - in kvm_lapic_set_base(), the order of vcpu creation determines those

Gah, should have been 'recalculate_apic_map()' ...

The patch would be especially surprising with a dynamic adding of vcpus.

assigned to cluster 0, and the rest is unaddressable (overwritten)
  - we can send IPI to an unplugged high cpuid and it arives in cluster 0
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: SVM: fix NMI window after iret

2014-01-17 Thread Radim Krčmář

We should open NMI window right after an iret, but SVM exits before it.
We wanted to single step using the trap flag and then open it.
(or we could emulate the iret instead)
We don't do it since commit 3842d135ff2 (likely), because the iret exit
handler does not request an event, so NMI window remains closed until
the next exit.

Fix this by making KVM_REQ_EVENT request in the iret handler.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 (btw. kvm-unit-tests weren't executed on SVM since Nov 2010, at least)

 arch/x86/kvm/svm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c7168a5..b5a735b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2829,6 +2829,7 @@ static int iret_interception(struct vcpu_svm *svm)
clr_intercept(svm, INTERCEPT_IRET);
svm-vcpu.arch.hflags |= HF_IRET_MASK;
svm-nmi_iret_rip = kvm_rip_read(svm-vcpu);
+   kvm_make_request(KVM_REQ_EVENT, svm-vcpu);
return 1;
 }
 
-- 
1.8.5.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: SVM: fix NMI window after iret

2014-01-17 Thread Radim Krčmář

2014-01-17 12:18-0800, Greg KH:
 On Fri, Jan 17, 2014 at 08:52:42PM +0100, Radim Krčmář wrote:
  We should open NMI window right after an iret, but SVM exits before it.
  We wanted to single step using the trap flag and then open it.
  (or we could emulate the iret instead)
  We don't do it since commit 3842d135ff2 (likely), because the iret exit
  handler does not request an event, so NMI window remains closed until
  the next exit.
  
  Fix this by making KVM_REQ_EVENT request in the iret handler.
  
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
   (btw. kvm-unit-tests weren't executed on SVM since Nov 2010, at least)
  
   arch/x86/kvm/svm.c | 1 +
   1 file changed, 1 insertion(+)
 
 
 formletter
 
 This is not the correct way to submit patches for inclusion in the
 stable kernel tree.  Please read Documentation/stable_kernel_rules.txt
 for how to do this properly.
 
 /formletter

Welp, at the last second, I decided it is not that critical to have it
in stable and forgot to clean the git-send-email command line too.

Please ignore this patch in stable.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] kvm: print suberror on all internal errors

2014-01-21 Thread Radim Krčmář

KVM introduced internal error exit reason and suberror at the same time,
and later extended it with internal error data.
QEMU does not report suberror on hosts between these two events because
we check for the extension. (half a year in 2009, but it is misleading)

Fix by removing KVM_CAP_INTERNAL_ERROR_DATA condition on printf.

(partially improved by bb44e0d12df70 and ba4047cf848a3 in the past)

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 kvm-all.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 0bfb060..0a91d8e 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -1539,17 +1539,16 @@ static void kvm_handle_io(uint16_t port, void *data, 
int direction, int size,
 
 static int kvm_handle_internal_error(CPUState *cpu, struct kvm_run *run)
 {
-fprintf(stderr, KVM internal error.);
+fprintf(stderr, KVM internal error. Suberror: %d\n,
+run-internal.suberror);
+
 if (kvm_check_extension(kvm_state, KVM_CAP_INTERNAL_ERROR_DATA)) {
 int i;
 
-fprintf(stderr,  Suberror: %d\n, run-internal.suberror);
 for (i = 0; i  run-internal.ndata; ++i) {
 fprintf(stderr, extra data[%d]: %PRIx64\n,
 i, (uint64_t)run-internal.data[i]);
 }
-} else {
-fprintf(stderr, \n);
 }
 if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) {
 fprintf(stderr, emulation failure\n);
-- 
1.8.5.2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: host_fx_image, guest_fx_image ; what are they ?

2014-03-05 Thread Radim Krčmář

2014-03-04 23:35+0530, ratheesh kannoth:
 Could you please  help  me  understand  below  variables  ?  i found
 it in  struct kvm_vcpu
 
 char *host_fx_image;
  char *guest_fx_image;

Commit b114b0804df7131cb6764b948c1c530c834fa3c0 explains them.

(How does the existence depend on lost KVM code, dear time traveller?)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7] KVM: vmx: Allow the guest to run with dirty debug registers

2014-03-09 Thread Radim Krčmář

2014-03-07 12:42+0100, Paolo Bonzini:
 When not running in guest-debug mode (i.e. the guest controls the debug
 registers, having to take an exit for each DR access is a waste of time.
 If the guest gets into a state where each context switch causes DR to be
 saved and restored, this can take away as much as 40% of the execution
 time from the guest.
 
 If the guest is running with vcpu-arch.db == vcpu-arch.eff_db, we
 can let it write freely to the debug registers and reload them on the
 next exit.  We still need to exit on the first access, so that the
 KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further
 accesses to the debug registers will not cause a vmexit.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/kvm/vmx.c | 37 -
  1 file changed, 36 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 06e4ec877a1c..e377522408b1 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -2843,7 +2843,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
 *vmcs_conf)
 vmx_capability.ept, vmx_capability.vpid);
   }
  
 - min = 0;
 + min = VM_EXIT_SAVE_DEBUG_CONTROLS;
  #ifdef CONFIG_X86_64
   min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
  #endif
 @@ -5113,6 +5113,22 @@ static int handle_dr(struct kvm_vcpu *vcpu)
   }
   }
  
 + if (vcpu-guest_debug == 0) {
 + u32 cpu_based_vm_exec_control;
 +
 + cpu_based_vm_exec_control = 
 vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 + cpu_based_vm_exec_control = ~CPU_BASED_MOV_DR_EXITING;
 + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, 
 cpu_based_vm_exec_control);

vmcs_clear_bits() covers exactly this use-case.
(Barring the explicit bit-width.)

 +
 + /*
 +  * No more DR vmexits; force a reload of the debug registers
 +  * and reenter on this instruction.  The next vmexit will
 +  * retrieve the full state of the debug registers.
 +  */
 + vcpu-arch.switch_db_regs |= KVM_DEBUGREG_WONT_EXIT;
 + return 1;
 + }
 +

We could make the code slighly uglier and move the functional part of
this block before the previous one, so it would do both things in one
exit.  (Exception handler will likely access DR too.)

   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
   dr = exit_qualification  DEBUG_REG_ACCESS_NUM;
   reg = DEBUG_REG_ACCESS_REG(exit_qualification);
 @@ -5139,6 +5155,24 @@ static void vmx_set_dr6(struct kvm_vcpu *vcpu, 
 unsigned long val)
  {
  }
  
 +static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 +{
 + u32 cpu_based_vm_exec_control;
 +
 + get_debugreg(vcpu-arch.db[0], 0);
 + get_debugreg(vcpu-arch.db[1], 1);
 + get_debugreg(vcpu-arch.db[2], 2);
 + get_debugreg(vcpu-arch.db[3], 3);
 + get_debugreg(vcpu-arch.dr6, 6);
 + vcpu-arch.dr7 = vmcs_readl(GUEST_DR7);
 +
 + vcpu-arch.switch_db_regs = ~KVM_DEBUGREG_WONT_EXIT;
 +
 + cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 + cpu_based_vm_exec_control |= CPU_BASED_MOV_DR_EXITING;
 + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);

(I'll be calling this sneaky later on.)

 +}
 +
  static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
  {
   vmcs_writel(GUEST_DR7, val);
 @@ -8590,6 +8624,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
   .get_dr6 = vmx_get_dr6,
   .set_dr6 = vmx_set_dr6,
   .set_dr7 = vmx_set_dr7,
 + .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
   .cache_reg = vmx_cache_reg,
   .get_rflags = vmx_get_rflags,
   .set_rflags = vmx_set_rflags,
 -- 
 1.8.3.1
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/7] KVM: x86: Allow the guest to run with dirty debug registers

2014-03-09 Thread Radim Krčmář

2014-03-07 12:42+0100, Paolo Bonzini:
 When not running in guest-debug mode, the guest controls the debug
 registers and having to take an exit for each DR access is a waste
 of time.  If the guest gets into a state where each context switch
 causes DR to be saved and restored, this can take away as much as 40%
 of the execution time from the guest.
 
 After this patch, VMX- and SVM-specific code can set a flag in
 switch_db_regs, telling vcpu_enter_guest that on the next exit the debug
 registers might be dirty and need to be reloaded (syncing will be taken
 care of by a new callback in kvm_x86_ops).  This flag can be set on the
 first access to a debug registers, so that multiple accesses to the
 debug registers only cause one vmexit.
 
 Note that since the guest will be able to read debug registers and
 enable breakpoints in DR7, we need to ensure that they are synchronized
 on entry to the guest---including DR6 that was not synced before.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h |  2 ++
  arch/x86/kvm/x86.c  | 16 
  2 files changed, 18 insertions(+)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 5ef59d3b6c63..74eb361eaa8f 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -339,6 +339,7 @@ struct kvm_pmu {
  
  enum {
   KVM_DEBUGREG_BP_ENABLED = 1,
 + KVM_DEBUGREG_WONT_EXIT = 2,
  };
  
  struct kvm_vcpu_arch {
 @@ -707,6 +708,7 @@ struct kvm_x86_ops {
   void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
   u64 (*get_dr6)(struct kvm_vcpu *vcpu);
   void (*set_dr6)(struct kvm_vcpu *vcpu, unsigned long value);
 + void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu);
   void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value);
   void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg);
   unsigned long (*get_rflags)(struct kvm_vcpu *vcpu);
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 252b47e85c69..c48818aa04c0 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6033,12 +6033,28 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
   set_debugreg(vcpu-arch.eff_db[1], 1);
   set_debugreg(vcpu-arch.eff_db[2], 2);
   set_debugreg(vcpu-arch.eff_db[3], 3);
 + set_debugreg(vcpu-arch.dr6, 6);
   }
  
   trace_kvm_entry(vcpu-vcpu_id);
   kvm_x86_ops-run(vcpu);
  
   /*
 +  * Do this here before restoring debug registers on the host.  And
 +  * since we do this before handling the vmexit, a DR access vmexit
 +  * can (a) read the correct value of the debug registers, (b) set
 +  * KVM_DEBUGREG_WONT_EXIT again.

We re-enable intercepts on the next exit for the sake of simplicity?
(Batched accesses make perfect sense, but it seems we don't have to care
 about DRs at all, without guest-debug.)

 +  */
 + if (unlikely(vcpu-arch.switch_db_regs  KVM_DEBUGREG_WONT_EXIT)) {
 + int i;
 +
 + WARN_ON(vcpu-guest_debug  KVM_GUESTDBG_USE_HW_BP);

Is this possible? (I presumed that we vmexit before handling ioctls.)

 + kvm_x86_ops-sync_dirty_debug_regs(vcpu);

Sneaky functionality ... it would be nicer to split 'enable DR
intercepts' into a new kvm op.
I think we want to disable them whenever we are not in guest-debug mode
anyway, so it would be a pair.
And we don't have to modify DR intercepts then, which is probably the
main reason why sync_dirty_debug_regs() does two things.

 + for (i = 0; i  KVM_NR_DB_REGS; i++)
 + vcpu-arch.eff_db[i] = vcpu-arch.db[i];
 + }
 +
 + /*
* If the guest has used debug registers, at least dr7
* will be disabled while returning to the host.
* If we don't have active breakpoints in the host, we don't
 -- 
 1.8.3.1
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/7] KVM: x86: Allow the guest to run with dirty debug registers

2014-03-10 Thread Radim Krčmář

2014-03-09 21:07+0100, Paolo Bonzini:
 Il 09/03/2014 19:28, Radim Krčmář ha scritto:
/*
  +  * Do this here before restoring debug registers on the host.  And
  +  * since we do this before handling the vmexit, a DR access vmexit
  +  * can (a) read the correct value of the debug registers, (b) set
  +  * KVM_DEBUGREG_WONT_EXIT again.
 We re-enable intercepts on the next exit for the sake of simplicity?
 (Batched accesses make perfect sense, but it seems we don't have to care
  about DRs at all, without guest-debug.)
 
 It's not just for the sake of simplicity.  Synchronizing debug
 registers on entry does have some cost, and it's required for
 running without debug register intercepts.  You would incur this
 cost always, since no-guest-debug is actually the common case.

Good point, it's just this case that accesses them often, most guest
won't use them at all ...

 We're well into diminishing returns; Alex timed this patch vs. one
 that completely ignores MOV DR exits and (to the surprise of both of
 us) this patch completely restored performance despite still having
 a few tens of thousands MOV DR exits/second.
 
 In the problematic case, each context switch will do a save and
 restore of debug registers; that's in total 13 debug register
 accesses (read 6 registers, write DR7 to disable all registers,
 write 6 registers including DR7 which enables breakpoints), all very
 close in time.  It's quite likely that the final DR7 write is very
 expensive (e.g. it might have to flush the pipeline).  Also, this
 test case must be very heavy on context switches, and each context
 switch already incurs a few exits (for example to inject the
 interrupt that causes it, to read the current time).

We would save just one exit and there are probably enough non-DR exits
even in this case to balance it out.
I've been carried too far away for design.
(Thanks for the details.)

 A different optimization could be to skip the LOAD_DEBUG_CONTROLS
 vm-entry control if DR7 == host DR7 == 0x400  MOV DR exits are
 enabled.  Not sure it's worthwhile though, and there's also the
 DEBUGCTL MSR to take into account.  Better do these kinds of tweaks
 if they actually show up in the profile: Alex's testcase shows that
 when they do, the impact is massive.

Yeah, we'd have to implement TRUE-VMX-MSR handling just for two
load/stores ...

 +   kvm_x86_ops-sync_dirty_debug_regs(vcpu);
 
 Sneaky functionality ... it would be nicer to split 'enable DR
 intercepts' into a new kvm op.
 
 Why?  Disable DR intercepts is already folded into the handler for
 MOV DR exits.

It's not obvious from the name that we also enable intercepts again,
which is more important part, imho.
(I don't like the folded code much, but considering the alternatives,
 it's a good solution.)

 And we don't have to modify DR intercepts then, which is probably the
 main reason why sync_dirty_debug_regs() does two things.
 
 Another is that indirect calls are relatively expensive and add
 complexity; in this case they would always be used back-to-back.

True.
(I could not come up with a good name to this function,
 intercept_debug_registers() does not reveal that we also save them, and
 a longer name would be hard to read without 'and', which looks weird.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7] KVM: vmx: Allow the guest to run with dirty debug registers

2014-03-10 Thread Radim Krčmář

2014-03-09 21:12+0100, Paolo Bonzini:
 Il 09/03/2014 19:26, Radim Krčmář ha scritto:
  +
  +  /*
  +   * No more DR vmexits; force a reload of the debug registers
  +   * and reenter on this instruction.  The next vmexit will
  +   * retrieve the full state of the debug registers.
  +   */
  +  vcpu-arch.switch_db_regs |= KVM_DEBUGREG_WONT_EXIT;
  +  return 1;
  +  }
  +
 
 We could make the code slighly uglier and move the functional part of
 this block before the previous one, so it would do both things in one
 exit.
 
 I considered this, but decided that it's unlikely for emulation to
 be faster than hardware---especially on those AMD CPUs that lack
 decode assists (and it's good for VMX and SVM code to look as
 similar as possible).

I meant to move it before the block that exits if there is the
'exception on access' bit set in cr7, so we wouldn't exit again right
away on the actual access, which is quite likely.
(Exiting without emulation was a great move.)

 (Exception handler will likely access DR too.)
 
 Which exception handler?

For #DB. (Pure guesswork, I haven't seen any of them.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/7] KVM: x86: Let the guest write to multiple debug registers with one vmexit

2014-03-10 Thread Radim Krčmář

2014-03-07 12:42+0100, Paolo Bonzini:
 Alex Williamson reported that a Windows game does something weird that
 makes the guest save and restore debug registers on each context switch.
 This cause several hundred thousands vmexits per second, and basically
 cuts performance in half when running under KVM.
 
 However, when not running in guest-debug mode, the guest controls the
 debug registers and having to take an exit for each DR access is a waste
 of time.  We just need one vmexit to load any stale values of DR0-DR6,
 and then we can let the guest run freely.  On the next vmexit (whatever
 the reason) we will read out whatever changes the guest made to the
 debug registers.
 
 Tested with x86/debug.flat on both Intel and AMD, both direct and
 nested virtualization.
 
 Changes from RFC: changed get_dr7 callback to sync_dirty_debug_regs,
   new patches 5-7.
 
 Paolo Bonzini (7):
   KVM: vmx: we do rely on loading DR7 on entry
   KVM: x86: change vcpu-arch.switch_db_regs to a bit mask
   KVM: x86: Allow the guest to run with dirty debug registers
   KVM: vmx: Allow the guest to run with dirty debug registers
   KVM: nVMX: Allow nested guests to run with dirty debug registers
   KVM: svm: set/clear all DR intercepts in one swoop
   KVM: svm: Allow the guest to run with dirty debug registers

All patches,
Reviewed-by: Radim Krčmář rkrc...@redhat.com

This series is good even without vmcs_{set,clr}_bits().
(There is enough of them already to warrant a cleanup patch.)

 
  arch/x86/include/asm/kvm_host.h |  8 -
  arch/x86/kvm/svm.c  | 68 
 -
  arch/x86/kvm/vmx.c  | 43 --
  arch/x86/kvm/x86.c  | 20 +++-
  4 files changed, 114 insertions(+), 25 deletions(-)
 
 -- 
 1.8.3.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: SVM: fix cr8 intercept window

2014-03-11 Thread Radim Krčmář

We always disable cr8 intercept in its handler, but only re-enable it
if handling KVM_REQ_EVENT, so there can be a window where we do not
intercept cr8 writes, which allows an interrupt to disrupt a higher
priority task.

Fix this by disabling intercepts in the same function that re-enables
them when needed. This fixes BSOD in Windows 2008.

Cc: sta...@vger.kernel.org
Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/svm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 64d9bb9..f676c18 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm *svm)
u8 cr8_prev = kvm_get_cr8(svm-vcpu);
/* instruction emulation calls kvm_set_cr8() */
r = cr_interception(svm);
-   if (irqchip_in_kernel(svm-vcpu.kvm)) {
-   clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
+   if (irqchip_in_kernel(svm-vcpu.kvm))
return r;
-   }
if (cr8_prev = kvm_get_cr8(svm-vcpu))
return r;
kvm_run-exit_reason = KVM_EXIT_SET_TPR;
@@ -3568,6 +3566,8 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, 
int tpr, int irr)
if (is_guest_mode(vcpu)  (vcpu-arch.hflags  HF_VINTR_MASK))
return;
 
+   clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
+
if (irr == -1)
return;
 
-- 
1.8.5.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: SVM: fix cr8 intercept window

2014-03-12 Thread Radim Krčmář

2014-03-11 22:05-0300, Marcelo Tosatti:
 On Tue, Mar 11, 2014 at 07:11:18PM +0100, Radim Krčmář wrote:
  We always disable cr8 intercept in its handler, but only re-enable it
  if handling KVM_REQ_EVENT, so there can be a window where we do not
  intercept cr8 writes, which allows an interrupt to disrupt a higher
  priority task.
  
  Fix this by disabling intercepts in the same function that re-enables
  them when needed. This fixes BSOD in Windows 2008.
  
  Cc: sta...@vger.kernel.org
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
   arch/x86/kvm/svm.c | 6 +++---
   1 file changed, 3 insertions(+), 3 deletions(-)
  
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index 64d9bb9..f676c18 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm 
  *svm)
  u8 cr8_prev = kvm_get_cr8(svm-vcpu);
  /* instruction emulation calls kvm_set_cr8() */
  r = cr_interception(svm);
  -   if (irqchip_in_kernel(svm-vcpu.kvm)) {
  -   clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
  +   if (irqchip_in_kernel(svm-vcpu.kvm))
  return r;
  -   }
  if (cr8_prev = kvm_get_cr8(svm-vcpu))
  return r;
  kvm_run-exit_reason = KVM_EXIT_SET_TPR;
  @@ -3568,6 +3566,8 @@ static void update_cr8_intercept(struct kvm_vcpu 
  *vcpu, int tpr, int irr)
  if (is_guest_mode(vcpu)  (vcpu-arch.hflags  HF_VINTR_MASK))
  return;
   
  +   clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
  +
  if (irr == -1)
  return;
 
 Shouldnt IRR be injected if TPR  IRR ? (via KVM_REQ_EVENT).
 
 1) IRR has interrupt 10.
 2) TPR now 9 due to CR8 write.
 3) 10 should be injected.

Definitely should, and we will set KVM_REQ_EVENT through kvm_set_cr8()
if we lower the TPR.
(I checked that the bug isn't in apic_update_ppr().)

 Also not clearing the intercept can cause continuous CR8 writes to 
 exit until KVM_REQ_EVENT ? 

It is intended, I suppose this is because we run with V_INTR_MASKING, so
writes to CR8 only affect V_TPR register; guest then raises it once more
and APIC incorrectly gives us low priority interrupt.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: SVM: fix cr8 intercept window

2014-03-13 Thread Radim Krčmář

2014-03-13 15:52+0200, Gleb Natapov:
 On Wed, Mar 12, 2014 at 06:20:01PM +0100, Paolo Bonzini wrote:
  Il 12/03/2014 11:40, Radim Krčmář ha scritto:
  2014-03-11 22:05-0300, Marcelo Tosatti:
  On Tue, Mar 11, 2014 at 07:11:18PM +0100, Radim Krčmář wrote:
  We always disable cr8 intercept in its handler, but only re-enable it
  if handling KVM_REQ_EVENT, so there can be a window where we do not
  intercept cr8 writes, which allows an interrupt to disrupt a higher
  priority task.
  
  Fix this by disabling intercepts in the same function that re-enables
  them when needed. This fixes BSOD in Windows 2008.
  
  Cc: sta...@vger.kernel.org
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
   arch/x86/kvm/svm.c | 6 +++---
   1 file changed, 3 insertions(+), 3 deletions(-)
  
  diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
  index 64d9bb9..f676c18 100644
  --- a/arch/x86/kvm/svm.c
  +++ b/arch/x86/kvm/svm.c
  @@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm 
  *svm)
   u8 cr8_prev = kvm_get_cr8(svm-vcpu);
   /* instruction emulation calls kvm_set_cr8() */
   r = cr_interception(svm);
  -if (irqchip_in_kernel(svm-vcpu.kvm)) {
  -clr_cr_intercept(svm, INTERCEPT_CR8_WRITE);
  +if (irqchip_in_kernel(svm-vcpu.kvm))
   return r;
  -}
  
  I think that the old code here makes little sense, and for two reasons:
  
 I agree that old code is wrong and the patch looks correct, but I only
 see how the bug may cause pending IRR to not be delivered in time,
 not how interrupt can disrupt a higher priority task.

True, the commit message is bad.
We BSOD only because IRR is not injected right after lowering the TPR
and code depends on values that had to be computed in it.

Paolo, can you change the last sentence to , which means we don't
inject pending IRR immediately.?  (or do we just forget it?)

Thanks.

---
 My model of what is happening had misconceptions about Windows, KVM and
 SVM ... explained BSOD in IRQL 0xc more directly though, so it lasted
 till Paolo's review -- the list of things to (re)read is long.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/4] KVM: ioapic: extract body of kvm_ioapic_set_irq

2014-03-21 Thread Radim Krčmář

2014-03-21 10:28+0100, Paolo Bonzini:
 We will reuse it to process a nonzero IRR that is passed to KVM_SET_IRQCHIP.
 
 Reviewed-by: Alex Williamson alex.william...@redhat.com
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   v1-v2:
   more comments
   change argument name from level to irq_level
 
  virt/kvm/ioapic.c | 74 
 +--
  1 file changed, 50 insertions(+), 24 deletions(-)
 
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 25e16a6898ed..270f7fe73f39 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -163,6 +163,55 @@ static bool rtc_irq_check_coalesced(struct kvm_ioapic 
 *ioapic)
   return false;
  }
  
 +static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq,
 + int irq_level, bool line_status)
 +{
 + union kvm_ioapic_redirect_entry entry;
 + u32 mask = 1  irq;
 + u32 old_irr;
 + int edge, ret;
 +
 + entry = ioapic-redirtbl[irq];
 + edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG);
 +
 + if (!irq_level) {
 + ioapic-irr = ~mask;
 + ret = 1;
 + goto out;
 + }
 +
 + /*
 +  * Return 0 for coalesced interrupts; for edge-triggered interrupts,
 +  * this only happens if a previous edge has not been delivered due
 +  * do masking.  For level interrupts, the remote_irr field tells
  (^ to)
 +  * us if the interrupt is waiting for an EOI.

How do we know we haven't delivered an edge-triggered interrupt when
ioapic_service() always clears this IRR bit?

 +  *
 +  * RTC is special: it is edge-triggered, but userspace likes to know
 +  * if it has been already ack-ed via EOI because coalesced RTC
 +  * interrupts lead to time drift in Windows guests.  So we track
 +  * EOI manually for the RTC interrupt.
 +  */

The comment makes me think we should not be coalescing more, but we do.

 + if (irq == RTC_GSI  line_status 
 + rtc_irq_check_coalesced(ioapic)) {
 + ret = 0;
 + goto out;
 + }
 +
 + old_irr = ioapic-irr;
 + ioapic-irr |= mask;
 + if ((edge  old_irr == ioapic-irr) ||
 + (!edge  entry.fields.remote_irr)) {

(Fun fact: GCC 4.8.2 doesn't optimize it as well as
   edge ? old_irr == ioapic-irr : entry.fields.remote_irr)

 + ret = 0;
 + goto out;
 + }
 +
 + ret = ioapic_service(ioapic, irq, line_status);
 +
 +out:
 + trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
 + return ret;
 +}
 +
  static void update_handled_vectors(struct kvm_ioapic *ioapic)
  {
   DECLARE_BITMAP(handled_vectors, 256);
 @@ -308,38 +357,15 @@ static int ioapic_service(struct kvm_ioapic *ioapic, 
 int irq, bool line_status)
  int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id,
  int level, bool line_status)
  {
 - u32 old_irr;
 - u32 mask = 1  irq;
 - union kvm_ioapic_redirect_entry entry;
   int ret, irq_level;
  
   BUG_ON(irq  0 || irq = IOAPIC_NUM_PINS);
  
   spin_lock(ioapic-lock);
 - old_irr = ioapic-irr;
   irq_level = __kvm_irq_line_state(ioapic-irq_states[irq],
irq_source_id, level);
 - entry = ioapic-redirtbl[irq];
 - if (!irq_level) {
 - ioapic-irr = ~mask;
 - ret = 1;
 - } else {
 - int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG);
 + ret = ioapic_set_irq(ioapic, irq, irq_level, line_status);
  
 - if (irq == RTC_GSI  line_status 
 - rtc_irq_check_coalesced(ioapic)) {
 - ret = 0; /* coalesced */
 - goto out;
 - }
 - ioapic-irr |= mask;
 - if ((edge  old_irr != ioapic-irr) ||
 - (!edge  !entry.fields.remote_irr))
 - ret = ioapic_service(ioapic, irq, line_status);
 - else
 - ret = 0; /* report coalesced interrupt */
 - }
 -out:
 - trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0);
   spin_unlock(ioapic-lock);
  
   return ret;
 -- 
 1.8.3.1
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/4] KVM: ioapic: extract body of kvm_ioapic_set_irq

2014-03-24 Thread Radim Krčmář

2014-03-23 09:44+0100, Paolo Bonzini:
 Il 21/03/2014 19:58, Radim Krčmář ha scritto:
  + /*
  +  * Return 0 for coalesced interrupts; for edge-triggered interrupts,
  +  * this only happens if a previous edge has not been delivered due
  +  * do masking.  For level interrupts, the remote_irr field tells
   (^ to)
  +  * us if the interrupt is waiting for an EOI.
 
 How do we know we haven't delivered an edge-triggered interrupt when
 ioapic_service() always clears this IRR bit?
 
 We know we have at least delivered it to the local APIC(s).  If it
 is masked in the ioredirtbl, ioapic_service doesn't clear the IRR.

I see, LAPIC can't refuse an interrupt, so kvm_irq_delivery_to_apic()
won't (mustn't) fail. (The code looks fragile if it returned -1.)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/4] KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP

2014-03-24 Thread Radim Krčmář

2014-03-21 10:28+0100, Paolo Bonzini:
 After the previous patches, an interrupt whose bit is set in the IRR
 register will never be in the LAPIC's IRR and has never been injected
 on the migration source.  So inject it on the destination.
 
 This fixes migration of Windows guests without HPET (they use the RTC
 to trigger the scheduler tick, and lose it after migration).
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   v1-v2:
   use IOAPIC_NUM_PINS as a limit to for_each_set_bit
   remove debug printk
 
  virt/kvm/ioapic.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)
 
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 270f7fe73f39..d4b601547f1f 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -212,6 +212,18 @@ out:
   return ret;
  }
  
 +static void kvm_ioapic_inject_all(struct kvm_ioapic *ioapic, unsigned long 
 irr)
 +{
 + u32 idx;
 +
 + rtc_irq_eoi_tracking_reset(ioapic);
 + for_each_set_bit(idx, irr, IOAPIC_NUM_PINS)
 + ioapic_set_irq(ioapic, idx, 1, true);
 +
 + kvm_rtc_eoi_tracking_restore_all(ioapic);

(We shouldn't have RTC interrupt with pending EOI in irr, so the
 function could be independent.
 I'd prefer 'ioapic-irr = 0' here ...)

 +}
 +
 +
  static void update_handled_vectors(struct kvm_ioapic *ioapic)
  {
   DECLARE_BITMAP(handled_vectors, 256);
 @@ -612,9 +624,10 @@ int kvm_set_ioapic(struct kvm *kvm, struct 
 kvm_ioapic_state *state)
  
   spin_lock(ioapic-lock);
   memcpy(ioapic, state, sizeof(struct kvm_ioapic_state));
 + ioapic-irr = 0;
   update_handled_vectors(ioapic);
   kvm_vcpu_request_scan_ioapic(kvm);
 - kvm_rtc_eoi_tracking_restore_all(ioapic);
 + kvm_ioapic_inject_all(ioapic, state-irr);
   spin_unlock(ioapic-lock);
   return 0;
  }
 -- 
 1.8.3.1
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 0/4] KVM: cleanup ioapic and fix KVM_SET_IRQCHIP with irr != 0

2014-03-24 Thread Radim Krčmář

2014-03-21 10:27+0100, Paolo Bonzini:
 Unlike the old qemu-kvm, which really never did that, with new QEMU
 it is for some reason somewhat likely to migrate a VM with a nonzero
 IRR in the ioapic.  In the case of ISA edge-triggered interrupts,
 this represents an interrupt that has not left the IOAPIC, which would
 be okay but it is not handled right by KVM_SET_IRQCHIP.  Because the
 interrupt is never injected, the guest never acknowledges it, the host
 never deasserts the pin and new interrupts are dropped.
 
 There are two problems to solve.
 
 The obvious one is that interrupts are not reinjected upon KVM_SET_IRQCHIP,
 which is taken care of by patches 3-4.
 
 The second is that right now the IRR value depends on the falling edge
 of the interrupt (as passed by the userspace via kvm_ioapic_set_irq).
 This is unnecessary, and may lead to spurious reinjection in the
 destination of migration; instead, we can clear the (internal-only)
 IRR bit as soon as the interrupt leaves the IOAPIC.  This is done by
 patch 2, which patch 1 prepares for.
 
 This fixes migration of Windows guests without HPET.  Please review.
 
 Paolo
 
 v1-v2:
   more comments in patch 3
   change argument name in patch 3 from level to irq_level
   use IOAPIC_NUM_PINS in patch 4 as a limit to for_each_set_bit
   remove debug printk in patch 4

Nice solution to a tricky problem,

Reviewed-by: Radim Krčmář rkrc...@redhat.com

 Paolo Bonzini (4):
   KVM: ioapic: merge ioapic_deliver into ioapic_service
   KVM: ioapic: clear IRR for edge-triggered interrupts at delivery
   KVM: ioapic: extract body of kvm_ioapic_set_irq
   KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
 
  virt/kvm/ioapic.c | 107 
 +++---
  1 file changed, 69 insertions(+), 38 deletions(-)
 
 -- 
 1.8.3.1
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/4] KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP

2014-03-24 Thread Radim Krčmář

2014-03-24 19:14+0100, Paolo Bonzini:
 Il 24/03/2014 18:58, Radim Krčmář ha scritto:
  I'd prefer 'ioapic-irr = 0' here ...)
 
 The point is that ioapic-irr = 0 is overriding the previous
 memcpy, because state-irr is used as argument to
 kvm_ioapic_inject_all instead.  So I think iopic-irr = 0 should
 stay close to the memcpy.

Yeah, I was just spouting ... my reasoning was that we clear irr only
because it's going to be recomputed, so that code is more related.
(The function name would need to change though.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 03/19] qspinlock: Add pending bit

2014-05-12 Thread Radim Krčmář

2014-05-07 11:01-0400, Waiman Long:
 From: Peter Zijlstra pet...@infradead.org
 
 Because the qspinlock needs to touch a second cacheline; add a pending
 bit and allow a single in-word spinner before we punt to the second
 cacheline.

I think there is an unwanted scenario on virtual machines:
1) VCPU sets the pending bit and start spinning.
2) Pending VCPU gets descheduled.
- we have PLE and lock holder isn't running [1]
- the hypervisor randomly preempts us
3) Lock holder unlocks while pending VCPU is waiting in queue.
4) Subsequent lockers will see free lock with set pending bit and will
   loop in trylock's 'for (;;)'
- the worst-case is lock starving [2]
- PLE can save us from wasting whole timeslice

Retry threshold is the easiest solution, regardless of its ugliness [4].

Another minor design flaw is that formerly first VCPU gets appended to
the tail when it decides to queue;
is the performance gain worth it?

Thanks.


---
1: Pause Loop Exiting is almost certain to vmexit in that case: we
   default to 4096 TSC cycles on KVM, and pending loop is longer than 4
   (4096/PSPIN_THRESHOLD).
   We would also vmexit if critical section was longer than 4k.

2: In this example, vpus 1 and 2 use the lock while 3 never gets there.
   VCPU:  1  2  3
lock()   // we are the holder
  pend() // we have pending bit
  vmexit // while in PSPIN_THRESHOLD loop
unlock()
 vmentry
 SPINNING// for {;;} loop
 vmexit
  vmentry
  lock()
pend()
vmexit
 unlock()
 vmentry
 SPINNING
 vmexit
vmentry
--- loop ---

   The window is (should be) too small to happen in bare-metal.

3: Pending VCPU was first in line, but when it decides to queue, it must
   go to the tail.

4:
The idea is to prevent unfairness by queueing after a while of useless
looping.  Magic value should be set a bit above the time it takes an
active pending bit holder to go through the loop.  4 looks enough.
We can use either pv_qspinlock_enabled() or cpu_has_hypervisor.
I presume that we never want this to happen in a VM and that we won't
have pv_qspinlock_enabled() without cpu_has_hypervisor.

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 37b5c7f..cd45c27 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -573,7 +573,7 @@ static __always_inline int get_qlock(struct qspinlock *lock)
 static inline int trylock_pending(struct qspinlock *lock, u32 *pval)
 {
u32 old, new, val = *pval;
-   int retry = 1;
+   int retry = 0;
 
/*
 * trylock || pending
@@ -595,9 +595,9 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
 * a while to see if that either bit will be cleared.
 * If that is no change, we return and be queued.
 */
-   if (!retry)
+   if (retry)
return 0;
-   retry--;
+   retry++;
cpu_relax();
cpu_relax();
*pval = val = atomic_read(lock-val);
@@ -608,7 +608,11 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
 * Assuming that the pending bit holder is going to
 * set the lock bit and clear the pending bit soon,
 * it is better to wait than to exit at this point.
+* Our assumption does not hold on hypervisors, where
+* the pending bit holder doesn't have to be running.
 */
+   if (cpu_has_hypervisor  ++retry  MAGIC)
+   return 0;
cpu_relax();
*pval = val = atomic_read(lock-val);
continue;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-05-12 Thread Radim Krčmář

(tl;dr: paravirtualization could be better than unfair qspinlock)

2014-05-07 11:01-0400, Waiman Long:
 Locking is always an issue in a virtualized environment because of 2
 different types of problems:
  1) Lock holder preemption
  2) Lock waiter preemption

Paravirtualized ticketlocks have a shortcoming;
we don't know which VCPU the ticket belongs to, so the hypervisor can
only blindly yield to runnable VCPUs after waiters halt in slowpath.
There aren't enough free bits in ticket struct to improve, thus we
have resorted to unfairness.

Qspinlock is different.

Most queued VCPUs already know the VCPU before it, so we have what it
takes to mitigate lock waiter preemption: we can include preempted CPU
id in hypercall, the hypervisor will schedule it, and we'll be woken up
from unlock slowpath [1].
This still isn't perfect: we can wake up a VCPU that got preempted
before it could hypercall, and these hypercalls will propagate one by
one through our queue to the preempted lock holder.
(We'd have to share the whole waiter-list to avoid this.
 We could also try to send holder's id instead and unconditionally kick
 next-in-line on unlock, I think it would be slower.)

Lock holder problem is tougher because we don't always share who is it.
The tail bits can be used for it as we don't really use them before a
queue has formed.  This would cost us one bit to differentiate between
holder/tail CPU id [2] and complicate operations a little, but only for
the paravirt case, where benefits are expected to be far greater.
Hypercall from lock slowpath could schedule preempted VCPU right away.

I think this could obsolete unfair locks and will prepare RFC patches
soon-ish [3]. (If the idea isn't proved infeasible before.)


---
1: It is possible that we could avoid O(N) traversal and hypercall in
   unlock slowpath by scheduling VCPUs in the right order often.
2: Or even less. idx=3 is a bug: if we are spinning in NMI, we are
   almost deadlocked, so we should WARN/BUG if it were to happen; which
   leaves the combination free to mean that the CPU id is a sole holder,
   not a tail.  (I prefer clean code though.)
3: I already tried and got quickly fed up by refactoring, so it might
   get postponed till the series gets merged.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 03/19] qspinlock: Add pending bit

2014-05-14 Thread Radim Krčmář

2014-05-13 15:47-0400, Waiman Long:
 On 05/12/2014 11:22 AM, Radim Krčmář wrote:
 I think there is an unwanted scenario on virtual machines:
 1) VCPU sets the pending bit and start spinning.
 2) Pending VCPU gets descheduled.
  - we have PLE and lock holder isn't running [1]
  - the hypervisor randomly preempts us
 3) Lock holder unlocks while pending VCPU is waiting in queue.
 4) Subsequent lockers will see free lock with set pending bit and will
 loop in trylock's 'for (;;)'
  - the worst-case is lock starving [2]
  - PLE can save us from wasting whole timeslice
 
 Retry threshold is the easiest solution, regardless of its ugliness [4].
 
 Yes, that can be a real issue. Some sort of retry threshold, as you said,
 should be able to handle it.
 
 BTW, the relevant patch should be 16/19 where the PV spinlock stuff should
 be discussed. This patch is perfectly fine.

Ouch, my apology to Peter didn't make it ... Agreed, I should have split
the comment under patches
 [06/19] (part quoted above; does not depend on PV),
 [16/19] (part quoted below) and
 [17/19] (general doubts).

 Another minor design flaw is that formerly first VCPU gets appended to
 the tail when it decides to queue;
 is the performance gain worth it?
 
 Thanks.
 
 Yes, the performance gain is worth it. The primary goal is to be not worse
 than ticket spinlock in light load situation which is the most common case.
 This feature is need to achieve that.

Ok.
I've seen merit in pvqspinlock even with slightly slower first-waiter,
so I would have happily sacrificed those horrible branches.
(I prefer elegant to optimized code, but I can see why we want to be
 strictly better than ticketlock.)
Peter mentioned that we are focusing on bare-metal patches, so I'll
withold my other paravirt rants until they are polished.

And to forcefully bring this thread a little bit on-topic:

Pending-bit is effectively a lock in a lock, so I was wondering why
don't we use more pending bits; advantages are the same, just diminished
by the probability of having an ideally contended lock:
 - waiter won't be blocked on RAM access if critical section (or more)
   ends sooner
 - some unlucky cacheline is not forgotten
 - faster unlock (no need for tail operations)
(- ?)
disadvantages are magnified:
 - increased complexity
 - intense cacheline sharing
   (I thought that this is the main disadvantage of ticketlock.)
(- ?)

One bit still improved performance, is it the best we got?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v10 03/19] qspinlock: Add pending bit

2014-05-14 Thread Radim Krčmář

2014-05-14 19:00+0200, Peter Zijlstra:
 On Wed, May 14, 2014 at 06:51:24PM +0200, Radim Krčmář wrote:
  Ok.
  I've seen merit in pvqspinlock even with slightly slower first-waiter,
  so I would have happily sacrificed those horrible branches.
  (I prefer elegant to optimized code, but I can see why we want to be
   strictly better than ticketlock.)
  Peter mentioned that we are focusing on bare-metal patches, so I'll
  withold my other paravirt rants until they are polished.

(It was an ambiguous sentence, I have comments for later patches.)

 Well, paravirt must happen too, but comes later in this series, patch 3
 which we're replying to is still very much in the bare metal part of the
 series.

(I think that bare metal spans the first 7 patches.)

 I've not had time yet to decode all that Waiman has done to make
 paravirt work.
 
 But as a general rule I like patches that start with something simple
 and working and then optimize it, this series doesn't seem to quite
 grasp that.
 
  And to forcefully bring this thread a little bit on-topic:
  
  Pending-bit is effectively a lock in a lock, so I was wondering why
  don't we use more pending bits; advantages are the same, just diminished
  by the probability of having an ideally contended lock:
   - waiter won't be blocked on RAM access if critical section (or more)
 ends sooner
   - some unlucky cacheline is not forgotten
   - faster unlock (no need for tail operations)
  (- ?)
  disadvantages are magnified:
   - increased complexity
   - intense cacheline sharing
 (I thought that this is the main disadvantage of ticketlock.)
  (- ?)
  
  One bit still improved performance, is it the best we got?
 
 So, the advantage of one bit is that if we use a whole byte for 1 bit we
 can avoid some atomic ops.
 
 The entire reason for this in-word spinner is to amortize the cost of
 hitting the external node cacheline.

Every pending CPU removes one length of the critical section from the
delay caused by cacheline propagation and really cold cache is
hundreds(?) of cycles, so we could burn some to ensure correctness and
still be waiting when the first pending CPU unlocks.

 So traditional locks like test-and-test and the ticket lock only ever
 access the spinlock word itsef, this MCS style queueing lock has a
 second (and, see my other rants in this thread, when done wrong more
 than 2) cacheline to touch.
 
 That said, all our benchmarking is pretty much for the cache-hot case,
 so I'm not entirely convinced yet that the one pending bit makes up for
 it, it does in the cache-hot case.

Yeah, we probably use the faster pre-lock quite a lot.
Cover letter states that queue depth 1-3 is a bit slower than ticket
spinlock, so we might not be losing if we implemented a faster
in-word-lock of this capacity.  (Not that I'm a fan of the hybrid lock.)

 But... writing cache-cold benchmarks is _hard_ :/

Wouldn't clflush of the second cacheline before trying for the lock give
us a rough estimate?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] kvm: increase max vcpu count

2014-05-15 Thread Radim Krčmář

2014-05-15 17:50+0800, Li, Zhen-Hua:
 This patch is trying to increase the maximum supported vcpu number.
 
 There has been big system supporting more than 256 logical CPUs, and vmware
 can also support guest system with more than logical 256 CPUs. So kvm should 
 also increase the maximum supported cpu number.

How did it work?
KVM lapic does not handle more more than 256 [1] at the moment, so
additional VCPUs had to wraparound ...

---
1: struct kvm_apic_map has struct kvm_lapic *phys_map[256] and we
   are using  0xff when dealing with it, too.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 08/07] qspinlock: integrate pending bit into queue

2014-05-21 Thread Radim Krčmář

2014-05-21 18:49+0200, Radim Krčmář:
 2014-05-19 16:17-0400, Waiman Long:
As for now, I will focus on just having one pending bit.
 
 I'll throw some ideas at it,

One of the ideas follows; it seems sound, but I haven't benchmarked it
thoroughly. (Wasted a lot of time by writing/playing with various tools
and loads.)

Dbench on ext4 ramdisk, hackbench and ebizzy have shown a small
improvement in performance, but my main drive was the weird design of
Pending Bit.
Does your setup yield improvements too?
(A minor code swap noted in the patch might help things.)

It is meant to be aplied on top of first 7 patches, because the virt
stuff would just get in the way.
I have preserved a lot of dead code and made some questionable decisions
just to keep the diff short and in one patch, sorry about that.

(It is work in progress, double slashed lines mark points of interest.)

---8---
Pending Bit wasn't used if we already had a node queue with one cpu,
which meant that we suffered from these drawbacks again:
 - unlock path was more complicated
   (last queued CPU had to clear the tail)
 - cold node cacheline was just one critical section away

With this patch, Pending Bit is used as an additional step in the queue.
Waiting for lock is the same: we try Pending Bit and if it is taken, we
append to Node Queue.
Unlock is different: pending CPU moves into critical section and first
CPU from Node Queue takes Pending Bit and notifies next in line or
clears the tail.

This allows the pending CPU to take the lock as fast as possible,
because all bookkeeping was done when entering Pending Queue.
Node Queue operations can also be slower without affecting the
performance, because we have an additional buffer of one critical
section.
---
 kernel/locking/qspinlock.c | 180 +
 1 file changed, 135 insertions(+), 45 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0ee1a23..76cafb0 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -98,7 +98,10 @@ struct __qspinlock {
union {
atomic_t val;
 #ifdef __LITTLE_ENDIAN
-   u8   locked;
+   struct {
+   u8  locked;
+   u8  pending;
+   };
struct {
u16 locked_pending;
u16 tail;
@@ -109,7 +112,8 @@ struct __qspinlock {
u16 locked_pending;
};
struct {
-   u8  reserved[3];
+   u8  reserved[2];
+   u8  pending;
u8  locked;
};
 #endif
@@ -314,6 +318,59 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
return 1;
 }
 
+// nice comment here
+static inline bool trylock(struct qspinlock *lock, u32 *val) {
+   if (!(*val = atomic_read(lock-val)) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) {
+   *val = _Q_LOCKED_VAL;
+   return 1;
+   }
+   return 0;
+}
+
+// here
+static inline bool trypending(struct qspinlock *lock, u32 *pval) {
+   u32 old, val = *pval;
+   // optimizer might produce the same code if we use *pval directly
+
+   // we could use 'if' and a xchg that touches only the pending bit to
+   // save some cycles at the price of a longer line cutting window
+   // (and I think it would bug without changing the rest)
+   while (!(val  (_Q_PENDING_MASK | _Q_TAIL_MASK))) {
+   old = atomic_cmpxchg(lock-val, val, val | _Q_PENDING_MASK);
+   if (old == val) {
+   *pval = val | _Q_PENDING_MASK;
+   return 1;
+   }
+   val = old;
+   }
+   *pval = val;
+   return 0;
+}
+
+// here
+static inline void set_pending(struct qspinlock *lock, u8 pending)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   // take a look if this is necessary, and if we don't have an
+   // abstraction already
+   barrier();
+   ACCESS_ONCE(l-pending) = pending;
+   barrier();
+}
+
+// and here
+static inline u32 cmpxchg_tail(struct qspinlock *lock, u32 tail, u32 newtail)
+// API-incompatible with set_pending and the shifting is ugly, so I'd rather
+// refactor this one, xchg_tail() and encode_tail() ... another day
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)cmpxchg(l-tail, tail  _Q_TAIL_OFFSET,
+   newtail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
@@ -324,21 +381,21 @@ static inline int trylock_pending(struct qspinlock *lock, 
u32 *pval)
  *  fast :slow  :unlock

Re: [PATCH v7 08/14] Introduce chr-testdev

2014-07-22 Thread Radim Krčmář

2014-07-16 11:33+0200, Andrew Jones:
 On Wed, Jul 16, 2014 at 05:31:33AM -0400, Levente Kurusa wrote:
  - Original Message -
   [...]
   +void chr_testdev_exit(int code)
   +{
   + char buf[8];
   + int len;
   +
   + snprintf(buf, sizeof(buf), %dq, code);
   + len = strlen(buf);
  
  AFAIK, snprintf returns the number of characters written, so
  these two statements can be merged into one.
 
 You are correct. I'll do this for v8 (if a v8 is needed).

snprintf returns the number of characters that would be written without
truncation, so it wouldn't handle a very long code ;)

I'd prefer the obvious v7, to finding a minimum with 'sizeof(buf) - 1'.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/9] KVM: add kvm_arch_sched_in

2014-08-19 Thread Radim Krčmář

Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/arm/kvm/arm.c | 4 
 arch/mips/kvm/mips.c   | 4 
 arch/powerpc/kvm/powerpc.c | 4 
 arch/s390/kvm/kvm-s390.c   | 4 
 arch/x86/kvm/x86.c | 4 
 include/linux/kvm_host.h   | 2 ++
 virt/kvm/kvm_main.c| 2 ++
 7 files changed, 24 insertions(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a99e0cd..9f788eb 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
vcpu-cpu = cpu;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index cd71141..2362df2 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
  struct kvm_translation *tr)
 {
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4c79284..cbc432f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
kvmppc_subarch_vcpu_uninit(vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 #ifdef CONFIG_BOOKE
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ce81eb2..a3c324e 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
/* Nothing todo */
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
save_fp_ctl(vcpu-arch.host_fpregs.fpc);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f1e22d..d7c214f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
static_key_slow_dec(kvm_no_apic_vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
if (type)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4c33b3..ebd7236 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -624,6 +624,8 @@ void kvm_arch_exit(void);
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..d3c3ed0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int 
cpu)
if (vcpu-preempted)
vcpu-preempted = false;
 
+   kvm_arch_sched_in(vcpu, cpu);
+
kvm_arch_vcpu_load(vcpu, cpu);
 }
 
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/9] KVM: VMX: clamp PLE window

2014-08-19 Thread Radim Krčmář

Modifications could get unwanted values of PLE window. (low or negative)
Use ple_window and the maximal value that cannot overflow as bounds.

ple_window_max defaults to a very high value, but it would make sense to
set it to some fraction of the scheduler tick.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 66259fd..e1192fb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -144,6 +144,10 @@ module_param(ple_window_grow, int, S_IRUGO);
 static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
 module_param(ple_window_shrink, int, S_IRUGO);
 
+/* Default is to compute the maximum so we can never overflow. */
+static int ple_window_max = INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
+module_param(ple_window_max, int, S_IRUGO);
+
 extern const ulong vmx_return;
 
 #define NR_AUTOLOAD_MSRS 8
@@ -5704,7 +5708,7 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
else
new = old + ple_window_grow;
 
-   vmx-ple_window = new;
+   vmx-ple_window = min(new, ple_window_max);
 }
 
 static void shrink_ple_window(struct kvm_vcpu *vcpu)
@@ -5720,7 +5724,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
else
new = old - ple_window_shrink;
 
-   vmx-ple_window = new;
+   vmx-ple_window = max(new, ple_window);
 }
 
 /*
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/9] KVM: trace kvm_ple_window grow/shrink

2014-08-19 Thread Radim Krčmář

Tracepoint for dynamic PLE window, fired on every potential change.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/trace.h | 29 +
 arch/x86/kvm/vmx.c   |  4 
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 34 insertions(+)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index e850a7d..e4682f5 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -848,6 +848,35 @@ TRACE_EVENT(kvm_track_tsc,
  __print_symbolic(__entry-host_clock, host_clocks))
 );
 
+TRACE_EVENT(kvm_ple_window,
+   TP_PROTO(int grow, unsigned int vcpu_id, int new, int old),
+   TP_ARGS(grow, vcpu_id, new, old),
+
+   TP_STRUCT__entry(
+   __field( int,  grow )
+   __field(unsigned int,   vcpu_id )
+   __field( int,   new )
+   __field( int,   old )
+   ),
+
+   TP_fast_assign(
+   __entry-grow   = grow;
+   __entry-vcpu_id= vcpu_id;
+   __entry-new= new;
+   __entry-old= old;
+   ),
+
+   TP_printk(vcpu %u: ple_window %d %s %d,
+ __entry-vcpu_id,
+ __entry-new,
+ __entry-grow ? + : -,
+ __entry-old)
+);
+#define trace_kvm_ple_window_grow(vcpu_id, new, old) \
+   trace_kvm_ple_window(1, vcpu_id, new, old)
+#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
+   trace_kvm_ple_window(0, vcpu_id, new, old)
+
 #endif /* CONFIG_X86_64 */
 
 #endif /* _TRACE_KVM_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e1192fb..a236a9f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5709,6 +5709,8 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
new = old + ple_window_grow;
 
vmx-ple_window = min(new, ple_window_max);
+
+   trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old);
 }
 
 static void shrink_ple_window(struct kvm_vcpu *vcpu)
@@ -5725,6 +5727,8 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
new = old - ple_window_shrink;
 
vmx-ple_window = max(new, ple_window);
+
+   trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old);
 }
 
 /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5696ee7..814b20c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7648,3 +7648,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 9/9] KVM: VMX: automatic PLE window maximum

2014-08-19 Thread Radim Krčmář

Every increase of ple_window_grow creates potential overflows.
They are not serious, because we clamp ple_window and userspace is
expected to fix ple_window_max within a second.
---
 arch/x86/kvm/vmx.c | 34 +-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d7f58e8..6873a0b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -138,7 +138,9 @@ module_param(ple_window, int, S_IRUGO | S_IWUSR);
 
 /* Default doubles per-vcpu window every exit. */
 static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
-module_param(ple_window_grow, int, S_IRUGO | S_IWUSR);
+static struct kernel_param_ops ple_window_grow_ops;
+module_param_cb(ple_window_grow, ple_window_grow_ops,
+ple_window_grow, S_IRUGO | S_IWUSR);
 
 /* Default resets per-vcpu window every exit to ple_window. */
 static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
@@ -5717,6 +5719,36 @@ static void type##_ple_window(struct kvm_vcpu *vcpu) \
 make_ple_window_modifier(grow,   *, +) /* grow_ple_window */
 make_ple_window_modifier(shrink, /, -) /* shrink_ple_window */
 
+static void clamp_ple_window_max(void)
+{
+   int maximum;
+
+   if (ple_window_grow  1)
+   return;
+
+   if (ple_window_grow  ple_window)
+   maximum = INT_MAX / ple_window_grow;
+   else
+   maximum = INT_MAX - ple_window_grow;
+
+   ple_window_max = clamp(ple_window_max, ple_window, maximum);
+}
+
+static int set_ple_window_grow(const char *arg, const struct kernel_param *kp)
+{
+   int ret;
+
+   clamp_ple_window_max();
+   ret = param_set_int(arg, kp);
+
+   return ret;
+}
+
+static struct kernel_param_ops ple_window_grow_ops = {
+   .set = set_ple_window_grow,
+   .get = param_get_int,
+};
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/9] KVM: VMX: abstract ple_window modifiers

2014-08-19 Thread Radim Krčmář

They were almost identical and thus merged with a loathable macro.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 This solution is hopefully more acceptable than function pointers.

 arch/x86/kvm/vmx.c | 53 +++--
 1 file changed, 19 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a236a9f..c6cfb71 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5694,42 +5694,27 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
 out:
return ret;
 }
-
-static void grow_ple_window(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_vmx *vmx = to_vmx(vcpu);
-   int old = vmx-ple_window;
-   int new;
-
-   if (ple_window_grow  1)
-   new = ple_window;
-   else if (ple_window_grow  ple_window)
-   new = old * ple_window_grow;
-   else
-   new = old + ple_window_grow;
-
-   vmx-ple_window = min(new, ple_window_max);
-
-   trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old);
+#define make_ple_window_modifier(type, oplt, opge, cmp, bound) \
+static void type##_ple_window(struct kvm_vcpu *vcpu) \
+{ \
+   struct vcpu_vmx *vmx = to_vmx(vcpu); \
+   int old = vmx-ple_window; \
+   int new; \
+\
+   if (ple_window_##type  1) \
+   new = ple_window; \
+   else if (ple_window_##type  ple_window) \
+   new = old oplt ple_window_##type; \
+   else \
+   new = old opge ple_window_##type; \
+\
+   vmx-ple_window = cmp(new, bound); \
+\
+   trace_kvm_ple_window_##type(vcpu-vcpu_id, vmx-ple_window, old); \
 }
 
-static void shrink_ple_window(struct kvm_vcpu *vcpu)
-{
-   struct vcpu_vmx *vmx = to_vmx(vcpu);
-   int old = vmx-ple_window;
-   int new;
-
-   if (ple_window_shrink  1)
-   new = ple_window;
-   else if (ple_window_shrink  ple_window)
-   new = old / ple_window_shrink;
-   else
-   new = old - ple_window_shrink;
-
-   vmx-ple_window = max(new, ple_window);
-
-   trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old);
-}
+make_ple_window_modifier(grow,   *, +, min, ple_window_max)
+make_ple_window_modifier(shrink, /, -, max, ple_window)
 
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 8/9] KVM: VMX: runtime knobs for dynamic PLE window

2014-08-19 Thread Radim Krčmář

ple_window is updated on every vmentry, so there is no reason to have it
read-only anymore.
ple_window_* weren't writable to prevent runtime overflow races;
they are mitigated by clamping the value of ple_window.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 If we decide to ignore insane overflows, last two hunks can be dropped.

 arch/x86/kvm/vmx.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6cfb71..d7f58e8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -134,19 +134,19 @@ static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
-module_param(ple_window, int, S_IRUGO);
+module_param(ple_window, int, S_IRUGO | S_IWUSR);
 
 /* Default doubles per-vcpu window every exit. */
 static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
-module_param(ple_window_grow, int, S_IRUGO);
+module_param(ple_window_grow, int, S_IRUGO | S_IWUSR);
 
 /* Default resets per-vcpu window every exit to ple_window. */
 static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
-module_param(ple_window_shrink, int, S_IRUGO);
+module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR);
 
 /* Default is to compute the maximum so we can never overflow. */
 static int ple_window_max = INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
-module_param(ple_window_max, int, S_IRUGO);
+module_param(ple_window_max, int, S_IRUGO | S_IWUSR);
 
 extern const ulong vmx_return;
 
@@ -5694,7 +5694,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu 
*vcpu)
 out:
return ret;
 }
-#define make_ple_window_modifier(type, oplt, opge, cmp, bound) \
+
+#define make_ple_window_modifier(type, oplt, opge) \
 static void type##_ple_window(struct kvm_vcpu *vcpu) \
 { \
struct vcpu_vmx *vmx = to_vmx(vcpu); \
@@ -5708,13 +5709,13 @@ static void type##_ple_window(struct kvm_vcpu *vcpu) \
else \
new = old opge ple_window_##type; \
 \
-   vmx-ple_window = cmp(new, bound); \
+   vmx-ple_window = clamp(new, ple_window, ple_window_max); \
 \
trace_kvm_ple_window_##type(vcpu-vcpu_id, vmx-ple_window, old); \
 }
 
-make_ple_window_modifier(grow,   *, +, min, ple_window_max)
-make_ple_window_modifier(shrink, /, -, max, ple_window)
+make_ple_window_modifier(grow,   *, +) /* grow_ple_window */
+make_ple_window_modifier(shrink, /, -) /* shrink_ple_window */
 
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/9] KVM: VMX: dynamise PLE window

2014-08-19 Thread Radim Krčmář

Window is increased on every PLE exit and decreased on every sched_in.
The idea is that we don't want to PLE exit if there is no preemption
going on.

We do this with sched_in() because it does not hold rq lock.

There are two new kernel parameters for changing the window:
 ple_window_grow and ple_window_shrink
ple_window_grow affects the window on PLE exit and ple_window_shrink
does it on sched_in;  depending on their value, the window is modifier
like this: (ple_window is kvm_intel's global)

  ple_window_shrink/ |
  ple_window_grow| PLE exit   | sched_in
  ---++-
   1|  = ple_window  |  = ple_window
   ple_window   | *= ple_window_grow | /= ple_window_shrink
  otherwise  | += ple_window_grow | -= ple_window_shrink

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 52 ++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index eaa5574..66259fd 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -125,14 +125,25 @@ module_param(nested, bool, S_IRUGO);
  * Time is measured based on a counter that runs at the same rate as the TSC,
  * refer SDM volume 3b section 21.6.13  22.1.3.
  */
-#define KVM_VMX_DEFAULT_PLE_GAP128
-#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
+#define KVM_VMX_DEFAULT_PLE_GAP   128
+#define KVM_VMX_DEFAULT_PLE_WINDOW4096
+#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
+#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
 module_param(ple_window, int, S_IRUGO);
 
+/* Default doubles per-vcpu window every exit. */
+static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
+module_param(ple_window_grow, int, S_IRUGO);
+
+/* Default resets per-vcpu window every exit to ple_window. */
+static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
+module_param(ple_window_shrink, int, S_IRUGO);
+
 extern const ulong vmx_return;
 
 #define NR_AUTOLOAD_MSRS 8
@@ -5680,12 +5691,47 @@ out:
return ret;
 }
 
+static void grow_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int old = vmx-ple_window;
+   int new;
+
+   if (ple_window_grow  1)
+   new = ple_window;
+   else if (ple_window_grow  ple_window)
+   new = old * ple_window_grow;
+   else
+   new = old + ple_window_grow;
+
+   vmx-ple_window = new;
+}
+
+static void shrink_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int old = vmx-ple_window;
+   int new;
+
+   if (ple_window_shrink  1)
+   new = ple_window;
+   else if (ple_window_shrink  ple_window)
+   new = old / ple_window_shrink;
+   else
+   new = old - ple_window_shrink;
+
+   vmx-ple_window = new;
+}
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
  */
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
+   if (ple_gap)
+   grow_ple_window(vcpu);
+
skip_emulated_instruction(vcpu);
kvm_vcpu_on_spin(vcpu);
 
@@ -8855,6 +8901,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 
 void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+   if (ple_gap)
+   shrink_ple_window(vcpu);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/9] KVM: x86: introduce sched_in to kvm_x86_ops

2014-08-19 Thread Radim Krčmář

sched_in preempt notifier is available for x86, allow its use in
specific virtualization technlogies as well.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/svm.c  | 6 ++
 arch/x86/kvm/vmx.c  | 6 ++
 arch/x86/kvm/x86.c  | 1 +
 4 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5724601..358e2f3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -772,6 +772,8 @@ struct kvm_x86_ops {
bool (*mpx_supported)(void);
 
int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
+
+   void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..4baf1bc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4305,6 +4305,10 @@ static void svm_handle_external_intr(struct kvm_vcpu 
*vcpu)
local_irq_enable();
 }
 
+static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops svm_x86_ops = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -4406,6 +4410,8 @@ static struct kvm_x86_ops svm_x86_ops = {
 
.check_intercept = svm_check_intercept,
.handle_external_intr = svm_handle_external_intr,
+
+   .sched_in = svm_sched_in,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bfe11cf..2b306f9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8846,6 +8846,10 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
return X86EMUL_CONTINUE;
 }
 
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -8951,6 +8955,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.mpx_supported = vmx_mpx_supported,
 
.check_nested_events = vmx_check_nested_events,
+
+   .sched_in = vmx_sched_in,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7c214f..5696ee7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7148,6 +7148,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+   kvm_x86_ops-sched_in(vcpu, cpu);
 }
 
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/9] KVM: VMX: make PLE window per-vcpu

2014-08-19 Thread Radim Krčmář

Change PLE window into per-vcpu variable, seeded from module parameter,
to allow greater flexibility.

Brings in a small overhead on every vmentry.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 I've been thinking about a general hierarchical per-vcpu variable model,
 but it's hard to have current performance and sane code.

 arch/x86/kvm/vmx.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2b306f9..eaa5574 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -484,6 +484,9 @@ struct vcpu_vmx {
 
/* Support for a guest hypervisor (nested VMX) */
struct nested_vmx nested;
+
+   /* Dynamic PLE window. */
+   int ple_window;
 };
 
 enum segment_cache_field {
@@ -4403,6 +4406,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
if (ple_gap) {
vmcs_write32(PLE_GAP, ple_gap);
vmcs_write32(PLE_WINDOW, ple_window);
+   vmx-ple_window = ple_window;
}
 
vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
@@ -7387,6 +7391,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vmx-emulation_required)
return;
 
+   if (ple_gap)
+   vmcs_write32(PLE_WINDOW, vmx-ple_window);
+
if (vmx-nested.sync_shadow_vmcs) {
copy_vmcs12_to_shadow(vmx);
vmx-nested.sync_shadow_vmcs = false;
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/9] Dynamic Pause Loop Exiting window.

2014-08-19 Thread Radim Krčmář

PLE does not scale in its current form.  When increasing VCPU count
above 150, one can hit soft lockups because of runqueue lock contention.
(Which says a lot about performance.)

The main reason is that kvm_ple_loop cycles through all VCPUs.
Replacing it with a scalable solution would be ideal, but it has already
been well optimized for various workloads, so this series tries to
alleviate one different major problem while minimizing a chance of
regressions: we have too many useless PLE exits.

Just increasing PLE window would help some cases, but it still spirals
out of control.  By increasing the window after every PLE exit, we can
limit the amount of useless ones, so we don't reach the state where CPUs
spend 99% of the time waiting for a lock.

HP confirmed that this series avoids soft lockups and TSC sync errors on
large guests.

---
Design notes and questions:

Alternative to first two patches could be a new notifier.

All values are made changeable because defaults weren't selected after
weeks of benchmarking -- we can get improved performance by hardcoding
if someone is willing to do it.
(Or by presuming that noone is ever going to.)

Then, we can quite safely drop overflow checks: they are impossible to
hit with small increases and I don't think that anyone wants large ones.

Also, I'd argue against the last patch: it should be done in userspace,
but I'm not sure about Linux's policy.


Radim Krčmář (9):
  KVM: add kvm_arch_sched_in
  KVM: x86: introduce sched_in to kvm_x86_ops
  KVM: VMX: make PLE window per-vcpu
  KVM: VMX: dynamise PLE window
  KVM: VMX: clamp PLE window
  KVM: trace kvm_ple_window grow/shrink
  KVM: VMX: abstract ple_window modifiers
  KVM: VMX: runtime knobs for dynamic PLE window
  KVM: VMX: automatic PLE window maximum

 arch/arm/kvm/arm.c  |  4 ++
 arch/mips/kvm/mips.c|  4 ++
 arch/powerpc/kvm/powerpc.c  |  4 ++
 arch/s390/kvm/kvm-s390.c|  4 ++
 arch/x86/include/asm/kvm_host.h |  2 +
 arch/x86/kvm/svm.c  |  6 +++
 arch/x86/kvm/trace.h| 29 +
 arch/x86/kvm/vmx.c  | 93 +++--
 arch/x86/kvm/x86.c  |  6 +++
 include/linux/kvm_host.h|  2 +
 virt/kvm/kvm_main.c |  2 +
 11 files changed, 153 insertions(+), 3 deletions(-)

-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/9] KVM: VMX: abstract ple_window modifiers

2014-08-20 Thread Radim Krčmář

2014-08-20 09:02+0200, Paolo Bonzini:
 Il 19/08/2014 22:35, Radim Krčmář ha scritto:
  They were almost identical and thus merged with a loathable macro.
  
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
   This solution is hopefully more acceptable than function pointers.
 
 I think a little amount of duplication is not a problem.

Ok, I'll drop this patch from from v2.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum

2014-08-20 Thread Radim Krčmář

2014-08-20 09:16+0200, Paolo Bonzini:
 Il 19/08/2014 22:35, Radim Krčmář ha scritto:
  Every increase of ple_window_grow creates potential overflows.
  They are not serious, because we clamp ple_window and userspace is
  expected to fix ple_window_max within a second.
  ---
 I think avoiding overflows is better.  In fact, I think you should call
 this function for ple_window_max too.

(Ack, I just wanted to avoid the worst userspace error, which is why
 PW_max hasn't changed when PW_grow got smaller and we could overflow.)

 You could keep the ple_window_max variable to the user-set value.
 Whenever ple_window_grow or ple_window_max are changed, you can set an
 internal variable (let's call it ple_window_actual_max, but I'm not wed
 to this name) to the computed value, and then do:
 
   if (ple_window_grow  1 || ple_window_actual_max  ple_window)
   new = ple_window;
   else if (ple_window_grow  ple_window)
   new = max(ple_window_actual_max, old) * ple_window_grow;
   else
   new = max(ple_window_actual_max, old) + ple_window_grow;

Oh, I like that this can get rid of all overflows, ple_window_actual_max
(PW_effective_max?) is going to be set to
ple_window_max [/-] ple_window_grow in v2.

 (I think the || in the first if can be eliminated with some creativity
 in clamp_ple_window_max).

To do it, we'll want to intercept changes to ple_window as well.
(I disliked this patch a lot even before :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/9] KVM: VMX: clamp PLE window

2014-08-20 Thread Radim Krčmář

2014-08-20 09:18+0200, Paolo Bonzini:
 Il 19/08/2014 22:35, Radim Krčmář ha scritto:
  Modifications could get unwanted values of PLE window. (low or negative)
  Use ple_window and the maximal value that cannot overflow as bounds.
  
  ple_window_max defaults to a very high value, but it would make sense to
  set it to some fraction of the scheduler tick.
  
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
 Please introduce a dynamic overflow-avoiding ple_window_max (like what
 you have in patch 9) already in patch 4...
 
   static void shrink_ple_window(struct kvm_vcpu *vcpu)
  @@ -5720,7 +5724,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
  else
  new = old - ple_window_shrink;
   
  -   vmx-ple_window = new;
  +   vmx-ple_window = max(new, ple_window);
 
 ... and also squash this in patch 4.
 
 This patch can then introduce the ple_window_max module parameter (using
 module_param_cb to avoid overflows).

Will do.

---
It is going to make the patches slightly harder to review;
Are we doing it because git doesn't bisect on series boundaries?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/9] KVM: VMX: make PLE window per-vcpu

2014-08-20 Thread Radim Krčmář

2014-08-20 09:13+0200, Paolo Bonzini:
 Il 19/08/2014 22:35, Radim Krčmář ha scritto:
   enum segment_cache_field {
  @@ -4403,6 +4406,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
  if (ple_gap) {
  vmcs_write32(PLE_GAP, ple_gap);
  vmcs_write32(PLE_WINDOW, ple_window);
 
 Is this necessary?

V2, thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/9] KVM: add kvm_arch_sched_in

2014-08-20 Thread Radim Krčmář

2014-08-20 09:47+0200, Christian Borntraeger:
 On 19/08/14 22:35, Radim Krčmář wrote:
 
  --- a/virt/kvm/kvm_main.c
  +++ b/virt/kvm/kvm_main.c
  @@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, 
  int cpu)
  if (vcpu-preempted)
  vcpu-preempted = false;
  
  +   kvm_arch_sched_in(vcpu, cpu);
  +
  kvm_arch_vcpu_load(vcpu, cpu);
   }
  
 
 Why cant you reuse kvm_arch_vcpu_load? Its also called on each sched_in and 
 is architecture specific.

kvm_arch_vcpu_load is also called from kvm_vcpu_ioctl, so we'd be
shrinking unnecessarily.
(sched_in gives us a bit of useful information about the state of the
 system, kvm_vcpu_ioctl not that much.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum

2014-08-20 Thread Radim Krčmář

2014-08-20 15:15+0200, Paolo Bonzini:
 Il 20/08/2014 14:41, Radim Krčmář ha scritto:
if (ple_window_grow  1 || ple_window_actual_max  ple_window)
new = ple_window;
else if (ple_window_grow  ple_window)
new = max(ple_window_actual_max, old) * ple_window_grow;
else
new = max(ple_window_actual_max, old) + ple_window_grow;
  Oh, I like that this can get rid of all overflows, ple_window_actual_max
  (PW_effective_max?) is going to be set to
  ple_window_max [/-] ple_window_grow in v2.
  
   (I think the || in the first if can be eliminated with some creativity
   in clamp_ple_window_max).
  To do it, we'll want to intercept changes to ple_window as well.
  (I disliked this patch a lot even before :)
 
 What about setting ple_window_actual_max to 0 if ple_window_grow is 0
 (instead of just returning)?
 
 Then the if (ple_window_actual_max  ple_window) will always fail and
 you'll go through new = ple_window.  But perhaps it's more gross and
 worthless than creative. :)

That code can't use PW directly, because PW_actual_max needs to be one
PW_grow below PW_max, so I'd rather enforce minimal PW_actual_max.

Btw. without extra code, we are still going to overflow on races when
changing PW_grow, should they be covered as well?

(+ There is a bug in this patch -- clamp_ple_window_max() should be
 after param_set_int() ... damned unreviewed last-second changes.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum

2014-08-20 Thread Radim Krčmář

2014-08-20 17:34+0200, Paolo Bonzini:
 Il 20/08/2014 17:31, Radim Krčmář ha scritto:
  Btw. without extra code, we are still going to overflow on races when
  changing PW_grow, should they be covered as well?
 
 You mean because there is no spinlock or similar protecting the changes?
  I guess you could use a seqlock.

Yes, for example between a modification of ple_window
  new = min(old, PW_actual_max) * PW_grow
which gets compiled into something like this:
  1) tmp = min(old, PW_actual_max)
  2) new = tmp * PW_grow
and a write to increase PW_grow
  3) PW_actual_max = min(PW_max / new_PW_grow, PW_actual_max)
  4) PW_grow = new_PW_grow
  5) PW_actual_max = PW_max / new_PW_grow

3 and 4 can exectute between 1 and 2, which could overflow.

I don't think they are important enough to warrant a significant
performance hit of locking.
Or even more checks that would prevent it in a lockless way.

(I'd just see that the result is set to something legal and also drop
 line 3, because it does not help things that much.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum

2014-08-20 Thread Radim Krčmář

2014-08-20 18:03+0200, Paolo Bonzini:
 Il 20/08/2014 18:01, Radim Krčmář ha scritto:
  2014-08-20 17:34+0200, Paolo Bonzini:
  Il 20/08/2014 17:31, Radim Krčmář ha scritto:
  Btw. without extra code, we are still going to overflow on races when
  changing PW_grow, should they be covered as well?
 
  You mean because there is no spinlock or similar protecting the changes?
   I guess you could use a seqlock.
  
  Yes, for example between a modification of ple_window
new = min(old, PW_actual_max) * PW_grow
  which gets compiled into something like this:
1) tmp = min(old, PW_actual_max)
2) new = tmp * PW_grow
  and a write to increase PW_grow
3) PW_actual_max = min(PW_max / new_PW_grow, PW_actual_max)
4) PW_grow = new_PW_grow
5) PW_actual_max = PW_max / new_PW_grow
  
  3 and 4 can exectute between 1 and 2, which could overflow.
  
  I don't think they are important enough to warrant a significant
  performance hit of locking.
 
 A seqlock just costs two memory accesses to the same (shared) cache line
 as the PW data, and a non-taken branch.

Oh, seqlock readers do not have to write to shared memory, so it is
acceptable ...

 I don't like code that is
 unsafe by design...

I wouldn't say it is unsafe, because VCPU's PW is always greater than
module's PW. We are just going to PLE exit sooner than expected.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/6] KVM: VMX: make PLE window per-VCPU

2014-08-20 Thread Radim Krčmář

Change PLE window into per-VCPU variable, seeded from module parameter,
to allow greater flexibility.

Brings in a small overhead on every vmentry.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2b306f9..18e0e52 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -484,6 +484,9 @@ struct vcpu_vmx {
 
/* Support for a guest hypervisor (nested VMX) */
struct nested_vmx nested;
+
+   /* Dynamic PLE window. */
+   int ple_window;
 };
 
 enum segment_cache_field {
@@ -4402,7 +4405,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
if (ple_gap) {
vmcs_write32(PLE_GAP, ple_gap);
-   vmcs_write32(PLE_WINDOW, ple_window);
+   vmx-ple_window = ple_window;
}
 
vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
@@ -7387,6 +7390,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vmx-emulation_required)
return;
 
+   if (ple_gap)
+   vmcs_write32(PLE_WINDOW, vmx-ple_window);
+
if (vmx-nested.sync_shadow_vmcs) {
copy_vmcs12_to_shadow(vmx);
vmx-nested.sync_shadow_vmcs = false;
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 6/6] KVM: VMX: runtime knobs for dynamic PLE window

2014-08-20 Thread Radim Krčmář

ple_window is updated on every vmentry, so there is no reason to have it
read-only anymore.
ple_window* weren't writable to prevent runtime overflow races;
they are prevented by a seqlock.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 48 +---
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f63ac5d..bd73fa1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -132,24 +132,29 @@ module_param(nested, bool, S_IRUGO);
 #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\
INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
 
+static struct kernel_param_ops param_ops_ple_t;
+#define param_check_ple_t(name, p) __param_check(name, p, int)
+
+static DEFINE_SEQLOCK(ple_window_seqlock);
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
-module_param(ple_window, int, S_IRUGO);
+module_param(ple_window, ple_t, S_IRUGO | S_IWUSR);
 
 /* Default doubles per-vcpu window every exit. */
 static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
-module_param(ple_window_grow, int, S_IRUGO);
+module_param(ple_window_grow, ple_t, S_IRUGO | S_IWUSR);
 
 /* Default resets per-vcpu window every exit to ple_window. */
 static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
-module_param(ple_window_shrink, int, S_IRUGO);
+module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR);
 
 /* Default is to compute the maximum so we can never overflow. */
 static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
-module_param(ple_window_max, int, S_IRUGO);
+module_param(ple_window_max, ple_t, S_IRUGO | S_IWUSR);
 
 extern const ulong vmx_return;
 
@@ -5730,13 +5735,19 @@ static void modify_ple_window(struct kvm_vcpu *vcpu, 
int grow)
struct vcpu_vmx *vmx = to_vmx(vcpu);
int old = vmx-ple_window;
int new;
+   unsigned seq;
 
-   if (grow)
-   new = __grow_ple_window(old)
-   else
-   new = __shrink_ple_window(old, ple_window_shrink, ple_window);
+   do {
+   seq = read_seqbegin(ple_window_seqlock);
 
-   vmx-ple_window = max(new, ple_window);
+   if (grow)
+   new = __grow_ple_window(old);
+   else
+   new = __shrink_ple_window(old, ple_window_shrink,
+ ple_window);
+
+   vmx-ple_window = max(new, ple_window);
+   } while (read_seqretry(ple_window_seqlock, seq));
 
trace_kvm_ple_window(grow, vcpu-vcpu_id, vmx-ple_window, old);
 }
@@ -5750,6 +5761,23 @@ static void update_ple_window_actual_max(void)
ple_window_grow, INT_MIN);
 }
 
+static int param_set_ple_t(const char *arg, const struct kernel_param *kp)
+{
+   int ret;
+
+   write_seqlock(ple_window_seqlock);
+   ret = param_set_int(arg, kp);
+   update_ple_window_actual_max();
+   write_sequnlock(ple_window_seqlock);
+
+   return ret;
+}
+
+static struct kernel_param_ops param_ops_ple_t = {
+   .set = param_set_ple_t,
+   .get = param_get_int,
+};
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
@@ -9153,8 +9181,6 @@ static int __init vmx_init(void)
} else
kvm_disable_tdp();
 
-   update_ple_window_actual_max();
-
return 0;
 
 out7:
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/6] KVM: VMX: dynamise PLE window

2014-08-20 Thread Radim Krčmář

Window is increased on every PLE exit and decreased on every sched_in.
The idea is that we don't want to PLE exit if there is no preemption
going on.
We do this with sched_in() because it does not hold rq lock.

There are two new kernel parameters for changing the window:
 ple_window_grow and ple_window_shrink
ple_window_grow affects the window on PLE exit and ple_window_shrink
does it on sched_in;  depending on their value, the window is modifier
like this: (ple_window is kvm_intel's global)

  ple_window_shrink/ |
  ple_window_grow| PLE exit   | sched_in
  ---++-
   1|  = ple_window  |  = ple_window
   ple_window   | *= ple_window_grow | /= ple_window_shrink
  otherwise  | += ple_window_grow | -= ple_window_shrink

A third new parameter, ple_window_max, controls a maximal ple_window.
A minimum equals to ple_window.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 80 --
 1 file changed, 78 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 18e0e52..e63d7ac 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -125,14 +125,32 @@ module_param(nested, bool, S_IRUGO);
  * Time is measured based on a counter that runs at the same rate as the TSC,
  * refer SDM volume 3b section 21.6.13  22.1.3.
  */
-#define KVM_VMX_DEFAULT_PLE_GAP128
-#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
+#define KVM_VMX_DEFAULT_PLE_GAP   128
+#define KVM_VMX_DEFAULT_PLE_WINDOW4096
+#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
+#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
+#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\
+   INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
 module_param(ple_window, int, S_IRUGO);
 
+/* Default doubles per-vcpu window every exit. */
+static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
+module_param(ple_window_grow, int, S_IRUGO);
+
+/* Default resets per-vcpu window every exit to ple_window. */
+static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
+module_param(ple_window_shrink, int, S_IRUGO);
+
+/* Default is to compute the maximum so we can never overflow. */
+static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+module_param(ple_window_max, int, S_IRUGO);
+
 extern const ulong vmx_return;
 
 #define NR_AUTOLOAD_MSRS 8
@@ -5679,12 +5697,66 @@ out:
return ret;
 }
 
+static int __grow_ple_window(int val)
+{
+   if (ple_window_grow  1)
+   return ple_window;
+
+   val = min(val, ple_window_actual_max);
+
+   if (ple_window_grow  ple_window)
+   val *= ple_window_grow;
+   else
+   val += ple_window_grow;
+
+   return val;
+}
+
+static int __shrink_ple_window(int val, int shrinker, int minimum)
+{
+   if (shrinker  1)
+   return ple_window;
+
+   if (shrinker  ple_window)
+   val /= shrinker;
+   else
+   val -= shrinker;
+
+   return max(val, minimum);
+}
+
+static void modify_ple_window(struct kvm_vcpu *vcpu, int grow)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int new;
+
+   if (grow)
+   new = __grow_ple_window(vmx-ple_window);
+   else
+   new = __shrink_ple_window(vmx-ple_window, ple_window_shrink,
+ ple_window);
+
+   vmx-ple_window = max(new, ple_window);
+}
+#define grow_ple_window(vcpu)   modify_ple_window(vcpu, 1)
+#define shrink_ple_window(vcpu) modify_ple_window(vcpu, 0)
+
+static void update_ple_window_actual_max(void)
+{
+   ple_window_actual_max =
+   __shrink_ple_window(max(ple_window_max, ple_window),
+   ple_window_grow, INT_MIN);
+}
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
  */
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
+   if (ple_gap)
+   grow_ple_window(vcpu);
+
skip_emulated_instruction(vcpu);
kvm_vcpu_on_spin(vcpu);
 
@@ -8854,6 +8926,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 
 void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+   if (ple_gap)
+   shrink_ple_window(vcpu);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {
@@ -9077,6 +9151,8 @@ static int __init vmx_init(void)
} else
kvm_disable_tdp();
 
+   update_ple_window_actual_max();
+
return 0;
 
 out7:
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org

[PATCH v2 5/6] KVM: trace kvm_ple_window

2014-08-20 Thread Radim Krčmář

Tracepoint for dynamic PLE window, fired on every potential change.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/trace.h | 25 +
 arch/x86/kvm/vmx.c   |  8 +---
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index e850a7d..4b8e6cb 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -848,6 +848,31 @@ TRACE_EVENT(kvm_track_tsc,
  __print_symbolic(__entry-host_clock, host_clocks))
 );
 
+TRACE_EVENT(kvm_ple_window,
+   TP_PROTO(int grow, unsigned int vcpu_id, int new, int old),
+   TP_ARGS(grow, vcpu_id, new, old),
+
+   TP_STRUCT__entry(
+   __field( int,  grow )
+   __field(unsigned int,   vcpu_id )
+   __field( int,   new )
+   __field( int,   old )
+   ),
+
+   TP_fast_assign(
+   __entry-grow   = grow;
+   __entry-vcpu_id= vcpu_id;
+   __entry-new= new;
+   __entry-old= old;
+   ),
+
+   TP_printk(vcpu %u: ple_window %d %s %d,
+ __entry-vcpu_id,
+ __entry-new,
+ __entry-grow ? + : -,
+ __entry-old)
+);
+
 #endif /* CONFIG_X86_64 */
 
 #endif /* _TRACE_KVM_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e63d7ac..f63ac5d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5728,15 +5728,17 @@ static int __shrink_ple_window(int val, int shrinker, 
int minimum)
 static void modify_ple_window(struct kvm_vcpu *vcpu, int grow)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int old = vmx-ple_window;
int new;
 
if (grow)
-   new = __grow_ple_window(vmx-ple_window);
+   new = __grow_ple_window(old)
else
-   new = __shrink_ple_window(vmx-ple_window, ple_window_shrink,
- ple_window);
+   new = __shrink_ple_window(old, ple_window_shrink, ple_window);
 
vmx-ple_window = max(new, ple_window);
+
+   trace_kvm_ple_window(grow, vcpu-vcpu_id, vmx-ple_window, old);
 }
 #define grow_ple_window(vcpu)   modify_ple_window(vcpu, 1)
 #define shrink_ple_window(vcpu) modify_ple_window(vcpu, 0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5696ee7..814b20c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7648,3 +7648,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/6] KVM: x86: introduce sched_in to kvm_x86_ops

2014-08-20 Thread Radim Krčmář

sched_in preempt notifier is available for x86, allow its use in
specific virtualization technlogies as well.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/svm.c  | 6 ++
 arch/x86/kvm/vmx.c  | 6 ++
 arch/x86/kvm/x86.c  | 1 +
 4 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5724601..358e2f3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -772,6 +772,8 @@ struct kvm_x86_ops {
bool (*mpx_supported)(void);
 
int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
+
+   void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..4baf1bc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4305,6 +4305,10 @@ static void svm_handle_external_intr(struct kvm_vcpu 
*vcpu)
local_irq_enable();
 }
 
+static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops svm_x86_ops = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -4406,6 +4410,8 @@ static struct kvm_x86_ops svm_x86_ops = {
 
.check_intercept = svm_check_intercept,
.handle_external_intr = svm_handle_external_intr,
+
+   .sched_in = svm_sched_in,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bfe11cf..2b306f9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8846,6 +8846,10 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
return X86EMUL_CONTINUE;
 }
 
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -8951,6 +8955,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.mpx_supported = vmx_mpx_supported,
 
.check_nested_events = vmx_check_nested_events,
+
+   .sched_in = vmx_sched_in,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7c214f..5696ee7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7148,6 +7148,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+   kvm_x86_ops-sched_in(vcpu, cpu);
 }
 
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/6] Dynamic Pause Loop Exiting window.

2014-08-20 Thread Radim Krčmář

v1 - v2:
 * squashed [v1 4/9] and [v1 5/9] (clamping)
 * dropped [v1 7/9] (CPP abstractions)
 * merged core of [v1 9/9] into [v1 4/9] (automatic maximum)
 * reworked kernel_param_ops: closer to pure int [v2 6/6]
 * introduced ple_window_actual_max  reworked clamping [v2 4/6]
 * added seqlock for parameter modifications [v2 6/6]

---
PLE does not scale in its current form.  When increasing VCPU count
above 150, one can hit soft lockups because of runqueue lock contention.
(Which says a lot about performance.)

The main reason is that kvm_ple_loop cycles through all VCPUs.
Replacing it with a scalable solution would be ideal, but it has already
been well optimized for various workloads, so this series tries to
alleviate one different major problem while minimizing a chance of
regressions: we have too many useless PLE exits.

Just increasing PLE window would help some cases, but it still spirals
out of control.  By increasing the window after every PLE exit, we can
limit the amount of useless ones, so we don't reach the state where CPUs
spend 99% of the time waiting for a lock.

HP confirmed that this series prevents soft lockups and TSC sync errors
on large guests.

Radim Krčmář (6):
  KVM: add kvm_arch_sched_in
  KVM: x86: introduce sched_in to kvm_x86_ops
  KVM: VMX: make PLE window per-VCPU
  KVM: VMX: dynamise PLE window
  KVM: trace kvm_ple_window
  KVM: VMX: runtime knobs for dynamic PLE window

 arch/arm/kvm/arm.c  |   4 ++
 arch/mips/kvm/mips.c|   4 ++
 arch/powerpc/kvm/powerpc.c  |   4 ++
 arch/s390/kvm/kvm-s390.c|   4 ++
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/svm.c  |   6 ++
 arch/x86/kvm/trace.h|  25 
 arch/x86/kvm/vmx.c  | 124 ++--
 arch/x86/kvm/x86.c  |   6 ++
 include/linux/kvm_host.h|   2 +
 virt/kvm/kvm_main.c |   2 +
 11 files changed, 179 insertions(+), 4 deletions(-)

-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/6] KVM: add kvm_arch_sched_in

2014-08-20 Thread Radim Krčmář

Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/arm/kvm/arm.c | 4 
 arch/mips/kvm/mips.c   | 4 
 arch/powerpc/kvm/powerpc.c | 4 
 arch/s390/kvm/kvm-s390.c   | 4 
 arch/x86/kvm/x86.c | 4 
 include/linux/kvm_host.h   | 2 ++
 virt/kvm/kvm_main.c| 2 ++
 7 files changed, 24 insertions(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a99e0cd..9f788eb 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
vcpu-cpu = cpu;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index cd71141..2362df2 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
  struct kvm_translation *tr)
 {
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4c79284..cbc432f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
kvmppc_subarch_vcpu_uninit(vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 #ifdef CONFIG_BOOKE
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ce81eb2..a3c324e 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
/* Nothing todo */
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
save_fp_ctl(vcpu-arch.host_fpregs.fpc);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f1e22d..d7c214f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
static_key_slow_dec(kvm_no_apic_vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
if (type)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4c33b3..ebd7236 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -624,6 +624,8 @@ void kvm_arch_exit(void);
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..d3c3ed0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int 
cpu)
if (vcpu-preempted)
vcpu-preempted = false;
 
+   kvm_arch_sched_in(vcpu, cpu);
+
kvm_arch_vcpu_load(vcpu, cpu);
 }
 
-- 
2.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/6] KVM: add kvm_arch_sched_in

2014-08-21 Thread Radim Krčmář

2014-08-21 10:29+0200, Paolo Bonzini:
 Il 20/08/2014 22:53, Radim Krčmář ha scritto:
  Introduce preempt notifiers for architecture specific code.
  Advantage over creating a new notifier in every arch is slightly simpler
  code and guaranteed call order with respect to kvm_sched_in.
  
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
   arch/arm/kvm/arm.c | 4 
   arch/mips/kvm/mips.c   | 4 
   arch/powerpc/kvm/powerpc.c | 4 
   arch/s390/kvm/kvm-s390.c   | 4 
   arch/x86/kvm/x86.c | 4 
 
 What about adding them as static inlines in
 arch/*/include/asm/kvm_host.h (except for arch/x86 of course)?

All empty arch functions are in '.c' files, so it seems better to follow
the same path.
(And have one refactoring patch if GCC does not optimize this.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 3/6] KVM: VMX: make PLE window per-VCPU

2014-08-21 Thread Radim Krčmář

2014-08-21 10:25+0200, Paolo Bonzini:
 Il 20/08/2014 22:53, Radim Krčmář ha scritto:
  +   if (ple_gap)
  +   vmcs_write32(PLE_WINDOW, vmx-ple_window);
 
 Maybe we can add a ple_window_dirty field?  It can be tested instead of
 ple_gap to avoid the vmwrite in the common case?

Sure, v3!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window

2014-08-21 Thread Radim Krčmář

2014-08-21 10:24+0200, Paolo Bonzini:
 Il 20/08/2014 22:53, Radim Krčmář ha scritto:
  +static void update_ple_window_actual_max(void)
  +{
  +   ple_window_actual_max =
  +   __shrink_ple_window(max(ple_window_max, ple_window),
 
 Why the max() here?

To have ple_window act as a ple_window_min as well.
(When we are already preventing most stupid choices.)

And it's cheap ... I can add comment to this function :)

  +   ple_window_grow, INT_MIN);
 
 This should be 0 (in fact you can probably make everything unsigned now
 that you've sorted out the overflows).

Not in v2:
  val = min(vmx-ple_window, ple_window_actual_max);
  val += ple_window_grow;
  vmx-ple_window = val;
so we need to dip below zero to allow all possible grows.

(I'm not sure if anyone is ever going to use the additive option, so
 getting rid of it is also a valid choice -- code would be nicer.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window

2014-08-21 Thread Radim Krčmář

2014-08-21 10:26+0200, Paolo Bonzini:
 Il 20/08/2014 22:53, Radim Krčmář ha scritto:
  +static int __shrink_ple_window(int val, int shrinker, int minimum)
 
 s/shrinker/factor/ or s/shrinker/param/ (shrinker has another meaning in
 the kernel).

True, thanks.

  +{
  +   if (shrinker  1)
  +   return ple_window;
  +
  +   if (shrinker  ple_window)
  +   val /= shrinker;
  +   else
  +   val -= shrinker;
  +
  +   return max(val, minimum);
 
 Any reason to use anything but ple_window as the minimum, even in
 update_ple_window_actual_max?

ple_window_actual_max needs to be one grow below the ple_window_max, so
it can be lower than ple_window.

  +}
  +
  +static void modify_ple_window(struct kvm_vcpu *vcpu, int grow)
  +{
  +   struct vcpu_vmx *vmx = to_vmx(vcpu);
  +   int new;
  +
  +   if (grow)
  +   new = __grow_ple_window(vmx-ple_window);
  +   else
  +   new = __shrink_ple_window(vmx-ple_window, ple_window_shrink,
  + ple_window);
  +
  +   vmx-ple_window = max(new, ple_window);
  +}
  +#define grow_ple_window(vcpu)   modify_ple_window(vcpu, 1)
  +#define shrink_ple_window(vcpu) modify_ple_window(vcpu, 0)
 
 No macros please. :)

Guity as charged.
Using 0/1 or true/false in this context directly would be pretty bad ...
Is enum fine? (SHIRINK_PLE_WINDOW, GROW_PLE_WINDOW?)

(I can always make it into function pointers ;)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/6] KVM: trace kvm_ple_window

2014-08-21 Thread Radim Krčmář

2014-08-21 10:29+0200, Paolo Bonzini:
 Il 20/08/2014 22:53, Radim Krčmář ha scritto:
  Tracepoint for dynamic PLE window, fired on every potential change.
  
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
   arch/x86/kvm/trace.h | 25 +
   arch/x86/kvm/vmx.c   |  8 +---
   arch/x86/kvm/x86.c   |  1 +
   3 files changed, 31 insertions(+), 3 deletions(-)
  
  diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
  index e850a7d..4b8e6cb 100644
  --- a/arch/x86/kvm/trace.h
  +++ b/arch/x86/kvm/trace.h
  @@ -848,6 +848,31 @@ TRACE_EVENT(kvm_track_tsc,
__print_symbolic(__entry-host_clock, host_clocks))
   );
   
  +TRACE_EVENT(kvm_ple_window,
  +   TP_PROTO(int grow, unsigned int vcpu_id, int new, int old),
 
 s/int grow/bool grow/ (and similarly in TP_STRUCT__entry).

Ok.
(We are using 'int' for this in other tracepoints, so I guessed there is
 some hate agains bool.)

  +   TP_ARGS(grow, vcpu_id, new, old),
  +
  +   TP_STRUCT__entry(
  +   __field( int,  grow )
  +   __field(unsigned int,   vcpu_id )
  +   __field( int,   new )
  +   __field( int,   old )
  +   ),
  +
  +   TP_fast_assign(
  +   __entry-grow   = grow;
  +   __entry-vcpu_id= vcpu_id;
  +   __entry-new= new;
  +   __entry-old= old;
  +   ),
  +
  +   TP_printk(vcpu %u: ple_window %d %s %d,
  + __entry-vcpu_id,
  + __entry-new,
  + __entry-grow ? + : -,
 
 ? grow : shrink

Can't say I haven't expected comment about that :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window

2014-08-21 Thread Radim Krčmář

2014-08-21 14:29+0200, Paolo Bonzini:
 Il 21/08/2014 13:54, Radim Krčmář ha scritto:
  Guity as charged.
  Using 0/1 or true/false in this context directly would be pretty bad ...
  Is enum fine? (SHIRINK_PLE_WINDOW, GROW_PLE_WINDOW?)
 
 I prefer good old Ctrl-C Ctrl-V (adjusted for your favorite editor). :)

I, I should be able to do it, but nightmares are going to last a while.

(And to lower chances of v4: are tracepoint macros, like for kvm_apic,
 frowned upon now?)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 1/6] KVM: add kvm_arch_sched_in

2014-08-21 Thread Radim Krčmář

2014-08-21 14:27+0200, Paolo Bonzini:
 Il 21/08/2014 13:38, Radim Krčmář ha scritto:
  2014-08-21 10:29+0200, Paolo Bonzini:
  Il 20/08/2014 22:53, Radim Krčmář ha scritto:
  Introduce preempt notifiers for architecture specific code.
  Advantage over creating a new notifier in every arch is slightly simpler
  code and guaranteed call order with respect to kvm_sched_in.
 
  Signed-off-by: Radim Krčmář rkrc...@redhat.com
  ---
   arch/arm/kvm/arm.c | 4 
   arch/mips/kvm/mips.c   | 4 
   arch/powerpc/kvm/powerpc.c | 4 
   arch/s390/kvm/kvm-s390.c   | 4 
   arch/x86/kvm/x86.c | 4 
 
  What about adding them as static inlines in
  arch/*/include/asm/kvm_host.h (except for arch/x86 of course)?
  
  All empty arch functions are in '.c' files, so it seems better to follow
  the same path.
  (And have one refactoring patch if GCC does not optimize this.)
 
 GCC certainly does not optimize this (unless you use LTO).

I see LTO patches in next ... do we want to move every empty arch
function into headers?
(It is probably going to take LTO few years to be enabled by default.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window

2014-08-21 Thread Radim Krčmář

2014-08-21 15:18+0200, Paolo Bonzini:
 Il 21/08/2014 14:42, Radim Krčmář ha scritto:
  2014-08-21 14:29+0200, Paolo Bonzini:
  Il 21/08/2014 13:54, Radim Krčmář ha scritto:
  Guity as charged.
  Using 0/1 or true/false in this context directly would be pretty bad ...
  Is enum fine? (SHIRINK_PLE_WINDOW, GROW_PLE_WINDOW?)
 
  I prefer good old Ctrl-C Ctrl-V (adjusted for your favorite editor). :)
  
  I, I should be able to do it, but nightmares are going to last a while.
  
  (And to lower chances of v4: are tracepoint macros, like for kvm_apic,
   frowned upon now?)
 
 Nah, go ahead.  But is there any example of #define like what you were
 doing?  I tried grepping for #define.*[a-z]\( and didn't get anything
 similar.

No, at least not in KVM code, it was the same idea as tracepoints:
less magical constants at the cost of (syntactic) indirection.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 5/6] KVM: trace kvm_ple_window

2014-08-21 Thread Radim Krčmář

2014-08-21 15:22+0200, Paolo Bonzini:
 Il 21/08/2014 13:56, Radim Krčmář ha scritto:
  2014-08-21 10:29+0200, Paolo Bonzini:
  Il 20/08/2014 22:53, Radim Krčmář ha scritto:
  + TP_PROTO(int grow, unsigned int vcpu_id, int new, int old),
 
  s/int grow/bool grow/ (and similarly in TP_STRUCT__entry).
  
  Ok.
  (We are using 'int' for this in other tracepoints, so I guessed there is
   some hate agains bool.)
 
 Yeah, there are many ints, but there are also some bools. :)
 
 Of course bool is only a no-brainer if there is a clear direction to
 use (e.g. bool has_error), and indeed this case is less obvious.  But
 there are some like this one (e.g. bool gpa_match).
 
 Keep the int if you prefer it that way.

I actually like bool much better, but first few were only int and then I
stopped looking ...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 0/7] Dynamic Pause Loop Exiting window.

2014-08-21 Thread Radim Krčmář

v2 - v3:
 * copypaste frenzy [v3 4/7] (split modify_ple_window)
 * commented update_ple_window_actual_max [v3 4/7]
 * renamed shrinker to modifier [v3 4/7]
 * removed an extraneous max(new, ple_window) [v3 4/7] (should have been in v2)
 * changed tracepoint argument type, printing and macro abstractions [v3 5/7]
 * renamed ple_t to ple_int [v3 6/7] (visible in modinfo)
 * intelligent updates of ple_window [v3 7/7]

---
v1 - v2:
 * squashed [v1 4/9] and [v1 5/9] (clamping)
 * dropped [v1 7/9] (CPP abstractions)
 * merged core of [v1 9/9] into [v1 4/9] (automatic maximum)
 * reworked kernel_param_ops: closer to pure int [v2 6/6]
 * introduced ple_window_actual_max  reworked clamping [v2 4/6]
 * added seqlock for parameter modifications [v2 6/6]

---
PLE does not scale in its current form.  When increasing VCPU count
above 150, one can hit soft lockups because of runqueue lock contention.
(Which says a lot about performance.)

The main reason is that kvm_ple_loop cycles through all VCPUs.
Replacing it with a scalable solution would be ideal, but it has already
been well optimized for various workloads, so this series tries to
alleviate one different major problem while minimizing a chance of
regressions: we have too many useless PLE exits.

Just increasing PLE window would help some cases, but it still spirals
out of control.  By increasing the window after every PLE exit, we can
limit the amount of useless ones, so we don't reach the state where CPUs
spend 99% of the time waiting for a lock.

HP confirmed that this series prevents soft lockups and TSC sync errors
on large guests.


Radim Krčmář (7):
  KVM: add kvm_arch_sched_in
  KVM: x86: introduce sched_in to kvm_x86_ops
  KVM: VMX: make PLE window per-VCPU
  KVM: VMX: dynamise PLE window
  KVM: trace kvm_ple_window grow/shrink
  KVM: VMX: runtime knobs for dynamic PLE window
  KVM: VMX: optimize ple_window updates to VMCS

 arch/arm/kvm/arm.c  |   4 ++
 arch/mips/kvm/mips.c|   4 ++
 arch/powerpc/kvm/powerpc.c  |   4 ++
 arch/s390/kvm/kvm-s390.c|   4 ++
 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/svm.c  |   6 ++
 arch/x86/kvm/trace.h|  30 
 arch/x86/kvm/vmx.c  | 147 ++--
 arch/x86/kvm/x86.c  |   6 ++
 include/linux/kvm_host.h|   2 +
 virt/kvm/kvm_main.c |   2 +
 11 files changed, 207 insertions(+), 4 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink

2014-08-21 Thread Radim Krčmář

Tracepoint for dynamic PLE window, fired on every potential change.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/trace.h | 30 ++
 arch/x86/kvm/vmx.c   | 10 --
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index e850a7d..1742dfb 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -848,6 +848,36 @@ TRACE_EVENT(kvm_track_tsc,
  __print_symbolic(__entry-host_clock, host_clocks))
 );
 
+TRACE_EVENT(kvm_ple_window,
+   TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
+   TP_ARGS(grow, vcpu_id, new, old),
+
+   TP_STRUCT__entry(
+   __field(bool,  grow )
+   __field(unsigned int,   vcpu_id )
+   __field( int,   new )
+   __field( int,   old )
+   ),
+
+   TP_fast_assign(
+   __entry-grow   = grow;
+   __entry-vcpu_id= vcpu_id;
+   __entry-new= new;
+   __entry-old= old;
+   ),
+
+   TP_printk(vcpu %u: ple_window %d (%s %d),
+ __entry-vcpu_id,
+ __entry-new,
+ __entry-grow ? grow : shrink,
+ __entry-old)
+);
+
+#define trace_kvm_ple_window_grow(vcpu_id, new, old) \
+   trace_kvm_ple_window(true, vcpu_id, new, old)
+#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
+   trace_kvm_ple_window(false, vcpu_id, new, old)
+
 #endif /* CONFIG_X86_64 */
 
 #endif /* _TRACE_KVM_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a2fa6ba..273cbd5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5728,16 +5728,22 @@ static int __shrink_ple_window(int val, int modifier, 
int minimum)
 static void grow_ple_window(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int old = vmx-ple_window;
 
-   vmx-ple_window = __grow_ple_window(vmx-ple_window);
+   vmx-ple_window = __grow_ple_window(old);
+
+   trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old);
 }
 
 static void shrink_ple_window(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
+   int old = vmx-ple_window;
 
-   vmx-ple_window = __shrink_ple_window(vmx-ple_window,
+   vmx-ple_window = __shrink_ple_window(old,
  ple_window_shrink, ple_window);
+
+   trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old);
 }
 
 /*
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5696ee7..814b20c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7648,3 +7648,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/7] KVM: VMX: dynamise PLE window

2014-08-21 Thread Radim Krčmář

Window is increased on every PLE exit and decreased on every sched_in.
The idea is that we don't want to PLE exit if there is no preemption
going on.
We do this with sched_in() because it does not hold rq lock.

There are two new kernel parameters for changing the window:
 ple_window_grow and ple_window_shrink
ple_window_grow affects the window on PLE exit and ple_window_shrink
does it on sched_in;  depending on their value, the window is modifier
like this: (ple_window is kvm_intel's global)

  ple_window_shrink/ |
  ple_window_grow| PLE exit   | sched_in
  ---++-
   1|  = ple_window  |  = ple_window
   ple_window   | *= ple_window_grow | /= ple_window_shrink
  otherwise  | += ple_window_grow | -= ple_window_shrink

A third new parameter, ple_window_max, controls the maximal ple_window;
it is internally rounded down to a closest multiple of ple_window_grow.

VCPU's PLE window is never allowed below ple_window.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 87 --
 1 file changed, 85 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 18e0e52..a2fa6ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -125,14 +125,32 @@ module_param(nested, bool, S_IRUGO);
  * Time is measured based on a counter that runs at the same rate as the TSC,
  * refer SDM volume 3b section 21.6.13  22.1.3.
  */
-#define KVM_VMX_DEFAULT_PLE_GAP128
-#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
+#define KVM_VMX_DEFAULT_PLE_GAP   128
+#define KVM_VMX_DEFAULT_PLE_WINDOW4096
+#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
+#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
+#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\
+   INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
 module_param(ple_window, int, S_IRUGO);
 
+/* Default doubles per-vcpu window every exit. */
+static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
+module_param(ple_window_grow, int, S_IRUGO);
+
+/* Default resets per-vcpu window every exit to ple_window. */
+static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
+module_param(ple_window_shrink, int, S_IRUGO);
+
+/* Default is to compute the maximum so we can never overflow. */
+static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+module_param(ple_window_max, int, S_IRUGO);
+
 extern const ulong vmx_return;
 
 #define NR_AUTOLOAD_MSRS 8
@@ -5679,12 +5697,73 @@ out:
return ret;
 }
 
+static int __grow_ple_window(int val)
+{
+   if (ple_window_grow  1)
+   return ple_window;
+
+   val = min(val, ple_window_actual_max);
+
+   if (ple_window_grow  ple_window)
+   val *= ple_window_grow;
+   else
+   val += ple_window_grow;
+
+   return val;
+}
+
+static int __shrink_ple_window(int val, int modifier, int minimum)
+{
+   if (modifier  1)
+   return ple_window;
+
+   if (modifier  ple_window)
+   val /= modifier;
+   else
+   val -= modifier;
+
+   return max(val, minimum);
+}
+
+static void grow_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   vmx-ple_window = __grow_ple_window(vmx-ple_window);
+}
+
+static void shrink_ple_window(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   vmx-ple_window = __shrink_ple_window(vmx-ple_window,
+ ple_window_shrink, ple_window);
+}
+
+/*
+ * ple_window_actual_max is computed to be one grow_ple_window() below
+ * ple_window_max. (See __grow_ple_window for the reason.)
+ * This prevents overflows, because ple_window_max is int.
+ * ple_window_max effectively rounded down to a multiple of ple_window_grow in
+ * this process.
+ * ple_window_max is also prevented from setting vmx-ple_window  ple_window.
+ */
+static void update_ple_window_actual_max(void)
+{
+   ple_window_actual_max =
+   __shrink_ple_window(max(ple_window_max, ple_window),
+   ple_window_grow, INT_MIN);
+}
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
  */
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
+   if (ple_gap)
+   grow_ple_window(vcpu);
+
skip_emulated_instruction(vcpu);
kvm_vcpu_on_spin(vcpu);
 
@@ -8854,6 +8933,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 
 void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+   if (ple_gap)
+   shrink_ple_window(vcpu);
 }
 
 static struct kvm_x86_ops

[PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window

2014-08-21 Thread Radim Krčmář

ple_window is updated on every vmentry, so there is no reason to have it
read-only anymore.
ple_window* weren't writable to prevent runtime overflow races;
they are prevented by a seqlock.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 46 +-
 1 file changed, 37 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 273cbd5..1318232 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -132,24 +132,29 @@ module_param(nested, bool, S_IRUGO);
 #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\
INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
 
+static struct kernel_param_ops param_ops_ple_int;
+#define param_check_ple_int(name, p) __param_check(name, p, int)
+
+static DEFINE_SEQLOCK(ple_window_seqlock);
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
-module_param(ple_window, int, S_IRUGO);
+module_param(ple_window, ple_int, S_IRUGO | S_IWUSR);
 
 /* Default doubles per-vcpu window every exit. */
 static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
-module_param(ple_window_grow, int, S_IRUGO);
+module_param(ple_window_grow, ple_int, S_IRUGO | S_IWUSR);
 
 /* Default resets per-vcpu window every exit to ple_window. */
 static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
-module_param(ple_window_shrink, int, S_IRUGO);
+module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR);
 
 /* Default is to compute the maximum so we can never overflow. */
 static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
 static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
-module_param(ple_window_max, int, S_IRUGO);
+module_param(ple_window_max, ple_int, S_IRUGO | S_IWUSR);
 
 extern const ulong vmx_return;
 
@@ -5729,8 +5734,12 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
int old = vmx-ple_window;
+   unsigned seq;
 
-   vmx-ple_window = __grow_ple_window(old);
+   do {
+   seq = read_seqbegin(ple_window_seqlock);
+   vmx-ple_window = __grow_ple_window(old);
+   } while (read_seqretry(ple_window_seqlock, seq));
 
trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old);
 }
@@ -5739,9 +5748,13 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
int old = vmx-ple_window;
+   unsigned seq;
 
-   vmx-ple_window = __shrink_ple_window(old,
- ple_window_shrink, ple_window);
+   do {
+   seq = read_seqbegin(ple_window_seqlock);
+   vmx-ple_window = __shrink_ple_window(old, ple_window_shrink,
+ ple_window);
+   } while (read_seqretry(ple_window_seqlock, seq));
 
trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old);
 }
@@ -5761,6 +5774,23 @@ static void update_ple_window_actual_max(void)
ple_window_grow, INT_MIN);
 }
 
+static int param_set_ple_int(const char *arg, const struct kernel_param *kp)
+{
+   int ret;
+
+   write_seqlock(ple_window_seqlock);
+   ret = param_set_int(arg, kp);
+   update_ple_window_actual_max();
+   write_sequnlock(ple_window_seqlock);
+
+   return ret;
+}
+
+static struct kernel_param_ops param_ops_ple_int = {
+   .set = param_set_ple_int,
+   .get = param_get_int,
+};
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
@@ -9164,8 +9194,6 @@ static int __init vmx_init(void)
} else
kvm_disable_tdp();
 
-   update_ple_window_actual_max();
-
return 0;
 
 out7:
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS

2014-08-21 Thread Radim Krčmář

ple_window is preserved in VMCS, so can write it only after a change.
Do this by keeping a dirty bit.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1318232..f32415b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -510,6 +510,7 @@ struct vcpu_vmx {
 
/* Dynamic PLE window. */
int ple_window;
+   bool ple_window_dirty;
 };
 
 enum segment_cache_field {
@@ -4426,6 +4427,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx-pi_desc)));
}
 
+   vmx-ple_window_dirty = ple_gap;
if (ple_gap) {
vmcs_write32(PLE_GAP, ple_gap);
vmx-ple_window = ple_window;
@@ -5741,6 +5743,9 @@ static void grow_ple_window(struct kvm_vcpu *vcpu)
vmx-ple_window = __grow_ple_window(old);
} while (read_seqretry(ple_window_seqlock, seq));
 
+   if (vmx-ple_window != old)
+   vmx-ple_window_dirty = true;
+
trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old);
 }
 
@@ -5756,6 +5761,9 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
  ple_window);
} while (read_seqretry(ple_window_seqlock, seq));
 
+   if (vmx-ple_window != old)
+   vmx-ple_window_dirty = true;
+
trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old);
 }
 
@@ -7505,8 +7513,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vmx-emulation_required)
return;
 
-   if (ple_gap)
+   if (vmx-ple_window_dirty) {
+   vmx-ple_window_dirty = false;
vmcs_write32(PLE_WINDOW, vmx-ple_window);
+   }
 
if (vmx-nested.sync_shadow_vmcs) {
copy_vmcs12_to_shadow(vmx);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU

2014-08-21 Thread Radim Krčmář

Change PLE window into per-VCPU variable, seeded from module parameter,
to allow greater flexibility.

Brings in a small overhead on every vmentry.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/kvm/vmx.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2b306f9..18e0e52 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -484,6 +484,9 @@ struct vcpu_vmx {
 
/* Support for a guest hypervisor (nested VMX) */
struct nested_vmx nested;
+
+   /* Dynamic PLE window. */
+   int ple_window;
 };
 
 enum segment_cache_field {
@@ -4402,7 +4405,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 
if (ple_gap) {
vmcs_write32(PLE_GAP, ple_gap);
-   vmcs_write32(PLE_WINDOW, ple_window);
+   vmx-ple_window = ple_window;
}
 
vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
@@ -7387,6 +7390,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
if (vmx-emulation_required)
return;
 
+   if (ple_gap)
+   vmcs_write32(PLE_WINDOW, vmx-ple_window);
+
if (vmx-nested.sync_shadow_vmcs) {
copy_vmcs12_to_shadow(vmx);
vmx-nested.sync_shadow_vmcs = false;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/7] KVM: add kvm_arch_sched_in

2014-08-21 Thread Radim Krčmář

Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/arm/kvm/arm.c | 4 
 arch/mips/kvm/mips.c   | 4 
 arch/powerpc/kvm/powerpc.c | 4 
 arch/s390/kvm/kvm-s390.c   | 4 
 arch/x86/kvm/x86.c | 4 
 include/linux/kvm_host.h   | 2 ++
 virt/kvm/kvm_main.c| 2 ++
 7 files changed, 24 insertions(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a99e0cd..9f788eb 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
vcpu-cpu = cpu;
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index cd71141..2362df2 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
  struct kvm_translation *tr)
 {
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 4c79284..cbc432f 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
kvmppc_subarch_vcpu_uninit(vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 #ifdef CONFIG_BOOKE
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ce81eb2..a3c324e 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
/* Nothing todo */
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
save_fp_ctl(vcpu-arch.host_fpregs.fpc);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f1e22d..d7c214f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
static_key_slow_dec(kvm_no_apic_vcpu);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 {
if (type)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4c33b3..ebd7236 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -624,6 +624,8 @@ void kvm_arch_exit(void);
 int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu);
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu);
+
 void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu);
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..d3c3ed0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int 
cpu)
if (vcpu-preempted)
vcpu-preempted = false;
 
+   kvm_arch_sched_in(vcpu, cpu);
+
kvm_arch_vcpu_load(vcpu, cpu);
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops

2014-08-21 Thread Radim Krčmář

sched_in preempt notifier is available for x86, allow its use in
specific virtualization technlogies as well.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/x86/include/asm/kvm_host.h | 2 ++
 arch/x86/kvm/svm.c  | 6 ++
 arch/x86/kvm/vmx.c  | 6 ++
 arch/x86/kvm/x86.c  | 1 +
 4 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5724601..358e2f3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -772,6 +772,8 @@ struct kvm_x86_ops {
bool (*mpx_supported)(void);
 
int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
+
+   void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
 };
 
 struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ddf7427..4baf1bc 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4305,6 +4305,10 @@ static void svm_handle_external_intr(struct kvm_vcpu 
*vcpu)
local_irq_enable();
 }
 
+static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops svm_x86_ops = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -4406,6 +4410,8 @@ static struct kvm_x86_ops svm_x86_ops = {
 
.check_intercept = svm_check_intercept,
.handle_external_intr = svm_handle_external_intr,
+
+   .sched_in = svm_sched_in,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index bfe11cf..2b306f9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -8846,6 +8846,10 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
return X86EMUL_CONTINUE;
 }
 
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -8951,6 +8955,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.mpx_supported = vmx_mpx_supported,
 
.check_nested_events = vmx_check_nested_events,
+
+   .sched_in = vmx_sched_in,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d7c214f..5696ee7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7148,6 +7148,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 
 void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+   kvm_x86_ops-sched_in(vcpu, cpu);
 }
 
 int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.

2014-08-21 Thread Radim Krčmář

2014-08-21 18:30+0200, Paolo Bonzini:
 Il 21/08/2014 18:08, Radim Krčmář ha scritto:
 I'm not sure of the usefulness of patch 6, so I'm going to drop it.
 I'll keep it in my local junkyard branch in case it's going to be useful
 in some scenario I didn't think of.

I've been using it to benchmark different values, because it is more
convenient than reloading the module after shutting down guests.
(And easier to sell than writing to kernel memory.)

I don't think the additional code is worth it though.

 Patch 7 can be easily rebased, so no need to repost (and I might even
 squash it into patch 3, what do you think?).

Yeah, the core is already a huge patch, so it does look weird without
squashing.  (No-one wants to rebase to that point anyway.)

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/7] KVM: add kvm_arch_sched_in

2014-08-21 Thread Radim Krčmář

2014-08-22 00:19+0530, Raghavendra K T:
 On 08/21/2014 09:38 PM, Radim Krčmář wrote:
 Introduce preempt notifiers for architecture specific code.
 Advantage over creating a new notifier in every arch is slightly simpler
 code and guaranteed call order with respect to kvm_sched_in.
 
 Signed-off-by: Radim Krčmář rkrc...@redhat.com
 ---
 
 Reviewed-by: Raghavendra KT raghavendra...@linux.vnet.ibm.com
 
 No surprise that ia64 doesn't show here :).

Oh, totally forgot about its predicted comeback ...
Paolo, do you want to fixup this?  (It removes double newline.)

diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 0729ba6..1630624 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1468,6 +1468,9 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
kfree(vcpu-arch.apic);
 }
 
+void kvm_arch_sched_in(struct kvm_vcpu *vcpu)
+{
+}
 
 long kvm_arch_vcpu_ioctl(struct file *filp,
 unsigned int ioctl, unsigned long arg)

 and I also would have liked
 static inlines (as indicated by Paolo).

I'll send a patch that converts empty functions into static inlines,
maybe return-0s too, soon(ish).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/7] KVM: VMX: dynamise PLE window

2014-08-21 Thread Radim Krčmář

2014-08-22 00:40+0530, Raghavendra K T:
 On 08/21/2014 09:38 PM, Radim Krčmář wrote:
 Thanks for the nice patch.
 default of grow = 2 and shrink = 0 is very good, which aids fast
 clamping in overcommit and less ple_exits in undercommit.
 with a small concern over modifier (shrinker) value in patch 6,
 
 Reviewed-by: Raghavendra KT raghavendra...@linux.vnet.ibm.com

Thank you for the review and testing.

 -#define KVM_VMX_DEFAULT_PLE_GAP128
 -#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
 +#define KVM_VMX_DEFAULT_PLE_GAP   128
 +#define KVM_VMX_DEFAULT_PLE_WINDOW4096
 +#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
 +#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
 +#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\
 +INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
 
 Some observations:
 
 1. I think a large ple_window for e.g.,  16MB itself
 would be very large value after which the ple_window change would not
 have much effect.

Agreed, 16M is around 4ms@4GHz, holding a spinlock for that long is
closer to a bug.

 So could KVM_VMX_DEFAULT_PLE_WINDOW_MAX be just
 KVM_VMX_DEFAULT_PLE_WINDOW^2 or something ? or otherwise..

I'd be perfectly content with a high default maximum like this, yet
there isn't much point in having that default at all if we can't reach
it in practice, so with the current default, we are at least ready for
THz+ x86 intels :)

I though about deafaulting it to some guessed fraction of the scheduler
tick, but then, I wanted to merge at least something :)

 How about ..
 define KVM_VMX_DEFAULT_PLE_WINDOW_MAX  = INT_MAX
 /KVM_VMX_DEFAULT_PLE_WINDOW.

= 524288 (2^19), too low IMO,
no-overcommit scenarios seem to go higher quite often.

 (considering we don't allow default grow to be greater than default ple
 window).
 
 2. can we consider min_ple_window value of 2k. tracing showed that in
 overcommit there were several occations of 4k - 4k situations.

Low values are more affected by KVM's overhead on every PLE exit, but
benchmarks really decide this ...

Having a separate ple_window_min would allow more tuning, so I can do
that if there are benefits of lower windows.
On the other hand, I thought that increasing the default window would be
a good default, which would make separate minimum even better.

 3. Do you think using (  and  ) instead of (*, /) saves some cycles
 considering we could have potentially have grow,shrink numbers != power of 2

* takes one or two cycles more than , so I wouldn't replace it,
because it limits available grows a lot.
(I don't expect we would set values above 5.)

/ is slow (around ten times slower than *), which the reason why we
avoid it by default, but I still prefer more options.

 +static int __grow_ple_window(int val)
 +{
 +if (ple_window_grow  1)
 
 why not ple_window_grow = 1 ?

Emergent mini-feature: have different static levels of ple_window, in
combination with dynamic knobs.
(set grow and shrink to 1, choose ple_window before starting a guest)

And because you have to willingly set 1, I'm not aware of any advantage
of '= 1'.

 +static int __shrink_ple_window(int val, int modifier, int minimum)
 +{
 +if (modifier  1)
 
 why not modifier  1. May be this would address a concern in patch 6.

/= 1 gives no shrinking, which I considered as a feature -- you can
easily search for the maximal achievable ple_window that way.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window

2014-08-21 Thread Radim Krčmář

2014-08-22 00:47+0530, Raghavendra K T:
 Positive thing about able to change default grow/shrink value is the
 flexibility of tuning ple window to different workloads and different
 number of cpus.
 
 But is it that a value of shrink = 1 and grow  1 is problematic ?
 (running a undercommit workload and then overcommit workload)

Yeah, could have mentioned it in commit message, but I hope that my
previous email explained it.  (1 = don't shrink, design decision :)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink

2014-08-25 Thread Radim Krčmář

2014-08-25 15:53+0200, Sabrina Dubroca:
 Hello,
 
 2014-08-21, 18:08:09 +0200, Radim Krčmář wrote:
  Tracepoint for dynamic PLE window, fired on every potential change.
  +#define trace_kvm_ple_window_grow(vcpu_id, new, old) \
  +   trace_kvm_ple_window(true, vcpu_id, new, old)
  +#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
  +   trace_kvm_ple_window(false, vcpu_id, new, old)
  +
   #endif /* CONFIG_X86_64 */
 
 Looks like these are needed on 32-bit as well.
 Today's linux-next doesn't build:
 
 [...]
 
 I moved the line
 
 #endif /* CONFIG_X86_64 */
 
 above
 
 TRACE_EVENT(kvm_ple_window,
 
 and it builds.

Thanks!

Paolo, can you still fix this just by rebasing?

---
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 1742dfb..4c2868f 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -848,6 +848,8 @@ TRACE_EVENT(kvm_track_tsc,
  __print_symbolic(__entry-host_clock, host_clocks))
 );
 
+#endif /* CONFIG_X86_64 */
+
 TRACE_EVENT(kvm_ple_window,
TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old),
TP_ARGS(grow, vcpu_id, new, old),
@@ -878,8 +880,6 @@ TRACE_EVENT(kvm_ple_window,
 #define trace_kvm_ple_window_shrink(vcpu_id, new, old) \
trace_kvm_ple_window(false, vcpu_id, new, old)
 
-#endif /* CONFIG_X86_64 */
-
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.

2014-08-25 Thread Radim Krčmář

2014-08-22 12:45+0800, Wanpeng Li:
 Hi Radim,
 On Thu, Aug 21, 2014 at 06:50:03PM +0200, Radim Krčmář wrote:
 2014-08-21 18:30+0200, Paolo Bonzini:
  Il 21/08/2014 18:08, Radim Krčmář ha scritto:
  I'm not sure of the usefulness of patch 6, so I'm going to drop it.
  I'll keep it in my local junkyard branch in case it's going to be useful
  in some scenario I didn't think of.
 
 I've been using it to benchmark different values, because it is more
 
 Is there any benchmark data for this patchset?

Sorry, I already returned the testing machine and it wasn't quality
benchmarking, so I haven't kept the results ...

I used ebizzy and dbench, because ebizzy had large difference between
PLE on/off and dbench minimal (without overcommit), so one was looking
for improvements while the other was checking regressions.
(And they are easy to set up.)

From what I remember, this patch had roughly 5x better performance with
ebizzy on 60 VCPU guests and no obvious difference for dbench.
(And improvement under overcommit was visible for both.)

There was a significant reduction in %sys, which never raised much above
30%, as oposed to original 90%+.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] KVM: static inline empty kvm_arch functions

2014-08-28 Thread Radim Krčmář

Using static inline is going to save few bytes and cycles.
For example on powerpc, the difference is 700 B after stripping.
(5 kB before)

This patch also deals with two overlooked empty functions:
kvm_arch_flush_shadow was not removed from arch/mips/kvm/mips.c
  2df72e9bc KVM: split kvm_arch_flush_shadow
and kvm_arch_sched_in never made it into arch/ia64/kvm/kvm-ia64.c.
  e790d9ef6 KVM: add kvm_arch_sched_in

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/arm/include/asm/kvm_host.h |  6 ++
 arch/arm/kvm/arm.c  | 19 
 arch/arm64/include/asm/kvm_host.h   |  6 ++
 arch/ia64/include/asm/kvm_host.h| 12 +++
 arch/ia64/kvm/kvm-ia64.c| 30 --
 arch/mips/include/asm/kvm_host.h| 11 ++
 arch/mips/kvm/mips.c| 42 
 arch/powerpc/include/asm/kvm_host.h |  8 +++
 arch/powerpc/kvm/powerpc.c  | 29 -
 arch/s390/include/asm/kvm_host.h| 14 
 arch/s390/kvm/kvm-s390.c| 43 -
 11 files changed, 57 insertions(+), 163 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 6dfb404..84291fe 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -233,4 +233,10 @@ static inline void vgic_arch_setup(const struct 
vgic_params *vgic)
 int kvm_perf_init(void);
 int kvm_perf_teardown(void);
 
+static inline void kvm_arch_hardware_disable(void *garbage) {}
+static inline void kvm_arch_hardware_unsetup(void) {}
+static inline void kvm_arch_sync_events(struct kvm *kvm) {}
+static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 9f788eb..132bb0d 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -97,27 +97,16 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
 }
 
-void kvm_arch_hardware_disable(void *garbage)
-{
-}
-
 int kvm_arch_hardware_setup(void)
 {
return 0;
 }
 
-void kvm_arch_hardware_unsetup(void)
-{
-}
-
 void kvm_arch_check_processor_compat(void *rtn)
 {
*(int *)rtn = 0;
 }
 
-void kvm_arch_sync_events(struct kvm *kvm)
-{
-}
 
 /**
  * kvm_arch_init_vm - initializes a VM data structure
@@ -284,14 +273,6 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
return 0;
 }
 
-void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
-{
-}
-
-void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
-{
-}
-
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
vcpu-cpu = cpu;
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index e10c45a..94d8a3c 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -244,4 +244,10 @@ static inline void vgic_arch_setup(const struct 
vgic_params *vgic)
}
 }
 
+static inline void kvm_arch_hardware_disable(void *garbage) {}
+static inline void kvm_arch_hardware_unsetup(void) {}
+static inline void kvm_arch_sync_events(struct kvm *kvm) {}
+static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {}
+
 #endif /* __ARM64_KVM_HOST_H__ */
diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index db95f57..353167d 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -595,6 +595,18 @@ void kvm_sal_emul(struct kvm_vcpu *vcpu);
 struct kvm *kvm_arch_alloc_vm(void);
 void kvm_arch_free_vm(struct kvm *kvm);
 
+static inline void kvm_arch_sync_events(struct kvm *kvm) {}
+static inline void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu) {}
+static inline void kvm_arch_free_memslot(struct kvm *kvm,
+   struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {}
+static inline void kvm_arch_memslots_updated(struct kvm *kvm) {}
+static inline void kvm_arch_commit_memory_region(struct kvm *kvm,
+   struct kvm_userspace_memory_region *mem,
+   const struct kvm_memory_slot *old,
+   enum kvm_mr_change change) {}
+static inline void kvm_arch_hardware_unsetup(void) {}
+
 #endif /* __ASSEMBLY__*/
 
 #endif
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 0729ba6..5e14dca 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -1364,10 +1364,6 @@ static void kvm_release_vm_pages(struct kvm *kvm)
}
 }
 
-void kvm_arch_sync_events(struct kvm *kvm)
-{
-}
-
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
kvm_iommu_unmap_guest(kvm);
@@ -1376,10 +1372,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
kvm_release_vm_pages(kvm);
 }
 
-void

[PATCH 0/2] KVM: minor cleanup and optimizations

2014-08-28 Thread Radim Krčmář

The first patch answers a demand for inline arch functions.
(There is a lot of constant functions that could be inlined as well.)

Second patch digs a bit into the history of KVM and removes a useless
argument that seemed suspicious when preparing the first patch.


Radim Krčmář (2):
  KVM: static inline empty kvm_arch functions
  KVM: remove garbage arg to *hardware_{en,dis}able

 arch/arm/include/asm/kvm_host.h |  6 +
 arch/arm/kvm/arm.c  | 21 +
 arch/arm64/include/asm/kvm_host.h   |  6 +
 arch/ia64/include/asm/kvm_host.h| 12 ++
 arch/ia64/kvm/kvm-ia64.c| 34 ++--
 arch/mips/include/asm/kvm_host.h| 11 +
 arch/mips/kvm/mips.c| 44 +---
 arch/powerpc/include/asm/kvm_host.h |  8 +++
 arch/powerpc/kvm/powerpc.c  | 31 +
 arch/s390/include/asm/kvm_host.h| 14 
 arch/s390/kvm/kvm-s390.c| 45 +
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/svm.c  |  4 ++--
 arch/x86/kvm/vmx.c  |  4 ++--
 arch/x86/kvm/x86.c  | 12 +-
 include/linux/kvm_host.h|  4 ++--
 virt/kvm/kvm_main.c |  4 ++--
 17 files changed, 79 insertions(+), 185 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] KVM: remove garbage arg to *hardware_{en,dis}able

2014-08-28 Thread Radim Krčmář

In the beggining was on_each_cpu(), which required an unused argument to
kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten.

Remove unnecessary arguments that stem from this.

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
 arch/arm/include/asm/kvm_host.h |  2 +-
 arch/arm/kvm/arm.c  |  2 +-
 arch/arm64/include/asm/kvm_host.h   |  2 +-
 arch/ia64/kvm/kvm-ia64.c|  4 ++--
 arch/mips/include/asm/kvm_host.h|  2 +-
 arch/mips/kvm/mips.c|  2 +-
 arch/powerpc/include/asm/kvm_host.h |  2 +-
 arch/powerpc/kvm/powerpc.c  |  2 +-
 arch/s390/include/asm/kvm_host.h|  2 +-
 arch/s390/kvm/kvm-s390.c|  2 +-
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/svm.c  |  4 ++--
 arch/x86/kvm/vmx.c  |  4 ++--
 arch/x86/kvm/x86.c  | 12 ++--
 include/linux/kvm_host.h|  4 ++--
 virt/kvm/kvm_main.c |  4 ++--
 16 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 84291fe..278ecfa 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -233,7 +233,7 @@ static inline void vgic_arch_setup(const struct vgic_params 
*vgic)
 int kvm_perf_init(void);
 int kvm_perf_teardown(void);
 
-static inline void kvm_arch_hardware_disable(void *garbage) {}
+static inline void kvm_arch_hardware_disable(void) {}
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 132bb0d..005a7b5 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -87,7 +87,7 @@ struct kvm_vcpu __percpu **kvm_get_running_vcpus(void)
return kvm_arm_running_vcpu;
 }
 
-int kvm_arch_hardware_enable(void *garbage)
+int kvm_arch_hardware_enable(void)
 {
return 0;
 }
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 94d8a3c..3816c52 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -244,7 +244,7 @@ static inline void vgic_arch_setup(const struct vgic_params 
*vgic)
}
 }
 
-static inline void kvm_arch_hardware_disable(void *garbage) {}
+static inline void kvm_arch_hardware_disable(void) {}
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {}
diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
index 5e14dca..ec6b9ac 100644
--- a/arch/ia64/kvm/kvm-ia64.c
+++ b/arch/ia64/kvm/kvm-ia64.c
@@ -125,7 +125,7 @@ long ia64_pal_vp_create(u64 *vpd, u64 *host_iva, u64 
*opt_handler)
 
 static  DEFINE_SPINLOCK(vp_lock);
 
-int kvm_arch_hardware_enable(void *garbage)
+int kvm_arch_hardware_enable(void)
 {
long  status;
long  tmp_base;
@@ -160,7 +160,7 @@ int kvm_arch_hardware_enable(void *garbage)
return 0;
 }
 
-void kvm_arch_hardware_disable(void *garbage)
+void kvm_arch_hardware_disable(void)
 {
 
long status;
diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index b4d900a..2f63ca2 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -767,7 +767,7 @@ extern int kvm_mips_trans_mtc0(uint32_t inst, uint32_t *opc,
 extern void kvm_mips_dump_stats(struct kvm_vcpu *vcpu);
 extern unsigned long kvm_mips_get_ramsize(struct kvm *kvm);
 
-static inline void kvm_arch_hardware_disable(void *garbage) {}
+static inline void kvm_arch_hardware_disable(void) {}
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_free_memslot(struct kvm *kvm,
diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 0ec7490..e3b21e5 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -77,7 +77,7 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
return 1;
 }
 
-int kvm_arch_hardware_enable(void *garbage)
+int kvm_arch_hardware_enable(void)
 {
return 0;
 }
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 0e646c1..b00113b 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -687,7 +687,7 @@ struct kvm_vcpu_arch {
 #define __KVM_HAVE_ARCH_WQP
 #define __KVM_HAVE_CREATE_DEVICE
 
-static inline void kvm_arch_hardware_disable(void *garbage) {}
+static inline void kvm_arch_hardware_disable(void) {}
 static inline void kvm_arch_hardware_unsetup(void) {}
 static inline void kvm_arch_sync_events(struct kvm *kvm) {}
 static inline void kvm_arch_memslots_updated(struct kvm *kvm) {}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 72c3fc0..da50523 100644
--- a/arch/powerpc/kvm

Re: [PATCH] KVM: forward declare structs in kvm_types.h

2014-08-29 Thread Radim Krčmář

2014-08-29 14:01+0200, Paolo Bonzini:
 Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
 'struct foo' declared inside parameter list warnings (and consequent
 breakage due to conflicting types).
 
 Move them from individual files to a generic place in linux/kvm_types.h.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---

Reviewed-by: Radim Krčmář rkrc...@redhat.com

(Inclusion of linux/types.h seems to be piggybacking, but is a nice
 thing to do :)

And I'd apply these changes:  (Definitely the first hunk.)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 33d8d0a..e098dce 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -140,8 +140,6 @@ static inline bool is_error_page(struct page *page)
 #define KVM_USERSPACE_IRQ_SOURCE_ID0
 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID   1
 
-struct kvm;
-struct kvm_vcpu;
 extern struct kmem_cache *kvm_vcpu_cache;
 
 extern spinlock_t kvm_lock;
@@ -325,8 +323,6 @@ struct kvm_kernel_irq_routing_entry {
struct hlist_node link;
 };
 
-struct kvm_irq_routing_table;
-
 #ifndef KVM_PRIVATE_MEM_SLOTS
 #define KVM_PRIVATE_MEM_SLOTS 0
 #endif
@@ -1036,8 +1032,6 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
 
 extern bool kvm_rebooting;
 
-struct kvm_device_ops;
-
 struct kvm_device {
struct kvm_device_ops *ops;
struct kvm *kvm;
diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 1d6daca..53c4f20 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -19,7 +19,9 @@
 
 struct kvm;
 struct kvm_async_pf;
+struct kvm_device_ops;
 struct kvm_interrupt;
+struct kvm_irq_routing_table;
 struct kvm_memory_slot;
 struct kvm_one_reg;
 struct kvm_run;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] virtio-rng: fix stuck in catting hwrng attributes

2014-09-15 Thread Radim Krčmář

2014-09-14 10:25+0800, Amos Kong:
 On Sun, Sep 14, 2014 at 09:12:08AM +0800, Amos Kong wrote:
 
 ...
 diff --git a/drivers/char/hw_random/core.c 
 b/drivers/char/hw_random/core.c
 index c591d7e..b5d1b6f 100644
 --- a/drivers/char/hw_random/core.c
 +++ b/drivers/char/hw_random/core.c
 @@ -195,8 +195,7 @@ static ssize_t rng_dev_read(struct file *filp, 
 char __user *buf,
  
   mutex_unlock(rng_mutex);
  
 - if (need_resched())
 - schedule_timeout_interruptible(1);
 + schedule_timeout_interruptible(10);

If cond_resched() does not work, it is a bug elsewehere.

 Problem only occurred in non-smp guest, we can improve it to:
 
 if(!is_smp())
 schedule_timeout_interruptible(10);
 
 is_smp() is only available for arm arch, we need a general one.

(It is num_online_cpus()  1.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Using the tlb flush util function where applicable

2014-09-15 Thread Radim Krčmář

2014-09-12 17:06-0400, Liang Chen:
 Using kvm_mmu_flush_tlb as the other places to make sure vcpu
  stat is incremented
 
 Signed-off-by: Liang Chen liangchen.li...@gmail.com
 ---

Good catch.

  arch/x86/kvm/vmx.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index bfe11cf..439682e 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -1810,7 +1810,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int 
 cpu)
   struct desc_ptr *gdt = __get_cpu_var(host_gdt);
   unsigned long sysenter_esp;
  
 - kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 + kvm_mmu_flush_tlb(vcpu);
   local_irq_disable();
   crash_disable_local_vmclear(cpu);
  

And to hijack this thread ...
I noticed three other deficits in stat.tlb_flush, patch below.

Do you prefer the current behavior?

--- 8 ---
KVM: x86: count actual tlb flushes

- we count KVM_REQ_TLB_FLUSH requests, not actual flushes
  (KVM can have multiple requests for one flush)
- flushes from kvm_flush_remote_tlbs aren't counted
- it's easy to make a direct request by mistake

Solve these by postponing the counting to kvm_check_request().

Signed-off-by: Radim Krčmář rkrc...@redhat.com
---
(And what about a possible followup patch that replaces
 kvm_mmu_flush_tlb() with kvm_make_request() again?
 It would free the namespace a bit and we could call something
 similarly named from vcpu_enter_guest() to do the job.)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..b41fd97 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3452,7 +3452,6 @@ static void nonpaging_init_context(struct kvm_vcpu *vcpu,
 
 void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu)
 {
-   ++vcpu-stat.tlb_flush;
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_flush_tlb);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 916e895..0f0ad08 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6041,8 +6041,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
}
if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
kvm_mmu_sync_roots(vcpu);
-   if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
+   if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) {
+   ++vcpu-stat.tlb_flush;
kvm_x86_ops-tlb_flush(vcpu);
+   }
if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
vcpu-run-exit_reason = KVM_EXIT_TPR_ACCESS;
r = 0;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-16 Thread Radim Krčmář

2014-09-15 13:11-0700, Andres Lagar-Cavilla:
 +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm,

The suffix '_retry' is not best suited for this.
On first reading, I imagined we will be retrying something from before,
possibly calling it in a loop, but we are actually doing the first and
last try in one call.

Hard to find something that conveys our lock-dropping mechanic,
'_polite' is my best candidate at the moment.

 + int flags = FOLL_TOUCH | FOLL_HWPOISON |

(FOLL_HWPOISON wasn't used before, but it's harmless.)

2014-09-16 15:51+0200, Paolo Bonzini:
 Il 15/09/2014 22:11, Andres Lagar-Cavilla ha scritto:
  @@ -1177,9 +1210,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool 
  *async, bool write_fault,
  npages = get_user_page_nowait(current, current-mm,
addr, write_fault, page);
  up_read(current-mm-mmap_sem);
  -   } else
  -   npages = get_user_pages_fast(addr, 1, write_fault,
  -page);
  +   } else {
  +   /*
  +* By now we have tried gup_fast, and possible async_pf, and we
^
(If we really tried get_user_pages_fast, we wouldn't be here, so I'd
 prepend two underscores here as well.)

  +* are certainly not atomic. Time to retry the gup, allowing
  +* mmap semaphore to be relinquished in the case of IO.
  +*/
  +   npages = kvm_get_user_page_retry(current, current-mm, addr,
  +write_fault, page);
 
 This is a separate logical change.  Was this:
 
   down_read(mm-mmap_sem);
   npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL);
   up_read(mm-mmap_sem);
 
 the intention rather than get_user_pages_fast?

I believe so as well.

(Looking at get_user_pages_fast and __get_user_pages_fast made my
 abstraction detector very sad.)

 I think a first patch should introduce kvm_get_user_page_retry (Retry a
 fault after a gup with FOLL_NOWAIT.) and the second would add
 FOLL_TRIED (This properly relinquishes mmap semaphore if the
 filemap/swap has to wait on page lock (and retries the gup to completion
 after that).

Not sure if that would help to understand the goal ...

 Apart from this, the patch looks good.  The mm/ parts are minimal, so I
 think it's best to merge it through the KVM tree with someone's Acked-by.

I would prefer to have the last hunk in a separate patch, but still,

Acked-by: Radim Krčmář rkrc...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-16 Thread Radim Krčmář

[Emergency posting to fix the tag and couldn't find unmangled Cc list,
 so some recipients were dropped, sorry.  (I guess you are glad though).]

2014-09-16 14:01-0700, Andres Lagar-Cavilla:
 On Tue, Sep 16, 2014 at 1:51 PM, Radim Krčmář rkrc...@redhat.com wrote:
  2014-09-15 13:11-0700, Andres Lagar-Cavilla:
  +int kvm_get_user_page_retry(struct task_struct *tsk, struct
  mm_struct *mm,
 
  The suffix '_retry' is not best suited for this.
  On first reading, I imagined we will be retrying something from
  before,
  possibly calling it in a loop, but we are actually doing the first and
  last try in one call.
 
 We are doing ... the second and third in most scenarios. async_pf did
 the first with _NOWAIT. We call this from the async pf retrier, or if
 async pf couldn't be notified to the guest.

I was thinking more about what the function does, not how we currently
use it -- nothing prevents us from using it as first somewhere -- but
yeah, even comments would be off then.

  Apart from this, the patch looks good.  The mm/ parts are minimal, so
  I
  think it's best to merge it through the KVM tree with someone's
  Acked-by.
 
  I would prefer to have the last hunk in a separate patch, but still,
 
  Acked-by: Radim Krčmář rkrc...@redhat.com
 
 Awesome, thanks much.
 
 I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything
 else from this email should go into the recut.

Ah, sorry, I'm not maintaining mm ... what I meant was

Reviewed-by: Radim Krčmář rkrc...@redhat.com

and I had to leave before I could find a good apology for
VM_WARN_ON_ONCE(), so if you are replacing BUG_ON, you might want to
look at that one as well.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Using the tlb flush util function where applicable

2014-09-17 Thread Radim Krčmář

2014-09-17 08:15+0800, Wanpeng Li:
 Hi Radim,
 On Mon, Sep 15, 2014 at 09:33:52PM +0200, Radim Krčmář wrote:
 Do you prefer the current behavior?
 
 --- 8 ---
 KVM: x86: count actual tlb flushes
 
 - we count KVM_REQ_TLB_FLUSH requests, not actual flushes
 
 So there maybe multiple requests accumulated at the point of 
 kvm_check_request, 
 if your patch account these accumulations correctly?

It will ignore request accumulations and count it as one TLB flush,
we have to decide what is correct (the value is just statistics)

 a) count local KVM_REQ_TLB_FLUSH requests
 b) count all TLB flushes
 c) both (a) and (b)

I was thinking that when you look at /sys/kernel/debug/kvm/tlb_flushes,
you are interested in the number of TLB flushes that VMs did, not
requests, so you won't have to add remote_tlb_flushes multiplied by
maximal vcpu count and guess their amount from this upper bound.

And that we don't even care about requests, so (c) is just complication.

---
I tried to get an idea about the number of coalesced requests and added
a counter called tlb_flush_real, making option (c).
After a night of reading virtio-rng (different experiment) on 1 VCPU VM:

 # cat /sys/kernel/debug/kvm/{tlb_flush{,_real},remote_tlb_flush}
 5927
 5206
 0
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Radim Krčmář

2014-09-17 13:26+0300, Gleb Natapov:
 For async_pf_execute() you do not need to even retry. Next guest's page fault
 will retry it for you.

Wouldn't that be a waste of vmentries?

The guest might be able to handle interrupts while we are waiting, so if
we used async-io-done notifier, this could work without CPU spinning.
(Probably with added latency.)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem

2014-09-17 Thread Radim Krčmář

[Repost for lists, the last mail was eaten by a security troll.]

2014-09-16 14:01-0700, Andres Lagar-Cavilla:
 On Tue, Sep 16, 2014 at 1:51 PM, Radim Krčmář rkrc...@redhat.com
 wrote:
  2014-09-15 13:11-0700, Andres Lagar-Cavilla:
  +int kvm_get_user_page_retry(struct task_struct *tsk, struct
  mm_struct *mm,
 
  The suffix '_retry' is not best suited for this.
  On first reading, I imagined we will be retrying something from
  before,
  possibly calling it in a loop, but we are actually doing the first
  and
  last try in one call.

 We are doing ... the second and third in most scenarios. async_pf did
 the first with _NOWAIT. We call this from the async pf retrier, or if
 async pf couldn't be notified to the guest.

I was thinking more about what the function does, not how we currently
use it -- nothing prevents us from using it as first somewhere -- but
yeah, even comments would be off then.

  Apart from this, the patch looks good.  The mm/ parts are minimal,
  so
  I
  think it's best to merge it through the KVM tree with someone's
  Acked-by.
 
  I would prefer to have the last hunk in a separate patch, but still,
 
  Acked-by: Radim Krčmář rkrc...@redhat.com

 Awesome, thanks much.

 I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything
 else from this email should go into the recut.

Ah, sorry, I'm not maintaining mm ... what I meant was

Reviewed-by: Radim Krčmář rkrc...@redhat.com

and I had to leave before I could find a good apology for
VM_WARN_ON_ONCE(), so if you are replacing BUG_ON, you might want to
look at that one as well.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RESEND PATCH 1/3] x86: Adding structs to reflect cpuid fields

2014-09-17 Thread Radim Krčmář

2014-09-17 16:06+0200, Borislav Petkov:
 On Wed, Sep 17, 2014 at 04:53:39PM +0300, Nadav Amit wrote:
  AFAIK backward compatibility is usually maintained in x86. I did not
  see in Intel SDM anything that says this CPUID field means something
  for CPU X and something else for CPU Y. Anyhow, it is not different
  than bitmasks in this respect.
 
 You still don't get my point: what are you going to do when
 min_monitor_line_size needs to be 17 bits all of a sudden?
 
 Currently, you simply do an if-else check before using the respective
 mask and with your defined structs, you need to keep two versions:
 
 union cpuid5_ebx_before_family_X {
struct {
unsigned int max_monitor_line_size:16;
unsigned int reserved:16;
} split;
unsigned int full;
 };
 
 union cpuid5_ebx_after_family_X {
struct {
unsigned int max_monitor_line_size:17;
unsigned int reserved:15;
} split;
unsigned int full;
 };

New union wouldn't be very convenient if the change touched just a small
part of the register ... probably the best choice is using anonymous
elements like this,

  union cpuid5_ebx {
union {
struct {
unsigned int max_monitor_line_size:16;
unsigned int reserved:16;
};
struct {
unsigned int max_monitor_line_size_after_family_X:17;
unsigned int reserved_after_family_X:15;
};
} split;
unsigned int full;
  };

which would result in a similar if-else hack

  if (family  X)
ebx.split.max_monitor_line_size_after_family_X = 0
  else
ebx.split.max_monitor_line_size = 0

other options are
  ebx.split.after_family_X.max_monitor_line_size
or even
  ebx.split.max_monitor_line_size.after_family_X

Flat namespace is more flexible wrt. code.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 >

1 - 100 of 525 matches

Mail list logo