[PATCH 1/2] kvm: free resources after canceling async_pf
When we cancel 'async_pf_execute()', we should behave as if the work was never scheduled in 'kvm_setup_async_pf()'. Fixes a bug when we can't unload module because the vm wasn't destroyed. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- virt/kvm/async_pf.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index b44cea0..f30aa1c 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -102,8 +102,11 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) typeof(*work), queue); cancel_work_sync(work-work); list_del(work-queue); - if (!work-done) /* work was canceled */ + if (!work-done) { /* work was canceled */ + mmdrop(work-mm); + kvm_put_kvm(vcpu-kvm); /* == work-vcpu-kvm */ kmem_cache_free(async_pf_cache, work); + } } spin_lock(vcpu-async_pf.lock); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] kvm: fix a bug and remove a redundancy in async_pf
I did not reproduce the bug fixed in [1/2], but there are not that many reasons why we could not unload a module, so the spot is quite obvious. Radim Krčmář (2): kvm: free resources after canceling async_pf kvm: remove .done from struct kvm_async_pf include/linux/kvm_host.h | 1 - virt/kvm/async_pf.c | 8 2 files changed, 4 insertions(+), 5 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm: remove .done from struct kvm_async_pf
'.done' is used to mark the completion of 'async_pf_execute()', but 'cancel_work_sync()' returns true when the work was canceled, so we use it instead. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- include/linux/kvm_host.h | 1 - virt/kvm/async_pf.c | 5 + 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ca645a0..c7a5e08 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -190,7 +190,6 @@ struct kvm_async_pf { unsigned long addr; struct kvm_arch_async_pf arch; struct page *page; - bool done; }; void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index f30aa1c..89acf41 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -76,7 +76,6 @@ static void async_pf_execute(struct work_struct *work) spin_lock(vcpu-async_pf.lock); list_add_tail(apf-link, vcpu-async_pf.done); apf-page = page; - apf-done = true; spin_unlock(vcpu-async_pf.lock); /* @@ -100,9 +99,8 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu) struct kvm_async_pf *work = list_entry(vcpu-async_pf.queue.next, typeof(*work), queue); - cancel_work_sync(work-work); list_del(work-queue); - if (!work-done) { /* work was canceled */ + if (cancel_work_sync(work-work)) { mmdrop(work-mm); kvm_put_kvm(vcpu-kvm); /* == work-vcpu-kvm */ kmem_cache_free(async_pf_cache, work); @@ -167,7 +165,6 @@ int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn, return 0; work-page = NULL; - work-done = false; work-vcpu = vcpu; work-gva = gva; work-addr = gfn_to_hva(vcpu-kvm, gfn); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] static_key: use static_key_slow_inc_deferred()
Simple replacement where possible. Saves us problematic access to the structure and allows optimalizations and bug fixes to take place. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/lapic.c | 7 --- kernel/events/core.c | 6 +++--- 2 files changed, 7 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5439117..eff85f6 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -121,7 +121,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) if (val APIC_SPIV_APIC_ENABLED) static_key_slow_dec_deferred(apic_sw_disabled); else - static_key_slow_inc(apic_sw_disabled.key); + static_key_slow_inc_deferred(apic_sw_disabled); } apic_set_reg(apic, APIC_SPIV, val); } @@ -1351,7 +1351,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) if (value MSR_IA32_APICBASE_ENABLE) static_key_slow_dec_deferred(apic_hw_disabled); else - static_key_slow_inc(apic_hw_disabled.key); + static_key_slow_inc_deferred(apic_hw_disabled); recalculate_apic_map(vcpu-kvm); } @@ -1546,7 +1546,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) kvm_lapic_set_base(vcpu, APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE); - static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */ + /* sw disabled at reset */ + static_key_slow_inc_deferred(apic_sw_disabled); kvm_lapic_reset(vcpu); kvm_iodevice_init(apic-dev, apic_mmio_ops); diff --git a/kernel/events/core.c b/kernel/events/core.c index d49a9d2..ade89a1 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6565,7 +6565,7 @@ static void account_event(struct perf_event *event) return; if (event-attach_state PERF_ATTACH_TASK) - static_key_slow_inc(perf_sched_events.key); + static_key_slow_inc_deferred(perf_sched_events); if (event-attr.mmap || event-attr.mmap_data) atomic_inc(nr_mmap_events); if (event-attr.comm) @@ -6577,9 +6577,9 @@ static void account_event(struct perf_event *event) tick_nohz_full_kick_all(); } if (has_branch_stack(event)) - static_key_slow_inc(perf_sched_events.key); + static_key_slow_inc_deferred(perf_sched_events); if (is_cgroup_event(event)) - static_key_slow_inc(perf_sched_events.key); + static_key_slow_inc_deferred(perf_sched_events); account_event_cpu(event, event-cpu); } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7] static_key: use static_key_slow_inc_deferred()
2013-10-17 12:39+0200, Paolo Bonzini: Il 17/10/2013 12:10, Radim Krčmář ha scritto: Simple replacement where possible. Saves us problematic access to the structure and allows optimalizations and bug fixes to take place. I think you should introduce this first as a simple wrapper around static_key_slow_inc, and then add the bells and whistles you have in patches 2 and 3. Ok, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/5] static_key: improve deferred inc behavior
We can cancel deferred static_key_slow_dec() instead of increasing .enabled.counter. Timer now won't fire before 'timeout' since the last increase, so this patch further stabilizes the case of frequent switching. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- kernel/jump_label.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/jump_label.c b/kernel/jump_label.c index bd7ad31..9b57261 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -79,7 +79,8 @@ void static_key_slow_inc_deferred(struct static_key_deferred *key) STATIC_KEY_CHECK_USE(); if (atomic_dec_if_positive(key-enabled_debt) = 0) return; - static_key_slow_inc(key-key); + if (!cancel_delayed_work(key-work)) + static_key_slow_inc(key-key); } EXPORT_SYMBOL_GPL(static_key_slow_inc_deferred); -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/5] static_key: deferred key fixes and improvements
This series is made out of three logical parts - patches 1 and 2 fix panic caused by accessing freed module memory - patches 3 and 4 fix miscounting by static_key_slow_dec_deferred() - patch 5 introduces a minor optimization More jump_label/static_key patches are prepared, but I returned to them yesterday and implemented a variable jump length on amd64, which requires some refactoring, porting to remaining architectures, and retesting, so I'm posting this independent part before it gets overtaken by higher priority work again = This series was tested with additional patches ^ I wrote this on Tuesday and then moved to higher priority work, but returned with enough courage to post a different first part. The first part was tested on amd64, s390x and ppc64, the rest also on armv7. Applies to next-20131206 and v3.13-rc3. Radim Krčmář (5): static_key: add a section for deferred keys static_key: cancel rate limit timer on rmmod static_key: add static_key_slow_inc_deferred() static_key: keep deferred enabled counter debt static_key: improve deferred inc behavior arch/x86/kvm/lapic.c | 11 + arch/x86/kvm/lapic.h | 4 +-- include/asm-generic/vmlinux.lds.h| 1 + include/linux/jump_label_ratelimit.h | 10 include/linux/module.h | 3 +++ include/linux/perf_event.h | 2 +- kernel/events/core.c | 8 +++--- kernel/jump_label.c | 47 +++- kernel/module.c | 4 +++ 9 files changed, 72 insertions(+), 18 deletions(-) -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/5] static_key: add a section for deferred keys
We need to know about all deferred keys if we want to correctly - initialize timers on kernel init/module load - destroy pending timers when unloading a module We depend on section attribute, so direct definitions of struct static_key_deferred should be avoided, which is suboptimal. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- More general solution would use compile-time magic to generate an array of pointers to deferred structures, but I am not sure if it is acceptable and possible. Worse approach added an unload_callback_list to the struct module. Callbacks of type void (*)(void *) were registered on static_key_rate_limit(), to cancel the deferred key work from module_unload_free(). (I have patches for this) Jump labels are already notified for module changes, so we could keep track of deferred keys in modules there, using a static global tree. (in the worst case) arch/x86/kvm/lapic.c | 4 ++-- arch/x86/kvm/lapic.h | 4 ++-- include/asm-generic/vmlinux.lds.h| 1 + include/linux/jump_label_ratelimit.h | 4 include/linux/module.h | 3 +++ include/linux/perf_event.h | 2 +- kernel/events/core.c | 2 +- kernel/module.c | 4 8 files changed, 18 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5439117..5f01547 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -112,8 +112,8 @@ static inline int __apic_test_and_clear_vector(int vec, void *bitmap) return __test_and_clear_bit(VEC_POS(vec), (bitmap) + REG_POS(vec)); } -struct static_key_deferred apic_hw_disabled __read_mostly; -struct static_key_deferred apic_sw_disabled __read_mostly; +static_key_deferred(apic_hw_disabled); +static_key_deferred(apic_sw_disabled); static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) { diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index c730ac9..4ae9a7a 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -108,7 +108,7 @@ static inline bool kvm_vcpu_has_lapic(struct kvm_vcpu *vcpu) return true; } -extern struct static_key_deferred apic_hw_disabled; +extern static_key_deferred(apic_hw_disabled); static inline int kvm_apic_hw_enabled(struct kvm_lapic *apic) { @@ -117,7 +117,7 @@ static inline int kvm_apic_hw_enabled(struct kvm_lapic *apic) return MSR_IA32_APICBASE_ENABLE; } -extern struct static_key_deferred apic_sw_disabled; +extern static_key_deferred(apic_sw_disabled); static inline int kvm_apic_sw_enabled(struct kvm_lapic *apic) { diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index bc2121f..572e583 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -214,6 +214,7 @@ #define READ_MOSTLY_DATA(align) \ . = ALIGN(align); \ *(.data..read_mostly) \ + *(__deferred_keys) \ . = ALIGN(align); #define CACHELINE_ALIGNED_DATA(align) \ diff --git a/include/linux/jump_label_ratelimit.h b/include/linux/jump_label_ratelimit.h index 089f70f..1216db0 100644 --- a/include/linux/jump_label_ratelimit.h +++ b/include/linux/jump_label_ratelimit.h @@ -3,6 +3,10 @@ #include linux/jump_label.h #include linux/workqueue.h +#include linux/compiler.h + +#define static_key_deferred(name) \ + struct static_key_deferred name __section(__deferred_keys) #if defined(CC_HAVE_ASM_GOTO) defined(CONFIG_JUMP_LABEL) struct static_key_deferred { diff --git a/include/linux/module.h b/include/linux/module.h index 46e548f..4cc7269 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -347,6 +347,9 @@ struct module #ifdef HAVE_JUMP_LABEL struct jump_entry *jump_entries; unsigned int num_jump_entries; + + struct static_key_deferred *deferred_keys; + unsigned int num_deferred_keys; #endif #ifdef CONFIG_TRACING unsigned int num_trace_bprintk_fmt; diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 8f4a70f..8baabca 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -672,7 +672,7 @@ perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) } } -extern struct static_key_deferred perf_sched_events; +extern static_key_deferred(perf_sched_events); static inline void perf_event_task_sched_in(struct task_struct *prev, struct task_struct *task) diff --git a/kernel/events/core.c b/kernel/events/core.c index 403b781..ee64d26 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -138,7 +138,7 @@ enum event_type_t { * perf_sched_events : 0 events exist * perf_cgroup_events: 0 per-cpu cgroup events exist on this cpu
[PATCH v2 2/5] static_key: cancel rate limit timer on rmmod
Fix a bug when we free module's memory while a timer is pending by canceling all deferred timers from the unloaded module. static_key_rate_limit() still can't be called more than once. Reproducer: (host crasher) modprobe kvm_intel (sleep 1; echo quit) \ | qemu-kvm -kernel /dev/null -monitor stdio sleep 0.5 until modprobe -rv kvm_intel 2/dev/null; do :; done Signed-off-by: Radim Krčmář rkrc...@redhat.com --- I decided not to post a patch that uses __deferred_key in kernel/module init, so these three functions might seem like an overkill. kernel/jump_label.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 9019f15..02d610a 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -125,6 +125,27 @@ void jump_label_rate_limit(struct static_key_deferred *key, } EXPORT_SYMBOL_GPL(jump_label_rate_limit); +/* could (should?) be abstracted more */ +static void with_deferred_keys(struct static_key_deferred *entry, + struct static_key_deferred *stop, + void (*op)(struct static_key_deferred *)) +{ + struct static_key_deferred *iter; + + for (iter = entry; iter stop; iter++) + op(iter); +} + +#define with_module_deferred_keys(mod, op) \ + with_deferred_keys(mod-deferred_keys, \ + mod-deferred_keys + mod-num_deferred_keys, \ + op) + +static void deferred_key_cancel_work(struct static_key_deferred *dkey) +{ + cancel_delayed_work_sync(dkey-work); +} + static int addr_conflict(struct jump_entry *entry, void *start, void *end) { if (entry-code = (unsigned long)end @@ -385,6 +406,7 @@ jump_label_module_notify(struct notifier_block *self, unsigned long val, jump_label_unlock(); break; case MODULE_STATE_GOING: + with_module_deferred_keys(mod, deferred_key_cancel_work); jump_label_lock(); jump_label_del_module(mod); jump_label_unlock(); -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/5] static_key: add static_key_slow_inc_deferred()
Complement the static_key_slow_dec_deferred(). This avoids asymmetrical API, and prepares us for future optimizations and bug fixes. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/lapic.c | 7 --- include/linux/jump_label_ratelimit.h | 5 + kernel/events/core.c | 6 +++--- kernel/jump_label.c | 7 +++ 4 files changed, 19 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5f01547..86973ac 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -121,7 +121,7 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) if (val APIC_SPIV_APIC_ENABLED) static_key_slow_dec_deferred(apic_sw_disabled); else - static_key_slow_inc(apic_sw_disabled.key); + static_key_slow_inc_deferred(apic_sw_disabled); } apic_set_reg(apic, APIC_SPIV, val); } @@ -1351,7 +1351,7 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value) if (value MSR_IA32_APICBASE_ENABLE) static_key_slow_dec_deferred(apic_hw_disabled); else - static_key_slow_inc(apic_hw_disabled.key); + static_key_slow_inc_deferred(apic_hw_disabled); recalculate_apic_map(vcpu-kvm); } @@ -1546,7 +1546,8 @@ int kvm_create_lapic(struct kvm_vcpu *vcpu) kvm_lapic_set_base(vcpu, APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE); - static_key_slow_inc(apic_sw_disabled.key); /* sw disabled at reset */ + /* sw disabled at reset */ + static_key_slow_inc_deferred(apic_sw_disabled); kvm_lapic_reset(vcpu); kvm_iodevice_init(apic-dev, apic_mmio_ops); diff --git a/include/linux/jump_label_ratelimit.h b/include/linux/jump_label_ratelimit.h index 112ba5f..a18aded 100644 --- a/include/linux/jump_label_ratelimit.h +++ b/include/linux/jump_label_ratelimit.h @@ -18,6 +18,7 @@ struct static_key_deferred { #endif #ifdef HAVE_JUMP_LABEL +extern void static_key_slow_inc_deferred(struct static_key_deferred *key); extern void static_key_slow_dec_deferred(struct static_key_deferred *key); extern void jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl); @@ -26,6 +27,10 @@ jump_label_rate_limit(struct static_key_deferred *key, unsigned long rl); struct static_key_deferred { struct static_key key; }; +static inline void static_key_slow_inc_deferred(struct static_key_deferred *key) +{ + static_key_slow_inc(key-key); +} static inline void static_key_slow_dec_deferred(struct static_key_deferred *key) { STATIC_KEY_CHECK_USE(); diff --git a/kernel/events/core.c b/kernel/events/core.c index ee64d26..b3fb4c2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6597,7 +6597,7 @@ static void account_event(struct perf_event *event) return; if (event-attach_state PERF_ATTACH_TASK) - static_key_slow_inc(perf_sched_events.key); + static_key_slow_inc_deferred(perf_sched_events); if (event-attr.mmap || event-attr.mmap_data) atomic_inc(nr_mmap_events); if (event-attr.comm) @@ -6609,9 +6609,9 @@ static void account_event(struct perf_event *event) tick_nohz_full_kick_all(); } if (has_branch_stack(event)) - static_key_slow_inc(perf_sched_events.key); + static_key_slow_inc_deferred(perf_sched_events); if (is_cgroup_event(event)) - static_key_slow_inc(perf_sched_events.key); + static_key_slow_inc_deferred(perf_sched_events); account_event_cpu(event, event-cpu); } diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 02d610a..41592ba 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -74,6 +74,13 @@ void static_key_slow_inc(struct static_key *key) } EXPORT_SYMBOL_GPL(static_key_slow_inc); +void static_key_slow_inc_deferred(struct static_key_deferred *key) +{ + STATIC_KEY_CHECK_USE(); + static_key_slow_inc(key-key); +} +EXPORT_SYMBOL_GPL(static_key_slow_inc_deferred); + static void __static_key_slow_dec(struct static_key *key, unsigned long rate_limit, struct delayed_work *work) { -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/5] static_key: keep deferred enabled counter debt
When '.enabled.counter == 1', static_key_slow_dec_deferred() gets silently dropped if the decrease is already pending. We print a warning if this happens and because .enabled.counter cannot go below 1 before the decrease has finished, the number of ignored static_key_slow_dec_deferred() is kept and we skip an equal amount of static_key_slow_inc_deferred(). Signed-off-by: Radim Krčmář rkrc...@redhat.com --- include/linux/jump_label_ratelimit.h | 1 + kernel/jump_label.c | 17 +++-- 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/include/linux/jump_label_ratelimit.h b/include/linux/jump_label_ratelimit.h index a18aded..2d5fa1a 100644 --- a/include/linux/jump_label_ratelimit.h +++ b/include/linux/jump_label_ratelimit.h @@ -14,6 +14,7 @@ struct static_key_deferred { struct static_key key; unsigned long timeout; struct delayed_work work; + atomic_t enabled_debt; }; #endif diff --git a/kernel/jump_label.c b/kernel/jump_label.c index 41592ba..bd7ad31 100644 --- a/kernel/jump_label.c +++ b/kernel/jump_label.c @@ -77,12 +77,14 @@ EXPORT_SYMBOL_GPL(static_key_slow_inc); void static_key_slow_inc_deferred(struct static_key_deferred *key) { STATIC_KEY_CHECK_USE(); + if (atomic_dec_if_positive(key-enabled_debt) = 0) + return; static_key_slow_inc(key-key); } EXPORT_SYMBOL_GPL(static_key_slow_inc_deferred); static void __static_key_slow_dec(struct static_key *key, - unsigned long rate_limit, struct delayed_work *work) + struct static_key_deferred *dkey) { if (!atomic_dec_and_mutex_lock(key-enabled, jump_label_mutex)) { WARN(atomic_read(key-enabled) 0, @@ -90,9 +92,12 @@ static void __static_key_slow_dec(struct static_key *key, return; } - if (rate_limit) { + if (dkey dkey-timeout) { atomic_inc(key-enabled); - schedule_delayed_work(work, rate_limit); + if (!schedule_delayed_work(dkey-work, dkey-timeout)) { + atomic_inc(dkey-enabled_debt); + WARN(1, jump label: negative deferred count!\n); + } } else { if (!jump_label_get_branch_default(key)) jump_label_update(key, JUMP_LABEL_DISABLE); @@ -106,20 +111,20 @@ static void jump_label_update_timeout(struct work_struct *work) { struct static_key_deferred *key = container_of(work, struct static_key_deferred, work.work); - __static_key_slow_dec(key-key, 0, NULL); + __static_key_slow_dec(key-key, NULL); } void static_key_slow_dec(struct static_key *key) { STATIC_KEY_CHECK_USE(); - __static_key_slow_dec(key, 0, NULL); + __static_key_slow_dec(key, NULL); } EXPORT_SYMBOL_GPL(static_key_slow_dec); void static_key_slow_dec_deferred(struct static_key_deferred *key) { STATIC_KEY_CHECK_USE(); - __static_key_slow_dec(key-key, key-timeout, key-work); + __static_key_slow_dec(key-key, key); } EXPORT_SYMBOL_GPL(static_key_slow_dec_deferred); -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
2013-12-12 21:36+0100, Paolo Bonzini: From: Gleb Natapov g...@redhat.com A guest can cause a BUG_ON() leading to a host kernel crash. When the guest writes to the ICR to request an IPI, while in x2apic mode the following things happen, the destination is read from ICR2, which is a register that the guest can control. kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the cluster id. A BUG_ON is triggered, which is a protection against accessing map-logical_map with an out-of-bounds access and manages to avoid that anything really unsafe occurs. The logic in the code is correct from real HW point of view. The problem is that KVM supports only one cluster with ID 0 in clustered mode, but the code that has the bug does not take this into account. The more I read about x2apic, the more confused I am ... - How was the cluster x2apic enabled? Linux won't enable cluster x2apic without interrupt remapping and I had no idea we were allowed to do it. - A hardware test-suite found this? This bug can only be hit when the destination cpu is 256, so the request itself is buggy -- we don't support that many in kvm and it would crash when initializing the vcpus if we did. = It looks like we should just ignore the ipi, because we have no vcpus in that cluster. - Where does the 'only one supported cluster' come from? I only see we use 'struct kvm_lapic *logical_map[16][16];', which supports 16 clusters of 16 apics = first 256 vcpus, so if we map everything to logical_map[0][0:15], we would not work correctly in the cluster x2apic, with 16 vcpus. Thanks. Reported-by: Lars Bull larsb...@google.com Cc: sta...@vger.kernel.org Signed-off-by: Gleb Natapov g...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/lapic.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index b8bec45c1610..801dc3fd66e1 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -143,6 +143,8 @@ static inline int kvm_apic_id(struct kvm_lapic *apic) return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } +#define KMV_X2APIC_CID_BITS 0 + static void recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; @@ -180,7 +182,8 @@ static void recalculate_apic_map(struct kvm *kvm) if (apic_x2apic_mode(apic)) { new-ldr_bits = 32; new-cid_shift = 16; - new-cid_mask = new-lid_mask = 0x; + new-cid_mask = (1 KMV_X2APIC_CID_BITS) - 1; + new-lid_mask = 0x; } else if (kvm_apic_sw_enabled(apic) !new-cid_mask /* flat mode */ kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER) { -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
2013-12-13 18:25+0100, Paolo Bonzini: Il 13/12/2013 17:07, Radim Krčmář ha scritto: This bug can only be hit when the destination cpu is 256, so the request itself is buggy -- we don't support that many in kvm and it would crash when initializing the vcpus if we did. = It looks like we should just ignore the ipi, because we have no vcpus in that cluster. That's what should happen in physical mode. Something like this patch: diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5439117d5c4c..1f8e9e1abd3b 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -613,9 +615,10 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, if (irq-dest_mode == 0) { /* physical mode */ if (irq-delivery_mode == APIC_DM_LOWEST || - irq-dest_id == 0xff) + irq-dest_id == 0xff || The 0xff is xapic broadcast, we did not care about the x2apic one and no-one noticed that it does not work :) 0xff isn't special in x2apic mode. + (apic_x2apic_mode(src) irq-dest_id 0xff)) goto out; - dst = map-phys_map[irq-dest_id 0xff]; + dst = map-phys_map[irq-dest_id]; } else { u32 mda = irq-dest_id (32 - map-ldr_bits); On top of this, the x2apic spec documents a broadcast destination ID that could be implemented as follows. But I have no idea if this is correct and how it differs from the usual broadcast shorthand: @@ -815,9 +818,11 @@ static void apic_send_ipi(struct kvm_lapic *apic) irq.level = icr_low APIC_INT_ASSERT; irq.trig_mode = icr_low APIC_INT_LEVELTRIG; irq.shorthand = icr_low APIC_SHORT_MASK; - if (apic_x2apic_mode(apic)) + if (apic_x2apic_mode(apic)) { irq.dest_id = icr_high; - else + if (icr_high == 0x) + irq.shorthand = APIC_DEST_ALLINC; The shorthand takes priority, so we shouldn't overwrite it. This is better solved after we 'goto out' from the _fast(). (could be nicer still ...) + } else irq.dest_id = GET_APIC_DEST_FIELD(icr_high); trace_kvm_apic_ipi(icr_low, irq.dest_id); And another possibility occured to me now: Google was the first one to encounter a broadcast; we don't handle it at all in the fast path and x2apic has 0x as the target in both modes ... (I think that xapic has 0xff in both of them too, but I'll check) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 5439117..e4618c4 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -512,19 +512,24 @@ static void apic_set_tpr(struct kvm_lapic *apic, u32 tpr) apic_update_ppr(apic); } -int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest) +int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 dest) { - return dest == 0xff || kvm_apic_id(apic) == dest; + return dest == (apic_x2apic_mode(apic) ? 0x : 0xff) + || kvm_apic_id(apic) == dest; } -int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda) +int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda) { int result = 0; u32 logical_id; + if (mda == (apic_x2apic_mode(apic) ? 0x : 0xff)) + return 1; + if (apic_x2apic_mode(apic)) { logical_id = kvm_apic_get_reg(apic, APIC_LDR); - return logical_id mda; + return mda 16 == logical_id 16 + logical_id mda 0x; } logical_id = GET_APIC_LOGICAL_ID(kvm_apic_get_reg(apic, APIC_LDR)); @@ -605,6 +610,10 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, if (irq-shorthand) return false; + /* broadcast */ + if (irq-dest_id == (apic_x2apic_mode(src) ? 0x : 0xff)) + return false; + rcu_read_lock(); map = rcu_dereference(kvm-arch.apic_map); @@ -612,10 +621,13 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, goto out; if (irq-dest_mode == 0) { /* physical mode */ - if (irq-delivery_mode == APIC_DM_LOWEST || - irq-dest_id == 0xff) + if (irq-dest_id 0xff) { + ret = true; + goto out; + } + if (irq-delivery_mode == APIC_DM_LOWEST) goto out; - dst = map-phys_map[irq-dest_id 0xff]; + dst = map-phys_map[irq-dest_id]; } else { u32 mda = irq-dest_id (32 - map-ldr_bits); The logical x2apic slowpath was completely broken, maybe still is, I have no way to test it ... The broadcast checking should be abstracted, we call it a lot and giving it a name would be much better than commenting. Also
Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
2013-12-14 11:46+0200, Gleb Natapov: On Fri, Dec 13, 2013 at 05:07:54PM +0100, Radim Krčmář wrote: 2013-12-12 21:36+0100, Paolo Bonzini: From: Gleb Natapov g...@redhat.com A guest can cause a BUG_ON() leading to a host kernel crash. When the guest writes to the ICR to request an IPI, while in x2apic mode the following things happen, the destination is read from ICR2, which is a register that the guest can control. kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the cluster id. A BUG_ON is triggered, which is a protection against accessing map-logical_map with an out-of-bounds access and manages to avoid that anything really unsafe occurs. The logic in the code is correct from real HW point of view. The problem is that KVM supports only one cluster with ID 0 in clustered mode, but the code that has the bug does not take this into account. The more I read about x2apic, the more confused I am ... - How was the cluster x2apic enabled? Linux won't enable cluster x2apic without interrupt remapping and I had no idea we were allowed to do it. Malicious code can do it. - A hardware test-suite found this? This bug can only be hit when the destination cpu is 256, so the request itself is buggy -- we don't support that many in kvm and it would crash when initializing the vcpus if we did. = It looks like we should just ignore the ipi, because we have no vcpus in that cluster. That's the nature of malicious code: it does what your code does not expects it to do :) I was wondering if there wasn't malicious linux on the other side too :) - Where does the 'only one supported cluster' come from? only one supported cluster comes from 8 bit cpuid limitation of KVM's x2apic implementation. With 8 bit cpuid you can only address cluster 0 in logical mode. One x2apic cluster has 16 cpus and we generate the x2apic LDR correctly, so 8 bit cpuid can address first 16 clusters as well. u32 ldr = ((id 4) 16) | (1 (id 0xf)); I only see we use 'struct kvm_lapic *logical_map[16][16];', which supports 16 clusters of 16 apics = first 256 vcpus, so if we map everything to logical_map[0][0:15], we would not work correctly in the cluster x2apic, with 16 vcpus. Such config cannot work today because of 8 bit cpuid limitation. When the limitation will be removed KMV_X2APIC_CID_BITS will be set to actual number of bits we want to support. Even with KMV_X2APIC_CID_BITS = 4, which would allow us to support 8 bit cpuid, we would still deliver interrupts destined for cpuid 256 to potentially plugged cpus. It will likely be smaller then 16 bit though since full 18 bit support will require huge tables. Yeah, we'll have to do dynamic allocation if we are ever going to need the full potential of x2apic. (2^20-16 cpus in cluster and 2^32-1 in flat mode) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
2013-12-16 14:16+0200, Gleb Natapov: On Mon, Dec 16, 2013 at 01:01:10PM +0100, Radim Krčmář wrote: - Where does the 'only one supported cluster' come from? only one supported cluster comes from 8 bit cpuid limitation of KVM's x2apic implementation. With 8 bit cpuid you can only address cluster 0 in logical mode. One x2apic cluster has 16 cpus and we generate the x2apic LDR correctly, so 8 bit cpuid can address first 16 clusters as well. u32 ldr = ((id 4) 16) | (1 (id 0xf)); Interrupt from a device cannot generate such ldr, only IPI can. Only 4 cpus in cluster zero are addressable in clustering mode by a device. Without irq remapping x2apic is a PV interface between host and guest where guest needs to know KVM implementation's limitation to use it. Thanks, I'll read more about devices ... still no idea how could they address cluster 15. I do not see a point in fixing problems in x2apic logical mode emulation right now since it will not make it usable, as long as there is not security problems there. Agreed; I wanted to know why this patch was correct, if we cared. I only see we use 'struct kvm_lapic *logical_map[16][16];', which supports 16 clusters of 16 apics = first 256 vcpus, so if we map everything to logical_map[0][0:15], we would not work correctly in the cluster x2apic, with 16 vcpus. Such config cannot work today because of 8 bit cpuid limitation. When the limitation will be removed KMV_X2APIC_CID_BITS will be set to actual number of bits we want to support. Even with KMV_X2APIC_CID_BITS = 4, which would allow us to support 8 bit cpuid, we would still deliver interrupts destined for cpuid 256 to potentially plugged cpus. Again, KMV_X2APIC_CID_BITS = 4 will not allow us to support 8 bit cpuids unfortunately, not sure what you mean by the second part of the sentence. Sorry, I meant that with this change, we map all clusters to cluster 0, which has two flaws: - in kvm_lapic_set_base(), the order of vcpu creation determines those assigned to cluster 0, and the rest is unaddressable (overwritten) - we can send IPI to an unplugged high cpuid and it arives in cluster 0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
2013-12-16 13:55+0100, Radim Krčmář: 2013-12-16 14:16+0200, Gleb Natapov: On Mon, Dec 16, 2013 at 01:01:10PM +0100, Radim Krčmář wrote: - Where does the 'only one supported cluster' come from? only one supported cluster comes from 8 bit cpuid limitation of KVM's x2apic implementation. With 8 bit cpuid you can only address cluster 0 in logical mode. One x2apic cluster has 16 cpus and we generate the x2apic LDR correctly, so 8 bit cpuid can address first 16 clusters as well. u32 ldr = ((id 4) 16) | (1 (id 0xf)); Interrupt from a device cannot generate such ldr, only IPI can. Only 4 cpus in cluster zero are addressable in clustering mode by a device. Without irq remapping x2apic is a PV interface between host and guest where guest needs to know KVM implementation's limitation to use it. Thanks, I'll read more about devices ... still no idea how could they address cluster 15. I do not see a point in fixing problems in x2apic logical mode emulation right now since it will not make it usable, as long as there is not security problems there. Agreed; I wanted to know why this patch was correct, if we cared. I only see we use 'struct kvm_lapic *logical_map[16][16];', which supports 16 clusters of 16 apics = first 256 vcpus, so if we map everything to logical_map[0][0:15], we would not work correctly in the cluster x2apic, with 16 vcpus. Such config cannot work today because of 8 bit cpuid limitation. When the limitation will be removed KMV_X2APIC_CID_BITS will be set to actual number of bits we want to support. Even with KMV_X2APIC_CID_BITS = 4, which would allow us to support 8 bit cpuid, we would still deliver interrupts destined for cpuid 256 to potentially plugged cpus. Again, KMV_X2APIC_CID_BITS = 4 will not allow us to support 8 bit cpuids unfortunately, not sure what you mean by the second part of the sentence. Sorry, I meant that with this change, we map all clusters to cluster 0, which has two flaws: - in kvm_lapic_set_base(), the order of vcpu creation determines those Gah, should have been 'recalculate_apic_map()' ... The patch would be especially surprising with a dynamic adding of vcpus. assigned to cluster 0, and the rest is unaddressable (overwritten) - we can send IPI to an unplugged high cpuid and it arives in cluster 0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: SVM: fix NMI window after iret
We should open NMI window right after an iret, but SVM exits before it. We wanted to single step using the trap flag and then open it. (or we could emulate the iret instead) We don't do it since commit 3842d135ff2 (likely), because the iret exit handler does not request an event, so NMI window remains closed until the next exit. Fix this by making KVM_REQ_EVENT request in the iret handler. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- (btw. kvm-unit-tests weren't executed on SVM since Nov 2010, at least) arch/x86/kvm/svm.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c7168a5..b5a735b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2829,6 +2829,7 @@ static int iret_interception(struct vcpu_svm *svm) clr_intercept(svm, INTERCEPT_IRET); svm-vcpu.arch.hflags |= HF_IRET_MASK; svm-nmi_iret_rip = kvm_rip_read(svm-vcpu); + kvm_make_request(KVM_REQ_EVENT, svm-vcpu); return 1; } -- 1.8.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: SVM: fix NMI window after iret
2014-01-17 12:18-0800, Greg KH: On Fri, Jan 17, 2014 at 08:52:42PM +0100, Radim Krčmář wrote: We should open NMI window right after an iret, but SVM exits before it. We wanted to single step using the trap flag and then open it. (or we could emulate the iret instead) We don't do it since commit 3842d135ff2 (likely), because the iret exit handler does not request an event, so NMI window remains closed until the next exit. Fix this by making KVM_REQ_EVENT request in the iret handler. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- (btw. kvm-unit-tests weren't executed on SVM since Nov 2010, at least) arch/x86/kvm/svm.c | 1 + 1 file changed, 1 insertion(+) formletter This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read Documentation/stable_kernel_rules.txt for how to do this properly. /formletter Welp, at the last second, I decided it is not that critical to have it in stable and forgot to clean the git-send-email command line too. Please ignore this patch in stable. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: print suberror on all internal errors
KVM introduced internal error exit reason and suberror at the same time, and later extended it with internal error data. QEMU does not report suberror on hosts between these two events because we check for the extension. (half a year in 2009, but it is misleading) Fix by removing KVM_CAP_INTERNAL_ERROR_DATA condition on printf. (partially improved by bb44e0d12df70 and ba4047cf848a3 in the past) Signed-off-by: Radim Krčmář rkrc...@redhat.com --- kvm-all.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index 0bfb060..0a91d8e 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1539,17 +1539,16 @@ static void kvm_handle_io(uint16_t port, void *data, int direction, int size, static int kvm_handle_internal_error(CPUState *cpu, struct kvm_run *run) { -fprintf(stderr, KVM internal error.); +fprintf(stderr, KVM internal error. Suberror: %d\n, +run-internal.suberror); + if (kvm_check_extension(kvm_state, KVM_CAP_INTERNAL_ERROR_DATA)) { int i; -fprintf(stderr, Suberror: %d\n, run-internal.suberror); for (i = 0; i run-internal.ndata; ++i) { fprintf(stderr, extra data[%d]: %PRIx64\n, i, (uint64_t)run-internal.data[i]); } -} else { -fprintf(stderr, \n); } if (run-internal.suberror == KVM_INTERNAL_ERROR_EMULATION) { fprintf(stderr, emulation failure\n); -- 1.8.5.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: host_fx_image, guest_fx_image ; what are they ?
2014-03-04 23:35+0530, ratheesh kannoth: Could you please help me understand below variables ? i found it in struct kvm_vcpu char *host_fx_image; char *guest_fx_image; Commit b114b0804df7131cb6764b948c1c530c834fa3c0 explains them. (How does the existence depend on lost KVM code, dear time traveller?) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7] KVM: vmx: Allow the guest to run with dirty debug registers
2014-03-07 12:42+0100, Paolo Bonzini: When not running in guest-debug mode (i.e. the guest controls the debug registers, having to take an exit for each DR access is a waste of time. If the guest gets into a state where each context switch causes DR to be saved and restored, this can take away as much as 40% of the execution time from the guest. If the guest is running with vcpu-arch.db == vcpu-arch.eff_db, we can let it write freely to the debug registers and reload them on the next exit. We still need to exit on the first access, so that the KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further accesses to the debug registers will not cause a vmexit. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/kvm/vmx.c | 37 - 1 file changed, 36 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 06e4ec877a1c..e377522408b1 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2843,7 +2843,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmx_capability.ept, vmx_capability.vpid); } - min = 0; + min = VM_EXIT_SAVE_DEBUG_CONTROLS; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; #endif @@ -5113,6 +5113,22 @@ static int handle_dr(struct kvm_vcpu *vcpu) } } + if (vcpu-guest_debug == 0) { + u32 cpu_based_vm_exec_control; + + cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); + cpu_based_vm_exec_control = ~CPU_BASED_MOV_DR_EXITING; + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); vmcs_clear_bits() covers exactly this use-case. (Barring the explicit bit-width.) + + /* + * No more DR vmexits; force a reload of the debug registers + * and reenter on this instruction. The next vmexit will + * retrieve the full state of the debug registers. + */ + vcpu-arch.switch_db_regs |= KVM_DEBUGREG_WONT_EXIT; + return 1; + } + We could make the code slighly uglier and move the functional part of this block before the previous one, so it would do both things in one exit. (Exception handler will likely access DR too.) exit_qualification = vmcs_readl(EXIT_QUALIFICATION); dr = exit_qualification DEBUG_REG_ACCESS_NUM; reg = DEBUG_REG_ACCESS_REG(exit_qualification); @@ -5139,6 +5155,24 @@ static void vmx_set_dr6(struct kvm_vcpu *vcpu, unsigned long val) { } +static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu) +{ + u32 cpu_based_vm_exec_control; + + get_debugreg(vcpu-arch.db[0], 0); + get_debugreg(vcpu-arch.db[1], 1); + get_debugreg(vcpu-arch.db[2], 2); + get_debugreg(vcpu-arch.db[3], 3); + get_debugreg(vcpu-arch.dr6, 6); + vcpu-arch.dr7 = vmcs_readl(GUEST_DR7); + + vcpu-arch.switch_db_regs = ~KVM_DEBUGREG_WONT_EXIT; + + cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL); + cpu_based_vm_exec_control |= CPU_BASED_MOV_DR_EXITING; + vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control); (I'll be calling this sneaky later on.) +} + static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val) { vmcs_writel(GUEST_DR7, val); @@ -8590,6 +8624,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .get_dr6 = vmx_get_dr6, .set_dr6 = vmx_set_dr6, .set_dr7 = vmx_set_dr7, + .sync_dirty_debug_regs = vmx_sync_dirty_debug_regs, .cache_reg = vmx_cache_reg, .get_rflags = vmx_get_rflags, .set_rflags = vmx_set_rflags, -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] KVM: x86: Allow the guest to run with dirty debug registers
2014-03-07 12:42+0100, Paolo Bonzini: When not running in guest-debug mode, the guest controls the debug registers and having to take an exit for each DR access is a waste of time. If the guest gets into a state where each context switch causes DR to be saved and restored, this can take away as much as 40% of the execution time from the guest. After this patch, VMX- and SVM-specific code can set a flag in switch_db_regs, telling vcpu_enter_guest that on the next exit the debug registers might be dirty and need to be reloaded (syncing will be taken care of by a new callback in kvm_x86_ops). This flag can be set on the first access to a debug registers, so that multiple accesses to the debug registers only cause one vmexit. Note that since the guest will be able to read debug registers and enable breakpoints in DR7, we need to ensure that they are synchronized on entry to the guest---including DR6 that was not synced before. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/x86.c | 16 2 files changed, 18 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5ef59d3b6c63..74eb361eaa8f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -339,6 +339,7 @@ struct kvm_pmu { enum { KVM_DEBUGREG_BP_ENABLED = 1, + KVM_DEBUGREG_WONT_EXIT = 2, }; struct kvm_vcpu_arch { @@ -707,6 +708,7 @@ struct kvm_x86_ops { void (*set_gdt)(struct kvm_vcpu *vcpu, struct desc_ptr *dt); u64 (*get_dr6)(struct kvm_vcpu *vcpu); void (*set_dr6)(struct kvm_vcpu *vcpu, unsigned long value); + void (*sync_dirty_debug_regs)(struct kvm_vcpu *vcpu); void (*set_dr7)(struct kvm_vcpu *vcpu, unsigned long value); void (*cache_reg)(struct kvm_vcpu *vcpu, enum kvm_reg reg); unsigned long (*get_rflags)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 252b47e85c69..c48818aa04c0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6033,12 +6033,28 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) set_debugreg(vcpu-arch.eff_db[1], 1); set_debugreg(vcpu-arch.eff_db[2], 2); set_debugreg(vcpu-arch.eff_db[3], 3); + set_debugreg(vcpu-arch.dr6, 6); } trace_kvm_entry(vcpu-vcpu_id); kvm_x86_ops-run(vcpu); /* + * Do this here before restoring debug registers on the host. And + * since we do this before handling the vmexit, a DR access vmexit + * can (a) read the correct value of the debug registers, (b) set + * KVM_DEBUGREG_WONT_EXIT again. We re-enable intercepts on the next exit for the sake of simplicity? (Batched accesses make perfect sense, but it seems we don't have to care about DRs at all, without guest-debug.) + */ + if (unlikely(vcpu-arch.switch_db_regs KVM_DEBUGREG_WONT_EXIT)) { + int i; + + WARN_ON(vcpu-guest_debug KVM_GUESTDBG_USE_HW_BP); Is this possible? (I presumed that we vmexit before handling ioctls.) + kvm_x86_ops-sync_dirty_debug_regs(vcpu); Sneaky functionality ... it would be nicer to split 'enable DR intercepts' into a new kvm op. I think we want to disable them whenever we are not in guest-debug mode anyway, so it would be a pair. And we don't have to modify DR intercepts then, which is probably the main reason why sync_dirty_debug_regs() does two things. + for (i = 0; i KVM_NR_DB_REGS; i++) + vcpu-arch.eff_db[i] = vcpu-arch.db[i]; + } + + /* * If the guest has used debug registers, at least dr7 * will be disabled while returning to the host. * If we don't have active breakpoints in the host, we don't -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/7] KVM: x86: Allow the guest to run with dirty debug registers
2014-03-09 21:07+0100, Paolo Bonzini: Il 09/03/2014 19:28, Radim Krčmář ha scritto: /* + * Do this here before restoring debug registers on the host. And + * since we do this before handling the vmexit, a DR access vmexit + * can (a) read the correct value of the debug registers, (b) set + * KVM_DEBUGREG_WONT_EXIT again. We re-enable intercepts on the next exit for the sake of simplicity? (Batched accesses make perfect sense, but it seems we don't have to care about DRs at all, without guest-debug.) It's not just for the sake of simplicity. Synchronizing debug registers on entry does have some cost, and it's required for running without debug register intercepts. You would incur this cost always, since no-guest-debug is actually the common case. Good point, it's just this case that accesses them often, most guest won't use them at all ... We're well into diminishing returns; Alex timed this patch vs. one that completely ignores MOV DR exits and (to the surprise of both of us) this patch completely restored performance despite still having a few tens of thousands MOV DR exits/second. In the problematic case, each context switch will do a save and restore of debug registers; that's in total 13 debug register accesses (read 6 registers, write DR7 to disable all registers, write 6 registers including DR7 which enables breakpoints), all very close in time. It's quite likely that the final DR7 write is very expensive (e.g. it might have to flush the pipeline). Also, this test case must be very heavy on context switches, and each context switch already incurs a few exits (for example to inject the interrupt that causes it, to read the current time). We would save just one exit and there are probably enough non-DR exits even in this case to balance it out. I've been carried too far away for design. (Thanks for the details.) A different optimization could be to skip the LOAD_DEBUG_CONTROLS vm-entry control if DR7 == host DR7 == 0x400 MOV DR exits are enabled. Not sure it's worthwhile though, and there's also the DEBUGCTL MSR to take into account. Better do these kinds of tweaks if they actually show up in the profile: Alex's testcase shows that when they do, the impact is massive. Yeah, we'd have to implement TRUE-VMX-MSR handling just for two load/stores ... + kvm_x86_ops-sync_dirty_debug_regs(vcpu); Sneaky functionality ... it would be nicer to split 'enable DR intercepts' into a new kvm op. Why? Disable DR intercepts is already folded into the handler for MOV DR exits. It's not obvious from the name that we also enable intercepts again, which is more important part, imho. (I don't like the folded code much, but considering the alternatives, it's a good solution.) And we don't have to modify DR intercepts then, which is probably the main reason why sync_dirty_debug_regs() does two things. Another is that indirect calls are relatively expensive and add complexity; in this case they would always be used back-to-back. True. (I could not come up with a good name to this function, intercept_debug_registers() does not reveal that we also save them, and a longer name would be hard to read without 'and', which looks weird.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7] KVM: vmx: Allow the guest to run with dirty debug registers
2014-03-09 21:12+0100, Paolo Bonzini: Il 09/03/2014 19:26, Radim Krčmář ha scritto: + + /* + * No more DR vmexits; force a reload of the debug registers + * and reenter on this instruction. The next vmexit will + * retrieve the full state of the debug registers. + */ + vcpu-arch.switch_db_regs |= KVM_DEBUGREG_WONT_EXIT; + return 1; + } + We could make the code slighly uglier and move the functional part of this block before the previous one, so it would do both things in one exit. I considered this, but decided that it's unlikely for emulation to be faster than hardware---especially on those AMD CPUs that lack decode assists (and it's good for VMX and SVM code to look as similar as possible). I meant to move it before the block that exits if there is the 'exception on access' bit set in cr7, so we wouldn't exit again right away on the actual access, which is quite likely. (Exiting without emulation was a great move.) (Exception handler will likely access DR too.) Which exception handler? For #DB. (Pure guesswork, I haven't seen any of them.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] KVM: x86: Let the guest write to multiple debug registers with one vmexit
2014-03-07 12:42+0100, Paolo Bonzini: Alex Williamson reported that a Windows game does something weird that makes the guest save and restore debug registers on each context switch. This cause several hundred thousands vmexits per second, and basically cuts performance in half when running under KVM. However, when not running in guest-debug mode, the guest controls the debug registers and having to take an exit for each DR access is a waste of time. We just need one vmexit to load any stale values of DR0-DR6, and then we can let the guest run freely. On the next vmexit (whatever the reason) we will read out whatever changes the guest made to the debug registers. Tested with x86/debug.flat on both Intel and AMD, both direct and nested virtualization. Changes from RFC: changed get_dr7 callback to sync_dirty_debug_regs, new patches 5-7. Paolo Bonzini (7): KVM: vmx: we do rely on loading DR7 on entry KVM: x86: change vcpu-arch.switch_db_regs to a bit mask KVM: x86: Allow the guest to run with dirty debug registers KVM: vmx: Allow the guest to run with dirty debug registers KVM: nVMX: Allow nested guests to run with dirty debug registers KVM: svm: set/clear all DR intercepts in one swoop KVM: svm: Allow the guest to run with dirty debug registers All patches, Reviewed-by: Radim Krčmář rkrc...@redhat.com This series is good even without vmcs_{set,clr}_bits(). (There is enough of them already to warrant a cleanup patch.) arch/x86/include/asm/kvm_host.h | 8 - arch/x86/kvm/svm.c | 68 - arch/x86/kvm/vmx.c | 43 -- arch/x86/kvm/x86.c | 20 +++- 4 files changed, 114 insertions(+), 25 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: SVM: fix cr8 intercept window
We always disable cr8 intercept in its handler, but only re-enable it if handling KVM_REQ_EVENT, so there can be a window where we do not intercept cr8 writes, which allows an interrupt to disrupt a higher priority task. Fix this by disabling intercepts in the same function that re-enables them when needed. This fixes BSOD in Windows 2008. Cc: sta...@vger.kernel.org Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/svm.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 64d9bb9..f676c18 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm *svm) u8 cr8_prev = kvm_get_cr8(svm-vcpu); /* instruction emulation calls kvm_set_cr8() */ r = cr_interception(svm); - if (irqchip_in_kernel(svm-vcpu.kvm)) { - clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); + if (irqchip_in_kernel(svm-vcpu.kvm)) return r; - } if (cr8_prev = kvm_get_cr8(svm-vcpu)) return r; kvm_run-exit_reason = KVM_EXIT_SET_TPR; @@ -3568,6 +3566,8 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) if (is_guest_mode(vcpu) (vcpu-arch.hflags HF_VINTR_MASK)) return; + clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); + if (irr == -1) return; -- 1.8.5.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: SVM: fix cr8 intercept window
2014-03-11 22:05-0300, Marcelo Tosatti: On Tue, Mar 11, 2014 at 07:11:18PM +0100, Radim Krčmář wrote: We always disable cr8 intercept in its handler, but only re-enable it if handling KVM_REQ_EVENT, so there can be a window where we do not intercept cr8 writes, which allows an interrupt to disrupt a higher priority task. Fix this by disabling intercepts in the same function that re-enables them when needed. This fixes BSOD in Windows 2008. Cc: sta...@vger.kernel.org Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/svm.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 64d9bb9..f676c18 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm *svm) u8 cr8_prev = kvm_get_cr8(svm-vcpu); /* instruction emulation calls kvm_set_cr8() */ r = cr_interception(svm); - if (irqchip_in_kernel(svm-vcpu.kvm)) { - clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); + if (irqchip_in_kernel(svm-vcpu.kvm)) return r; - } if (cr8_prev = kvm_get_cr8(svm-vcpu)) return r; kvm_run-exit_reason = KVM_EXIT_SET_TPR; @@ -3568,6 +3566,8 @@ static void update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) if (is_guest_mode(vcpu) (vcpu-arch.hflags HF_VINTR_MASK)) return; + clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); + if (irr == -1) return; Shouldnt IRR be injected if TPR IRR ? (via KVM_REQ_EVENT). 1) IRR has interrupt 10. 2) TPR now 9 due to CR8 write. 3) 10 should be injected. Definitely should, and we will set KVM_REQ_EVENT through kvm_set_cr8() if we lower the TPR. (I checked that the bug isn't in apic_update_ppr().) Also not clearing the intercept can cause continuous CR8 writes to exit until KVM_REQ_EVENT ? It is intended, I suppose this is because we run with V_INTR_MASKING, so writes to CR8 only affect V_TPR register; guest then raises it once more and APIC incorrectly gives us low priority interrupt. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: SVM: fix cr8 intercept window
2014-03-13 15:52+0200, Gleb Natapov: On Wed, Mar 12, 2014 at 06:20:01PM +0100, Paolo Bonzini wrote: Il 12/03/2014 11:40, Radim Krčmář ha scritto: 2014-03-11 22:05-0300, Marcelo Tosatti: On Tue, Mar 11, 2014 at 07:11:18PM +0100, Radim Krčmář wrote: We always disable cr8 intercept in its handler, but only re-enable it if handling KVM_REQ_EVENT, so there can be a window where we do not intercept cr8 writes, which allows an interrupt to disrupt a higher priority task. Fix this by disabling intercepts in the same function that re-enables them when needed. This fixes BSOD in Windows 2008. Cc: sta...@vger.kernel.org Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/svm.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 64d9bb9..f676c18 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3003,10 +3003,8 @@ static int cr8_write_interception(struct vcpu_svm *svm) u8 cr8_prev = kvm_get_cr8(svm-vcpu); /* instruction emulation calls kvm_set_cr8() */ r = cr_interception(svm); -if (irqchip_in_kernel(svm-vcpu.kvm)) { -clr_cr_intercept(svm, INTERCEPT_CR8_WRITE); +if (irqchip_in_kernel(svm-vcpu.kvm)) return r; -} I think that the old code here makes little sense, and for two reasons: I agree that old code is wrong and the patch looks correct, but I only see how the bug may cause pending IRR to not be delivered in time, not how interrupt can disrupt a higher priority task. True, the commit message is bad. We BSOD only because IRR is not injected right after lowering the TPR and code depends on values that had to be computed in it. Paolo, can you change the last sentence to , which means we don't inject pending IRR immediately.? (or do we just forget it?) Thanks. --- My model of what is happening had misconceptions about Windows, KVM and SVM ... explained BSOD in IRQL 0xc more directly though, so it lasted till Paolo's review -- the list of things to (re)read is long. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/4] KVM: ioapic: extract body of kvm_ioapic_set_irq
2014-03-21 10:28+0100, Paolo Bonzini: We will reuse it to process a nonzero IRR that is passed to KVM_SET_IRQCHIP. Reviewed-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- v1-v2: more comments change argument name from level to irq_level virt/kvm/ioapic.c | 74 +-- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 25e16a6898ed..270f7fe73f39 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -163,6 +163,55 @@ static bool rtc_irq_check_coalesced(struct kvm_ioapic *ioapic) return false; } +static int ioapic_set_irq(struct kvm_ioapic *ioapic, unsigned int irq, + int irq_level, bool line_status) +{ + union kvm_ioapic_redirect_entry entry; + u32 mask = 1 irq; + u32 old_irr; + int edge, ret; + + entry = ioapic-redirtbl[irq]; + edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG); + + if (!irq_level) { + ioapic-irr = ~mask; + ret = 1; + goto out; + } + + /* + * Return 0 for coalesced interrupts; for edge-triggered interrupts, + * this only happens if a previous edge has not been delivered due + * do masking. For level interrupts, the remote_irr field tells (^ to) + * us if the interrupt is waiting for an EOI. How do we know we haven't delivered an edge-triggered interrupt when ioapic_service() always clears this IRR bit? + * + * RTC is special: it is edge-triggered, but userspace likes to know + * if it has been already ack-ed via EOI because coalesced RTC + * interrupts lead to time drift in Windows guests. So we track + * EOI manually for the RTC interrupt. + */ The comment makes me think we should not be coalescing more, but we do. + if (irq == RTC_GSI line_status + rtc_irq_check_coalesced(ioapic)) { + ret = 0; + goto out; + } + + old_irr = ioapic-irr; + ioapic-irr |= mask; + if ((edge old_irr == ioapic-irr) || + (!edge entry.fields.remote_irr)) { (Fun fact: GCC 4.8.2 doesn't optimize it as well as edge ? old_irr == ioapic-irr : entry.fields.remote_irr) + ret = 0; + goto out; + } + + ret = ioapic_service(ioapic, irq, line_status); + +out: + trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0); + return ret; +} + static void update_handled_vectors(struct kvm_ioapic *ioapic) { DECLARE_BITMAP(handled_vectors, 256); @@ -308,38 +357,15 @@ static int ioapic_service(struct kvm_ioapic *ioapic, int irq, bool line_status) int kvm_ioapic_set_irq(struct kvm_ioapic *ioapic, int irq, int irq_source_id, int level, bool line_status) { - u32 old_irr; - u32 mask = 1 irq; - union kvm_ioapic_redirect_entry entry; int ret, irq_level; BUG_ON(irq 0 || irq = IOAPIC_NUM_PINS); spin_lock(ioapic-lock); - old_irr = ioapic-irr; irq_level = __kvm_irq_line_state(ioapic-irq_states[irq], irq_source_id, level); - entry = ioapic-redirtbl[irq]; - if (!irq_level) { - ioapic-irr = ~mask; - ret = 1; - } else { - int edge = (entry.fields.trig_mode == IOAPIC_EDGE_TRIG); + ret = ioapic_set_irq(ioapic, irq, irq_level, line_status); - if (irq == RTC_GSI line_status - rtc_irq_check_coalesced(ioapic)) { - ret = 0; /* coalesced */ - goto out; - } - ioapic-irr |= mask; - if ((edge old_irr != ioapic-irr) || - (!edge !entry.fields.remote_irr)) - ret = ioapic_service(ioapic, irq, line_status); - else - ret = 0; /* report coalesced interrupt */ - } -out: - trace_kvm_ioapic_set_irq(entry.bits, irq, ret == 0); spin_unlock(ioapic-lock); return ret; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/4] KVM: ioapic: extract body of kvm_ioapic_set_irq
2014-03-23 09:44+0100, Paolo Bonzini: Il 21/03/2014 19:58, Radim Krčmář ha scritto: + /* + * Return 0 for coalesced interrupts; for edge-triggered interrupts, + * this only happens if a previous edge has not been delivered due + * do masking. For level interrupts, the remote_irr field tells (^ to) + * us if the interrupt is waiting for an EOI. How do we know we haven't delivered an edge-triggered interrupt when ioapic_service() always clears this IRR bit? We know we have at least delivered it to the local APIC(s). If it is masked in the ioredirtbl, ioapic_service doesn't clear the IRR. I see, LAPIC can't refuse an interrupt, so kvm_irq_delivery_to_apic() won't (mustn't) fail. (The code looks fragile if it returned -1.) Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
2014-03-21 10:28+0100, Paolo Bonzini: After the previous patches, an interrupt whose bit is set in the IRR register will never be in the LAPIC's IRR and has never been injected on the migration source. So inject it on the destination. This fixes migration of Windows guests without HPET (they use the RTC to trigger the scheduler tick, and lose it after migration). Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- v1-v2: use IOAPIC_NUM_PINS as a limit to for_each_set_bit remove debug printk virt/kvm/ioapic.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 270f7fe73f39..d4b601547f1f 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -212,6 +212,18 @@ out: return ret; } +static void kvm_ioapic_inject_all(struct kvm_ioapic *ioapic, unsigned long irr) +{ + u32 idx; + + rtc_irq_eoi_tracking_reset(ioapic); + for_each_set_bit(idx, irr, IOAPIC_NUM_PINS) + ioapic_set_irq(ioapic, idx, 1, true); + + kvm_rtc_eoi_tracking_restore_all(ioapic); (We shouldn't have RTC interrupt with pending EOI in irr, so the function could be independent. I'd prefer 'ioapic-irr = 0' here ...) +} + + static void update_handled_vectors(struct kvm_ioapic *ioapic) { DECLARE_BITMAP(handled_vectors, 256); @@ -612,9 +624,10 @@ int kvm_set_ioapic(struct kvm *kvm, struct kvm_ioapic_state *state) spin_lock(ioapic-lock); memcpy(ioapic, state, sizeof(struct kvm_ioapic_state)); + ioapic-irr = 0; update_handled_vectors(ioapic); kvm_vcpu_request_scan_ioapic(kvm); - kvm_rtc_eoi_tracking_restore_all(ioapic); + kvm_ioapic_inject_all(ioapic, state-irr); spin_unlock(ioapic-lock); return 0; } -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] KVM: cleanup ioapic and fix KVM_SET_IRQCHIP with irr != 0
2014-03-21 10:27+0100, Paolo Bonzini: Unlike the old qemu-kvm, which really never did that, with new QEMU it is for some reason somewhat likely to migrate a VM with a nonzero IRR in the ioapic. In the case of ISA edge-triggered interrupts, this represents an interrupt that has not left the IOAPIC, which would be okay but it is not handled right by KVM_SET_IRQCHIP. Because the interrupt is never injected, the guest never acknowledges it, the host never deasserts the pin and new interrupts are dropped. There are two problems to solve. The obvious one is that interrupts are not reinjected upon KVM_SET_IRQCHIP, which is taken care of by patches 3-4. The second is that right now the IRR value depends on the falling edge of the interrupt (as passed by the userspace via kvm_ioapic_set_irq). This is unnecessary, and may lead to spurious reinjection in the destination of migration; instead, we can clear the (internal-only) IRR bit as soon as the interrupt leaves the IOAPIC. This is done by patch 2, which patch 1 prepares for. This fixes migration of Windows guests without HPET. Please review. Paolo v1-v2: more comments in patch 3 change argument name in patch 3 from level to irq_level use IOAPIC_NUM_PINS in patch 4 as a limit to for_each_set_bit remove debug printk in patch 4 Nice solution to a tricky problem, Reviewed-by: Radim Krčmář rkrc...@redhat.com Paolo Bonzini (4): KVM: ioapic: merge ioapic_deliver into ioapic_service KVM: ioapic: clear IRR for edge-triggered interrupts at delivery KVM: ioapic: extract body of kvm_ioapic_set_irq KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP virt/kvm/ioapic.c | 107 +++--- 1 file changed, 69 insertions(+), 38 deletions(-) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
2014-03-24 19:14+0100, Paolo Bonzini: Il 24/03/2014 18:58, Radim Krčmář ha scritto: I'd prefer 'ioapic-irr = 0' here ...) The point is that ioapic-irr = 0 is overriding the previous memcpy, because state-irr is used as argument to kvm_ioapic_inject_all instead. So I think iopic-irr = 0 should stay close to the memcpy. Yeah, I was just spouting ... my reasoning was that we clear irr only because it's going to be recomputed, so that code is more related. (The function name would need to change though.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 03/19] qspinlock: Add pending bit
2014-05-07 11:01-0400, Waiman Long: From: Peter Zijlstra pet...@infradead.org Because the qspinlock needs to touch a second cacheline; add a pending bit and allow a single in-word spinner before we punt to the second cacheline. I think there is an unwanted scenario on virtual machines: 1) VCPU sets the pending bit and start spinning. 2) Pending VCPU gets descheduled. - we have PLE and lock holder isn't running [1] - the hypervisor randomly preempts us 3) Lock holder unlocks while pending VCPU is waiting in queue. 4) Subsequent lockers will see free lock with set pending bit and will loop in trylock's 'for (;;)' - the worst-case is lock starving [2] - PLE can save us from wasting whole timeslice Retry threshold is the easiest solution, regardless of its ugliness [4]. Another minor design flaw is that formerly first VCPU gets appended to the tail when it decides to queue; is the performance gain worth it? Thanks. --- 1: Pause Loop Exiting is almost certain to vmexit in that case: we default to 4096 TSC cycles on KVM, and pending loop is longer than 4 (4096/PSPIN_THRESHOLD). We would also vmexit if critical section was longer than 4k. 2: In this example, vpus 1 and 2 use the lock while 3 never gets there. VCPU: 1 2 3 lock() // we are the holder pend() // we have pending bit vmexit // while in PSPIN_THRESHOLD loop unlock() vmentry SPINNING// for {;;} loop vmexit vmentry lock() pend() vmexit unlock() vmentry SPINNING vmexit vmentry --- loop --- The window is (should be) too small to happen in bare-metal. 3: Pending VCPU was first in line, but when it decides to queue, it must go to the tail. 4: The idea is to prevent unfairness by queueing after a while of useless looping. Magic value should be set a bit above the time it takes an active pending bit holder to go through the loop. 4 looks enough. We can use either pv_qspinlock_enabled() or cpu_has_hypervisor. I presume that we never want this to happen in a VM and that we won't have pv_qspinlock_enabled() without cpu_has_hypervisor. diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 37b5c7f..cd45c27 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -573,7 +573,7 @@ static __always_inline int get_qlock(struct qspinlock *lock) static inline int trylock_pending(struct qspinlock *lock, u32 *pval) { u32 old, new, val = *pval; - int retry = 1; + int retry = 0; /* * trylock || pending @@ -595,9 +595,9 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * a while to see if that either bit will be cleared. * If that is no change, we return and be queued. */ - if (!retry) + if (retry) return 0; - retry--; + retry++; cpu_relax(); cpu_relax(); *pval = val = atomic_read(lock-val); @@ -608,7 +608,11 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * Assuming that the pending bit holder is going to * set the lock bit and clear the pending bit soon, * it is better to wait than to exit at this point. +* Our assumption does not hold on hypervisors, where +* the pending bit holder doesn't have to be running. */ + if (cpu_has_hypervisor ++retry MAGIC) + return 0; cpu_relax(); *pval = val = atomic_read(lock-val); continue; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 10/19] qspinlock, x86: Allow unfair spinlock in a virtual guest
(tl;dr: paravirtualization could be better than unfair qspinlock) 2014-05-07 11:01-0400, Waiman Long: Locking is always an issue in a virtualized environment because of 2 different types of problems: 1) Lock holder preemption 2) Lock waiter preemption Paravirtualized ticketlocks have a shortcoming; we don't know which VCPU the ticket belongs to, so the hypervisor can only blindly yield to runnable VCPUs after waiters halt in slowpath. There aren't enough free bits in ticket struct to improve, thus we have resorted to unfairness. Qspinlock is different. Most queued VCPUs already know the VCPU before it, so we have what it takes to mitigate lock waiter preemption: we can include preempted CPU id in hypercall, the hypervisor will schedule it, and we'll be woken up from unlock slowpath [1]. This still isn't perfect: we can wake up a VCPU that got preempted before it could hypercall, and these hypercalls will propagate one by one through our queue to the preempted lock holder. (We'd have to share the whole waiter-list to avoid this. We could also try to send holder's id instead and unconditionally kick next-in-line on unlock, I think it would be slower.) Lock holder problem is tougher because we don't always share who is it. The tail bits can be used for it as we don't really use them before a queue has formed. This would cost us one bit to differentiate between holder/tail CPU id [2] and complicate operations a little, but only for the paravirt case, where benefits are expected to be far greater. Hypercall from lock slowpath could schedule preempted VCPU right away. I think this could obsolete unfair locks and will prepare RFC patches soon-ish [3]. (If the idea isn't proved infeasible before.) --- 1: It is possible that we could avoid O(N) traversal and hypercall in unlock slowpath by scheduling VCPUs in the right order often. 2: Or even less. idx=3 is a bug: if we are spinning in NMI, we are almost deadlocked, so we should WARN/BUG if it were to happen; which leaves the combination free to mean that the CPU id is a sole holder, not a tail. (I prefer clean code though.) 3: I already tried and got quickly fed up by refactoring, so it might get postponed till the series gets merged. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 03/19] qspinlock: Add pending bit
2014-05-13 15:47-0400, Waiman Long: On 05/12/2014 11:22 AM, Radim Krčmář wrote: I think there is an unwanted scenario on virtual machines: 1) VCPU sets the pending bit and start spinning. 2) Pending VCPU gets descheduled. - we have PLE and lock holder isn't running [1] - the hypervisor randomly preempts us 3) Lock holder unlocks while pending VCPU is waiting in queue. 4) Subsequent lockers will see free lock with set pending bit and will loop in trylock's 'for (;;)' - the worst-case is lock starving [2] - PLE can save us from wasting whole timeslice Retry threshold is the easiest solution, regardless of its ugliness [4]. Yes, that can be a real issue. Some sort of retry threshold, as you said, should be able to handle it. BTW, the relevant patch should be 16/19 where the PV spinlock stuff should be discussed. This patch is perfectly fine. Ouch, my apology to Peter didn't make it ... Agreed, I should have split the comment under patches [06/19] (part quoted above; does not depend on PV), [16/19] (part quoted below) and [17/19] (general doubts). Another minor design flaw is that formerly first VCPU gets appended to the tail when it decides to queue; is the performance gain worth it? Thanks. Yes, the performance gain is worth it. The primary goal is to be not worse than ticket spinlock in light load situation which is the most common case. This feature is need to achieve that. Ok. I've seen merit in pvqspinlock even with slightly slower first-waiter, so I would have happily sacrificed those horrible branches. (I prefer elegant to optimized code, but I can see why we want to be strictly better than ticketlock.) Peter mentioned that we are focusing on bare-metal patches, so I'll withold my other paravirt rants until they are polished. And to forcefully bring this thread a little bit on-topic: Pending-bit is effectively a lock in a lock, so I was wondering why don't we use more pending bits; advantages are the same, just diminished by the probability of having an ideally contended lock: - waiter won't be blocked on RAM access if critical section (or more) ends sooner - some unlucky cacheline is not forgotten - faster unlock (no need for tail operations) (- ?) disadvantages are magnified: - increased complexity - intense cacheline sharing (I thought that this is the main disadvantage of ticketlock.) (- ?) One bit still improved performance, is it the best we got? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v10 03/19] qspinlock: Add pending bit
2014-05-14 19:00+0200, Peter Zijlstra: On Wed, May 14, 2014 at 06:51:24PM +0200, Radim Krčmář wrote: Ok. I've seen merit in pvqspinlock even with slightly slower first-waiter, so I would have happily sacrificed those horrible branches. (I prefer elegant to optimized code, but I can see why we want to be strictly better than ticketlock.) Peter mentioned that we are focusing on bare-metal patches, so I'll withold my other paravirt rants until they are polished. (It was an ambiguous sentence, I have comments for later patches.) Well, paravirt must happen too, but comes later in this series, patch 3 which we're replying to is still very much in the bare metal part of the series. (I think that bare metal spans the first 7 patches.) I've not had time yet to decode all that Waiman has done to make paravirt work. But as a general rule I like patches that start with something simple and working and then optimize it, this series doesn't seem to quite grasp that. And to forcefully bring this thread a little bit on-topic: Pending-bit is effectively a lock in a lock, so I was wondering why don't we use more pending bits; advantages are the same, just diminished by the probability of having an ideally contended lock: - waiter won't be blocked on RAM access if critical section (or more) ends sooner - some unlucky cacheline is not forgotten - faster unlock (no need for tail operations) (- ?) disadvantages are magnified: - increased complexity - intense cacheline sharing (I thought that this is the main disadvantage of ticketlock.) (- ?) One bit still improved performance, is it the best we got? So, the advantage of one bit is that if we use a whole byte for 1 bit we can avoid some atomic ops. The entire reason for this in-word spinner is to amortize the cost of hitting the external node cacheline. Every pending CPU removes one length of the critical section from the delay caused by cacheline propagation and really cold cache is hundreds(?) of cycles, so we could burn some to ensure correctness and still be waiting when the first pending CPU unlocks. So traditional locks like test-and-test and the ticket lock only ever access the spinlock word itsef, this MCS style queueing lock has a second (and, see my other rants in this thread, when done wrong more than 2) cacheline to touch. That said, all our benchmarking is pretty much for the cache-hot case, so I'm not entirely convinced yet that the one pending bit makes up for it, it does in the cache-hot case. Yeah, we probably use the faster pre-lock quite a lot. Cover letter states that queue depth 1-3 is a bit slower than ticket spinlock, so we might not be losing if we implemented a faster in-word-lock of this capacity. (Not that I'm a fan of the hybrid lock.) But... writing cache-cold benchmarks is _hard_ :/ Wouldn't clflush of the second cacheline before trying for the lock give us a rough estimate? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] kvm: increase max vcpu count
2014-05-15 17:50+0800, Li, Zhen-Hua: This patch is trying to increase the maximum supported vcpu number. There has been big system supporting more than 256 logical CPUs, and vmware can also support guest system with more than logical 256 CPUs. So kvm should also increase the maximum supported cpu number. How did it work? KVM lapic does not handle more more than 256 [1] at the moment, so additional VCPUs had to wraparound ... --- 1: struct kvm_apic_map has struct kvm_lapic *phys_map[256] and we are using 0xff when dealing with it, too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 08/07] qspinlock: integrate pending bit into queue
2014-05-21 18:49+0200, Radim Krčmář: 2014-05-19 16:17-0400, Waiman Long: As for now, I will focus on just having one pending bit. I'll throw some ideas at it, One of the ideas follows; it seems sound, but I haven't benchmarked it thoroughly. (Wasted a lot of time by writing/playing with various tools and loads.) Dbench on ext4 ramdisk, hackbench and ebizzy have shown a small improvement in performance, but my main drive was the weird design of Pending Bit. Does your setup yield improvements too? (A minor code swap noted in the patch might help things.) It is meant to be aplied on top of first 7 patches, because the virt stuff would just get in the way. I have preserved a lot of dead code and made some questionable decisions just to keep the diff short and in one patch, sorry about that. (It is work in progress, double slashed lines mark points of interest.) ---8--- Pending Bit wasn't used if we already had a node queue with one cpu, which meant that we suffered from these drawbacks again: - unlock path was more complicated (last queued CPU had to clear the tail) - cold node cacheline was just one critical section away With this patch, Pending Bit is used as an additional step in the queue. Waiting for lock is the same: we try Pending Bit and if it is taken, we append to Node Queue. Unlock is different: pending CPU moves into critical section and first CPU from Node Queue takes Pending Bit and notifies next in line or clears the tail. This allows the pending CPU to take the lock as fast as possible, because all bookkeeping was done when entering Pending Queue. Node Queue operations can also be slower without affecting the performance, because we have an additional buffer of one critical section. --- kernel/locking/qspinlock.c | 180 + 1 file changed, 135 insertions(+), 45 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0ee1a23..76cafb0 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -98,7 +98,10 @@ struct __qspinlock { union { atomic_t val; #ifdef __LITTLE_ENDIAN - u8 locked; + struct { + u8 locked; + u8 pending; + }; struct { u16 locked_pending; u16 tail; @@ -109,7 +112,8 @@ struct __qspinlock { u16 locked_pending; }; struct { - u8 reserved[3]; + u8 reserved[2]; + u8 pending; u8 locked; }; #endif @@ -314,6 +318,59 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) return 1; } +// nice comment here +static inline bool trylock(struct qspinlock *lock, u32 *val) { + if (!(*val = atomic_read(lock-val)) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) { + *val = _Q_LOCKED_VAL; + return 1; + } + return 0; +} + +// here +static inline bool trypending(struct qspinlock *lock, u32 *pval) { + u32 old, val = *pval; + // optimizer might produce the same code if we use *pval directly + + // we could use 'if' and a xchg that touches only the pending bit to + // save some cycles at the price of a longer line cutting window + // (and I think it would bug without changing the rest) + while (!(val (_Q_PENDING_MASK | _Q_TAIL_MASK))) { + old = atomic_cmpxchg(lock-val, val, val | _Q_PENDING_MASK); + if (old == val) { + *pval = val | _Q_PENDING_MASK; + return 1; + } + val = old; + } + *pval = val; + return 0; +} + +// here +static inline void set_pending(struct qspinlock *lock, u8 pending) +{ + struct __qspinlock *l = (void *)lock; + + // take a look if this is necessary, and if we don't have an + // abstraction already + barrier(); + ACCESS_ONCE(l-pending) = pending; + barrier(); +} + +// and here +static inline u32 cmpxchg_tail(struct qspinlock *lock, u32 tail, u32 newtail) +// API-incompatible with set_pending and the shifting is ugly, so I'd rather +// refactor this one, xchg_tail() and encode_tail() ... another day +{ + struct __qspinlock *l = (void *)lock; + + return (u32)cmpxchg(l-tail, tail _Q_TAIL_OFFSET, + newtail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure @@ -324,21 +381,21 @@ static inline int trylock_pending(struct qspinlock *lock, u32 *pval) * fast :slow :unlock
Re: [PATCH v7 08/14] Introduce chr-testdev
2014-07-16 11:33+0200, Andrew Jones: On Wed, Jul 16, 2014 at 05:31:33AM -0400, Levente Kurusa wrote: - Original Message - [...] +void chr_testdev_exit(int code) +{ + char buf[8]; + int len; + + snprintf(buf, sizeof(buf), %dq, code); + len = strlen(buf); AFAIK, snprintf returns the number of characters written, so these two statements can be merged into one. You are correct. I'll do this for v8 (if a v8 is needed). snprintf returns the number of characters that would be written without truncation, so it wouldn't handle a very long code ;) I'd prefer the obvious v7, to finding a minimum with 'sizeof(buf) - 1'. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] KVM: add kvm_arch_sched_in
Introduce preempt notifiers for architecture specific code. Advantage over creating a new notifier in every arch is slightly simpler code and guaranteed call order with respect to kvm_sched_in. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/arm/kvm/arm.c | 4 arch/mips/kvm/mips.c | 4 arch/powerpc/kvm/powerpc.c | 4 arch/s390/kvm/kvm-s390.c | 4 arch/x86/kvm/x86.c | 4 include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c| 2 ++ 7 files changed, 24 insertions(+) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index a99e0cd..9f788eb 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { vcpu-cpu = cpu; diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index cd71141..2362df2 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, struct kvm_translation *tr) { diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 4c79284..cbc432f 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) kvmppc_subarch_vcpu_uninit(vcpu); } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { #ifdef CONFIG_BOOKE diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index ce81eb2..a3c324e 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) /* Nothing todo */ } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { save_fp_ctl(vcpu-arch.host_fpregs.fpc); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8f1e22d..d7c214f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) static_key_slow_dec(kvm_no_apic_vcpu); } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) { if (type) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a4c33b3..ebd7236 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -624,6 +624,8 @@ void kvm_arch_exit(void); int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu); +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu); + void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu); void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 33712fb..d3c3ed0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu) if (vcpu-preempted) vcpu-preempted = false; + kvm_arch_sched_in(vcpu, cpu); + kvm_arch_vcpu_load(vcpu, cpu); } -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] KVM: VMX: clamp PLE window
Modifications could get unwanted values of PLE window. (low or negative) Use ple_window and the maximal value that cannot overflow as bounds. ple_window_max defaults to a very high value, but it would make sense to set it to some fraction of the scheduler tick. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 66259fd..e1192fb 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -144,6 +144,10 @@ module_param(ple_window_grow, int, S_IRUGO); static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; module_param(ple_window_shrink, int, S_IRUGO); +/* Default is to compute the maximum so we can never overflow. */ +static int ple_window_max = INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW; +module_param(ple_window_max, int, S_IRUGO); + extern const ulong vmx_return; #define NR_AUTOLOAD_MSRS 8 @@ -5704,7 +5708,7 @@ static void grow_ple_window(struct kvm_vcpu *vcpu) else new = old + ple_window_grow; - vmx-ple_window = new; + vmx-ple_window = min(new, ple_window_max); } static void shrink_ple_window(struct kvm_vcpu *vcpu) @@ -5720,7 +5724,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) else new = old - ple_window_shrink; - vmx-ple_window = new; + vmx-ple_window = max(new, ple_window); } /* -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] KVM: trace kvm_ple_window grow/shrink
Tracepoint for dynamic PLE window, fired on every potential change. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/trace.h | 29 + arch/x86/kvm/vmx.c | 4 arch/x86/kvm/x86.c | 1 + 3 files changed, 34 insertions(+) diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index e850a7d..e4682f5 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -848,6 +848,35 @@ TRACE_EVENT(kvm_track_tsc, __print_symbolic(__entry-host_clock, host_clocks)) ); +TRACE_EVENT(kvm_ple_window, + TP_PROTO(int grow, unsigned int vcpu_id, int new, int old), + TP_ARGS(grow, vcpu_id, new, old), + + TP_STRUCT__entry( + __field( int, grow ) + __field(unsigned int, vcpu_id ) + __field( int, new ) + __field( int, old ) + ), + + TP_fast_assign( + __entry-grow = grow; + __entry-vcpu_id= vcpu_id; + __entry-new= new; + __entry-old= old; + ), + + TP_printk(vcpu %u: ple_window %d %s %d, + __entry-vcpu_id, + __entry-new, + __entry-grow ? + : -, + __entry-old) +); +#define trace_kvm_ple_window_grow(vcpu_id, new, old) \ + trace_kvm_ple_window(1, vcpu_id, new, old) +#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \ + trace_kvm_ple_window(0, vcpu_id, new, old) + #endif /* CONFIG_X86_64 */ #endif /* _TRACE_KVM_H */ diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e1192fb..a236a9f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5709,6 +5709,8 @@ static void grow_ple_window(struct kvm_vcpu *vcpu) new = old + ple_window_grow; vmx-ple_window = min(new, ple_window_max); + + trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old); } static void shrink_ple_window(struct kvm_vcpu *vcpu) @@ -5725,6 +5727,8 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) new = old - ple_window_shrink; vmx-ple_window = max(new, ple_window); + + trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old); } /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5696ee7..814b20c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7648,3 +7648,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset); +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window); -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/9] KVM: VMX: automatic PLE window maximum
Every increase of ple_window_grow creates potential overflows. They are not serious, because we clamp ple_window and userspace is expected to fix ple_window_max within a second. --- arch/x86/kvm/vmx.c | 34 +- 1 file changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index d7f58e8..6873a0b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -138,7 +138,9 @@ module_param(ple_window, int, S_IRUGO | S_IWUSR); /* Default doubles per-vcpu window every exit. */ static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW; -module_param(ple_window_grow, int, S_IRUGO | S_IWUSR); +static struct kernel_param_ops ple_window_grow_ops; +module_param_cb(ple_window_grow, ple_window_grow_ops, +ple_window_grow, S_IRUGO | S_IWUSR); /* Default resets per-vcpu window every exit to ple_window. */ static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; @@ -5717,6 +5719,36 @@ static void type##_ple_window(struct kvm_vcpu *vcpu) \ make_ple_window_modifier(grow, *, +) /* grow_ple_window */ make_ple_window_modifier(shrink, /, -) /* shrink_ple_window */ +static void clamp_ple_window_max(void) +{ + int maximum; + + if (ple_window_grow 1) + return; + + if (ple_window_grow ple_window) + maximum = INT_MAX / ple_window_grow; + else + maximum = INT_MAX - ple_window_grow; + + ple_window_max = clamp(ple_window_max, ple_window, maximum); +} + +static int set_ple_window_grow(const char *arg, const struct kernel_param *kp) +{ + int ret; + + clamp_ple_window_max(); + ret = param_set_int(arg, kp); + + return ret; +} + +static struct kernel_param_ops ple_window_grow_ops = { + .set = set_ple_window_grow, + .get = param_get_int, +}; + /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE * exiting, so only get here on cpu with PAUSE-Loop-Exiting. -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] KVM: VMX: abstract ple_window modifiers
They were almost identical and thus merged with a loathable macro. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- This solution is hopefully more acceptable than function pointers. arch/x86/kvm/vmx.c | 53 +++-- 1 file changed, 19 insertions(+), 34 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a236a9f..c6cfb71 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5694,42 +5694,27 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) out: return ret; } - -static void grow_ple_window(struct kvm_vcpu *vcpu) -{ - struct vcpu_vmx *vmx = to_vmx(vcpu); - int old = vmx-ple_window; - int new; - - if (ple_window_grow 1) - new = ple_window; - else if (ple_window_grow ple_window) - new = old * ple_window_grow; - else - new = old + ple_window_grow; - - vmx-ple_window = min(new, ple_window_max); - - trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old); +#define make_ple_window_modifier(type, oplt, opge, cmp, bound) \ +static void type##_ple_window(struct kvm_vcpu *vcpu) \ +{ \ + struct vcpu_vmx *vmx = to_vmx(vcpu); \ + int old = vmx-ple_window; \ + int new; \ +\ + if (ple_window_##type 1) \ + new = ple_window; \ + else if (ple_window_##type ple_window) \ + new = old oplt ple_window_##type; \ + else \ + new = old opge ple_window_##type; \ +\ + vmx-ple_window = cmp(new, bound); \ +\ + trace_kvm_ple_window_##type(vcpu-vcpu_id, vmx-ple_window, old); \ } -static void shrink_ple_window(struct kvm_vcpu *vcpu) -{ - struct vcpu_vmx *vmx = to_vmx(vcpu); - int old = vmx-ple_window; - int new; - - if (ple_window_shrink 1) - new = ple_window; - else if (ple_window_shrink ple_window) - new = old / ple_window_shrink; - else - new = old - ple_window_shrink; - - vmx-ple_window = max(new, ple_window); - - trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old); -} +make_ple_window_modifier(grow, *, +, min, ple_window_max) +make_ple_window_modifier(shrink, /, -, max, ple_window) /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] KVM: VMX: runtime knobs for dynamic PLE window
ple_window is updated on every vmentry, so there is no reason to have it read-only anymore. ple_window_* weren't writable to prevent runtime overflow races; they are mitigated by clamping the value of ple_window. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- If we decide to ignore insane overflows, last two hunks can be dropped. arch/x86/kvm/vmx.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c6cfb71..d7f58e8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -134,19 +134,19 @@ static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP; module_param(ple_gap, int, S_IRUGO); static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; -module_param(ple_window, int, S_IRUGO); +module_param(ple_window, int, S_IRUGO | S_IWUSR); /* Default doubles per-vcpu window every exit. */ static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW; -module_param(ple_window_grow, int, S_IRUGO); +module_param(ple_window_grow, int, S_IRUGO | S_IWUSR); /* Default resets per-vcpu window every exit to ple_window. */ static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; -module_param(ple_window_shrink, int, S_IRUGO); +module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR); /* Default is to compute the maximum so we can never overflow. */ static int ple_window_max = INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW; -module_param(ple_window_max, int, S_IRUGO); +module_param(ple_window_max, int, S_IRUGO | S_IWUSR); extern const ulong vmx_return; @@ -5694,7 +5694,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu) out: return ret; } -#define make_ple_window_modifier(type, oplt, opge, cmp, bound) \ + +#define make_ple_window_modifier(type, oplt, opge) \ static void type##_ple_window(struct kvm_vcpu *vcpu) \ { \ struct vcpu_vmx *vmx = to_vmx(vcpu); \ @@ -5708,13 +5709,13 @@ static void type##_ple_window(struct kvm_vcpu *vcpu) \ else \ new = old opge ple_window_##type; \ \ - vmx-ple_window = cmp(new, bound); \ + vmx-ple_window = clamp(new, ple_window, ple_window_max); \ \ trace_kvm_ple_window_##type(vcpu-vcpu_id, vmx-ple_window, old); \ } -make_ple_window_modifier(grow, *, +, min, ple_window_max) -make_ple_window_modifier(shrink, /, -, max, ple_window) +make_ple_window_modifier(grow, *, +) /* grow_ple_window */ +make_ple_window_modifier(shrink, /, -) /* shrink_ple_window */ /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] KVM: VMX: dynamise PLE window
Window is increased on every PLE exit and decreased on every sched_in. The idea is that we don't want to PLE exit if there is no preemption going on. We do this with sched_in() because it does not hold rq lock. There are two new kernel parameters for changing the window: ple_window_grow and ple_window_shrink ple_window_grow affects the window on PLE exit and ple_window_shrink does it on sched_in; depending on their value, the window is modifier like this: (ple_window is kvm_intel's global) ple_window_shrink/ | ple_window_grow| PLE exit | sched_in ---++- 1| = ple_window | = ple_window ple_window | *= ple_window_grow | /= ple_window_shrink otherwise | += ple_window_grow | -= ple_window_shrink Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 52 ++-- 1 file changed, 50 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index eaa5574..66259fd 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -125,14 +125,25 @@ module_param(nested, bool, S_IRUGO); * Time is measured based on a counter that runs at the same rate as the TSC, * refer SDM volume 3b section 21.6.13 22.1.3. */ -#define KVM_VMX_DEFAULT_PLE_GAP128 -#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 +#define KVM_VMX_DEFAULT_PLE_GAP 128 +#define KVM_VMX_DEFAULT_PLE_WINDOW4096 +#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW 2 +#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0 + static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP; module_param(ple_gap, int, S_IRUGO); static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; module_param(ple_window, int, S_IRUGO); +/* Default doubles per-vcpu window every exit. */ +static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW; +module_param(ple_window_grow, int, S_IRUGO); + +/* Default resets per-vcpu window every exit to ple_window. */ +static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; +module_param(ple_window_shrink, int, S_IRUGO); + extern const ulong vmx_return; #define NR_AUTOLOAD_MSRS 8 @@ -5680,12 +5691,47 @@ out: return ret; } +static void grow_ple_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + int old = vmx-ple_window; + int new; + + if (ple_window_grow 1) + new = ple_window; + else if (ple_window_grow ple_window) + new = old * ple_window_grow; + else + new = old + ple_window_grow; + + vmx-ple_window = new; +} + +static void shrink_ple_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + int old = vmx-ple_window; + int new; + + if (ple_window_shrink 1) + new = ple_window; + else if (ple_window_shrink ple_window) + new = old / ple_window_shrink; + else + new = old - ple_window_shrink; + + vmx-ple_window = new; +} + /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE * exiting, so only get here on cpu with PAUSE-Loop-Exiting. */ static int handle_pause(struct kvm_vcpu *vcpu) { + if (ple_gap) + grow_ple_window(vcpu); + skip_emulated_instruction(vcpu); kvm_vcpu_on_spin(vcpu); @@ -8855,6 +8901,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu, void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) { + if (ple_gap) + shrink_ple_window(vcpu); } static struct kvm_x86_ops vmx_x86_ops = { -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] KVM: x86: introduce sched_in to kvm_x86_ops
sched_in preempt notifier is available for x86, allow its use in specific virtualization technlogies as well. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 6 ++ arch/x86/kvm/x86.c | 1 + 4 files changed, 15 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5724601..358e2f3 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -772,6 +772,8 @@ struct kvm_x86_ops { bool (*mpx_supported)(void); int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); + + void (*sched_in)(struct kvm_vcpu *kvm, int cpu); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ddf7427..4baf1bc 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4305,6 +4305,10 @@ static void svm_handle_external_intr(struct kvm_vcpu *vcpu) local_irq_enable(); } +static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -4406,6 +4410,8 @@ static struct kvm_x86_ops svm_x86_ops = { .check_intercept = svm_check_intercept, .handle_external_intr = svm_handle_external_intr, + + .sched_in = svm_sched_in, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bfe11cf..2b306f9 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8846,6 +8846,10 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu, return X86EMUL_CONTINUE; } +void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, @@ -8951,6 +8955,8 @@ static struct kvm_x86_ops vmx_x86_ops = { .mpx_supported = vmx_mpx_supported, .check_nested_events = vmx_check_nested_events, + + .sched_in = vmx_sched_in, }; static int __init vmx_init(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d7c214f..5696ee7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7148,6 +7148,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) { + kvm_x86_ops-sched_in(vcpu, cpu); } int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] KVM: VMX: make PLE window per-vcpu
Change PLE window into per-vcpu variable, seeded from module parameter, to allow greater flexibility. Brings in a small overhead on every vmentry. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- I've been thinking about a general hierarchical per-vcpu variable model, but it's hard to have current performance and sane code. arch/x86/kvm/vmx.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2b306f9..eaa5574 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -484,6 +484,9 @@ struct vcpu_vmx { /* Support for a guest hypervisor (nested VMX) */ struct nested_vmx nested; + + /* Dynamic PLE window. */ + int ple_window; }; enum segment_cache_field { @@ -4403,6 +4406,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) if (ple_gap) { vmcs_write32(PLE_GAP, ple_gap); vmcs_write32(PLE_WINDOW, ple_window); + vmx-ple_window = ple_window; } vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); @@ -7387,6 +7391,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) if (vmx-emulation_required) return; + if (ple_gap) + vmcs_write32(PLE_WINDOW, vmx-ple_window); + if (vmx-nested.sync_shadow_vmcs) { copy_vmcs12_to_shadow(vmx); vmx-nested.sync_shadow_vmcs = false; -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/9] Dynamic Pause Loop Exiting window.
PLE does not scale in its current form. When increasing VCPU count above 150, one can hit soft lockups because of runqueue lock contention. (Which says a lot about performance.) The main reason is that kvm_ple_loop cycles through all VCPUs. Replacing it with a scalable solution would be ideal, but it has already been well optimized for various workloads, so this series tries to alleviate one different major problem while minimizing a chance of regressions: we have too many useless PLE exits. Just increasing PLE window would help some cases, but it still spirals out of control. By increasing the window after every PLE exit, we can limit the amount of useless ones, so we don't reach the state where CPUs spend 99% of the time waiting for a lock. HP confirmed that this series avoids soft lockups and TSC sync errors on large guests. --- Design notes and questions: Alternative to first two patches could be a new notifier. All values are made changeable because defaults weren't selected after weeks of benchmarking -- we can get improved performance by hardcoding if someone is willing to do it. (Or by presuming that noone is ever going to.) Then, we can quite safely drop overflow checks: they are impossible to hit with small increases and I don't think that anyone wants large ones. Also, I'd argue against the last patch: it should be done in userspace, but I'm not sure about Linux's policy. Radim Krčmář (9): KVM: add kvm_arch_sched_in KVM: x86: introduce sched_in to kvm_x86_ops KVM: VMX: make PLE window per-vcpu KVM: VMX: dynamise PLE window KVM: VMX: clamp PLE window KVM: trace kvm_ple_window grow/shrink KVM: VMX: abstract ple_window modifiers KVM: VMX: runtime knobs for dynamic PLE window KVM: VMX: automatic PLE window maximum arch/arm/kvm/arm.c | 4 ++ arch/mips/kvm/mips.c| 4 ++ arch/powerpc/kvm/powerpc.c | 4 ++ arch/s390/kvm/kvm-s390.c| 4 ++ arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/svm.c | 6 +++ arch/x86/kvm/trace.h| 29 + arch/x86/kvm/vmx.c | 93 +++-- arch/x86/kvm/x86.c | 6 +++ include/linux/kvm_host.h| 2 + virt/kvm/kvm_main.c | 2 + 11 files changed, 153 insertions(+), 3 deletions(-) -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/9] KVM: VMX: abstract ple_window modifiers
2014-08-20 09:02+0200, Paolo Bonzini: Il 19/08/2014 22:35, Radim Krčmář ha scritto: They were almost identical and thus merged with a loathable macro. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- This solution is hopefully more acceptable than function pointers. I think a little amount of duplication is not a problem. Ok, I'll drop this patch from from v2. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum
2014-08-20 09:16+0200, Paolo Bonzini: Il 19/08/2014 22:35, Radim Krčmář ha scritto: Every increase of ple_window_grow creates potential overflows. They are not serious, because we clamp ple_window and userspace is expected to fix ple_window_max within a second. --- I think avoiding overflows is better. In fact, I think you should call this function for ple_window_max too. (Ack, I just wanted to avoid the worst userspace error, which is why PW_max hasn't changed when PW_grow got smaller and we could overflow.) You could keep the ple_window_max variable to the user-set value. Whenever ple_window_grow or ple_window_max are changed, you can set an internal variable (let's call it ple_window_actual_max, but I'm not wed to this name) to the computed value, and then do: if (ple_window_grow 1 || ple_window_actual_max ple_window) new = ple_window; else if (ple_window_grow ple_window) new = max(ple_window_actual_max, old) * ple_window_grow; else new = max(ple_window_actual_max, old) + ple_window_grow; Oh, I like that this can get rid of all overflows, ple_window_actual_max (PW_effective_max?) is going to be set to ple_window_max [/-] ple_window_grow in v2. (I think the || in the first if can be eliminated with some creativity in clamp_ple_window_max). To do it, we'll want to intercept changes to ple_window as well. (I disliked this patch a lot even before :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/9] KVM: VMX: clamp PLE window
2014-08-20 09:18+0200, Paolo Bonzini: Il 19/08/2014 22:35, Radim Krčmář ha scritto: Modifications could get unwanted values of PLE window. (low or negative) Use ple_window and the maximal value that cannot overflow as bounds. ple_window_max defaults to a very high value, but it would make sense to set it to some fraction of the scheduler tick. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- Please introduce a dynamic overflow-avoiding ple_window_max (like what you have in patch 9) already in patch 4... static void shrink_ple_window(struct kvm_vcpu *vcpu) @@ -5720,7 +5724,7 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) else new = old - ple_window_shrink; - vmx-ple_window = new; + vmx-ple_window = max(new, ple_window); ... and also squash this in patch 4. This patch can then introduce the ple_window_max module parameter (using module_param_cb to avoid overflows). Will do. --- It is going to make the patches slightly harder to review; Are we doing it because git doesn't bisect on series boundaries? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: VMX: make PLE window per-vcpu
2014-08-20 09:13+0200, Paolo Bonzini: Il 19/08/2014 22:35, Radim Krčmář ha scritto: enum segment_cache_field { @@ -4403,6 +4406,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) if (ple_gap) { vmcs_write32(PLE_GAP, ple_gap); vmcs_write32(PLE_WINDOW, ple_window); Is this necessary? V2, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/9] KVM: add kvm_arch_sched_in
2014-08-20 09:47+0200, Christian Borntraeger: On 19/08/14 22:35, Radim Krčmář wrote: --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu) if (vcpu-preempted) vcpu-preempted = false; + kvm_arch_sched_in(vcpu, cpu); + kvm_arch_vcpu_load(vcpu, cpu); } Why cant you reuse kvm_arch_vcpu_load? Its also called on each sched_in and is architecture specific. kvm_arch_vcpu_load is also called from kvm_vcpu_ioctl, so we'd be shrinking unnecessarily. (sched_in gives us a bit of useful information about the state of the system, kvm_vcpu_ioctl not that much.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum
2014-08-20 15:15+0200, Paolo Bonzini: Il 20/08/2014 14:41, Radim Krčmář ha scritto: if (ple_window_grow 1 || ple_window_actual_max ple_window) new = ple_window; else if (ple_window_grow ple_window) new = max(ple_window_actual_max, old) * ple_window_grow; else new = max(ple_window_actual_max, old) + ple_window_grow; Oh, I like that this can get rid of all overflows, ple_window_actual_max (PW_effective_max?) is going to be set to ple_window_max [/-] ple_window_grow in v2. (I think the || in the first if can be eliminated with some creativity in clamp_ple_window_max). To do it, we'll want to intercept changes to ple_window as well. (I disliked this patch a lot even before :) What about setting ple_window_actual_max to 0 if ple_window_grow is 0 (instead of just returning)? Then the if (ple_window_actual_max ple_window) will always fail and you'll go through new = ple_window. But perhaps it's more gross and worthless than creative. :) That code can't use PW directly, because PW_actual_max needs to be one PW_grow below PW_max, so I'd rather enforce minimal PW_actual_max. Btw. without extra code, we are still going to overflow on races when changing PW_grow, should they be covered as well? (+ There is a bug in this patch -- clamp_ple_window_max() should be after param_set_int() ... damned unreviewed last-second changes.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum
2014-08-20 17:34+0200, Paolo Bonzini: Il 20/08/2014 17:31, Radim Krčmář ha scritto: Btw. without extra code, we are still going to overflow on races when changing PW_grow, should they be covered as well? You mean because there is no spinlock or similar protecting the changes? I guess you could use a seqlock. Yes, for example between a modification of ple_window new = min(old, PW_actual_max) * PW_grow which gets compiled into something like this: 1) tmp = min(old, PW_actual_max) 2) new = tmp * PW_grow and a write to increase PW_grow 3) PW_actual_max = min(PW_max / new_PW_grow, PW_actual_max) 4) PW_grow = new_PW_grow 5) PW_actual_max = PW_max / new_PW_grow 3 and 4 can exectute between 1 and 2, which could overflow. I don't think they are important enough to warrant a significant performance hit of locking. Or even more checks that would prevent it in a lockless way. (I'd just see that the result is set to something legal and also drop line 3, because it does not help things that much.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] KVM: VMX: automatic PLE window maximum
2014-08-20 18:03+0200, Paolo Bonzini: Il 20/08/2014 18:01, Radim Krčmář ha scritto: 2014-08-20 17:34+0200, Paolo Bonzini: Il 20/08/2014 17:31, Radim Krčmář ha scritto: Btw. without extra code, we are still going to overflow on races when changing PW_grow, should they be covered as well? You mean because there is no spinlock or similar protecting the changes? I guess you could use a seqlock. Yes, for example between a modification of ple_window new = min(old, PW_actual_max) * PW_grow which gets compiled into something like this: 1) tmp = min(old, PW_actual_max) 2) new = tmp * PW_grow and a write to increase PW_grow 3) PW_actual_max = min(PW_max / new_PW_grow, PW_actual_max) 4) PW_grow = new_PW_grow 5) PW_actual_max = PW_max / new_PW_grow 3 and 4 can exectute between 1 and 2, which could overflow. I don't think they are important enough to warrant a significant performance hit of locking. A seqlock just costs two memory accesses to the same (shared) cache line as the PW data, and a non-taken branch. Oh, seqlock readers do not have to write to shared memory, so it is acceptable ... I don't like code that is unsafe by design... I wouldn't say it is unsafe, because VCPU's PW is always greater than module's PW. We are just going to PLE exit sooner than expected. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/6] KVM: VMX: make PLE window per-VCPU
Change PLE window into per-VCPU variable, seeded from module parameter, to allow greater flexibility. Brings in a small overhead on every vmentry. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2b306f9..18e0e52 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -484,6 +484,9 @@ struct vcpu_vmx { /* Support for a guest hypervisor (nested VMX) */ struct nested_vmx nested; + + /* Dynamic PLE window. */ + int ple_window; }; enum segment_cache_field { @@ -4402,7 +4405,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) if (ple_gap) { vmcs_write32(PLE_GAP, ple_gap); - vmcs_write32(PLE_WINDOW, ple_window); + vmx-ple_window = ple_window; } vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); @@ -7387,6 +7390,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) if (vmx-emulation_required) return; + if (ple_gap) + vmcs_write32(PLE_WINDOW, vmx-ple_window); + if (vmx-nested.sync_shadow_vmcs) { copy_vmcs12_to_shadow(vmx); vmx-nested.sync_shadow_vmcs = false; -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 6/6] KVM: VMX: runtime knobs for dynamic PLE window
ple_window is updated on every vmentry, so there is no reason to have it read-only anymore. ple_window* weren't writable to prevent runtime overflow races; they are prevented by a seqlock. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 48 +--- 1 file changed, 37 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index f63ac5d..bd73fa1 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -132,24 +132,29 @@ module_param(nested, bool, S_IRUGO); #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\ INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW +static struct kernel_param_ops param_ops_ple_t; +#define param_check_ple_t(name, p) __param_check(name, p, int) + +static DEFINE_SEQLOCK(ple_window_seqlock); + static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP; module_param(ple_gap, int, S_IRUGO); static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; -module_param(ple_window, int, S_IRUGO); +module_param(ple_window, ple_t, S_IRUGO | S_IWUSR); /* Default doubles per-vcpu window every exit. */ static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW; -module_param(ple_window_grow, int, S_IRUGO); +module_param(ple_window_grow, ple_t, S_IRUGO | S_IWUSR); /* Default resets per-vcpu window every exit to ple_window. */ static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; -module_param(ple_window_shrink, int, S_IRUGO); +module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR); /* Default is to compute the maximum so we can never overflow. */ static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX; static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX; -module_param(ple_window_max, int, S_IRUGO); +module_param(ple_window_max, ple_t, S_IRUGO | S_IWUSR); extern const ulong vmx_return; @@ -5730,13 +5735,19 @@ static void modify_ple_window(struct kvm_vcpu *vcpu, int grow) struct vcpu_vmx *vmx = to_vmx(vcpu); int old = vmx-ple_window; int new; + unsigned seq; - if (grow) - new = __grow_ple_window(old) - else - new = __shrink_ple_window(old, ple_window_shrink, ple_window); + do { + seq = read_seqbegin(ple_window_seqlock); - vmx-ple_window = max(new, ple_window); + if (grow) + new = __grow_ple_window(old); + else + new = __shrink_ple_window(old, ple_window_shrink, + ple_window); + + vmx-ple_window = max(new, ple_window); + } while (read_seqretry(ple_window_seqlock, seq)); trace_kvm_ple_window(grow, vcpu-vcpu_id, vmx-ple_window, old); } @@ -5750,6 +5761,23 @@ static void update_ple_window_actual_max(void) ple_window_grow, INT_MIN); } +static int param_set_ple_t(const char *arg, const struct kernel_param *kp) +{ + int ret; + + write_seqlock(ple_window_seqlock); + ret = param_set_int(arg, kp); + update_ple_window_actual_max(); + write_sequnlock(ple_window_seqlock); + + return ret; +} + +static struct kernel_param_ops param_ops_ple_t = { + .set = param_set_ple_t, + .get = param_get_int, +}; + /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE * exiting, so only get here on cpu with PAUSE-Loop-Exiting. @@ -9153,8 +9181,6 @@ static int __init vmx_init(void) } else kvm_disable_tdp(); - update_ple_window_actual_max(); - return 0; out7: -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/6] KVM: VMX: dynamise PLE window
Window is increased on every PLE exit and decreased on every sched_in. The idea is that we don't want to PLE exit if there is no preemption going on. We do this with sched_in() because it does not hold rq lock. There are two new kernel parameters for changing the window: ple_window_grow and ple_window_shrink ple_window_grow affects the window on PLE exit and ple_window_shrink does it on sched_in; depending on their value, the window is modifier like this: (ple_window is kvm_intel's global) ple_window_shrink/ | ple_window_grow| PLE exit | sched_in ---++- 1| = ple_window | = ple_window ple_window | *= ple_window_grow | /= ple_window_shrink otherwise | += ple_window_grow | -= ple_window_shrink A third new parameter, ple_window_max, controls a maximal ple_window. A minimum equals to ple_window. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 80 -- 1 file changed, 78 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 18e0e52..e63d7ac 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -125,14 +125,32 @@ module_param(nested, bool, S_IRUGO); * Time is measured based on a counter that runs at the same rate as the TSC, * refer SDM volume 3b section 21.6.13 22.1.3. */ -#define KVM_VMX_DEFAULT_PLE_GAP128 -#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 +#define KVM_VMX_DEFAULT_PLE_GAP 128 +#define KVM_VMX_DEFAULT_PLE_WINDOW4096 +#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW 2 +#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0 +#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\ + INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW + static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP; module_param(ple_gap, int, S_IRUGO); static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; module_param(ple_window, int, S_IRUGO); +/* Default doubles per-vcpu window every exit. */ +static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW; +module_param(ple_window_grow, int, S_IRUGO); + +/* Default resets per-vcpu window every exit to ple_window. */ +static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; +module_param(ple_window_shrink, int, S_IRUGO); + +/* Default is to compute the maximum so we can never overflow. */ +static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX; +static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX; +module_param(ple_window_max, int, S_IRUGO); + extern const ulong vmx_return; #define NR_AUTOLOAD_MSRS 8 @@ -5679,12 +5697,66 @@ out: return ret; } +static int __grow_ple_window(int val) +{ + if (ple_window_grow 1) + return ple_window; + + val = min(val, ple_window_actual_max); + + if (ple_window_grow ple_window) + val *= ple_window_grow; + else + val += ple_window_grow; + + return val; +} + +static int __shrink_ple_window(int val, int shrinker, int minimum) +{ + if (shrinker 1) + return ple_window; + + if (shrinker ple_window) + val /= shrinker; + else + val -= shrinker; + + return max(val, minimum); +} + +static void modify_ple_window(struct kvm_vcpu *vcpu, int grow) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + int new; + + if (grow) + new = __grow_ple_window(vmx-ple_window); + else + new = __shrink_ple_window(vmx-ple_window, ple_window_shrink, + ple_window); + + vmx-ple_window = max(new, ple_window); +} +#define grow_ple_window(vcpu) modify_ple_window(vcpu, 1) +#define shrink_ple_window(vcpu) modify_ple_window(vcpu, 0) + +static void update_ple_window_actual_max(void) +{ + ple_window_actual_max = + __shrink_ple_window(max(ple_window_max, ple_window), + ple_window_grow, INT_MIN); +} + /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE * exiting, so only get here on cpu with PAUSE-Loop-Exiting. */ static int handle_pause(struct kvm_vcpu *vcpu) { + if (ple_gap) + grow_ple_window(vcpu); + skip_emulated_instruction(vcpu); kvm_vcpu_on_spin(vcpu); @@ -8854,6 +8926,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu, void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) { + if (ple_gap) + shrink_ple_window(vcpu); } static struct kvm_x86_ops vmx_x86_ops = { @@ -9077,6 +9151,8 @@ static int __init vmx_init(void) } else kvm_disable_tdp(); + update_ple_window_actual_max(); + return 0; out7: -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org
[PATCH v2 5/6] KVM: trace kvm_ple_window
Tracepoint for dynamic PLE window, fired on every potential change. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/trace.h | 25 + arch/x86/kvm/vmx.c | 8 +--- arch/x86/kvm/x86.c | 1 + 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index e850a7d..4b8e6cb 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -848,6 +848,31 @@ TRACE_EVENT(kvm_track_tsc, __print_symbolic(__entry-host_clock, host_clocks)) ); +TRACE_EVENT(kvm_ple_window, + TP_PROTO(int grow, unsigned int vcpu_id, int new, int old), + TP_ARGS(grow, vcpu_id, new, old), + + TP_STRUCT__entry( + __field( int, grow ) + __field(unsigned int, vcpu_id ) + __field( int, new ) + __field( int, old ) + ), + + TP_fast_assign( + __entry-grow = grow; + __entry-vcpu_id= vcpu_id; + __entry-new= new; + __entry-old= old; + ), + + TP_printk(vcpu %u: ple_window %d %s %d, + __entry-vcpu_id, + __entry-new, + __entry-grow ? + : -, + __entry-old) +); + #endif /* CONFIG_X86_64 */ #endif /* _TRACE_KVM_H */ diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index e63d7ac..f63ac5d 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5728,15 +5728,17 @@ static int __shrink_ple_window(int val, int shrinker, int minimum) static void modify_ple_window(struct kvm_vcpu *vcpu, int grow) { struct vcpu_vmx *vmx = to_vmx(vcpu); + int old = vmx-ple_window; int new; if (grow) - new = __grow_ple_window(vmx-ple_window); + new = __grow_ple_window(old) else - new = __shrink_ple_window(vmx-ple_window, ple_window_shrink, - ple_window); + new = __shrink_ple_window(old, ple_window_shrink, ple_window); vmx-ple_window = max(new, ple_window); + + trace_kvm_ple_window(grow, vcpu-vcpu_id, vmx-ple_window, old); } #define grow_ple_window(vcpu) modify_ple_window(vcpu, 1) #define shrink_ple_window(vcpu) modify_ple_window(vcpu, 0) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5696ee7..814b20c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7648,3 +7648,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset); +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window); -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/6] KVM: x86: introduce sched_in to kvm_x86_ops
sched_in preempt notifier is available for x86, allow its use in specific virtualization technlogies as well. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 6 ++ arch/x86/kvm/x86.c | 1 + 4 files changed, 15 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5724601..358e2f3 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -772,6 +772,8 @@ struct kvm_x86_ops { bool (*mpx_supported)(void); int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); + + void (*sched_in)(struct kvm_vcpu *kvm, int cpu); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ddf7427..4baf1bc 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4305,6 +4305,10 @@ static void svm_handle_external_intr(struct kvm_vcpu *vcpu) local_irq_enable(); } +static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -4406,6 +4410,8 @@ static struct kvm_x86_ops svm_x86_ops = { .check_intercept = svm_check_intercept, .handle_external_intr = svm_handle_external_intr, + + .sched_in = svm_sched_in, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bfe11cf..2b306f9 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8846,6 +8846,10 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu, return X86EMUL_CONTINUE; } +void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, @@ -8951,6 +8955,8 @@ static struct kvm_x86_ops vmx_x86_ops = { .mpx_supported = vmx_mpx_supported, .check_nested_events = vmx_check_nested_events, + + .sched_in = vmx_sched_in, }; static int __init vmx_init(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d7c214f..5696ee7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7148,6 +7148,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) { + kvm_x86_ops-sched_in(vcpu, cpu); } int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/6] Dynamic Pause Loop Exiting window.
v1 - v2: * squashed [v1 4/9] and [v1 5/9] (clamping) * dropped [v1 7/9] (CPP abstractions) * merged core of [v1 9/9] into [v1 4/9] (automatic maximum) * reworked kernel_param_ops: closer to pure int [v2 6/6] * introduced ple_window_actual_max reworked clamping [v2 4/6] * added seqlock for parameter modifications [v2 6/6] --- PLE does not scale in its current form. When increasing VCPU count above 150, one can hit soft lockups because of runqueue lock contention. (Which says a lot about performance.) The main reason is that kvm_ple_loop cycles through all VCPUs. Replacing it with a scalable solution would be ideal, but it has already been well optimized for various workloads, so this series tries to alleviate one different major problem while minimizing a chance of regressions: we have too many useless PLE exits. Just increasing PLE window would help some cases, but it still spirals out of control. By increasing the window after every PLE exit, we can limit the amount of useless ones, so we don't reach the state where CPUs spend 99% of the time waiting for a lock. HP confirmed that this series prevents soft lockups and TSC sync errors on large guests. Radim Krčmář (6): KVM: add kvm_arch_sched_in KVM: x86: introduce sched_in to kvm_x86_ops KVM: VMX: make PLE window per-VCPU KVM: VMX: dynamise PLE window KVM: trace kvm_ple_window KVM: VMX: runtime knobs for dynamic PLE window arch/arm/kvm/arm.c | 4 ++ arch/mips/kvm/mips.c| 4 ++ arch/powerpc/kvm/powerpc.c | 4 ++ arch/s390/kvm/kvm-s390.c| 4 ++ arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/trace.h| 25 arch/x86/kvm/vmx.c | 124 ++-- arch/x86/kvm/x86.c | 6 ++ include/linux/kvm_host.h| 2 + virt/kvm/kvm_main.c | 2 + 11 files changed, 179 insertions(+), 4 deletions(-) -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/6] KVM: add kvm_arch_sched_in
Introduce preempt notifiers for architecture specific code. Advantage over creating a new notifier in every arch is slightly simpler code and guaranteed call order with respect to kvm_sched_in. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/arm/kvm/arm.c | 4 arch/mips/kvm/mips.c | 4 arch/powerpc/kvm/powerpc.c | 4 arch/s390/kvm/kvm-s390.c | 4 arch/x86/kvm/x86.c | 4 include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c| 2 ++ 7 files changed, 24 insertions(+) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index a99e0cd..9f788eb 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { vcpu-cpu = cpu; diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index cd71141..2362df2 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, struct kvm_translation *tr) { diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 4c79284..cbc432f 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) kvmppc_subarch_vcpu_uninit(vcpu); } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { #ifdef CONFIG_BOOKE diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index ce81eb2..a3c324e 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) /* Nothing todo */ } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { save_fp_ctl(vcpu-arch.host_fpregs.fpc); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8f1e22d..d7c214f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) static_key_slow_dec(kvm_no_apic_vcpu); } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) { if (type) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a4c33b3..ebd7236 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -624,6 +624,8 @@ void kvm_arch_exit(void); int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu); +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu); + void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu); void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 33712fb..d3c3ed0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu) if (vcpu-preempted) vcpu-preempted = false; + kvm_arch_sched_in(vcpu, cpu); + kvm_arch_vcpu_load(vcpu, cpu); } -- 2.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/6] KVM: add kvm_arch_sched_in
2014-08-21 10:29+0200, Paolo Bonzini: Il 20/08/2014 22:53, Radim Krčmář ha scritto: Introduce preempt notifiers for architecture specific code. Advantage over creating a new notifier in every arch is slightly simpler code and guaranteed call order with respect to kvm_sched_in. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/arm/kvm/arm.c | 4 arch/mips/kvm/mips.c | 4 arch/powerpc/kvm/powerpc.c | 4 arch/s390/kvm/kvm-s390.c | 4 arch/x86/kvm/x86.c | 4 What about adding them as static inlines in arch/*/include/asm/kvm_host.h (except for arch/x86 of course)? All empty arch functions are in '.c' files, so it seems better to follow the same path. (And have one refactoring patch if GCC does not optimize this.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/6] KVM: VMX: make PLE window per-VCPU
2014-08-21 10:25+0200, Paolo Bonzini: Il 20/08/2014 22:53, Radim Krčmář ha scritto: + if (ple_gap) + vmcs_write32(PLE_WINDOW, vmx-ple_window); Maybe we can add a ple_window_dirty field? It can be tested instead of ple_gap to avoid the vmwrite in the common case? Sure, v3! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window
2014-08-21 10:24+0200, Paolo Bonzini: Il 20/08/2014 22:53, Radim Krčmář ha scritto: +static void update_ple_window_actual_max(void) +{ + ple_window_actual_max = + __shrink_ple_window(max(ple_window_max, ple_window), Why the max() here? To have ple_window act as a ple_window_min as well. (When we are already preventing most stupid choices.) And it's cheap ... I can add comment to this function :) + ple_window_grow, INT_MIN); This should be 0 (in fact you can probably make everything unsigned now that you've sorted out the overflows). Not in v2: val = min(vmx-ple_window, ple_window_actual_max); val += ple_window_grow; vmx-ple_window = val; so we need to dip below zero to allow all possible grows. (I'm not sure if anyone is ever going to use the additive option, so getting rid of it is also a valid choice -- code would be nicer.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window
2014-08-21 10:26+0200, Paolo Bonzini: Il 20/08/2014 22:53, Radim Krčmář ha scritto: +static int __shrink_ple_window(int val, int shrinker, int minimum) s/shrinker/factor/ or s/shrinker/param/ (shrinker has another meaning in the kernel). True, thanks. +{ + if (shrinker 1) + return ple_window; + + if (shrinker ple_window) + val /= shrinker; + else + val -= shrinker; + + return max(val, minimum); Any reason to use anything but ple_window as the minimum, even in update_ple_window_actual_max? ple_window_actual_max needs to be one grow below the ple_window_max, so it can be lower than ple_window. +} + +static void modify_ple_window(struct kvm_vcpu *vcpu, int grow) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + int new; + + if (grow) + new = __grow_ple_window(vmx-ple_window); + else + new = __shrink_ple_window(vmx-ple_window, ple_window_shrink, + ple_window); + + vmx-ple_window = max(new, ple_window); +} +#define grow_ple_window(vcpu) modify_ple_window(vcpu, 1) +#define shrink_ple_window(vcpu) modify_ple_window(vcpu, 0) No macros please. :) Guity as charged. Using 0/1 or true/false in this context directly would be pretty bad ... Is enum fine? (SHIRINK_PLE_WINDOW, GROW_PLE_WINDOW?) (I can always make it into function pointers ;) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/6] KVM: trace kvm_ple_window
2014-08-21 10:29+0200, Paolo Bonzini: Il 20/08/2014 22:53, Radim Krčmář ha scritto: Tracepoint for dynamic PLE window, fired on every potential change. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/trace.h | 25 + arch/x86/kvm/vmx.c | 8 +--- arch/x86/kvm/x86.c | 1 + 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index e850a7d..4b8e6cb 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -848,6 +848,31 @@ TRACE_EVENT(kvm_track_tsc, __print_symbolic(__entry-host_clock, host_clocks)) ); +TRACE_EVENT(kvm_ple_window, + TP_PROTO(int grow, unsigned int vcpu_id, int new, int old), s/int grow/bool grow/ (and similarly in TP_STRUCT__entry). Ok. (We are using 'int' for this in other tracepoints, so I guessed there is some hate agains bool.) + TP_ARGS(grow, vcpu_id, new, old), + + TP_STRUCT__entry( + __field( int, grow ) + __field(unsigned int, vcpu_id ) + __field( int, new ) + __field( int, old ) + ), + + TP_fast_assign( + __entry-grow = grow; + __entry-vcpu_id= vcpu_id; + __entry-new= new; + __entry-old= old; + ), + + TP_printk(vcpu %u: ple_window %d %s %d, + __entry-vcpu_id, + __entry-new, + __entry-grow ? + : -, ? grow : shrink Can't say I haven't expected comment about that :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window
2014-08-21 14:29+0200, Paolo Bonzini: Il 21/08/2014 13:54, Radim Krčmář ha scritto: Guity as charged. Using 0/1 or true/false in this context directly would be pretty bad ... Is enum fine? (SHIRINK_PLE_WINDOW, GROW_PLE_WINDOW?) I prefer good old Ctrl-C Ctrl-V (adjusted for your favorite editor). :) I, I should be able to do it, but nightmares are going to last a while. (And to lower chances of v4: are tracepoint macros, like for kvm_apic, frowned upon now?) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/6] KVM: add kvm_arch_sched_in
2014-08-21 14:27+0200, Paolo Bonzini: Il 21/08/2014 13:38, Radim Krčmář ha scritto: 2014-08-21 10:29+0200, Paolo Bonzini: Il 20/08/2014 22:53, Radim Krčmář ha scritto: Introduce preempt notifiers for architecture specific code. Advantage over creating a new notifier in every arch is slightly simpler code and guaranteed call order with respect to kvm_sched_in. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/arm/kvm/arm.c | 4 arch/mips/kvm/mips.c | 4 arch/powerpc/kvm/powerpc.c | 4 arch/s390/kvm/kvm-s390.c | 4 arch/x86/kvm/x86.c | 4 What about adding them as static inlines in arch/*/include/asm/kvm_host.h (except for arch/x86 of course)? All empty arch functions are in '.c' files, so it seems better to follow the same path. (And have one refactoring patch if GCC does not optimize this.) GCC certainly does not optimize this (unless you use LTO). I see LTO patches in next ... do we want to move every empty arch function into headers? (It is probably going to take LTO few years to be enabled by default.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] KVM: VMX: dynamise PLE window
2014-08-21 15:18+0200, Paolo Bonzini: Il 21/08/2014 14:42, Radim Krčmář ha scritto: 2014-08-21 14:29+0200, Paolo Bonzini: Il 21/08/2014 13:54, Radim Krčmář ha scritto: Guity as charged. Using 0/1 or true/false in this context directly would be pretty bad ... Is enum fine? (SHIRINK_PLE_WINDOW, GROW_PLE_WINDOW?) I prefer good old Ctrl-C Ctrl-V (adjusted for your favorite editor). :) I, I should be able to do it, but nightmares are going to last a while. (And to lower chances of v4: are tracepoint macros, like for kvm_apic, frowned upon now?) Nah, go ahead. But is there any example of #define like what you were doing? I tried grepping for #define.*[a-z]\( and didn't get anything similar. No, at least not in KVM code, it was the same idea as tracepoints: less magical constants at the cost of (syntactic) indirection. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 5/6] KVM: trace kvm_ple_window
2014-08-21 15:22+0200, Paolo Bonzini: Il 21/08/2014 13:56, Radim Krčmář ha scritto: 2014-08-21 10:29+0200, Paolo Bonzini: Il 20/08/2014 22:53, Radim Krčmář ha scritto: + TP_PROTO(int grow, unsigned int vcpu_id, int new, int old), s/int grow/bool grow/ (and similarly in TP_STRUCT__entry). Ok. (We are using 'int' for this in other tracepoints, so I guessed there is some hate agains bool.) Yeah, there are many ints, but there are also some bools. :) Of course bool is only a no-brainer if there is a clear direction to use (e.g. bool has_error), and indeed this case is less obvious. But there are some like this one (e.g. bool gpa_match). Keep the int if you prefer it that way. I actually like bool much better, but first few were only int and then I stopped looking ... -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/7] Dynamic Pause Loop Exiting window.
v2 - v3: * copypaste frenzy [v3 4/7] (split modify_ple_window) * commented update_ple_window_actual_max [v3 4/7] * renamed shrinker to modifier [v3 4/7] * removed an extraneous max(new, ple_window) [v3 4/7] (should have been in v2) * changed tracepoint argument type, printing and macro abstractions [v3 5/7] * renamed ple_t to ple_int [v3 6/7] (visible in modinfo) * intelligent updates of ple_window [v3 7/7] --- v1 - v2: * squashed [v1 4/9] and [v1 5/9] (clamping) * dropped [v1 7/9] (CPP abstractions) * merged core of [v1 9/9] into [v1 4/9] (automatic maximum) * reworked kernel_param_ops: closer to pure int [v2 6/6] * introduced ple_window_actual_max reworked clamping [v2 4/6] * added seqlock for parameter modifications [v2 6/6] --- PLE does not scale in its current form. When increasing VCPU count above 150, one can hit soft lockups because of runqueue lock contention. (Which says a lot about performance.) The main reason is that kvm_ple_loop cycles through all VCPUs. Replacing it with a scalable solution would be ideal, but it has already been well optimized for various workloads, so this series tries to alleviate one different major problem while minimizing a chance of regressions: we have too many useless PLE exits. Just increasing PLE window would help some cases, but it still spirals out of control. By increasing the window after every PLE exit, we can limit the amount of useless ones, so we don't reach the state where CPUs spend 99% of the time waiting for a lock. HP confirmed that this series prevents soft lockups and TSC sync errors on large guests. Radim Krčmář (7): KVM: add kvm_arch_sched_in KVM: x86: introduce sched_in to kvm_x86_ops KVM: VMX: make PLE window per-VCPU KVM: VMX: dynamise PLE window KVM: trace kvm_ple_window grow/shrink KVM: VMX: runtime knobs for dynamic PLE window KVM: VMX: optimize ple_window updates to VMCS arch/arm/kvm/arm.c | 4 ++ arch/mips/kvm/mips.c| 4 ++ arch/powerpc/kvm/powerpc.c | 4 ++ arch/s390/kvm/kvm-s390.c| 4 ++ arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/trace.h| 30 arch/x86/kvm/vmx.c | 147 ++-- arch/x86/kvm/x86.c | 6 ++ include/linux/kvm_host.h| 2 + virt/kvm/kvm_main.c | 2 + 11 files changed, 207 insertions(+), 4 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink
Tracepoint for dynamic PLE window, fired on every potential change. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/trace.h | 30 ++ arch/x86/kvm/vmx.c | 10 -- arch/x86/kvm/x86.c | 1 + 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index e850a7d..1742dfb 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -848,6 +848,36 @@ TRACE_EVENT(kvm_track_tsc, __print_symbolic(__entry-host_clock, host_clocks)) ); +TRACE_EVENT(kvm_ple_window, + TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old), + TP_ARGS(grow, vcpu_id, new, old), + + TP_STRUCT__entry( + __field(bool, grow ) + __field(unsigned int, vcpu_id ) + __field( int, new ) + __field( int, old ) + ), + + TP_fast_assign( + __entry-grow = grow; + __entry-vcpu_id= vcpu_id; + __entry-new= new; + __entry-old= old; + ), + + TP_printk(vcpu %u: ple_window %d (%s %d), + __entry-vcpu_id, + __entry-new, + __entry-grow ? grow : shrink, + __entry-old) +); + +#define trace_kvm_ple_window_grow(vcpu_id, new, old) \ + trace_kvm_ple_window(true, vcpu_id, new, old) +#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \ + trace_kvm_ple_window(false, vcpu_id, new, old) + #endif /* CONFIG_X86_64 */ #endif /* _TRACE_KVM_H */ diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a2fa6ba..273cbd5 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5728,16 +5728,22 @@ static int __shrink_ple_window(int val, int modifier, int minimum) static void grow_ple_window(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); + int old = vmx-ple_window; - vmx-ple_window = __grow_ple_window(vmx-ple_window); + vmx-ple_window = __grow_ple_window(old); + + trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old); } static void shrink_ple_window(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); + int old = vmx-ple_window; - vmx-ple_window = __shrink_ple_window(vmx-ple_window, + vmx-ple_window = __shrink_ple_window(old, ple_window_shrink, ple_window); + + trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old); } /* diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 5696ee7..814b20c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7648,3 +7648,4 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts); EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset); +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/7] KVM: VMX: dynamise PLE window
Window is increased on every PLE exit and decreased on every sched_in. The idea is that we don't want to PLE exit if there is no preemption going on. We do this with sched_in() because it does not hold rq lock. There are two new kernel parameters for changing the window: ple_window_grow and ple_window_shrink ple_window_grow affects the window on PLE exit and ple_window_shrink does it on sched_in; depending on their value, the window is modifier like this: (ple_window is kvm_intel's global) ple_window_shrink/ | ple_window_grow| PLE exit | sched_in ---++- 1| = ple_window | = ple_window ple_window | *= ple_window_grow | /= ple_window_shrink otherwise | += ple_window_grow | -= ple_window_shrink A third new parameter, ple_window_max, controls the maximal ple_window; it is internally rounded down to a closest multiple of ple_window_grow. VCPU's PLE window is never allowed below ple_window. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 87 -- 1 file changed, 85 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 18e0e52..a2fa6ba 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -125,14 +125,32 @@ module_param(nested, bool, S_IRUGO); * Time is measured based on a counter that runs at the same rate as the TSC, * refer SDM volume 3b section 21.6.13 22.1.3. */ -#define KVM_VMX_DEFAULT_PLE_GAP128 -#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 +#define KVM_VMX_DEFAULT_PLE_GAP 128 +#define KVM_VMX_DEFAULT_PLE_WINDOW4096 +#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW 2 +#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0 +#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\ + INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW + static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP; module_param(ple_gap, int, S_IRUGO); static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; module_param(ple_window, int, S_IRUGO); +/* Default doubles per-vcpu window every exit. */ +static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW; +module_param(ple_window_grow, int, S_IRUGO); + +/* Default resets per-vcpu window every exit to ple_window. */ +static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; +module_param(ple_window_shrink, int, S_IRUGO); + +/* Default is to compute the maximum so we can never overflow. */ +static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX; +static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX; +module_param(ple_window_max, int, S_IRUGO); + extern const ulong vmx_return; #define NR_AUTOLOAD_MSRS 8 @@ -5679,12 +5697,73 @@ out: return ret; } +static int __grow_ple_window(int val) +{ + if (ple_window_grow 1) + return ple_window; + + val = min(val, ple_window_actual_max); + + if (ple_window_grow ple_window) + val *= ple_window_grow; + else + val += ple_window_grow; + + return val; +} + +static int __shrink_ple_window(int val, int modifier, int minimum) +{ + if (modifier 1) + return ple_window; + + if (modifier ple_window) + val /= modifier; + else + val -= modifier; + + return max(val, minimum); +} + +static void grow_ple_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + vmx-ple_window = __grow_ple_window(vmx-ple_window); +} + +static void shrink_ple_window(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + vmx-ple_window = __shrink_ple_window(vmx-ple_window, + ple_window_shrink, ple_window); +} + +/* + * ple_window_actual_max is computed to be one grow_ple_window() below + * ple_window_max. (See __grow_ple_window for the reason.) + * This prevents overflows, because ple_window_max is int. + * ple_window_max effectively rounded down to a multiple of ple_window_grow in + * this process. + * ple_window_max is also prevented from setting vmx-ple_window ple_window. + */ +static void update_ple_window_actual_max(void) +{ + ple_window_actual_max = + __shrink_ple_window(max(ple_window_max, ple_window), + ple_window_grow, INT_MIN); +} + /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE * exiting, so only get here on cpu with PAUSE-Loop-Exiting. */ static int handle_pause(struct kvm_vcpu *vcpu) { + if (ple_gap) + grow_ple_window(vcpu); + skip_emulated_instruction(vcpu); kvm_vcpu_on_spin(vcpu); @@ -8854,6 +8933,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu, void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) { + if (ple_gap) + shrink_ple_window(vcpu); } static struct kvm_x86_ops
[PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window
ple_window is updated on every vmentry, so there is no reason to have it read-only anymore. ple_window* weren't writable to prevent runtime overflow races; they are prevented by a seqlock. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 46 +- 1 file changed, 37 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 273cbd5..1318232 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -132,24 +132,29 @@ module_param(nested, bool, S_IRUGO); #define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\ INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW +static struct kernel_param_ops param_ops_ple_int; +#define param_check_ple_int(name, p) __param_check(name, p, int) + +static DEFINE_SEQLOCK(ple_window_seqlock); + static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP; module_param(ple_gap, int, S_IRUGO); static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; -module_param(ple_window, int, S_IRUGO); +module_param(ple_window, ple_int, S_IRUGO | S_IWUSR); /* Default doubles per-vcpu window every exit. */ static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW; -module_param(ple_window_grow, int, S_IRUGO); +module_param(ple_window_grow, ple_int, S_IRUGO | S_IWUSR); /* Default resets per-vcpu window every exit to ple_window. */ static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK; -module_param(ple_window_shrink, int, S_IRUGO); +module_param(ple_window_shrink, int, S_IRUGO | S_IWUSR); /* Default is to compute the maximum so we can never overflow. */ static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX; static int ple_window_max= KVM_VMX_DEFAULT_PLE_WINDOW_MAX; -module_param(ple_window_max, int, S_IRUGO); +module_param(ple_window_max, ple_int, S_IRUGO | S_IWUSR); extern const ulong vmx_return; @@ -5729,8 +5734,12 @@ static void grow_ple_window(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); int old = vmx-ple_window; + unsigned seq; - vmx-ple_window = __grow_ple_window(old); + do { + seq = read_seqbegin(ple_window_seqlock); + vmx-ple_window = __grow_ple_window(old); + } while (read_seqretry(ple_window_seqlock, seq)); trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old); } @@ -5739,9 +5748,13 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); int old = vmx-ple_window; + unsigned seq; - vmx-ple_window = __shrink_ple_window(old, - ple_window_shrink, ple_window); + do { + seq = read_seqbegin(ple_window_seqlock); + vmx-ple_window = __shrink_ple_window(old, ple_window_shrink, + ple_window); + } while (read_seqretry(ple_window_seqlock, seq)); trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old); } @@ -5761,6 +5774,23 @@ static void update_ple_window_actual_max(void) ple_window_grow, INT_MIN); } +static int param_set_ple_int(const char *arg, const struct kernel_param *kp) +{ + int ret; + + write_seqlock(ple_window_seqlock); + ret = param_set_int(arg, kp); + update_ple_window_actual_max(); + write_sequnlock(ple_window_seqlock); + + return ret; +} + +static struct kernel_param_ops param_ops_ple_int = { + .set = param_set_ple_int, + .get = param_get_int, +}; + /* * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE * exiting, so only get here on cpu with PAUSE-Loop-Exiting. @@ -9164,8 +9194,6 @@ static int __init vmx_init(void) } else kvm_disable_tdp(); - update_ple_window_actual_max(); - return 0; out7: -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 7/7] KVM: VMX: optimize ple_window updates to VMCS
ple_window is preserved in VMCS, so can write it only after a change. Do this by keeping a dirty bit. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1318232..f32415b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -510,6 +510,7 @@ struct vcpu_vmx { /* Dynamic PLE window. */ int ple_window; + bool ple_window_dirty; }; enum segment_cache_field { @@ -4426,6 +4427,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((vmx-pi_desc))); } + vmx-ple_window_dirty = ple_gap; if (ple_gap) { vmcs_write32(PLE_GAP, ple_gap); vmx-ple_window = ple_window; @@ -5741,6 +5743,9 @@ static void grow_ple_window(struct kvm_vcpu *vcpu) vmx-ple_window = __grow_ple_window(old); } while (read_seqretry(ple_window_seqlock, seq)); + if (vmx-ple_window != old) + vmx-ple_window_dirty = true; + trace_kvm_ple_window_grow(vcpu-vcpu_id, vmx-ple_window, old); } @@ -5756,6 +5761,9 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu) ple_window); } while (read_seqretry(ple_window_seqlock, seq)); + if (vmx-ple_window != old) + vmx-ple_window_dirty = true; + trace_kvm_ple_window_shrink(vcpu-vcpu_id, vmx-ple_window, old); } @@ -7505,8 +7513,10 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) if (vmx-emulation_required) return; - if (ple_gap) + if (vmx-ple_window_dirty) { + vmx-ple_window_dirty = false; vmcs_write32(PLE_WINDOW, vmx-ple_window); + } if (vmx-nested.sync_shadow_vmcs) { copy_vmcs12_to_shadow(vmx); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/7] KVM: VMX: make PLE window per-VCPU
Change PLE window into per-VCPU variable, seeded from module parameter, to allow greater flexibility. Brings in a small overhead on every vmentry. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/vmx.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2b306f9..18e0e52 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -484,6 +484,9 @@ struct vcpu_vmx { /* Support for a guest hypervisor (nested VMX) */ struct nested_vmx nested; + + /* Dynamic PLE window. */ + int ple_window; }; enum segment_cache_field { @@ -4402,7 +4405,7 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) if (ple_gap) { vmcs_write32(PLE_GAP, ple_gap); - vmcs_write32(PLE_WINDOW, ple_window); + vmx-ple_window = ple_window; } vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0); @@ -7387,6 +7390,9 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu) if (vmx-emulation_required) return; + if (ple_gap) + vmcs_write32(PLE_WINDOW, vmx-ple_window); + if (vmx-nested.sync_shadow_vmcs) { copy_vmcs12_to_shadow(vmx); vmx-nested.sync_shadow_vmcs = false; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/7] KVM: add kvm_arch_sched_in
Introduce preempt notifiers for architecture specific code. Advantage over creating a new notifier in every arch is slightly simpler code and guaranteed call order with respect to kvm_sched_in. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/arm/kvm/arm.c | 4 arch/mips/kvm/mips.c | 4 arch/powerpc/kvm/powerpc.c | 4 arch/s390/kvm/kvm-s390.c | 4 arch/x86/kvm/x86.c | 4 include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c| 2 ++ 7 files changed, 24 insertions(+) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index a99e0cd..9f788eb 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -288,6 +288,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { vcpu-cpu = cpu; diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index cd71141..2362df2 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -1002,6 +1002,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, struct kvm_translation *tr) { diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 4c79284..cbc432f 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -720,6 +720,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) kvmppc_subarch_vcpu_uninit(vcpu); } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { #ifdef CONFIG_BOOKE diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index ce81eb2..a3c324e 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -555,6 +555,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) /* Nothing todo */ } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { save_fp_ctl(vcpu-arch.host_fpregs.fpc); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8f1e22d..d7c214f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7146,6 +7146,10 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) static_key_slow_dec(kvm_no_apic_vcpu); } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) { if (type) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a4c33b3..ebd7236 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -624,6 +624,8 @@ void kvm_arch_exit(void); int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu); +void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu); + void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu); void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 33712fb..d3c3ed0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -3123,6 +3123,8 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu) if (vcpu-preempted) vcpu-preempted = false; + kvm_arch_sched_in(vcpu, cpu); + kvm_arch_vcpu_load(vcpu, cpu); } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/7] KVM: x86: introduce sched_in to kvm_x86_ops
sched_in preempt notifier is available for x86, allow its use in specific virtualization technlogies as well. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/svm.c | 6 ++ arch/x86/kvm/vmx.c | 6 ++ arch/x86/kvm/x86.c | 1 + 4 files changed, 15 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5724601..358e2f3 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -772,6 +772,8 @@ struct kvm_x86_ops { bool (*mpx_supported)(void); int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr); + + void (*sched_in)(struct kvm_vcpu *kvm, int cpu); }; struct kvm_arch_async_pf { diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ddf7427..4baf1bc 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4305,6 +4305,10 @@ static void svm_handle_external_intr(struct kvm_vcpu *vcpu) local_irq_enable(); } +static void svm_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -4406,6 +4410,8 @@ static struct kvm_x86_ops svm_x86_ops = { .check_intercept = svm_check_intercept, .handle_external_intr = svm_handle_external_intr, + + .sched_in = svm_sched_in, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bfe11cf..2b306f9 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -8846,6 +8846,10 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu, return X86EMUL_CONTINUE; } +void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu) +{ +} + static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, @@ -8951,6 +8955,8 @@ static struct kvm_x86_ops vmx_x86_ops = { .mpx_supported = vmx_mpx_supported, .check_nested_events = vmx_check_nested_events, + + .sched_in = vmx_sched_in, }; static int __init vmx_init(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d7c214f..5696ee7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7148,6 +7148,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) { + kvm_x86_ops-sched_in(vcpu, cpu); } int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
2014-08-21 18:30+0200, Paolo Bonzini: Il 21/08/2014 18:08, Radim Krčmář ha scritto: I'm not sure of the usefulness of patch 6, so I'm going to drop it. I'll keep it in my local junkyard branch in case it's going to be useful in some scenario I didn't think of. I've been using it to benchmark different values, because it is more convenient than reloading the module after shutting down guests. (And easier to sell than writing to kernel memory.) I don't think the additional code is worth it though. Patch 7 can be easily rebased, so no need to repost (and I might even squash it into patch 3, what do you think?). Yeah, the core is already a huge patch, so it does look weird without squashing. (No-one wants to rebase to that point anyway.) Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/7] KVM: add kvm_arch_sched_in
2014-08-22 00:19+0530, Raghavendra K T: On 08/21/2014 09:38 PM, Radim Krčmář wrote: Introduce preempt notifiers for architecture specific code. Advantage over creating a new notifier in every arch is slightly simpler code and guaranteed call order with respect to kvm_sched_in. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- Reviewed-by: Raghavendra KT raghavendra...@linux.vnet.ibm.com No surprise that ia64 doesn't show here :). Oh, totally forgot about its predicted comeback ... Paolo, do you want to fixup this? (It removes double newline.) diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 0729ba6..1630624 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1468,6 +1468,9 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) kfree(vcpu-arch.apic); } +void kvm_arch_sched_in(struct kvm_vcpu *vcpu) +{ +} long kvm_arch_vcpu_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) and I also would have liked static inlines (as indicated by Paolo). I'll send a patch that converts empty functions into static inlines, maybe return-0s too, soon(ish). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 4/7] KVM: VMX: dynamise PLE window
2014-08-22 00:40+0530, Raghavendra K T: On 08/21/2014 09:38 PM, Radim Krčmář wrote: Thanks for the nice patch. default of grow = 2 and shrink = 0 is very good, which aids fast clamping in overcommit and less ple_exits in undercommit. with a small concern over modifier (shrinker) value in patch 6, Reviewed-by: Raghavendra KT raghavendra...@linux.vnet.ibm.com Thank you for the review and testing. -#define KVM_VMX_DEFAULT_PLE_GAP128 -#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 +#define KVM_VMX_DEFAULT_PLE_GAP 128 +#define KVM_VMX_DEFAULT_PLE_WINDOW4096 +#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW 2 +#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0 +#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX\ +INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW Some observations: 1. I think a large ple_window for e.g., 16MB itself would be very large value after which the ple_window change would not have much effect. Agreed, 16M is around 4ms@4GHz, holding a spinlock for that long is closer to a bug. So could KVM_VMX_DEFAULT_PLE_WINDOW_MAX be just KVM_VMX_DEFAULT_PLE_WINDOW^2 or something ? or otherwise.. I'd be perfectly content with a high default maximum like this, yet there isn't much point in having that default at all if we can't reach it in practice, so with the current default, we are at least ready for THz+ x86 intels :) I though about deafaulting it to some guessed fraction of the scheduler tick, but then, I wanted to merge at least something :) How about .. define KVM_VMX_DEFAULT_PLE_WINDOW_MAX = INT_MAX /KVM_VMX_DEFAULT_PLE_WINDOW. = 524288 (2^19), too low IMO, no-overcommit scenarios seem to go higher quite often. (considering we don't allow default grow to be greater than default ple window). 2. can we consider min_ple_window value of 2k. tracing showed that in overcommit there were several occations of 4k - 4k situations. Low values are more affected by KVM's overhead on every PLE exit, but benchmarks really decide this ... Having a separate ple_window_min would allow more tuning, so I can do that if there are benefits of lower windows. On the other hand, I thought that increasing the default window would be a good default, which would make separate minimum even better. 3. Do you think using ( and ) instead of (*, /) saves some cycles considering we could have potentially have grow,shrink numbers != power of 2 * takes one or two cycles more than , so I wouldn't replace it, because it limits available grows a lot. (I don't expect we would set values above 5.) / is slow (around ten times slower than *), which the reason why we avoid it by default, but I still prefer more options. +static int __grow_ple_window(int val) +{ +if (ple_window_grow 1) why not ple_window_grow = 1 ? Emergent mini-feature: have different static levels of ple_window, in combination with dynamic knobs. (set grow and shrink to 1, choose ple_window before starting a guest) And because you have to willingly set 1, I'm not aware of any advantage of '= 1'. +static int __shrink_ple_window(int val, int modifier, int minimum) +{ +if (modifier 1) why not modifier 1. May be this would address a concern in patch 6. /= 1 gives no shrinking, which I considered as a feature -- you can easily search for the maximal achievable ple_window that way. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 6/7] KVM: VMX: runtime knobs for dynamic PLE window
2014-08-22 00:47+0530, Raghavendra K T: Positive thing about able to change default grow/shrink value is the flexibility of tuning ple window to different workloads and different number of cpus. But is it that a value of shrink = 1 and grow 1 is problematic ? (running a undercommit workload and then overcommit workload) Yeah, could have mentioned it in commit message, but I hope that my previous email explained it. (1 = don't shrink, design decision :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 5/7] KVM: trace kvm_ple_window grow/shrink
2014-08-25 15:53+0200, Sabrina Dubroca: Hello, 2014-08-21, 18:08:09 +0200, Radim Krčmář wrote: Tracepoint for dynamic PLE window, fired on every potential change. +#define trace_kvm_ple_window_grow(vcpu_id, new, old) \ + trace_kvm_ple_window(true, vcpu_id, new, old) +#define trace_kvm_ple_window_shrink(vcpu_id, new, old) \ + trace_kvm_ple_window(false, vcpu_id, new, old) + #endif /* CONFIG_X86_64 */ Looks like these are needed on 32-bit as well. Today's linux-next doesn't build: [...] I moved the line #endif /* CONFIG_X86_64 */ above TRACE_EVENT(kvm_ple_window, and it builds. Thanks! Paolo, can you still fix this just by rebasing? --- diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 1742dfb..4c2868f 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -848,6 +848,8 @@ TRACE_EVENT(kvm_track_tsc, __print_symbolic(__entry-host_clock, host_clocks)) ); +#endif /* CONFIG_X86_64 */ + TRACE_EVENT(kvm_ple_window, TP_PROTO(bool grow, unsigned int vcpu_id, int new, int old), TP_ARGS(grow, vcpu_id, new, old), @@ -878,8 +880,6 @@ TRACE_EVENT(kvm_ple_window, #define trace_kvm_ple_window_shrink(vcpu_id, new, old) \ trace_kvm_ple_window(false, vcpu_id, new, old) -#endif /* CONFIG_X86_64 */ - #endif /* _TRACE_KVM_H */ #undef TRACE_INCLUDE_PATH -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/7] Dynamic Pause Loop Exiting window.
2014-08-22 12:45+0800, Wanpeng Li: Hi Radim, On Thu, Aug 21, 2014 at 06:50:03PM +0200, Radim Krčmář wrote: 2014-08-21 18:30+0200, Paolo Bonzini: Il 21/08/2014 18:08, Radim Krčmář ha scritto: I'm not sure of the usefulness of patch 6, so I'm going to drop it. I'll keep it in my local junkyard branch in case it's going to be useful in some scenario I didn't think of. I've been using it to benchmark different values, because it is more Is there any benchmark data for this patchset? Sorry, I already returned the testing machine and it wasn't quality benchmarking, so I haven't kept the results ... I used ebizzy and dbench, because ebizzy had large difference between PLE on/off and dbench minimal (without overcommit), so one was looking for improvements while the other was checking regressions. (And they are easy to set up.) From what I remember, this patch had roughly 5x better performance with ebizzy on 60 VCPU guests and no obvious difference for dbench. (And improvement under overcommit was visible for both.) There was a significant reduction in %sys, which never raised much above 30%, as oposed to original 90%+. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: static inline empty kvm_arch functions
Using static inline is going to save few bytes and cycles. For example on powerpc, the difference is 700 B after stripping. (5 kB before) This patch also deals with two overlooked empty functions: kvm_arch_flush_shadow was not removed from arch/mips/kvm/mips.c 2df72e9bc KVM: split kvm_arch_flush_shadow and kvm_arch_sched_in never made it into arch/ia64/kvm/kvm-ia64.c. e790d9ef6 KVM: add kvm_arch_sched_in Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/arm/include/asm/kvm_host.h | 6 ++ arch/arm/kvm/arm.c | 19 arch/arm64/include/asm/kvm_host.h | 6 ++ arch/ia64/include/asm/kvm_host.h| 12 +++ arch/ia64/kvm/kvm-ia64.c| 30 -- arch/mips/include/asm/kvm_host.h| 11 ++ arch/mips/kvm/mips.c| 42 arch/powerpc/include/asm/kvm_host.h | 8 +++ arch/powerpc/kvm/powerpc.c | 29 - arch/s390/include/asm/kvm_host.h| 14 arch/s390/kvm/kvm-s390.c| 43 - 11 files changed, 57 insertions(+), 163 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 6dfb404..84291fe 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -233,4 +233,10 @@ static inline void vgic_arch_setup(const struct vgic_params *vgic) int kvm_perf_init(void); int kvm_perf_teardown(void); +static inline void kvm_arch_hardware_disable(void *garbage) {} +static inline void kvm_arch_hardware_unsetup(void) {} +static inline void kvm_arch_sync_events(struct kvm *kvm) {} +static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} +static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} + #endif /* __ARM_KVM_HOST_H__ */ diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 9f788eb..132bb0d 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -97,27 +97,16 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu) return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE; } -void kvm_arch_hardware_disable(void *garbage) -{ -} - int kvm_arch_hardware_setup(void) { return 0; } -void kvm_arch_hardware_unsetup(void) -{ -} - void kvm_arch_check_processor_compat(void *rtn) { *(int *)rtn = 0; } -void kvm_arch_sync_events(struct kvm *kvm) -{ -} /** * kvm_arch_init_vm - initializes a VM data structure @@ -284,14 +273,6 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) return 0; } -void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) -{ -} - -void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) -{ -} - void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { vcpu-cpu = cpu; diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index e10c45a..94d8a3c 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -244,4 +244,10 @@ static inline void vgic_arch_setup(const struct vgic_params *vgic) } } +static inline void kvm_arch_hardware_disable(void *garbage) {} +static inline void kvm_arch_hardware_unsetup(void) {} +static inline void kvm_arch_sync_events(struct kvm *kvm) {} +static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} +static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu) {} + #endif /* __ARM64_KVM_HOST_H__ */ diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h index db95f57..353167d 100644 --- a/arch/ia64/include/asm/kvm_host.h +++ b/arch/ia64/include/asm/kvm_host.h @@ -595,6 +595,18 @@ void kvm_sal_emul(struct kvm_vcpu *vcpu); struct kvm *kvm_arch_alloc_vm(void); void kvm_arch_free_vm(struct kvm *kvm); +static inline void kvm_arch_sync_events(struct kvm *kvm) {} +static inline void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) {} +static inline void kvm_arch_sched_in(struct kvm_vcpu *vcpu) {} +static inline void kvm_arch_free_memslot(struct kvm *kvm, + struct kvm_memory_slot *free, struct kvm_memory_slot *dont) {} +static inline void kvm_arch_memslots_updated(struct kvm *kvm) {} +static inline void kvm_arch_commit_memory_region(struct kvm *kvm, + struct kvm_userspace_memory_region *mem, + const struct kvm_memory_slot *old, + enum kvm_mr_change change) {} +static inline void kvm_arch_hardware_unsetup(void) {} + #endif /* __ASSEMBLY__*/ #endif diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 0729ba6..5e14dca 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -1364,10 +1364,6 @@ static void kvm_release_vm_pages(struct kvm *kvm) } } -void kvm_arch_sync_events(struct kvm *kvm) -{ -} - void kvm_arch_destroy_vm(struct kvm *kvm) { kvm_iommu_unmap_guest(kvm); @@ -1376,10 +1372,6 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm_release_vm_pages(kvm); } -void
[PATCH 0/2] KVM: minor cleanup and optimizations
The first patch answers a demand for inline arch functions. (There is a lot of constant functions that could be inlined as well.) Second patch digs a bit into the history of KVM and removes a useless argument that seemed suspicious when preparing the first patch. Radim Krčmář (2): KVM: static inline empty kvm_arch functions KVM: remove garbage arg to *hardware_{en,dis}able arch/arm/include/asm/kvm_host.h | 6 + arch/arm/kvm/arm.c | 21 + arch/arm64/include/asm/kvm_host.h | 6 + arch/ia64/include/asm/kvm_host.h| 12 ++ arch/ia64/kvm/kvm-ia64.c| 34 ++-- arch/mips/include/asm/kvm_host.h| 11 + arch/mips/kvm/mips.c| 44 +--- arch/powerpc/include/asm/kvm_host.h | 8 +++ arch/powerpc/kvm/powerpc.c | 31 + arch/s390/include/asm/kvm_host.h| 14 arch/s390/kvm/kvm-s390.c| 45 + arch/x86/include/asm/kvm_host.h | 4 ++-- arch/x86/kvm/svm.c | 4 ++-- arch/x86/kvm/vmx.c | 4 ++-- arch/x86/kvm/x86.c | 12 +- include/linux/kvm_host.h| 4 ++-- virt/kvm/kvm_main.c | 4 ++-- 17 files changed, 79 insertions(+), 185 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: remove garbage arg to *hardware_{en,dis}able
In the beggining was on_each_cpu(), which required an unused argument to kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten. Remove unnecessary arguments that stem from this. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/arm/include/asm/kvm_host.h | 2 +- arch/arm/kvm/arm.c | 2 +- arch/arm64/include/asm/kvm_host.h | 2 +- arch/ia64/kvm/kvm-ia64.c| 4 ++-- arch/mips/include/asm/kvm_host.h| 2 +- arch/mips/kvm/mips.c| 2 +- arch/powerpc/include/asm/kvm_host.h | 2 +- arch/powerpc/kvm/powerpc.c | 2 +- arch/s390/include/asm/kvm_host.h| 2 +- arch/s390/kvm/kvm-s390.c| 2 +- arch/x86/include/asm/kvm_host.h | 4 ++-- arch/x86/kvm/svm.c | 4 ++-- arch/x86/kvm/vmx.c | 4 ++-- arch/x86/kvm/x86.c | 12 ++-- include/linux/kvm_host.h| 4 ++-- virt/kvm/kvm_main.c | 4 ++-- 16 files changed, 27 insertions(+), 27 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 84291fe..278ecfa 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -233,7 +233,7 @@ static inline void vgic_arch_setup(const struct vgic_params *vgic) int kvm_perf_init(void); int kvm_perf_teardown(void); -static inline void kvm_arch_hardware_disable(void *garbage) {} +static inline void kvm_arch_hardware_disable(void) {} static inline void kvm_arch_hardware_unsetup(void) {} static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 132bb0d..005a7b5 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -87,7 +87,7 @@ struct kvm_vcpu __percpu **kvm_get_running_vcpus(void) return kvm_arm_running_vcpu; } -int kvm_arch_hardware_enable(void *garbage) +int kvm_arch_hardware_enable(void) { return 0; } diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h index 94d8a3c..3816c52 100644 --- a/arch/arm64/include/asm/kvm_host.h +++ b/arch/arm64/include/asm/kvm_host.h @@ -244,7 +244,7 @@ static inline void vgic_arch_setup(const struct vgic_params *vgic) } } -static inline void kvm_arch_hardware_disable(void *garbage) {} +static inline void kvm_arch_hardware_disable(void) {} static inline void kvm_arch_hardware_unsetup(void) {} static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) {} diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c index 5e14dca..ec6b9ac 100644 --- a/arch/ia64/kvm/kvm-ia64.c +++ b/arch/ia64/kvm/kvm-ia64.c @@ -125,7 +125,7 @@ long ia64_pal_vp_create(u64 *vpd, u64 *host_iva, u64 *opt_handler) static DEFINE_SPINLOCK(vp_lock); -int kvm_arch_hardware_enable(void *garbage) +int kvm_arch_hardware_enable(void) { long status; long tmp_base; @@ -160,7 +160,7 @@ int kvm_arch_hardware_enable(void *garbage) return 0; } -void kvm_arch_hardware_disable(void *garbage) +void kvm_arch_hardware_disable(void) { long status; diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index b4d900a..2f63ca2 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -767,7 +767,7 @@ extern int kvm_mips_trans_mtc0(uint32_t inst, uint32_t *opc, extern void kvm_mips_dump_stats(struct kvm_vcpu *vcpu); extern unsigned long kvm_mips_get_ramsize(struct kvm *kvm); -static inline void kvm_arch_hardware_disable(void *garbage) {} +static inline void kvm_arch_hardware_disable(void) {} static inline void kvm_arch_hardware_unsetup(void) {} static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_free_memslot(struct kvm *kvm, diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 0ec7490..e3b21e5 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -77,7 +77,7 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu) return 1; } -int kvm_arch_hardware_enable(void *garbage) +int kvm_arch_hardware_enable(void) { return 0; } diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 0e646c1..b00113b 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -687,7 +687,7 @@ struct kvm_vcpu_arch { #define __KVM_HAVE_ARCH_WQP #define __KVM_HAVE_CREATE_DEVICE -static inline void kvm_arch_hardware_disable(void *garbage) {} +static inline void kvm_arch_hardware_disable(void) {} static inline void kvm_arch_hardware_unsetup(void) {} static inline void kvm_arch_sync_events(struct kvm *kvm) {} static inline void kvm_arch_memslots_updated(struct kvm *kvm) {} diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 72c3fc0..da50523 100644 --- a/arch/powerpc/kvm
Re: [PATCH] KVM: forward declare structs in kvm_types.h
2014-08-29 14:01+0200, Paolo Bonzini: Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid 'struct foo' declared inside parameter list warnings (and consequent breakage due to conflicting types). Move them from individual files to a generic place in linux/kvm_types.h. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- Reviewed-by: Radim Krčmář rkrc...@redhat.com (Inclusion of linux/types.h seems to be piggybacking, but is a nice thing to do :) And I'd apply these changes: (Definitely the first hunk.) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 33d8d0a..e098dce 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -140,8 +140,6 @@ static inline bool is_error_page(struct page *page) #define KVM_USERSPACE_IRQ_SOURCE_ID0 #define KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID 1 -struct kvm; -struct kvm_vcpu; extern struct kmem_cache *kvm_vcpu_cache; extern spinlock_t kvm_lock; @@ -325,8 +323,6 @@ struct kvm_kernel_irq_routing_entry { struct hlist_node link; }; -struct kvm_irq_routing_table; - #ifndef KVM_PRIVATE_MEM_SLOTS #define KVM_PRIVATE_MEM_SLOTS 0 #endif @@ -1036,8 +1032,6 @@ static inline bool kvm_check_request(int req, struct kvm_vcpu *vcpu) extern bool kvm_rebooting; -struct kvm_device_ops; - struct kvm_device { struct kvm_device_ops *ops; struct kvm *kvm; diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h index 1d6daca..53c4f20 100644 --- a/include/linux/kvm_types.h +++ b/include/linux/kvm_types.h @@ -19,7 +19,9 @@ struct kvm; struct kvm_async_pf; +struct kvm_device_ops; struct kvm_interrupt; +struct kvm_irq_routing_table; struct kvm_memory_slot; struct kvm_one_reg; struct kvm_run; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] virtio-rng: fix stuck in catting hwrng attributes
2014-09-14 10:25+0800, Amos Kong: On Sun, Sep 14, 2014 at 09:12:08AM +0800, Amos Kong wrote: ... diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c index c591d7e..b5d1b6f 100644 --- a/drivers/char/hw_random/core.c +++ b/drivers/char/hw_random/core.c @@ -195,8 +195,7 @@ static ssize_t rng_dev_read(struct file *filp, char __user *buf, mutex_unlock(rng_mutex); - if (need_resched()) - schedule_timeout_interruptible(1); + schedule_timeout_interruptible(10); If cond_resched() does not work, it is a bug elsewehere. Problem only occurred in non-smp guest, we can improve it to: if(!is_smp()) schedule_timeout_interruptible(10); is_smp() is only available for arm arch, we need a general one. (It is num_online_cpus() 1.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Using the tlb flush util function where applicable
2014-09-12 17:06-0400, Liang Chen: Using kvm_mmu_flush_tlb as the other places to make sure vcpu stat is incremented Signed-off-by: Liang Chen liangchen.li...@gmail.com --- Good catch. arch/x86/kvm/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bfe11cf..439682e 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1810,7 +1810,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) struct desc_ptr *gdt = __get_cpu_var(host_gdt); unsigned long sysenter_esp; - kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); + kvm_mmu_flush_tlb(vcpu); local_irq_disable(); crash_disable_local_vmclear(cpu); And to hijack this thread ... I noticed three other deficits in stat.tlb_flush, patch below. Do you prefer the current behavior? --- 8 --- KVM: x86: count actual tlb flushes - we count KVM_REQ_TLB_FLUSH requests, not actual flushes (KVM can have multiple requests for one flush) - flushes from kvm_flush_remote_tlbs aren't counted - it's easy to make a direct request by mistake Solve these by postponing the counting to kvm_check_request(). Signed-off-by: Radim Krčmář rkrc...@redhat.com --- (And what about a possible followup patch that replaces kvm_mmu_flush_tlb() with kvm_make_request() again? It would free the namespace a bit and we could call something similarly named from vcpu_enter_guest() to do the job.) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9314678..b41fd97 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3452,7 +3452,6 @@ static void nonpaging_init_context(struct kvm_vcpu *vcpu, void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu) { - ++vcpu-stat.tlb_flush; kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); } EXPORT_SYMBOL_GPL(kvm_mmu_flush_tlb); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 916e895..0f0ad08 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6041,8 +6041,10 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) } if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu)) kvm_mmu_sync_roots(vcpu); - if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) + if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu)) { + ++vcpu-stat.tlb_flush; kvm_x86_ops-tlb_flush(vcpu); + } if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) { vcpu-run-exit_reason = KVM_EXIT_TPR_ACCESS; r = 0; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
2014-09-15 13:11-0700, Andres Lagar-Cavilla: +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm, The suffix '_retry' is not best suited for this. On first reading, I imagined we will be retrying something from before, possibly calling it in a loop, but we are actually doing the first and last try in one call. Hard to find something that conveys our lock-dropping mechanic, '_polite' is my best candidate at the moment. + int flags = FOLL_TOUCH | FOLL_HWPOISON | (FOLL_HWPOISON wasn't used before, but it's harmless.) 2014-09-16 15:51+0200, Paolo Bonzini: Il 15/09/2014 22:11, Andres Lagar-Cavilla ha scritto: @@ -1177,9 +1210,15 @@ static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault, npages = get_user_page_nowait(current, current-mm, addr, write_fault, page); up_read(current-mm-mmap_sem); - } else - npages = get_user_pages_fast(addr, 1, write_fault, -page); + } else { + /* +* By now we have tried gup_fast, and possible async_pf, and we ^ (If we really tried get_user_pages_fast, we wouldn't be here, so I'd prepend two underscores here as well.) +* are certainly not atomic. Time to retry the gup, allowing +* mmap semaphore to be relinquished in the case of IO. +*/ + npages = kvm_get_user_page_retry(current, current-mm, addr, +write_fault, page); This is a separate logical change. Was this: down_read(mm-mmap_sem); npages = get_user_pages(NULL, mm, addr, 1, 1, 0, NULL, NULL); up_read(mm-mmap_sem); the intention rather than get_user_pages_fast? I believe so as well. (Looking at get_user_pages_fast and __get_user_pages_fast made my abstraction detector very sad.) I think a first patch should introduce kvm_get_user_page_retry (Retry a fault after a gup with FOLL_NOWAIT.) and the second would add FOLL_TRIED (This properly relinquishes mmap semaphore if the filemap/swap has to wait on page lock (and retries the gup to completion after that). Not sure if that would help to understand the goal ... Apart from this, the patch looks good. The mm/ parts are minimal, so I think it's best to merge it through the KVM tree with someone's Acked-by. I would prefer to have the last hunk in a separate patch, but still, Acked-by: Radim Krčmář rkrc...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
[Emergency posting to fix the tag and couldn't find unmangled Cc list, so some recipients were dropped, sorry. (I guess you are glad though).] 2014-09-16 14:01-0700, Andres Lagar-Cavilla: On Tue, Sep 16, 2014 at 1:51 PM, Radim Krčmář rkrc...@redhat.com wrote: 2014-09-15 13:11-0700, Andres Lagar-Cavilla: +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm, The suffix '_retry' is not best suited for this. On first reading, I imagined we will be retrying something from before, possibly calling it in a loop, but we are actually doing the first and last try in one call. We are doing ... the second and third in most scenarios. async_pf did the first with _NOWAIT. We call this from the async pf retrier, or if async pf couldn't be notified to the guest. I was thinking more about what the function does, not how we currently use it -- nothing prevents us from using it as first somewhere -- but yeah, even comments would be off then. Apart from this, the patch looks good. The mm/ parts are minimal, so I think it's best to merge it through the KVM tree with someone's Acked-by. I would prefer to have the last hunk in a separate patch, but still, Acked-by: Radim Krčmář rkrc...@redhat.com Awesome, thanks much. I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything else from this email should go into the recut. Ah, sorry, I'm not maintaining mm ... what I meant was Reviewed-by: Radim Krčmář rkrc...@redhat.com and I had to leave before I could find a good apology for VM_WARN_ON_ONCE(), so if you are replacing BUG_ON, you might want to look at that one as well. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Using the tlb flush util function where applicable
2014-09-17 08:15+0800, Wanpeng Li: Hi Radim, On Mon, Sep 15, 2014 at 09:33:52PM +0200, Radim Krčmář wrote: Do you prefer the current behavior? --- 8 --- KVM: x86: count actual tlb flushes - we count KVM_REQ_TLB_FLUSH requests, not actual flushes So there maybe multiple requests accumulated at the point of kvm_check_request, if your patch account these accumulations correctly? It will ignore request accumulations and count it as one TLB flush, we have to decide what is correct (the value is just statistics) a) count local KVM_REQ_TLB_FLUSH requests b) count all TLB flushes c) both (a) and (b) I was thinking that when you look at /sys/kernel/debug/kvm/tlb_flushes, you are interested in the number of TLB flushes that VMs did, not requests, so you won't have to add remote_tlb_flushes multiplied by maximal vcpu count and guess their amount from this upper bound. And that we don't even care about requests, so (c) is just complication. --- I tried to get an idea about the number of coalesced requests and added a counter called tlb_flush_real, making option (c). After a night of reading virtio-rng (different experiment) on 1 VCPU VM: # cat /sys/kernel/debug/kvm/{tlb_flush{,_real},remote_tlb_flush} 5927 5206 0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
2014-09-17 13:26+0300, Gleb Natapov: For async_pf_execute() you do not need to even retry. Next guest's page fault will retry it for you. Wouldn't that be a waste of vmentries? The guest might be able to handle interrupts while we are waiting, so if we used async-io-done notifier, this could work without CPU spinning. (Probably with added latency.) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: Faults which trigger IO release the mmap_sem
[Repost for lists, the last mail was eaten by a security troll.] 2014-09-16 14:01-0700, Andres Lagar-Cavilla: On Tue, Sep 16, 2014 at 1:51 PM, Radim Krčmář rkrc...@redhat.com wrote: 2014-09-15 13:11-0700, Andres Lagar-Cavilla: +int kvm_get_user_page_retry(struct task_struct *tsk, struct mm_struct *mm, The suffix '_retry' is not best suited for this. On first reading, I imagined we will be retrying something from before, possibly calling it in a loop, but we are actually doing the first and last try in one call. We are doing ... the second and third in most scenarios. async_pf did the first with _NOWAIT. We call this from the async pf retrier, or if async pf couldn't be notified to the guest. I was thinking more about what the function does, not how we currently use it -- nothing prevents us from using it as first somewhere -- but yeah, even comments would be off then. Apart from this, the patch looks good. The mm/ parts are minimal, so I think it's best to merge it through the KVM tree with someone's Acked-by. I would prefer to have the last hunk in a separate patch, but still, Acked-by: Radim Krčmář rkrc...@redhat.com Awesome, thanks much. I'll recut with the VM_BUG_ON from Paolo and your Ack. LMK if anything else from this email should go into the recut. Ah, sorry, I'm not maintaining mm ... what I meant was Reviewed-by: Radim Krčmář rkrc...@redhat.com and I had to leave before I could find a good apology for VM_WARN_ON_ONCE(), so if you are replacing BUG_ON, you might want to look at that one as well. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH 1/3] x86: Adding structs to reflect cpuid fields
2014-09-17 16:06+0200, Borislav Petkov: On Wed, Sep 17, 2014 at 04:53:39PM +0300, Nadav Amit wrote: AFAIK backward compatibility is usually maintained in x86. I did not see in Intel SDM anything that says this CPUID field means something for CPU X and something else for CPU Y. Anyhow, it is not different than bitmasks in this respect. You still don't get my point: what are you going to do when min_monitor_line_size needs to be 17 bits all of a sudden? Currently, you simply do an if-else check before using the respective mask and with your defined structs, you need to keep two versions: union cpuid5_ebx_before_family_X { struct { unsigned int max_monitor_line_size:16; unsigned int reserved:16; } split; unsigned int full; }; union cpuid5_ebx_after_family_X { struct { unsigned int max_monitor_line_size:17; unsigned int reserved:15; } split; unsigned int full; }; New union wouldn't be very convenient if the change touched just a small part of the register ... probably the best choice is using anonymous elements like this, union cpuid5_ebx { union { struct { unsigned int max_monitor_line_size:16; unsigned int reserved:16; }; struct { unsigned int max_monitor_line_size_after_family_X:17; unsigned int reserved_after_family_X:15; }; } split; unsigned int full; }; which would result in a similar if-else hack if (family X) ebx.split.max_monitor_line_size_after_family_X = 0 else ebx.split.max_monitor_line_size = 0 other options are ebx.split.after_family_X.max_monitor_line_size or even ebx.split.max_monitor_line_size.after_family_X Flat namespace is more flexible wrt. code. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html