Re: [Xen-devel] [PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin
On 11/01/2017 06:01 PM, Boris Ostrovsky wrote: > On 11/01/2017 04:58 PM, Waiman Long wrote: >> +/* TODO: To be removed in a future kernel version */ >> static __init int xen_parse_nopvspin(char *arg) >> { >> -xen_pvspin = false; >> +pr_warn("xen_nopvspin is deprecated, replace it with >> \"pvlock_type=queued\"!\n"); >> +if (!pv_spinlock_type) >> +pv_spinlock_type = locktype_queued; > Since we currently end up using unfair locks and because you are > deprecating xen_nopvspin I wonder whether it would be better to set this > to locktype_unfair so that current behavior doesn't change. (Sorry, I > haven't responded to your earlier message before you posted this). Juergen? I think the latest patch from Juergen in tip is to use native qspinlock when xen_nopvspin is specified. Right? That is why I made the current choice. I can certainly change to unfair if it is what you guys want. > I am also not sure I agree with making pv_spinlock an enum *and* a > bitmask at the same time. I understand that it makes checks easier but I > think not assuming a value or a pattern would be better, especially > since none of the uses is on a critical path. > > (For example, !pv_spinlock_type is the same as locktype_auto, which is > defined but never used) OK, I will take out the enum and make explicit use of locktype_auto. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH-tip v2 0/2] x86/paravirt: Enable users to choose PV lock type
v1->v2: - Make pv_spinlock_type a bit mask for easier checking. - Add patch 2 to deprecate xen_nopvspin v1 - https://lkml.org/lkml/2017/11/1/381 Patch 1 adds a new pvlock_type parameter for the administrators to specify the type of lock to be used in a para-virtualized kernel. Patch 2 deprecates Xen's xen_nopvspin parameter as it is no longer needed. Waiman Long (2): x86/paravirt: Add kernel parameter to choose paravirt lock type x86/xen: Deprecate xen_nopvspin Documentation/admin-guide/kernel-parameters.txt | 11 --- arch/x86/include/asm/paravirt.h | 9 ++ arch/x86/kernel/kvm.c | 3 ++ arch/x86/kernel/paravirt.c | 40 - arch/x86/xen/spinlock.c | 17 +-- 5 files changed, 65 insertions(+), 15 deletions(-) -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH-tip v2 1/2] x86/paravirt: Add kernel parameter to choose paravirt lock type
Currently, there are 3 different lock types that can be chosen for the x86 architecture: - qspinlock - pvqspinlock - unfair lock One of the above lock types will be chosen at boot time depending on a number of different factors. Ideally, the hypervisors should be able to pick the best performing lock type for the current VM configuration. That is not currently the case as the performance of each lock type are affected by many different factors like the number of vCPUs in the VM, the amount vCPU overcommitment, the CPU type and so on. Generally speaking, unfair lock performs well for VMs with a small number of vCPUs. Native qspinlock may perform better than pvqspinlock if there is vCPU pinning and there is no vCPU over-commitment. This patch adds a new kernel parameter to allow administrator to choose the paravirt spinlock type to be used. VM administrators can experiment with the different lock types and choose one that can best suit their need, if they want to. Hypervisor developers can also use that to experiment with different lock types so that they can come up with a better algorithm to pick the best lock type. The hypervisor paravirt spinlock code will override this new parameter in determining if pvqspinlock should be used. The parameter, however, will override Xen's xen_nopvspin in term of disabling unfair lock. Signed-off-by: Waiman Long <long...@redhat.com> --- Documentation/admin-guide/kernel-parameters.txt | 7 + arch/x86/include/asm/paravirt.h | 9 ++ arch/x86/kernel/kvm.c | 3 ++ arch/x86/kernel/paravirt.c | 40 - arch/x86/xen/spinlock.c | 12 5 files changed, 65 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f7df49d..c98d9c7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3275,6 +3275,13 @@ [KNL] Number of legacy pty's. Overwrites compiled-in default number. + pvlock_type=[X86,PV_OPS] + Specify the paravirt spinlock type to be used. + Options are: + queued - native queued spinlock + pv - paravirt queued spinlock + unfair - simple TATAS unfair lock + quiet [KNL] Disable most log messages r128= [HW,DRM] diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 12deec7..c8f9ad9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu) #endif /* SMP && PARAVIRT_SPINLOCKS */ +enum pv_spinlock_type { + locktype_auto = 0, + locktype_queued = 1, + locktype_paravirt = 2, + locktype_unfair = 4, +}; + +extern enum pv_spinlock_type pv_spinlock_type; + #ifdef CONFIG_X86_32 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;" #define PV_RESTORE_REGS "popl %edx; popl %ecx;" diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..3faee63 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -646,6 +646,9 @@ void __init kvm_spinlock_init(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return; + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) + return; + __pv_init_lock_hash(); pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath; pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 041096b..aee6756 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target, return 5; } +/* + * The kernel argument "pvlock_type=" can be used to explicitly specify + * which type of spinlocks to be used. Currently, there are 3 options: + * 1) queued - the native queued spinlock + * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS) + * 3) unfair - the simple TATAS unfair lock + * + * If this argument is not specified, the kernel will automatically choose + * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor + * specific settings. + */ +enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto; + +static int __init pvlock_setup(char *s) +{ + if (!s) + return -EINVAL; + + if (!strcmp(s, "queued")) + pv_spinlock_type = locktype_queued; + else if (!strcmp(s, "pv")) + pv_spinlock_type = locktype_paravirt; + else if (!strcmp(s, "u
[Xen-devel] [PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin
With the new pvlock_type kernel parameter, xen_nopvspin is no longer needed. This patch deprecates the xen_nopvspin parameter by removing its documentation and treating it as an alias of "pvlock_type=queued". Signed-off-by: Waiman Long <long...@redhat.com> --- Documentation/admin-guide/kernel-parameters.txt | 4 arch/x86/xen/spinlock.c | 19 +++ 2 files changed, 7 insertions(+), 16 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index c98d9c7..683a817 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4596,10 +4596,6 @@ the unplug protocol never -- do not unplug even if version check succeeds - xen_nopvspin[X86,XEN] - Disables the ticketlock slowpath using Xen PV - optimizations. - xen_nopv[X86] Disables the PV optimizations forcing the HVM guest to run as generic HVM guest with no PV drivers. diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d5f79ac..19e2e75 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -20,7 +20,6 @@ static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; static DEFINE_PER_CPU(char *, irq_name); -static bool xen_pvspin = true; #include @@ -81,12 +80,8 @@ void xen_init_lock_cpu(int cpu) int irq; char *name; - if (!xen_pvspin || - (pv_spinlock_type & (locktype_queued|locktype_unfair))) { - if ((cpu == 0) && !pv_spinlock_type) - static_branch_disable(_spin_lock_key); + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) return; - } WARN(per_cpu(lock_kicker_irq, cpu) >= 0, "spinlock on CPU%d exists on IRQ%d!\n", cpu, per_cpu(lock_kicker_irq, cpu)); @@ -110,8 +105,7 @@ void xen_init_lock_cpu(int cpu) void xen_uninit_lock_cpu(int cpu) { - if (!xen_pvspin || - (pv_spinlock_type & (locktype_queued|locktype_unfair))) + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) return; unbind_from_irqhandler(per_cpu(lock_kicker_irq, cpu), NULL); @@ -132,8 +126,7 @@ void xen_uninit_lock_cpu(int cpu) */ void __init xen_init_spinlocks(void) { - if (!xen_pvspin || - (pv_spinlock_type & (locktype_queued|locktype_unfair))) { + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) { printk(KERN_DEBUG "xen: PV spinlocks disabled\n"); return; } @@ -147,10 +140,12 @@ void __init xen_init_spinlocks(void) pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen); } +/* TODO: To be removed in a future kernel version */ static __init int xen_parse_nopvspin(char *arg) { - xen_pvspin = false; + pr_warn("xen_nopvspin is deprecated, replace it with \"pvlock_type=queued\"!\n"); + if (!pv_spinlock_type) + pv_spinlock_type = locktype_queued; return 0; } early_param("xen_nopvspin", xen_parse_nopvspin); - -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type
On 11/01/2017 03:01 PM, Boris Ostrovsky wrote: > On 11/01/2017 12:28 PM, Waiman Long wrote: >> On 11/01/2017 11:51 AM, Juergen Gross wrote: >>> On 01/11/17 16:32, Waiman Long wrote: >>>> Currently, there are 3 different lock types that can be chosen for >>>> the x86 architecture: >>>> >>>> - qspinlock >>>> - pvqspinlock >>>> - unfair lock >>>> >>>> One of the above lock types will be chosen at boot time depending on >>>> a number of different factors. >>>> >>>> Ideally, the hypervisors should be able to pick the best performing >>>> lock type for the current VM configuration. That is not currently >>>> the case as the performance of each lock type are affected by many >>>> different factors like the number of vCPUs in the VM, the amount vCPU >>>> overcommitment, the CPU type and so on. >>>> >>>> Generally speaking, unfair lock performs well for VMs with a small >>>> number of vCPUs. Native qspinlock may perform better than pvqspinlock >>>> if there is vCPU pinning and there is no vCPU over-commitment. >>>> >>>> This patch adds a new kernel parameter to allow administrator to >>>> choose the paravirt spinlock type to be used. VM administrators can >>>> experiment with the different lock types and choose one that can best >>>> suit their need, if they want to. Hypervisor developers can also use >>>> that to experiment with different lock types so that they can come >>>> up with a better algorithm to pick the best lock type. >>>> >>>> The hypervisor paravirt spinlock code will override this new parameter >>>> in determining if pvqspinlock should be used. The parameter, however, >>>> will override Xen's xen_nopvspin in term of disabling unfair lock. >>> Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others >>> think? >> I don't think we need xen_nopvspin, but I don't want to remove that >> without agreement from the community. > I also don't think xen_nopvspin will be needed after pvlock_type is > introduced. > > -boris Another reason that I didn't try to remove xen_nopvspin is backward compatibility concern. One way to handle it is to depreciate it and treat it as an alias to pvlock_type=queued. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type
On 11/01/2017 11:51 AM, Juergen Gross wrote: > On 01/11/17 16:32, Waiman Long wrote: >> Currently, there are 3 different lock types that can be chosen for >> the x86 architecture: >> >> - qspinlock >> - pvqspinlock >> - unfair lock >> >> One of the above lock types will be chosen at boot time depending on >> a number of different factors. >> >> Ideally, the hypervisors should be able to pick the best performing >> lock type for the current VM configuration. That is not currently >> the case as the performance of each lock type are affected by many >> different factors like the number of vCPUs in the VM, the amount vCPU >> overcommitment, the CPU type and so on. >> >> Generally speaking, unfair lock performs well for VMs with a small >> number of vCPUs. Native qspinlock may perform better than pvqspinlock >> if there is vCPU pinning and there is no vCPU over-commitment. >> >> This patch adds a new kernel parameter to allow administrator to >> choose the paravirt spinlock type to be used. VM administrators can >> experiment with the different lock types and choose one that can best >> suit their need, if they want to. Hypervisor developers can also use >> that to experiment with different lock types so that they can come >> up with a better algorithm to pick the best lock type. >> >> The hypervisor paravirt spinlock code will override this new parameter >> in determining if pvqspinlock should be used. The parameter, however, >> will override Xen's xen_nopvspin in term of disabling unfair lock. > Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others > think? I don't think we need xen_nopvspin, but I don't want to remove that without agreement from the community. >> DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); >> >> void __init native_pv_lock_init(void) >> { >> -if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >> +if (pv_spinlock_type == locktype_unfair) >> +return; >> + >> +if (!static_cpu_has(X86_FEATURE_HYPERVISOR) || >> + (pv_spinlock_type != locktype_auto)) >> static_branch_disable(_spin_lock_key); > Really? I don't think locktype_paravirt should disable the static key. With paravirt spinlock, it doesn't matter if the static key is disabled or not. Without CONFIG_PARAVIRT_SPINLOCKS, however, it does degenerate into the native qspinlock. So you are right, I should check for paravirt type as well. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type
Currently, there are 3 different lock types that can be chosen for the x86 architecture: - qspinlock - pvqspinlock - unfair lock One of the above lock types will be chosen at boot time depending on a number of different factors. Ideally, the hypervisors should be able to pick the best performing lock type for the current VM configuration. That is not currently the case as the performance of each lock type are affected by many different factors like the number of vCPUs in the VM, the amount vCPU overcommitment, the CPU type and so on. Generally speaking, unfair lock performs well for VMs with a small number of vCPUs. Native qspinlock may perform better than pvqspinlock if there is vCPU pinning and there is no vCPU over-commitment. This patch adds a new kernel parameter to allow administrator to choose the paravirt spinlock type to be used. VM administrators can experiment with the different lock types and choose one that can best suit their need, if they want to. Hypervisor developers can also use that to experiment with different lock types so that they can come up with a better algorithm to pick the best lock type. The hypervisor paravirt spinlock code will override this new parameter in determining if pvqspinlock should be used. The parameter, however, will override Xen's xen_nopvspin in term of disabling unfair lock. Signed-off-by: Waiman Long <long...@redhat.com> --- Documentation/admin-guide/kernel-parameters.txt | 7 + arch/x86/include/asm/paravirt.h | 9 ++ arch/x86/kernel/kvm.c | 4 +++ arch/x86/kernel/paravirt.c | 40 - arch/x86/xen/spinlock.c | 6 ++-- 5 files changed, 62 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f7df49d..c98d9c7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3275,6 +3275,13 @@ [KNL] Number of legacy pty's. Overwrites compiled-in default number. + pvlock_type=[X86,PV_OPS] + Specify the paravirt spinlock type to be used. + Options are: + queued - native queued spinlock + pv - paravirt queued spinlock + unfair - simple TATAS unfair lock + quiet [KNL] Disable most log messages r128= [HW,DRM] diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 12deec7..941a046 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu) #endif /* SMP && PARAVIRT_SPINLOCKS */ +enum pv_spinlock_type { + locktype_auto, + locktype_queued, + locktype_paravirt, + locktype_unfair, +}; + +extern enum pv_spinlock_type pv_spinlock_type; + #ifdef CONFIG_X86_32 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;" #define PV_RESTORE_REGS "popl %edx; popl %ecx;" diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..3a5d3ec4 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -646,6 +646,10 @@ void __init kvm_spinlock_init(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return; + if ((pv_spinlock_type == locktype_queued) || + (pv_spinlock_type == locktype_unfair)) + return; + __pv_init_lock_hash(); pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath; pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 041096b..ca35cd3 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target, return 5; } +/* + * The kernel argument "pvlock_type=" can be used to explicitly specify + * which type of spinlocks to be used. Currently, there are 3 options: + * 1) queued - the native queued spinlock + * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS) + * 3) unfair - the simple TATAS unfair lock + * + * If this argument is not specified, the kernel will automatically choose + * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor + * specific settings. + */ +enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto; + +static int __init pvlock_setup(char *s) +{ + if (!s) + return -EINVAL; + + if (!strcmp(s, "queued")) + pv_spinlock_type = locktype_queued; + else if (!strcmp(s, "pv")) + pv_spinlock_type = locktype_paravirt; + else if (!strcmp(
Re: [Xen-devel] [PATCH v3 0/2] guard virt_spin_lock() with a static key
On 09/25/2017 09:59 AM, Juergen Gross wrote: > Ping? > > On 06/09/17 19:36, Juergen Gross wrote: >> With virt_spin_lock() being guarded by a static key the bare metal case >> can be optimized by patching the call away completely. In case a kernel >> running as a guest it can decide whether to use paravitualized >> spinlocks, the current fallback to the unfair test-and-set scheme, or >> to mimic the bare metal behavior. >> >> V3: >> - remove test for hypervisor environment from virt_spin_lock(9 as >> suggested by Waiman Long >> >> V2: >> - use static key instead of making virt_spin_lock() a pvops function >> >> Juergen Gross (2): >> paravirt/locks: use new static key for controlling call of >> virt_spin_lock() >> paravirt,xen: correct xen_nopvspin case >> >> arch/x86/include/asm/qspinlock.h | 11 ++- >> arch/x86/kernel/paravirt-spinlocks.c | 6 ++ >> arch/x86/kernel/smpboot.c| 2 ++ >> arch/x86/xen/spinlock.c | 2 ++ >> kernel/locking/qspinlock.c | 4 >> 5 files changed, 24 insertions(+), 1 deletion(-) >> Acked-by: Waiman Long <long...@redhat.com> ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()
On 09/06/2017 12:04 PM, Peter Zijlstra wrote: > On Wed, Sep 06, 2017 at 11:49:49AM -0400, Waiman Long wrote: >>> #define virt_spin_lock virt_spin_lock >>> static inline bool virt_spin_lock(struct qspinlock *lock) >>> { >>> + if (!static_branch_likely(_spin_lock_key)) >>> + return false; >>> if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> return false; >>> > Now native has two NOPs instead of one. Can't we merge these two static > branches? I guess we can remove the static_cpu_has() call. Just that any spin_lock calls before native_pv_lock_init() will use the virt_spin_lock(). That is still OK as the init call is done before SMP starts. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()
On 09/06/2017 11:29 AM, Juergen Gross wrote: > There are cases where a guest tries to switch spinlocks to bare metal > behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this > has the downside of falling back to unfair test and set scheme for > qspinlocks due to virt_spin_lock() detecting the virtualized > environment. > > Add a static key controlling whether virt_spin_lock() should be > called or not. When running on bare metal set the new key to false. > > Signed-off-by: Juergen Gross <jgr...@suse.com> > --- > arch/x86/include/asm/qspinlock.h | 11 +++ > arch/x86/kernel/paravirt-spinlocks.c | 6 ++ > arch/x86/kernel/smpboot.c| 2 ++ > kernel/locking/qspinlock.c | 4 > 4 files changed, 23 insertions(+) > > diff --git a/arch/x86/include/asm/qspinlock.h > b/arch/x86/include/asm/qspinlock.h > index 48a706f641f2..fc39389f196b 100644 > --- a/arch/x86/include/asm/qspinlock.h > +++ b/arch/x86/include/asm/qspinlock.h > @@ -1,6 +1,7 @@ > #ifndef _ASM_X86_QSPINLOCK_H > #define _ASM_X86_QSPINLOCK_H > > +#include > #include > #include > #include > @@ -46,9 +47,15 @@ static inline void queued_spin_unlock(struct qspinlock > *lock) > #endif > > #ifdef CONFIG_PARAVIRT > +DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key); > + > +void native_pv_lock_init(void) __init; > + > #define virt_spin_lock virt_spin_lock > static inline bool virt_spin_lock(struct qspinlock *lock) > { > + if (!static_branch_likely(_spin_lock_key)) > + return false; > if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > return false; > > @@ -65,6 +72,10 @@ static inline bool virt_spin_lock(struct qspinlock *lock) > > return true; > } > +#else > +static inline void native_pv_lock_init(void) > +{ > +} > #endif /* CONFIG_PARAVIRT */ > > #include > diff --git a/arch/x86/kernel/paravirt-spinlocks.c > b/arch/x86/kernel/paravirt-spinlocks.c > index 8f2d1c9d43a8..2fc65ddea40d 100644 > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -42,3 +42,9 @@ struct pv_lock_ops pv_lock_ops = { > #endif /* SMP */ > }; > EXPORT_SYMBOL(pv_lock_ops); > + > +void __init native_pv_lock_init(void) > +{ > + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > + static_branch_disable(_spin_lock_key); > +} > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 54b9e89d4d6b..21500d3ba359 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -77,6 +77,7 @@ > #include > #include > #include > +#include > > /* Number of siblings per CPU package */ > int smp_num_siblings = 1; > @@ -1381,6 +1382,7 @@ void __init native_smp_prepare_boot_cpu(void) > /* already set me in cpu_online_mask in boot_cpu_init() */ > cpumask_set_cpu(me, cpu_callout_mask); > cpu_set_state_online(me); > + native_pv_lock_init(); > } > > void __init native_smp_cpus_done(unsigned int max_cpus) > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > index 294294c71ba4..838d235b87ef 100644 > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -76,6 +76,10 @@ > #define MAX_NODES4 > #endif > > +#ifdef CONFIG_PARAVIRT > +DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); > +#endif > + > /* > * Per-CPU queue node structures; we can never have more than 4 nested > * contexts: task, softirq, hardirq, nmi. Acked-by: Waiman Long <long...@redhat.com> ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 2/2] paravirt, xen: correct xen_nopvspin case
On 09/06/2017 11:29 AM, Juergen Gross wrote: > With the boot parameter "xen_nopvspin" specified a Xen guest should not > make use of paravirt spinlocks, but behave as if running on bare > metal. This is not true, however, as the qspinlock code will fall back > to a test-and-set scheme when it is detecting a hypervisor. > > In order to avoid this disable the virt_spin_lock_key. > > Signed-off-by: Juergen Gross <jgr...@suse.com> > --- > arch/x86/xen/spinlock.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c > index 25a7c4302ce7..e8ab80ad7a6f 100644 > --- a/arch/x86/xen/spinlock.c > +++ b/arch/x86/xen/spinlock.c > @@ -10,6 +10,7 @@ > #include > > #include > +#include > > #include > #include > @@ -129,6 +130,7 @@ void __init xen_init_spinlocks(void) > > if (!xen_pvspin) { > printk(KERN_DEBUG "xen: PV spinlocks disabled\n"); > + static_branch_disable(_spin_lock_key); > return; > } > printk(KERN_DEBUG "xen: PV spinlocks enabled\n"); Acked-by: Waiman Long <long...@redhat.com> ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/06/2017 09:06 AM, Peter Zijlstra wrote: > On Wed, Sep 06, 2017 at 08:44:09AM -0400, Waiman Long wrote: >> On 09/06/2017 03:08 AM, Peter Zijlstra wrote: >>> Guys, please trim email. >>> >>> On Tue, Sep 05, 2017 at 10:31:46AM -0400, Waiman Long wrote: >>>> For clarification, I was actually asking if you consider just adding one >>>> more jump label to skip it for Xen/KVM instead of making >>>> virt_spin_lock() a pv-op. >>> I don't understand. What performance are you worried about. Native will >>> now do: "xor rax,rax; jnz some_cold_label" that's fairly trival code. >> It is not native that I am talking about. I am worry about VM with >> non-Xen/KVM hypervisor where virt_spin_lock() will actually be called. >> Now that function will become a callee-saved function call instead of >> being inlined into the native slowpath function. > But only if we actually end up using the test-and-set thing, because if > you have paravirt we end up using that. I am talking about scenario like a kernel with paravirt spinlock running in a VM managed by VMware, for example. We may not want to penalize them while there are alternatives with less penalty. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/06/2017 03:08 AM, Peter Zijlstra wrote: > Guys, please trim email. > > On Tue, Sep 05, 2017 at 10:31:46AM -0400, Waiman Long wrote: >> For clarification, I was actually asking if you consider just adding one >> more jump label to skip it for Xen/KVM instead of making >> virt_spin_lock() a pv-op. > I don't understand. What performance are you worried about. Native will > now do: "xor rax,rax; jnz some_cold_label" that's fairly trival code. It is not native that I am talking about. I am worry about VM with non-Xen/KVM hypervisor where virt_spin_lock() will actually be called. Now that function will become a callee-saved function call instead of being inlined into the native slowpath function. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 10:24 AM, Waiman Long wrote: > On 09/05/2017 10:18 AM, Juergen Gross wrote: >> On 05/09/17 16:10, Waiman Long wrote: >>> On 09/05/2017 09:24 AM, Juergen Gross wrote: >>>> There are cases where a guest tries to switch spinlocks to bare metal >>>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this >>>> has the downside of falling back to unfair test and set scheme for >>>> qspinlocks due to virt_spin_lock() detecting the virtualized >>>> environment. >>>> >>>> Make virt_spin_lock() a paravirt operation in order to enable users >>>> to select an explicit behavior like bare metal. >>>> >>>> Signed-off-by: Juergen Gross <jgr...@suse.com> >>>> --- >>>> arch/x86/include/asm/paravirt.h | 5 >>>> arch/x86/include/asm/paravirt_types.h | 1 + >>>> arch/x86/include/asm/qspinlock.h | 48 >>>> --- >>>> arch/x86/kernel/paravirt-spinlocks.c | 14 ++ >>>> arch/x86/kernel/smpboot.c | 2 ++ >>>> 5 files changed, 55 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/arch/x86/include/asm/paravirt.h >>>> b/arch/x86/include/asm/paravirt.h >>>> index c25dd22f7c70..d9e954fb37df 100644 >>>> --- a/arch/x86/include/asm/paravirt.h >>>> +++ b/arch/x86/include/asm/paravirt.h >>>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long >>>> cpu) >>>>return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); >>>> } >>>> >>>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) >>>> +{ >>>> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); >>>> +} >>>> + >>>> #endif /* SMP && PARAVIRT_SPINLOCKS */ >>>> >>>> #ifdef CONFIG_X86_32 >>>> diff --git a/arch/x86/include/asm/paravirt_types.h >>>> b/arch/x86/include/asm/paravirt_types.h >>>> index 19efefc0e27e..928f5e7953a7 100644 >>>> --- a/arch/x86/include/asm/paravirt_types.h >>>> +++ b/arch/x86/include/asm/paravirt_types.h >>>> @@ -319,6 +319,7 @@ struct pv_lock_ops { >>>>void (*kick)(int cpu); >>>> >>>>struct paravirt_callee_save vcpu_is_preempted; >>>> + struct paravirt_callee_save virt_spin_lock; >>>> } __no_randomize_layout; >>>> >>>> /* This contains all the paravirt structures: we get a convenient >>>> diff --git a/arch/x86/include/asm/qspinlock.h >>>> b/arch/x86/include/asm/qspinlock.h >>>> index 48a706f641f2..fbd98896385c 100644 >>>> --- a/arch/x86/include/asm/qspinlock.h >>>> +++ b/arch/x86/include/asm/qspinlock.h >>>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct >>>> qspinlock *lock) >>>>smp_store_release((u8 *)lock, 0); >>>> } >>>> >>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock) >>>> +{ >>>> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>>> + return false; >>>> + >>>> + /* >>>> + * On hypervisors without PARAVIRT_SPINLOCKS support we fall >>>> + * back to a Test-and-Set spinlock, because fair locks have >>>> + * horrible lock 'holder' preemption issues. >>>> + */ >>>> + >>>> + do { >>>> + while (atomic_read(>val) != 0) >>>> + cpu_relax(); >>>> + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); >>>> + >>>> + return true; >>>> +} >>>> + >>>> #ifdef CONFIG_PARAVIRT_SPINLOCKS >>>> extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 >>>> val); >>>> extern void __pv_init_lock_hash(void); >>>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu) >>>> { >>>>return pv_vcpu_is_preempted(cpu); >>>> } >>>> + >>>> +void native_pv_lock_init(void) __init; >>>> #else >>>> static inline void queued_spin_unlock(struct qspinlock *lock) >>>> { >>>>native_queued_spin_unlock(lock); >>>> } >>>> + >>>> +static inline void native_pv_lock_init(void) >>>> +{ >>>> +} >>>> #endif >>>> >>>> #ifdef CONFIG_PARAVIRT >>>> #define virt_spin_lock virt_spin_lock >>>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS >>>> static inline bool virt_spin_lock(struct qspinlock *lock) >>>> { >>>> - if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>>> - return false; >>> Have you consider just add one more jump label here to skip >>> virt_spin_lock when KVM or Xen want to do so? >> Why? Did you look at patch 4? This is the way to do it... > I asked this because of my performance concern as stated later in the email. For clarification, I was actually asking if you consider just adding one more jump label to skip it for Xen/KVM instead of making virt_spin_lock() a pv-op. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 10:18 AM, Juergen Gross wrote: > On 05/09/17 16:10, Waiman Long wrote: >> On 09/05/2017 09:24 AM, Juergen Gross wrote: >>> There are cases where a guest tries to switch spinlocks to bare metal >>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this >>> has the downside of falling back to unfair test and set scheme for >>> qspinlocks due to virt_spin_lock() detecting the virtualized >>> environment. >>> >>> Make virt_spin_lock() a paravirt operation in order to enable users >>> to select an explicit behavior like bare metal. >>> >>> Signed-off-by: Juergen Gross <jgr...@suse.com> >>> --- >>> arch/x86/include/asm/paravirt.h | 5 >>> arch/x86/include/asm/paravirt_types.h | 1 + >>> arch/x86/include/asm/qspinlock.h | 48 >>> --- >>> arch/x86/kernel/paravirt-spinlocks.c | 14 ++ >>> arch/x86/kernel/smpboot.c | 2 ++ >>> 5 files changed, 55 insertions(+), 15 deletions(-) >>> >>> diff --git a/arch/x86/include/asm/paravirt.h >>> b/arch/x86/include/asm/paravirt.h >>> index c25dd22f7c70..d9e954fb37df 100644 >>> --- a/arch/x86/include/asm/paravirt.h >>> +++ b/arch/x86/include/asm/paravirt.h >>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long >>> cpu) >>> return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); >>> } >>> >>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) >>> +{ >>> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); >>> +} >>> + >>> #endif /* SMP && PARAVIRT_SPINLOCKS */ >>> >>> #ifdef CONFIG_X86_32 >>> diff --git a/arch/x86/include/asm/paravirt_types.h >>> b/arch/x86/include/asm/paravirt_types.h >>> index 19efefc0e27e..928f5e7953a7 100644 >>> --- a/arch/x86/include/asm/paravirt_types.h >>> +++ b/arch/x86/include/asm/paravirt_types.h >>> @@ -319,6 +319,7 @@ struct pv_lock_ops { >>> void (*kick)(int cpu); >>> >>> struct paravirt_callee_save vcpu_is_preempted; >>> + struct paravirt_callee_save virt_spin_lock; >>> } __no_randomize_layout; >>> >>> /* This contains all the paravirt structures: we get a convenient >>> diff --git a/arch/x86/include/asm/qspinlock.h >>> b/arch/x86/include/asm/qspinlock.h >>> index 48a706f641f2..fbd98896385c 100644 >>> --- a/arch/x86/include/asm/qspinlock.h >>> +++ b/arch/x86/include/asm/qspinlock.h >>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct >>> qspinlock *lock) >>> smp_store_release((u8 *)lock, 0); >>> } >>> >>> +static inline bool native_virt_spin_lock(struct qspinlock *lock) >>> +{ >>> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> + return false; >>> + >>> + /* >>> +* On hypervisors without PARAVIRT_SPINLOCKS support we fall >>> +* back to a Test-and-Set spinlock, because fair locks have >>> +* horrible lock 'holder' preemption issues. >>> +*/ >>> + >>> + do { >>> + while (atomic_read(>val) != 0) >>> + cpu_relax(); >>> + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); >>> + >>> + return true; >>> +} >>> + >>> #ifdef CONFIG_PARAVIRT_SPINLOCKS >>> extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 >>> val); >>> extern void __pv_init_lock_hash(void); >>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu) >>> { >>> return pv_vcpu_is_preempted(cpu); >>> } >>> + >>> +void native_pv_lock_init(void) __init; >>> #else >>> static inline void queued_spin_unlock(struct qspinlock *lock) >>> { >>> native_queued_spin_unlock(lock); >>> } >>> + >>> +static inline void native_pv_lock_init(void) >>> +{ >>> +} >>> #endif >>> >>> #ifdef CONFIG_PARAVIRT >>> #define virt_spin_lock virt_spin_lock >>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS >>> static inline bool virt_spin_lock(struct qspinlock *lock) >>> { >>> - if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> - return false; >> Have you consider just add one more j
Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 10:08 AM, Peter Zijlstra wrote: > On Tue, Sep 05, 2017 at 10:02:57AM -0400, Waiman Long wrote: >> On 09/05/2017 09:24 AM, Juergen Gross wrote: >>> +static inline bool native_virt_spin_lock(struct qspinlock *lock) >>> +{ >>> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> + return false; >>> + >> I think you can take the above if statement out as you has done test in >> native_pv_lock_init(). So the test will also be false here. > That does mean we'll run a test-and-set spinlock until paravirt patching > happens though. I prefer to not do that. > > One important point.. we must not be holding any locks when we switch > over between the two locks. Back then I spend some time making sure that > didn't happen with the X86 feature flag muck. AFAICT, native_pv_lock_init() is called before SMP init. So it shouldn't matter. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 09:24 AM, Juergen Gross wrote: > There are cases where a guest tries to switch spinlocks to bare metal > behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this > has the downside of falling back to unfair test and set scheme for > qspinlocks due to virt_spin_lock() detecting the virtualized > environment. > > Make virt_spin_lock() a paravirt operation in order to enable users > to select an explicit behavior like bare metal. > > Signed-off-by: Juergen Gross> --- > arch/x86/include/asm/paravirt.h | 5 > arch/x86/include/asm/paravirt_types.h | 1 + > arch/x86/include/asm/qspinlock.h | 48 > --- > arch/x86/kernel/paravirt-spinlocks.c | 14 ++ > arch/x86/kernel/smpboot.c | 2 ++ > 5 files changed, 55 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h > index c25dd22f7c70..d9e954fb37df 100644 > --- a/arch/x86/include/asm/paravirt.h > +++ b/arch/x86/include/asm/paravirt.h > @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long > cpu) > return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); > } > > +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) > +{ > + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); > +} > + > #endif /* SMP && PARAVIRT_SPINLOCKS */ > > #ifdef CONFIG_X86_32 > diff --git a/arch/x86/include/asm/paravirt_types.h > b/arch/x86/include/asm/paravirt_types.h > index 19efefc0e27e..928f5e7953a7 100644 > --- a/arch/x86/include/asm/paravirt_types.h > +++ b/arch/x86/include/asm/paravirt_types.h > @@ -319,6 +319,7 @@ struct pv_lock_ops { > void (*kick)(int cpu); > > struct paravirt_callee_save vcpu_is_preempted; > + struct paravirt_callee_save virt_spin_lock; > } __no_randomize_layout; > > /* This contains all the paravirt structures: we get a convenient > diff --git a/arch/x86/include/asm/qspinlock.h > b/arch/x86/include/asm/qspinlock.h > index 48a706f641f2..fbd98896385c 100644 > --- a/arch/x86/include/asm/qspinlock.h > +++ b/arch/x86/include/asm/qspinlock.h > @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct > qspinlock *lock) > smp_store_release((u8 *)lock, 0); > } > > +static inline bool native_virt_spin_lock(struct qspinlock *lock) > +{ > + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > + return false; > + > + /* > + * On hypervisors without PARAVIRT_SPINLOCKS support we fall > + * back to a Test-and-Set spinlock, because fair locks have > + * horrible lock 'holder' preemption issues. > + */ > + > + do { > + while (atomic_read(>val) != 0) > + cpu_relax(); > + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); > + > + return true; > +} > + > #ifdef CONFIG_PARAVIRT_SPINLOCKS > extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 > val); > extern void __pv_init_lock_hash(void); > @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu) > { > return pv_vcpu_is_preempted(cpu); > } > + > +void native_pv_lock_init(void) __init; > #else > static inline void queued_spin_unlock(struct qspinlock *lock) > { > native_queued_spin_unlock(lock); > } > + > +static inline void native_pv_lock_init(void) > +{ > +} > #endif > > #ifdef CONFIG_PARAVIRT > #define virt_spin_lock virt_spin_lock > +#ifdef CONFIG_PARAVIRT_SPINLOCKS > static inline bool virt_spin_lock(struct qspinlock *lock) > { > - if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > - return false; Have you consider just add one more jump label here to skip virt_spin_lock when KVM or Xen want to do so? > - > - /* > - * On hypervisors without PARAVIRT_SPINLOCKS support we fall > - * back to a Test-and-Set spinlock, because fair locks have > - * horrible lock 'holder' preemption issues. > - */ > - > - do { > - while (atomic_read(>val) != 0) > - cpu_relax(); > - } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); > - > - return true; > + return pv_virt_spin_lock(lock); > +} > +#else > +static inline bool virt_spin_lock(struct qspinlock *lock) > +{ > + return native_virt_spin_lock(lock); > } > +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ > #endif /* CONFIG_PARAVIRT */ > > #include > diff --git a/arch/x86/kernel/paravirt-spinlocks.c > b/arch/x86/kernel/paravirt-spinlocks.c > index 26e4bd92f309..1be187ef8a38 100644 > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -20,6 +20,12 @@ bool pv_is_native_spin_unlock(void) > __raw_callee_save___native_queued_spin_unlock; > } > > +__visible bool __native_virt_spin_lock(struct qspinlock *lock) > +{ > + return native_virt_spin_lock(lock); > +} > +PV_CALLEE_SAVE_REGS_THUNK(__native_virt_spin_lock); I have some
Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 09:24 AM, Juergen Gross wrote: > There are cases where a guest tries to switch spinlocks to bare metal > behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this > has the downside of falling back to unfair test and set scheme for > qspinlocks due to virt_spin_lock() detecting the virtualized > environment. > > Make virt_spin_lock() a paravirt operation in order to enable users > to select an explicit behavior like bare metal. > > Signed-off-by: Juergen Gross> --- > arch/x86/include/asm/paravirt.h | 5 > arch/x86/include/asm/paravirt_types.h | 1 + > arch/x86/include/asm/qspinlock.h | 48 > --- > arch/x86/kernel/paravirt-spinlocks.c | 14 ++ > arch/x86/kernel/smpboot.c | 2 ++ > 5 files changed, 55 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h > index c25dd22f7c70..d9e954fb37df 100644 > --- a/arch/x86/include/asm/paravirt.h > +++ b/arch/x86/include/asm/paravirt.h > @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long > cpu) > return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); > } > > +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) > +{ > + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); > +} > + > #endif /* SMP && PARAVIRT_SPINLOCKS */ > > #ifdef CONFIG_X86_32 > diff --git a/arch/x86/include/asm/paravirt_types.h > b/arch/x86/include/asm/paravirt_types.h > index 19efefc0e27e..928f5e7953a7 100644 > --- a/arch/x86/include/asm/paravirt_types.h > +++ b/arch/x86/include/asm/paravirt_types.h > @@ -319,6 +319,7 @@ struct pv_lock_ops { > void (*kick)(int cpu); > > struct paravirt_callee_save vcpu_is_preempted; > + struct paravirt_callee_save virt_spin_lock; > } __no_randomize_layout; > > /* This contains all the paravirt structures: we get a convenient > diff --git a/arch/x86/include/asm/qspinlock.h > b/arch/x86/include/asm/qspinlock.h > index 48a706f641f2..fbd98896385c 100644 > --- a/arch/x86/include/asm/qspinlock.h > +++ b/arch/x86/include/asm/qspinlock.h > @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct > qspinlock *lock) > smp_store_release((u8 *)lock, 0); > } > > +static inline bool native_virt_spin_lock(struct qspinlock *lock) > +{ > + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > + return false; > + I think you can take the above if statement out as you has done test in native_pv_lock_init(). So the test will also be false here. As this patch series is x86 specific, you should probably add "x86/" in front of paravirt in the patch titles. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v5 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patchwith patch --- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <pet...@infradead.org> Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/kernel/asm-offsets_64.c | 9 + arch/x86/kernel/kvm.c| 24 2 files changed, 33 insertions(+) diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c index 210927e..99332f5 100644 --- a/arch/x86/kernel/asm-offsets_64.c +++ b/arch/x86/kernel/asm-offsets_64.c @@ -13,6 +13,10 @@ #include }; +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) +#include +#endif + int main(void) { #ifdef CONFIG_PARAVIRT @@ -22,6 +26,11 @@ int main(void) BLANK(); #endif +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) + OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted); + BLANK(); +#endif + #define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry) ENTRY(bx); ENTRY(cx); diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 85ed343..14f65a5 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +#ifdef CONFIG_X86_32 __visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); @@ -597,6 +598,29 @@ __visible bool __kvm_vcpu_is_preempted(long cpu) } PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); +#else + +#include + +extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); + +/* + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and + * restoring to/from the stack. + */ +asm( +".pushsection .text;" +".global __raw_callee_save___kvm_vcpu_is_preempted;" +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" +"__raw_callee_save___kvm_vcpu_is_preempted:" +"movq __per_cpu_offset(,%rdi,8), %rax;" +"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);" +"setne %al;" +"ret;" +".popsection"); + +#endif + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v5 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead
v4->v5: - As suggested by PeterZ, use the asm-offsets header file generation mechanism to get the offset of the preempted field in kvm_steal_time instead of hardcoding it. v3->v4: - Fix x86-32 build error. v2->v3: - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted() in assembly as suggested by PeterZ. - Add a new patch to change vcpu_is_preempted() argument type to long to ease the writing of the assembly code. v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. As it was found that the overhead of callee-save vcpu_is_preempted() can have some impact on system performance on a VM guest, especially of x86-64 guest, this patch set intends to reduce this performance overhead by replacing the C __kvm_vcpu_is_preempted() function by an optimized version of __raw_callee_save___kvm_vcpu_is_preempted() written in assembly. Waiman Long (2): x86/paravirt: Change vcp_is_preempted() arg type to long x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/asm-offsets_64.c | 9 + arch/x86/kernel/kvm.c| 26 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 5 files changed, 37 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v5 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 2 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 1eea6ca..f75fbfe 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu) PVOP_VCALL1(pv_lock_ops.kick, cpu); } -static __always_inline bool pv_vcpu_is_preempted(int cpu) +static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index c343ab5..48a706f 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock) } #define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) +static inline bool vcpu_is_preempted(long cpu) { return pv_vcpu_is_preempted(cpu); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..85ed343 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } -__visible bool __kvm_vcpu_is_preempted(int cpu) +__visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..8f2d1c9 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } -__visible bool __native_vcpu_is_preempted(int cpu) +__visible bool __native_vcpu_is_preempted(long cpu) { return false; } -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
On 02/16/2017 11:09 AM, Peter Zijlstra wrote: > On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote: >> The cpu argument in the function prototype of vcpu_is_preempted() >> is changed from int to long. That makes it easier to provide a better >> optimized assembly version of that function. >> >> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the >> downcast from long to int is not a problem as vCPU number won't exceed >> 32 bits. >> > Note that because of the cast in PVOP_CALL_ARG1() this patch is > pointless. > > Then again, it doesn't seem to affect code generation, so why not. Takes > away the reliance on that weird cast. I add this patch because I am a bit uneasy about clearing the upper 32 bits of rdi and assuming that the compiler won't have a previous use of those bits. It gives me peace of mind. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
On 02/16/2017 11:48 AM, Peter Zijlstra wrote: > On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote: >> +/* >> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and >> + * restoring to/from the stack. It is assumed that the preempted value >> + * is at an offset of 16 from the beginning of the kvm_steal_time structure >> + * which is verified by the BUILD_BUG_ON() macro below. >> + */ >> +#define PREEMPTED_OFFSET16 > As per Andrew's suggestion, the 'right' way is something like so. Thanks for the tip. I was not aware of the asm-offsets stuff. I will update the patch to use it. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead
v3->v4: - Fix x86-32 build error. v2->v3: - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted() in assembly as suggested by PeterZ. - Add a new patch to change vcpu_is_preempted() argument type to long to ease the writing of the assembly code. v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. As it was found that the overhead of callee-save vcpu_is_preempted() can have some impact on system performance on a VM guest, especially of x86-64 guest, this patch set intends to reduce this performance overhead by replacing the C __kvm_vcpu_is_preempted() function by an optimized version of __raw_callee_save___kvm_vcpu_is_preempted() written in assembly. Waiman Long (2): x86/paravirt: Change vcp_is_preempted() arg type to long x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 32 +++- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 34 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 2 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 1eea6ca..f75fbfe 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu) PVOP_VCALL1(pv_lock_ops.kick, cpu); } -static __always_inline bool pv_vcpu_is_preempted(int cpu) +static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index c343ab5..48a706f 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock) } #define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) +static inline bool vcpu_is_preempted(long cpu) { return pv_vcpu_is_preempted(cpu); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..85ed343 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } -__visible bool __kvm_vcpu_is_preempted(int cpu) +__visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..8f2d1c9 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } -__visible bool __native_vcpu_is_preempted(int cpu) +__visible bool __native_vcpu_is_preempted(long cpu) { return false; } -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patchwith patch --- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <pet...@infradead.org> Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/kernel/kvm.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 85ed343..e423435 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +#ifdef CONFIG_X86_32 __visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); @@ -597,11 +598,40 @@ __visible bool __kvm_vcpu_is_preempted(long cpu) } PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); +#else + +extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); + +/* + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and + * restoring to/from the stack. It is assumed that the preempted value + * is at an offset of 16 from the beginning of the kvm_steal_time structure + * which is verified by the BUILD_BUG_ON() macro below. + */ +#define PREEMPTED_OFFSET 16 +asm( +".pushsection .text;" +".global __raw_callee_save___kvm_vcpu_is_preempted;" +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" +"__raw_callee_save___kvm_vcpu_is_preempted:" +"movq __per_cpu_offset(,%rdi,8), %rax;" +"cmpb $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);" +"setne %al;" +"ret;" +".popsection"); + +#endif + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ void __init kvm_spinlock_init(void) { +#ifdef CONFIG_X86_64 + BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted) + != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1)); +#endif + if (!kvm_para_available()) return; /* Does host kernel support KVM_FEATURE_PV_UNHALT? */ -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v3 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead
v2->v3: - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted() in assembly as suggested by PeterZ. - Add a new patch to change vcpu_is_preempted() argument type to long to ease the writing of the assembly code. v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. As it was found that the overhead of callee-save vcpu_is_preempted() can have some impact on system performance on a VM guest, especially of x86-64 guest, this patch set intends to reduce this performance overhead by replacing the C __kvm_vcpu_is_preempted() function by an optimized version of __raw_callee_save___kvm_vcpu_is_preempted() written in assembly. Waiman Long (2): x86/paravirt: Change vcp_is_preempted() arg type to long x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 30 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 32 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v3 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patchwith patch --- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <pet...@infradead.org> Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/kernel/kvm.c | 28 1 file changed, 28 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 85ed343..83e22c1 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +#ifdef CONFIG_X86_32 __visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); @@ -597,11 +598,38 @@ __visible bool __kvm_vcpu_is_preempted(long cpu) } PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); +#else + +extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); + +/* + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and + * restoring to/from the stack. It is assumed that the preempted value + * is at an offset of 16 from the beginning of the kvm_steal_time structure + * which is verified by the BUILD_BUG_ON() macro below. + */ +#define PREEMPTED_OFFSET 16 +asm( +".pushsection .text;" +".global __raw_callee_save___kvm_vcpu_is_preempted;" +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" +"__raw_callee_save___kvm_vcpu_is_preempted:" +"movq __per_cpu_offset(,%rdi,8), %rax;" +"cmpb $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);" +"setne %al;" +"ret;" +".popsection"); + +#endif + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ void __init kvm_spinlock_init(void) { + BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted) + != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1)); + if (!kvm_para_available()) return; /* Does host kernel support KVM_FEATURE_PV_UNHALT? */ -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v3 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 2 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 864f57b..2a46f73 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -674,7 +674,7 @@ static __always_inline void pv_kick(int cpu) PVOP_VCALL1(pv_lock_ops.kick, cpu); } -static __always_inline bool pv_vcpu_is_preempted(int cpu) +static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index c343ab5..48a706f 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock) } #define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) +static inline bool vcpu_is_preempted(long cpu) { return pv_vcpu_is_preempted(cpu); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..85ed343 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } -__visible bool __kvm_vcpu_is_preempted(int cpu) +__visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..8f2d1c9 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } -__visible bool __native_vcpu_is_preempted(int cpu) +__visible bool __native_vcpu_is_preempted(long cpu) { return false; } -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/14/2017 04:39 AM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote: >> It is the address of _time that will exceed the 32-bit limit. > That seems extremely unlikely. That would mean we have more than 4G > worth of per-cpu variables declared in the kernel. I have some doubt about if the compiler is able to properly use RIP-relative addressing for this. Anyway, it seems like constraints aren't allowed for asm() when not in the function context, at least for the the compiler that I am using (4.8.5). So it is a moot point. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 04:52 PM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote: >> On 02/13/2017 02:42 PM, Waiman Long wrote: >>> On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >>>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>>>> That way we'd end up with something like: >>>>> >>>>> asm(" >>>>> push %rdi; >>>>> movslq %edi, %rdi; >>>>> movq __per_cpu_offset(,%rdi,8), %rax; >>>>> cmpb $0, %[offset](%rax); >>>>> setne %al; >>>>> pop %rdi; >>>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >>>>> steal_time, preempted))); >>>>> >>>>> And if we could get rid of the sign extend on edi we could avoid all the >>>>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>>>> this asm foo isn't my strongest point). >>>> Maybe: >>>> >>>> movsql %edi, %rax; >>>> movq __per_cpu_offset(,%rax,8), %rax; >>>> cmpb $0, %[offset](%rax); >>>> setne %al; >>>> >>>> ? >>> Yes, that looks good to me. >>> >>> Cheers, >>> Longman >>> >> Sorry, I am going to take it back. The displacement or offset can only >> be up to 32-bit. So we will still need to use at least one more >> register, I think. > I don't think that would be a problem, I very much doubt we declare more > than 4G worth of per-cpu variables in the kernel. > > In any case, use "e" or "Z" as constraint (I never quite know when to > use which). That are s32 and u32 displacement immediates resp. and > should fail compile with a semi-sensible failure if the displacement is > too big. > It is the address of _time that will exceed the 32-bit limit. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 03:06 PM, h...@zytor.com wrote: > On February 13, 2017 2:53:43 AM PST, Peter Zijlstra> wrote: >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>> That way we'd end up with something like: >>> >>> asm(" >>> push %rdi; >>> movslq %edi, %rdi; >>> movq __per_cpu_offset(,%rdi,8), %rax; >>> cmpb $0, %[offset](%rax); >>> setne %al; >>> pop %rdi; >>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >> steal_time, preempted))); >>> And if we could get rid of the sign extend on edi we could avoid all >> the >>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>> this asm foo isn't my strongest point). >> Maybe: >> >> movsql %edi, %rax; >> movq __per_cpu_offset(,%rax,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> >> ? > We could kill the zero or sign extend by changing the calling interface to > pass an unsigned long instead of an int. It is much more likely that a zero > extend is free for the caller than a sign extend. I have thought of that too. However, the goal is to eliminate memory read/write from/to stack. Eliminating a register sign-extend instruction won't help much in term of performance. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 02:42 PM, Waiman Long wrote: > On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>> That way we'd end up with something like: >>> >>> asm(" >>> push %rdi; >>> movslq %edi, %rdi; >>> movq __per_cpu_offset(,%rdi,8), %rax; >>> cmpb $0, %[offset](%rax); >>> setne %al; >>> pop %rdi; >>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >>> steal_time, preempted))); >>> >>> And if we could get rid of the sign extend on edi we could avoid all the >>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>> this asm foo isn't my strongest point). >> Maybe: >> >> movsql %edi, %rax; >> movq __per_cpu_offset(,%rax,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> >> ? > Yes, that looks good to me. > > Cheers, > Longman > Sorry, I am going to take it back. The displacement or offset can only be up to 32-bit. So we will still need to use at least one more register, I think. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 05:53 AM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >> That way we'd end up with something like: >> >> asm(" >> push %rdi; >> movslq %edi, %rdi; >> movq __per_cpu_offset(,%rdi,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> pop %rdi; >> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >> steal_time, preempted))); >> >> And if we could get rid of the sign extend on edi we could avoid all the >> push-pop nonsense, but I'm not sure I see how to do that (then again, >> this asm foo isn't my strongest point). > Maybe: > > movsql %edi, %rax; > movq __per_cpu_offset(,%rax,8), %rax; > cmpb $0, %[offset](%rax); > setne %al; > > ? Yes, that looks good to me. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 05:47 AM, Peter Zijlstra wrote: > On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote: > >>>> +asm( >>>> +".pushsection .text;" >>>> +".global __raw_callee_save___kvm_vcpu_is_preempted;" >>>> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" >>>> +"__raw_callee_save___kvm_vcpu_is_preempted:" >>>> +FRAME_BEGIN >>>> +"push %rdi;" >>>> +"push %rdx;" >>>> +"movslq %edi, %rdi;" >>>> +"movq$steal_time+16, %rax;" >>>> +"movq__per_cpu_offset(,%rdi,8), %rdx;" >>>> +"cmpb$0, (%rdx,%rax);" > Could we not put the $steal_time+16 displacement as an immediate in the > cmpb and save a whole register here? > > That way we'd end up with something like: > > asm(" > push %rdi; > movslq %edi, %rdi; > movq __per_cpu_offset(,%rdi,8), %rax; > cmpb $0, %[offset](%rax); > setne %al; > pop %rdi; > " : : [offset] "i" (((unsigned long)_time) + offsetof(struct > steal_time, preempted))); > > And if we could get rid of the sign extend on edi we could avoid all the > push-pop nonsense, but I'm not sure I see how to do that (then again, > this asm foo isn't my strongest point). Yes, I think that can work. I will try to ran this patch to see how thing goes. >>>> +"setne %al;" >>>> +"pop %rdx;" >>>> +"pop %rdi;" >>>> +FRAME_END >>>> +"ret;" >>>> +".popsection"); >>>> + >>>> +#endif >>>> + >>>> /* >>>> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. >>>> */ >>> That should work for now. I have done something similar for >>> __pv_queued_spin_unlock. However, this has the problem of creating a >>> dependency on the exact layout of the steal_time structure. Maybe the >>> constant 16 can be passed in as a parameter offsetof(struct >>> kvm_steal_time, preempted) to the asm call. > Yeah it should be well possible to pass that in. But ideally we'd have > GCC grow something like __attribute__((callee_saved)) or somesuch and it > would do all this for us. That will be really nice too. I am not too fond of working in assembly. >> One more thing, that will improve KVM performance, but it won't help Xen. > People still use Xen? ;-) In any case, their implementation looks very > similar and could easily crib this. In Red Hat, my focus will be on KVM performance. I do believe that there are still Xen users out there. So we still need to keep their interest into consideration. Given that, I am OK to make it work better in KVM first and then think about Xen later. >> I looked into the assembly code for rwsem_spin_on_owner, It need to save >> and restore 2 additional registers with my patch. Doing it your way, >> will transfer the save and restore overhead to the assembly code. >> However, __kvm_vcpu_is_preempted() is called multiple times per >> invocation of rwsem_spin_on_owner. That function is simple enough that >> making __kvm_vcpu_is_preempted() callee-save won't produce much compiler >> optimization opportunity. > This is because of that noinline, right? Otherwise it would've been > folded and register pressure would be much higher. Yes, I guess so. The noinline is there so that we know what the CPU time is for spinning rather than other activities within the slowpath. > >> The outer function rwsem_down_write_failed() >> does appear to be a bit bigger (from 866 bytes to 884 bytes) though. > I suspect GCC is being clever and since all this is static it plays > games with the calling convention and pushes these clobbers out. > > Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/10/2017 11:35 AM, Waiman Long wrote: > On 02/10/2017 11:19 AM, Peter Zijlstra wrote: >> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >>> It was found when running fio sequential write test with a XFS ramdisk >>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >>> by perf were as follows: >>> >>> 69.75% 0.59% fio [k] down_write >>> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >>> 67.12% 1.12% fio [k] rwsem_down_write_failed >>> 63.48% 52.77% fio [k] osq_lock >>> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >>> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >>> >> Thinking about this again, wouldn't something like the below also work? >> >> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c >> index 099fcba4981d..6aa33702c15c 100644 >> --- a/arch/x86/kernel/kvm.c >> +++ b/arch/x86/kernel/kvm.c >> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) >> local_irq_restore(flags); >> } >> >> +#ifdef CONFIG_X86_32 >> __visible bool __kvm_vcpu_is_preempted(int cpu) >> { >> struct kvm_steal_time *src = _cpu(steal_time, cpu); >> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) >> } >> PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); >> >> +#else >> + >> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); >> + >> +asm( >> +".pushsection .text;" >> +".global __raw_callee_save___kvm_vcpu_is_preempted;" >> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" >> +"__raw_callee_save___kvm_vcpu_is_preempted:" >> +FRAME_BEGIN >> +"push %rdi;" >> +"push %rdx;" >> +"movslq %edi, %rdi;" >> +"movq$steal_time+16, %rax;" >> +"movq__per_cpu_offset(,%rdi,8), %rdx;" >> +"cmpb$0, (%rdx,%rax);" >> +"setne %al;" >> +"pop %rdx;" >> +"pop %rdi;" >> +FRAME_END >> +"ret;" >> +".popsection"); >> + >> +#endif >> + >> /* >> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. >> */ > That should work for now. I have done something similar for > __pv_queued_spin_unlock. However, this has the problem of creating a > dependency on the exact layout of the steal_time structure. Maybe the > constant 16 can be passed in as a parameter offsetof(struct > kvm_steal_time, preempted) to the asm call. > > Cheers, > Longman One more thing, that will improve KVM performance, but it won't help Xen. I looked into the assembly code for rwsem_spin_on_owner, It need to save and restore 2 additional registers with my patch. Doing it your way, will transfer the save and restore overhead to the assembly code. However, __kvm_vcpu_is_preempted() is called multiple times per invocation of rwsem_spin_on_owner. That function is simple enough that making __kvm_vcpu_is_preempted() callee-save won't produce much compiler optimization opportunity. The outer function rwsem_down_write_failed() does appear to be a bit bigger (from 866 bytes to 884 bytes) though. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/10/2017 11:19 AM, Peter Zijlstra wrote: > On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >> It was found when running fio sequential write test with a XFS ramdisk >> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >> by perf were as follows: >> >> 69.75% 0.59% fio [k] down_write >> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >> 67.12% 1.12% fio [k] rwsem_down_write_failed >> 63.48% 52.77% fio [k] osq_lock >> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >> > Thinking about this again, wouldn't something like the below also work? > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 099fcba4981d..6aa33702c15c 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) > local_irq_restore(flags); > } > > +#ifdef CONFIG_X86_32 > __visible bool __kvm_vcpu_is_preempted(int cpu) > { > struct kvm_steal_time *src = _cpu(steal_time, cpu); > @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) > } > PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); > > +#else > + > +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); > + > +asm( > +".pushsection .text;" > +".global __raw_callee_save___kvm_vcpu_is_preempted;" > +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" > +"__raw_callee_save___kvm_vcpu_is_preempted:" > +FRAME_BEGIN > +"push %rdi;" > +"push %rdx;" > +"movslq %edi, %rdi;" > +"movq$steal_time+16, %rax;" > +"movq__per_cpu_offset(,%rdi,8), %rdx;" > +"cmpb$0, (%rdx,%rax);" > +"setne %al;" > +"pop %rdx;" > +"pop %rdi;" > +FRAME_END > +"ret;" > +".popsection"); > + > +#endif > + > /* > * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > */ That should work for now. I have done something similar for __pv_queued_spin_unlock. However, this has the problem of creating a dependency on the exact layout of the steal_time structure. Maybe the constant 16 can be passed in as a parameter offsetof(struct kvm_steal_time, preempted) to the asm call. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
It was found when running fio sequential write test with a XFS ramdisk on a VM running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. As vcpu_is_preempted() is called within the spinlock, mutex and rwsem slowpaths, there isn't much to gain by making it callee-save. So it is now changed to a normal function call instead. With this patch applied on both bare-metal & KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: Bare MetalKVM Guest I/O Type w/o patchwith patch w/o patchwith patch --- --- random read 8650.5 MB/s 8560.9 MB/s 7602.9 MB/s 8196.1 MB/s seq read 9104.8 MB/s 9397.2 MB/s 8293.7 MB/s 8566.9 MB/s random write 1623.8 MB/s 1626.7 MB/s 1590.6 MB/s 1700.7 MB/s seq write 1626.4 MB/s 1624.9 MB/s 1604.8 MB/s 1726.3 MB/s The perf data (on KVM guest) now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted On bare metal, the patch doesn't introduce any performance regression. On KVM guest, it produces noticeable performance improvement (up to 7%). Signed-off-by: Waiman Long <long...@redhat.com> --- v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/paravirt_types.h | 2 +- arch/x86/kernel/kvm.c | 7 ++- arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/xen/spinlock.c | 4 +--- 5 files changed, 7 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 864f57b..2515885 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu) static __always_inline bool pv_vcpu_is_preempted(int cpu) { - return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); + return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } #endif /* SMP && PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index bb2de45..88dc852 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -309,7 +309,7 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); - struct paravirt_callee_save vcpu_is_preempted; + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..eb3753d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) return !!src->preempted; } -PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. @@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void) pv_lock_ops.wait = kvm_wait; pv_lock_ops.kick = kvm_kick_cpu; - if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { - pv_lock_ops.vcpu_is_preempted = - PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); - } + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) + pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted; } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..da050bc 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu) { return false; } -PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted); bool pv_is_native_vcpu_is_preempted(void) { - return pv_lock_ops.vcpu_is_preempted.func == - __raw_callee_save___
Re: [Xen-devel] [PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/08/2017 02:05 PM, Peter Zijlstra wrote: > On Wed, Feb 08, 2017 at 01:00:24PM -0500, Waiman Long wrote: >> It was found when running fio sequential write test with a XFS ramdisk >> on a 2-socket x86-64 system, the %CPU times as reported by perf were >> as follows: >> >> 71.27% 0.28% fio [k] down_write >> 70.99% 0.01% fio [k] call_rwsem_down_write_failed >> 69.43% 1.18% fio [k] rwsem_down_write_failed >> 65.51% 54.57% fio [k] osq_lock >> 9.72% 7.99% fio [k] __raw_callee_save___kvm_vcpu_is_preempted >> 4.16% 4.16% fio [k] __kvm_vcpu_is_preempted >> >> So making vcpu_is_preempted() a callee-save function has a pretty high >> cost associated with it. As vcpu_is_preempted() is called within the >> spinlock, mutex and rwsem slowpaths, there isn't much to gain by making >> it callee-save. So it is now changed to a normal function call instead. >> > Numbers for bare metal too please. I will run the test on bare metal, but I doubt there will be noticeable difference. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 2/2] locking/mutex, rwsem: Reduce vcpu_is_preempted() calling frequency
On 02/08/2017 02:05 PM, Peter Zijlstra wrote: > On Wed, Feb 08, 2017 at 01:00:25PM -0500, Waiman Long wrote: >> As the vcpu_is_preempted() call is pretty costly compared with other >> checks within mutex_spin_on_owner() and rwsem_spin_on_owner(), they >> are done at a reduce frequency of once every 256 iterations. > That's just disgusting. I do have some doubt myself on the effectiveness of this patch. Anyway, it is the first patch that I think is more beneficial. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH 2/2] locking/mutex, rwsem: Reduce vcpu_is_preempted() calling frequency
As the vcpu_is_preempted() call is pretty costly compared with other checks within mutex_spin_on_owner() and rwsem_spin_on_owner(), they are done at a reduce frequency of once every 256 iterations. Signed-off-by: Waiman Long <long...@redhat.com> --- kernel/locking/mutex.c | 5 - kernel/locking/rwsem-xadd.c | 6 -- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index ad2d9e2..2ece0c4 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -423,6 +423,7 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, struct ww_acquire_ctx *ww_ctx, struct mutex_waiter *waiter) { bool ret = true; + int loop = 0; rcu_read_lock(); while (__mutex_owner(lock) == owner) { @@ -436,9 +437,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner, /* * Use vcpu_is_preempted to detect lock holder preemption issue. +* As vcpu_is_preempted is more costly to use, it is called at +* a reduced frequencey (once every 256 iterations). */ if (!owner->on_cpu || need_resched() || - vcpu_is_preempted(task_cpu(owner))) { + (!(++loop & 0xff) && vcpu_is_preempted(task_cpu(owner { ret = false; break; } diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c index 2ad8d8d..7a884a6 100644 --- a/kernel/locking/rwsem-xadd.c +++ b/kernel/locking/rwsem-xadd.c @@ -351,6 +351,7 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem) static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) { struct task_struct *owner = READ_ONCE(sem->owner); + int loop = 0; if (!rwsem_owner_is_writer(owner)) goto out; @@ -367,10 +368,11 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem) /* * abort spinning when need_resched or owner is not running or -* owner's cpu is preempted. +* owner's cpu is preempted. The preemption check is done at +* a lower frequencey because of its high cost. */ if (!owner->on_cpu || need_resched() || - vcpu_is_preempted(task_cpu(owner))) { + (!(++loop & 0xff) && vcpu_is_preempted(task_cpu(owner { rcu_read_unlock(); return false; } -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
It was found when running fio sequential write test with a XFS ramdisk on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 71.27% 0.28% fio [k] down_write 70.99% 0.01% fio [k] call_rwsem_down_write_failed 69.43% 1.18% fio [k] rwsem_down_write_failed 65.51% 54.57% fio [k] osq_lock 9.72% 7.99% fio [k] __raw_callee_save___kvm_vcpu_is_preempted 4.16% 4.16% fio [k] __kvm_vcpu_is_preempted So making vcpu_is_preempted() a callee-save function has a pretty high cost associated with it. As vcpu_is_preempted() is called within the spinlock, mutex and rwsem slowpaths, there isn't much to gain by making it callee-save. So it is now changed to a normal function call instead. With this patch applied, the aggregrate bandwidth of the fio sequential write test increased slightly from 2563.3MB/s to 2588.1MB/s (about 1%). Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/paravirt_types.h | 2 +- arch/x86/kernel/kvm.c | 7 ++- arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/xen/spinlock.c | 4 +--- 5 files changed, 7 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 864f57b..2515885 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu) static __always_inline bool pv_vcpu_is_preempted(int cpu) { - return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); + return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } #endif /* SMP && PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index bb2de45..88dc852 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -309,7 +309,7 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); - struct paravirt_callee_save vcpu_is_preempted; + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..eb3753d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) return !!src->preempted; } -PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. @@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void) pv_lock_ops.wait = kvm_wait; pv_lock_ops.kick = kvm_kick_cpu; - if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { - pv_lock_ops.vcpu_is_preempted = - PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); - } + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) + pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted; } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..da050bc 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu) { return false; } -PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted); bool pv_is_native_vcpu_is_preempted(void) { - return pv_lock_ops.vcpu_is_preempted.func == - __raw_callee_save___native_vcpu_is_preempted; + return pv_lock_ops.vcpu_is_preempted == __native_vcpu_is_preempted; } struct pv_lock_ops pv_lock_ops = { @@ -38,7 +36,7 @@ struct pv_lock_ops pv_lock_ops = { .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, - .vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted), + .vcpu_is_preempted = __native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 25a7c43..c85bb8f 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,8 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } -PV_CALLEE_SAVE_REGS_THUNK(xen_vcpu_stolen); - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -138,7 +136,7 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; - pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen); + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } static __init int xen_parse_nopvspin(char
[Xen-devel] [PATCH v2] x86, locking/spinlocks: Remove paravirt_ticketlocks_enabled
This is a follow-up of commit cfd8983f03c7b2 ("x86, locking/spinlocks: Remove ticket (spin)lock implementation"). The static_key structure paravirt_ticketlocks_enabled is now removed as it is no longer used. As a result, the init functions kvm_spinlock_init_jump() and xen_init_spinlocks_jump() are also removed. A simple build and boot test was done to verify it. Signed-off-by: Waiman Long <long...@redhat.com> --- v1->v2: - Remove init functions kvm_spinlock_init_jump() and xen_init_spinlocks_jump(). arch/x86/include/asm/spinlock.h | 3 --- arch/x86/kernel/kvm.c| 14 -- arch/x86/kernel/paravirt-spinlocks.c | 3 --- arch/x86/xen/spinlock.c | 19 --- 4 files changed, 39 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..6d39190 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -23,9 +23,6 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 << 15) -extern struct static_key paravirt_ticketlocks_enabled; -static __always_inline bool static_key_false(struct static_key *key); - #include /* diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 36bc664..099fcba 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -620,18 +620,4 @@ void __init kvm_spinlock_init(void) } } -static __init int kvm_spinlock_init_jump(void) -{ - if (!kvm_para_available()) - return 0; - if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) - return 0; - - static_key_slow_inc(_ticketlocks_enabled); - printk(KERN_INFO "KVM setup paravirtual spinlock\n"); - - return 0; -} -early_initcall(kvm_spinlock_init_jump); - #endif /* CONFIG_PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6d4bf81..6259327 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -42,6 +42,3 @@ struct pv_lock_ops pv_lock_ops = { #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); - -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index e8a9ea7..25a7c43 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -141,25 +141,6 @@ void __init xen_init_spinlocks(void) pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen); } -/* - * While the jump_label init code needs to happend _after_ the jump labels are - * enabled and before SMP is started. Hence we use pre-SMP initcall level - * init. We cannot do it in xen_init_spinlocks as that is done before - * jump labels are activated. - */ -static __init int xen_init_spinlocks_jump(void) -{ - if (!xen_pvspin) - return 0; - - if (!xen_domain()) - return 0; - - static_key_slow_inc(_ticketlocks_enabled); - return 0; -} -early_initcall(xen_init_spinlocks_jump); - static __init int xen_parse_nopvspin(char *arg) { xen_pvspin = false; -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] [PATCH] x86, locking/spinlocks: Remove paravirt_ticketlocks_enabled
This is a follow-up of commit cfd8983f03c7b2 ("x86, locking/spinlocks: Remove ticket (spin)lock implementation"). The static_key structure paravirt_ticketlocks_enabled is now removed as it is no longer used. A simple build and boot test was done to verify it. Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/spinlock.h | 3 --- arch/x86/kernel/kvm.c| 1 - arch/x86/kernel/paravirt-spinlocks.c | 3 --- arch/x86/xen/spinlock.c | 1 - 4 files changed, 8 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 921bea7..6d39190 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -23,9 +23,6 @@ /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 << 15) -extern struct static_key paravirt_ticketlocks_enabled; -static __always_inline bool static_key_false(struct static_key *key); - #include /* diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 36bc664..6750fdc 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -627,7 +627,6 @@ static __init int kvm_spinlock_init_jump(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return 0; - static_key_slow_inc(_ticketlocks_enabled); printk(KERN_INFO "KVM setup paravirtual spinlock\n"); return 0; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6d4bf81..6259327 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -42,6 +42,3 @@ struct pv_lock_ops pv_lock_ops = { #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); - -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index e8a9ea7..a822606 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -155,7 +155,6 @@ static __init int xen_init_spinlocks_jump(void) if (!xen_domain()) return 0; - static_key_slow_inc(_ticketlocks_enabled); return 0; } early_initcall(xen_init_spinlocks_jump); -- 1.8.3.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v16 08/14] pvqspinlock: Implement simple paravirt support for the qspinlock
On 05/04/2015 10:20 AM, Peter Zijlstra wrote: I changed it to the below; I've not gotten around to compiling or even running it yet :-( The biggest change is the pv_hash/pv_unhash functions, which I've rewritten to hopefully be clearer (and also hopefully not wrecked them). I took out the cacheline sized structure which takes out that double loop and simplifies things. I've also added some comments which hopefully explain how/why we ended up with this exact scheme. I've also moved the __pv_queue_spin_unlock() function to the tail, such that we keep the 'wait'/'kick' order for both node and head. In any case, like I just wrote on the other email, I've stuck some things in my queue (up to and including patch 11) and if it all works out we can continue from there. --- Subject: pvqspinlock: Implement simple paravirt support for the qspinlock From: Waiman Longwaiman.l...@hp.com Date: Fri, 24 Apr 2015 14:56:37 -0400 Provide a separate (second) version of the spin_lock_slowpath for paravirt along with a special unlock path. The second slowpath is generated by adding a few pv hooks to the normal slowpath, but where those will compile away for the native case, they expand into special wait/wake code for the pv version. The actual MCS queue can use extra storage in the mcs_nodes[] array to keep track of state and therefore uses directed wakeups. The head contender has no such storage directly visible to the unlocker. So the unlocker searches a hash table with open addressing using a simple binary Galois linear feedback shift register. I am fine with the change as it makes it simpler. BTW, I just saw a build error mail. That should be fixed easily with some minor edit. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg
On 05/04/2015 10:05 AM, Peter Zijlstra wrote: On Thu, Apr 30, 2015 at 02:49:26PM -0400, Waiman Long wrote: On 04/29/2015 02:11 PM, Peter Zijlstra wrote: On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote: In the pv_scan_next() function, the slow cmpxchg atomic operation is performed even if the other CPU is not even close to being halted. This extra cmpxchg can harm slowpath performance. This patch introduces the new mayhalt flag to indicate if the other spinning CPU is close to being halted or not. The current threshold for x86 is 2k cpu_relax() calls. If this flag is not set, the other spinning CPU will have at least 2k more cpu_relax() calls before it can enter the halt state. This should give enough time for the setting of the locked flag in struct mcs_spinlock to propagate to that CPU without using atomic op. Yuck! I'm not at all sure you can make assumptions like that. And the worst part is, if it goes wrong the borkage is subtle and painful. I do think the code is OK. However, you are right that if my reasoning is incorrect, the resulting bug will be really subtle. So I do not think its correct, imagine the fabrics used for the 4096 cpu SGI machine, now add some serious traffic to them. There is no saying your random 2k relax loop will be enough to propagate the change. Equally, another arch (this is generic code) might have starvation issues on its inter-cpu fabric and delay the store just long enough. The thing is, one should _never_ rely on timing for correctness, _ever_. Yes, you are right. Having a dependency on timing can be dangerous. So I am going to withdraw this particular patch as it has no functional impact to the overall patch series. Please let me know if you have any other comments on other parts of the series and I will send send out a new series without this particular patch. Please wait a little while, I've queued the 'basic' patches, once that settles in tip we can look at the others. Also, I have some local changes (sorry, I could not help mysef) I should post, I've been somewhat delayed by illness. Sure. I will wait until you finish your tip test. I am sorry to hear that you are bothered with illness. I hope you get well by now. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg
On 04/29/2015 02:27 PM, Linus Torvalds wrote: On Wed, Apr 29, 2015 at 11:11 AM, Peter Zijlstrapet...@infradead.org wrote: On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote: In the pv_scan_next() function, the slow cmpxchg atomic operation is performed even if the other CPU is not even close to being halted. This extra cmpxchg can harm slowpath performance. This patch introduces the new mayhalt flag to indicate if the other spinning CPU is close to being halted or not. The current threshold for x86 is 2k cpu_relax() calls. If this flag is not set, the other spinning CPU will have at least 2k more cpu_relax() calls before it can enter the halt state. This should give enough time for the setting of the locked flag in struct mcs_spinlock to propagate to that CPU without using atomic op. Yuck! I'm not at all sure you can make assumptions like that. And the worst part is, if it goes wrong the borkage is subtle and painful.\ I have to agree with Peter. But it goes beyond this particular patch. Patterns like this: xchg(pn-mayhalt, true); are just evil and disgusting. Even befoe this patch, that code had (void)xchg(pn-state, vcpu_halted); which is *wrong* and should never be done. If you want it to be set_mb() (which sets a value and has a memory barrier), then use set_mb(). Yes, it happens to use a xchg() to do so, but dammit, it documents that whole this is a memory barrier in the name. Also, anybody who does this should damn well document why the memory barrier is needed. The xchg(pn-state, vcpu_halted) at least is preceded by a comment about the barriers. The new mayhalt has no sane comment in it, and the reason seems to be that no sane comment is possible. The xchg() seems to be some black magic thing. Let's not introduce magic stuff in our locking primitives. At least not undocumented magic that makes no sense. Linus Thanks for the comments. I will withdraw this patch and use set_mb() in the code as suggested for better readability. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg
On 04/29/2015 02:11 PM, Peter Zijlstra wrote: On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote: In the pv_scan_next() function, the slow cmpxchg atomic operation is performed even if the other CPU is not even close to being halted. This extra cmpxchg can harm slowpath performance. This patch introduces the new mayhalt flag to indicate if the other spinning CPU is close to being halted or not. The current threshold for x86 is 2k cpu_relax() calls. If this flag is not set, the other spinning CPU will have at least 2k more cpu_relax() calls before it can enter the halt state. This should give enough time for the setting of the locked flag in struct mcs_spinlock to propagate to that CPU without using atomic op. Yuck! I'm not at all sure you can make assumptions like that. And the worst part is, if it goes wrong the borkage is subtle and painful. I do think the code is OK. However, you are right that if my reasoning is incorrect, the resulting bug will be really subtle. So I am going to withdraw this particular patch as it has no functional impact to the overall patch series. Please let me know if you have any other comments on other parts of the series and I will send send out a new series without this particular patch. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v16 04/14] qspinlock: Extract out code snippets for the next patch
This is a preparatory patch that extracts out the following 2 code snippets to prepare for the next performance optimization patch. 1) the logic for the exchange of new and previous tail code words into a new xchg_tail() function. 2) the logic for clearing the pending bit and setting the locked bit into a new clear_pending_set_locked() function. This patch also simplifies the trylock operation before queuing by calling queue_spin_trylock() directly. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 79 - 2 files changed, 50 insertions(+), 31 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 9c3f5c2..ef36613 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -58,6 +58,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0351f78..11f6ad9 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -97,6 +97,42 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * + * *,1,0 - *,0,1 + */ +static __always_inline void clear_pending_set_locked(struct qspinlock *lock) +{ + atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, lock-val); +} + +/** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(lock-val); + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -178,15 +214,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * * *,1,0 - *,0,1 */ - for (;;) { - new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + clear_pending_set_locked(lock); return; /* @@ -203,37 +231,26 @@ queue: node-next = NULL; /* -* We have already touched the queueing cacheline; don't bother with -* pending stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked. -* p,y,x - n,y,x ; tail was p - tail is n; preserving locked. +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; /* -* we won the trylock; forget about queueing. +* We have already touched the queueing cacheline; don't bother with +* pending stuff. +* +* p,*,* - n,*,* */ - if (new == _Q_LOCKED_VAL) - goto release; + old = xchg_tail(lock, tail); /* * if there was a previous node; link it and wait until reaching the * head of the waitqueue. */ - if (old ~_Q_LOCKED_PENDING_MASK) { + if (old _Q_TAIL_MASK) { prev = decode_tail(old); WRITE_ONCE(prev-next, node); -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v16 01/14] qspinlock: A simple generic 4-byte queue spinlock
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- include/asm-generic/qspinlock.h | 132 + include/asm-generic/qspinlock_types.h | 58 + kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 209 + 6 files changed, 408 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 000..315d6dc --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,132 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(lock-val) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32
[Xen-devel] [PATCH v16 00/14] qspinlock: a 4-byte queue spinlock with PV support
that the queue spinlock patch can improve the performance of an I/O and interrupt intensive stress test with a lot of spinlock contention on a 2-socket system by up to 20%. Daniel Blueman of NumaScale had tested the v14 qspinlock patch on an older 512-core NumaConnect system for testing with 3.19.1: $ src/reaim -c data/reaim.config -f data/workfile.compute -i 16 -e 256 Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI Forked Time SysTime UTime Minute Child Time Percent 1 2.15 0.361.192818.602818.600.00 0.00100 17 6.29 42.41 22.43 16378.38 963.43 0.27 4.4395 33 56.241611.77 47.04 3555.83107.75 1.58 2.8897 49 120.88 5576.87 63.54 2456.4950.13 1.75 1.4798 65 199.17 12328.42 81.27 1977.7130.43 2.42 1.2398 81 270.89 20943.99 103.99 1812.0322.37 3.85 1.4498 97 315.31 29211.63 121.30 1864.2619.22 4.48 1.4498 113 397.68 42766.15 144.48 1721.9415.24 7.04 1.8098 129 520.74 63442.27 162.03 1501.2111.64 10.241.9998 ... Patched with qspinlocks v14: Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI Forked Time SysTime UTime Minute Child Time Percent 1 1.98 0.361.213060.613060.610.00 0.00100 17 5.60 25.41 22.39 18396.43 1082.140.16 3.0097 33 12.67153.64 49.17 15783.74 478.30 0.50 4.1695 49 21.15365.55 76.61 14039.72 286.52 0.83 4.1995 65 27.57556.87 105.96 14287.27 219.80 0.75 2.8097 81 33.07712.92 127.85 14843.06 183.25 1.16 3.6996 97 42.811009.02 161.87 13730.90 141.56 1.75 4.2795 113 49.131166.04 185.10 13938.12 123.35 2.11 4.5395 129 56.621262.07 231.92 13806.78 107.03 2.67 4.9995 145 66.411420.29 250.24 13231.44 91.25 2.95 4.7095 179 90.152237.78 325.18 12032.61 67.22 4.23 4.9495 193 85.411865.81 326.64 13693.71 70.95 4.29 5.3294 220 93.582298.77 376.46 14246.63 64.76 4.97 5.6594 Max Jobs per Minute 18396.43 The compute workload is mostly userspace code with some synchronization between working threads (HPC type workload). With large user count, the patched kernel has close to 8X the performance of an unpatched kernel. Paravirt qspinlock patch performance A linux kernel build workload (make -j n) was run on KVM and Xen guests with and without the qspinlock patch. The host machine was an 8-socket Westmere-EX server with 4 active cores/socket (8x4). For the KVM test, the VM configurations were: 1) 32 NUMA-pinned vCPUs (8x4) 2) 32 vCPUs (no pinning or NUMA) 3) 60 vCPUs (overcommitted) The kernel build times for different 4.0-rc6 based kernels were: KernelVM1 VM2 VM3 ----- --- --- PV ticket lock 3m49.7s 11m45.6s15m59.3s PV qspinlock 3m50.4s5m51.6s15m33.6s For VM1, both the qspinlock and ticket lock have essentially the same performance. VM2 illustrated the performance benefit of qspinlock versus ticket spinlock which got reduced in VM3 due to the overhead of constant vCPUs halting and kicking. For the Xen test, the VM configurations were: 1) 32 vCPUs (no pinning or NUMA) 2) 64 vCPUs 2) 128 vCPUs The kernel build times for different 4.0 based kernels were: KernelVM1 VM2 VM3 ----- --- --- PV ticket lock 7m27.3s 16m14.1s18m26.9s PV qspinlock 5m51.6s 12m02.6s14m19.2s Here, the PV qspinlock was faster than the PV ticket lock. David Vrabel (1): pvqspinlock, x86: Enable PV qspinlock for Xen Peter Zijlstra (Intel) (4): qspinlock: Add pending bit qspinlock: Optimize for smaller NR_CPUS qspinlock: Revert to test-and-set on hypervisors pvqspinlock, x86: Implement the paravirt qspinlock call patching Waiman Long (9): qspinlock: A simple generic 4-byte queue spinlock qspinlock, x86: Enable x86-64 to use queue spinlock qspinlock: Extract out code snippets for the next patch qspinlock: Use a simple write to grab the lock pvqspinlock: Implement simple paravirt support for the qspinlock pvqspinlock, x86: Enable PV qspinlock for KVM pvqspinlock: Only kick CPU at unlock time pvqspinlock: Improve slowpath performance by avoiding cmpxchg pvqspinlock: Collect slowpath lock statistics arch/x86/Kconfig |3 +- arch/x86/include/asm/paravirt.h | 29 ++- arch/x86/include/asm/paravirt_types.h | 10 + arch/x86/include/asm/qspinlock.h | 57
[Xen-devel] [PATCH v16 09/14] pvqspinlock, x86: Implement the paravirt qspinlock call patching
From: Peter Zijlstra (Intel) pet...@infradead.org We use the regular paravirt call patching to switch between: native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath() native_queue_spin_unlock()__pv_queue_spin_unlock() We use a callee saved call for the unlock function which reduces the i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions again. We further optimize the unlock path by patching the direct call with a movb $0,%arg1 if we are indeed using the native unlock code. This makes the unlock code almost as fast as the !PARAVIRT case. This significantly lowers the overhead of having CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/Kconfig |2 +- arch/x86/include/asm/paravirt.h | 29 - arch/x86/include/asm/paravirt_types.h | 10 ++ arch/x86/include/asm/qspinlock.h | 25 - arch/x86/include/asm/qspinlock_paravirt.h |6 ++ arch/x86/kernel/paravirt-spinlocks.c | 24 +++- arch/x86/kernel/paravirt_patch_32.c | 22 ++ arch/x86/kernel/paravirt_patch_64.c | 22 ++ 8 files changed, 128 insertions(+), 12 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock_paravirt.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 49fecb1..a0946e7 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -661,7 +661,7 @@ config PARAVIRT_DEBUG config PARAVIRT_SPINLOCKS bool Paravirtualization layer for spinlocks depends on PARAVIRT SMP - select UNINLINE_SPIN_UNLOCK + select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK ---help--- Paravirtualized spinlocks allow a pvops backend to replace the spinlock implementation with something virtualization-friendly diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 965c47d..f76199a 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,31 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK + +static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock *lock, + u32 val) +{ + PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val); +} + +static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock) +{ + PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock); +} + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + PVOP_VCALL2(pv_lock_ops.wait, ptr, val); +} + +static __always_inline void pv_kick(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick, cpu); +} + +#else /* !CONFIG_QUEUE_SPINLOCK */ + static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -724,7 +749,9 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } -#endif +#endif /* CONFIG_QUEUE_SPINLOCK */ + +#endif /* SMP PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 #define PV_SAVE_REGS pushl %ecx; pushl %edx; diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..f6acaea 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,19 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +struct qspinlock; + struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val); + struct paravirt_callee_save queue_spin_unlock; + + void (*wait)(u8 *ptr, u8 val); + void (*kick)(int cpu); +#else /* !CONFIG_QUEUE_SPINLOCK */ struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket); +#endif /* !CONFIG_QUEUE_SPINLOCK */ }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index 64c925e..c8290db 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -3,6 +3,7 @@ #include asm/cpufeature.h #include asm-generic/qspinlock_types.h +#include asm/paravirt.h #definequeue_spin_unlock queue_spin_unlock /** @@ -11,11 +12,33 @@ * * A smp_store_release() on the least-significant byte. */ -static inline void queue_spin_unlock(struct qspinlock *lock) +static inline void native_queue_spin_unlock(struct qspinlock *lock) { smp_store_release((u8 *)lock, 0); } +#ifdef CONFIG_PARAVIRT_SPINLOCKS +extern void
[Xen-devel] [PATCH v16 10/14] pvqspinlock, x86: Enable PV qspinlock for KVM
This patch adds the necessary KVM specific code to allow KVM to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/kernel/kvm.c | 43 +++ kernel/Kconfig.locks |2 +- 2 files changed, 44 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index e354cc6..4bb42c0 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -584,6 +584,39 @@ static void kvm_kick_cpu(int cpu) kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); } + +#ifdef CONFIG_QUEUE_SPINLOCK + +#include asm/qspinlock.h + +static void kvm_wait(u8 *ptr, u8 val) +{ + unsigned long flags; + + if (in_nmi()) + return; + + local_irq_save(flags); + + if (READ_ONCE(*ptr) != val) + goto out; + + /* +* halt until it's our turn and kicked. Note that we do safe halt +* for irq enabled case to avoid hang when lock info is overwritten +* in irq spinlock slowpath and no spurious interrupt occur to save us. +*/ + if (arch_irqs_disabled_flags(flags)) + halt(); + else + safe_halt(); + +out: + local_irq_restore(flags); +} + +#else /* !CONFIG_QUEUE_SPINLOCK */ + enum kvm_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -817,6 +850,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) } } +#endif /* !CONFIG_QUEUE_SPINLOCK */ + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ @@ -828,8 +863,16 @@ void __init kvm_spinlock_init(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return; +#ifdef CONFIG_QUEUE_SPINLOCK + __pv_init_lock_hash(); + pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath; + pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock); + pv_lock_ops.wait = kvm_wait; + pv_lock_ops.kick = kvm_kick_cpu; +#else /* !CONFIG_QUEUE_SPINLOCK */ pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning); pv_lock_ops.unlock_kick = kvm_unlock_kick; +#endif } static __init int kvm_spinlock_init_jump(void) diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index c6a8f7c..537b13e 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK config QUEUE_SPINLOCK def_bool y if ARCH_USE_QUEUE_SPINLOCK - depends on SMP !PARAVIRT_SPINLOCKS + depends on SMP (!PARAVIRT_SPINLOCKS || !XEN) config ARCH_USE_QUEUE_RWLOCK bool -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v16 06/14] qspinlock: Use a simple write to grab the lock
Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the tail code. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the tail code in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasksBefore patchAfter patch %Change ----- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasksBefore patchAfter patch %Change ----- -- --- 312242/12237 12087/12093 -1%/-1% 410688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- kernel/locking/qspinlock.c | 66 +-- 1 files changed, 50 insertions(+), 16 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index bcc99e6..99503ef 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -105,24 +105,37 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + struct { + u8 locked; + u8 pending; + }; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[2]; + u8 pending; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -195,6 +208,19 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * *,*,0 - *,0,1 + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + WRITE_ONCE(l-locked, _Q_LOCKED_VAL
[Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg
In the pv_scan_next() function, the slow cmpxchg atomic operation is performed even if the other CPU is not even close to being halted. This extra cmpxchg can harm slowpath performance. This patch introduces the new mayhalt flag to indicate if the other spinning CPU is close to being halted or not. The current threshold for x86 is 2k cpu_relax() calls. If this flag is not set, the other spinning CPU will have at least 2k more cpu_relax() calls before it can enter the halt state. This should give enough time for the setting of the locked flag in struct mcs_spinlock to propagate to that CPU without using atomic op. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock_paravirt.h | 28 +--- 1 files changed, 25 insertions(+), 3 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index 9b4ac3d..41ee033 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -19,7 +19,8 @@ * native_queue_spin_unlock(). */ -#define _Q_SLOW_VAL(3U _Q_LOCKED_OFFSET) +#define _Q_SLOW_VAL(3U _Q_LOCKED_OFFSET) +#define MAYHALT_THRESHOLD (SPIN_THRESHOLD 4) /* * The vcpu_hashed is a special state that is set by the new lock holder on @@ -39,6 +40,7 @@ struct pv_node { int cpu; u8 state; + u8 mayhalt; }; /* @@ -198,6 +200,7 @@ static void pv_init_node(struct mcs_spinlock *node) pn-cpu = smp_processor_id(); pn-state = vcpu_running; + pn-mayhalt = false; } /* @@ -214,23 +217,34 @@ static void pv_wait_node(struct mcs_spinlock *node) for (loop = SPIN_THRESHOLD; loop; loop--) { if (READ_ONCE(node-locked)) return; + if (loop == MAYHALT_THRESHOLD) + xchg(pn-mayhalt, true); cpu_relax(); } /* -* Order pn-state vs pn-locked thusly: +* Order pn-state/pn-mayhalt vs pn-locked thusly: * -* [S] pn-state = vcpu_halted[S] next-locked = 1 +* [S] pn-mayhalt = 1[S] next-locked = 1 +* MB, delay barrier() +* [S] pn-state = vcpu_halted[L] pn-mayhalt * MB MB * [L] pn-locked [RmW] pn-state = vcpu_hashed * * Matches the cmpxchg() from pv_scan_next(). +* +* As the new lock holder may quit (when pn-mayhalt is not +* set) without memory barrier, a sufficiently long delay is +* inserted between the setting of pn-mayhalt and pn-state +* to ensure that there is enough time for the new pn-locked +* value to be propagated here to be checked below. */ (void)xchg(pn-state, vcpu_halted); if (!READ_ONCE(node-locked)) pv_wait(pn-state, vcpu_halted); + pn-mayhalt = false; /* * Reset the state except when vcpu_hashed is set. */ @@ -263,6 +277,14 @@ static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node) struct __qspinlock *l = (void *)lock; /* +* If mayhalt is not set, there is enough time for the just set value +* in pn-locked to be propagated to the other CPU before it is time +* to halt. +*/ + if (!READ_ONCE(pn-mayhalt)) + return; + + /* * Transition CPU state: halted = hashed * Quit if the transition failed. */ -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v16 03/14] qspinlock: Add pending bit
From: Peter Zijlstra (Intel) pet...@infradead.org Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 119 +++-- 2 files changed, 107 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index c9348d8..9c3f5c2 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 3456819..0351f78 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + * uncontended :(n,x,y) +-- (n,0,0) --' | : + * queue : | ^--' | : + * : v | : + * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : + * queue : ^--' : */ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) { @@ -121,6 +125,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + /* +* wait for in-progress pending-locked hand-overs +* +* 0,1,0 - 0,0,1 +*/ + if (val == _Q_PENDING_VAL) { + while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL
[Xen-devel] [PATCH v16 14/14] pvqspinlock: Collect slowpath lock statistics
This patch enables the accumulation of PV qspinlock statistics when either one of the following three sets of CONFIG parameters are enabled: 1) CONFIG_LOCK_STAT CONFIG_DEBUG_FS 2) CONFIG_KVM_DEBUG_FS 3) CONFIG_XEN_DEBUG_FS The accumulated lock statistics will be reported in debugfs under the pv-qspinlock directory. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock_paravirt.h | 100 ++- 1 files changed, 98 insertions(+), 2 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index 41ee033..d512d9b 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -43,6 +43,86 @@ struct pv_node { u8 mayhalt; }; +#if defined(CONFIG_KVM_DEBUG_FS) || defined(CONFIG_XEN_DEBUG_FS) ||\ + (defined(CONFIG_LOCK_STAT) defined(CONFIG_DEBUG_FS)) +#define PV_QSPINLOCK_STAT +#endif + +/* + * PV qspinlock statistics + */ +enum pv_qlock_stat { + pv_stat_wait_head, + pv_stat_wait_node, + pv_stat_wait_hash, + pv_stat_kick_cpu, + pv_stat_no_kick, + pv_stat_spurious, + pv_stat_hash, + pv_stat_hops, + pv_stat_num /* Total number of statistics counts */ +}; + +#ifdef PV_QSPINLOCK_STAT + +#include linux/debugfs.h + +static const char * const stat_fsnames[pv_stat_num] = { + [pv_stat_wait_head] = wait_head_count, + [pv_stat_wait_node] = wait_node_count, + [pv_stat_wait_hash] = wait_hash_count, + [pv_stat_kick_cpu] = kick_cpu_count, + [pv_stat_no_kick] = no_kick_count, + [pv_stat_spurious] = spurious_wakeup, + [pv_stat_hash] = hash_count, + [pv_stat_hops] = hash_hops_count, +}; + +static atomic_t pv_stats[pv_stat_num]; + +/* + * Initialize debugfs for the PV qspinlock statistics + */ +static int __init pv_qspinlock_debugfs(void) +{ + struct dentry *d_pvqlock = debugfs_create_dir(pv-qspinlock, NULL); + int i; + + if (!d_pvqlock) + printk(KERN_WARNING + Could not create 'pv-qspinlock' debugfs directory\n); + + for (i = 0; i pv_stat_num; i++) + debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock, + (u32 *)pv_stats[i]); + return 0; +} +fs_initcall(pv_qspinlock_debugfs); + +/* + * Increment the PV qspinlock statistics counts + */ +static inline void pvstat_inc(enum pv_qlock_stat stat) +{ + atomic_inc(pv_stats[stat]); +} + +/* + * PV hash hop count + */ +static inline void pvstat_hop(int hopcnt) +{ + atomic_inc(pv_stats[pv_stat_hash]); + atomic_add(hopcnt, pv_stats[pv_stat_hops]); +} + +#else /* PV_QSPINLOCK_STAT */ + +static inline void pvstat_inc(enum pv_qlock_stat stat) { } +static inline void pvstat_hop(int hopcnt) { } + +#endif /* PV_QSPINLOCK_STAT */ + /* * Lock and MCS node addresses hash table for fast lookup * @@ -102,11 +182,13 @@ pv_hash(struct qspinlock *lock, struct pv_node *node) { unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits); struct pv_hash_entry *he, *end; + int hopcnt = 0; init_hash = hash; for (;;) { he = pv_lock_hash[hash].ent; for (end = he + PV_HE_PER_LINE; he end; he++) { + hopcnt++; if (!cmpxchg(he-lock, NULL, lock)) { /* * We haven't set the _Q_SLOW_VAL yet. So @@ -122,6 +204,7 @@ pv_hash(struct qspinlock *lock, struct pv_node *node) } done: + pvstat_hop(hopcnt); return he-lock; } @@ -177,8 +260,12 @@ __visible void __pv_queue_spin_unlock(struct qspinlock *lock) * At this point the memory pointed at by lock can be freed/reused, * however we can still use the PV node to kick the CPU. */ - if (READ_ONCE(node-state) != vcpu_running) + if (READ_ONCE(node-state) != vcpu_running) { + pvstat_inc(pv_stat_kick_cpu); pv_kick(node-cpu); + } else { + pvstat_inc(pv_stat_no_kick); + } } /* * Include the architecture specific callee-save thunk of the @@ -241,8 +328,10 @@ static void pv_wait_node(struct mcs_spinlock *node) */ (void)xchg(pn-state, vcpu_halted); - if (!READ_ONCE(node-locked)) + if (!READ_ONCE(node-locked)) { + pvstat_inc(pv_stat_wait_node); pv_wait(pn-state, vcpu_halted); + } pn-mayhalt = false; /* @@ -250,6 +339,8 @@ static void pv_wait_node(struct mcs_spinlock *node) */ (void)cmpxchg(pn-state, vcpu_halted, vcpu_running); + if (READ_ONCE(node-locked)) + break; /* * If the locked flag is still
[Xen-devel] [PATCH v16 12/14] pvqspinlock: Only kick CPU at unlock time
Before this patch, a CPU may have been kicked twice before getting the lock - one before it becomes queue head and once before it gets the lock. All these CPU kicking and halting (VMEXIT) can be expensive and slow down system performance, especially in an overcommitted guest. This patch adds a new vCPU state (vcpu_hashed) which enables the code to delay CPU kicking until at unlock time. Once this state is set, the new lock holder will set _Q_SLOW_VAL and fill in the hash table on behalf of the halted queue head vCPU. It also adds a second synchronization point in __pv_queue_spin_unlock() so as to do the pv_kick() only if it is really necessary. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 10 ++-- kernel/locking/qspinlock_paravirt.h | 76 +- 2 files changed, 61 insertions(+), 25 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index c009120..0a3a109 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock *lock) static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } -static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { } - +static __always_inline void __pv_scan_next(struct qspinlock *lock, + struct mcs_spinlock *node) { } static __always_inline void __pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) { } @@ -248,7 +248,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock, #define pv_init_node __pv_init_node #define pv_wait_node __pv_wait_node -#define pv_kick_node __pv_kick_node +#define pv_scan_next __pv_scan_next #define pv_wait_head __pv_wait_head #ifdef CONFIG_PARAVIRT_SPINLOCKS @@ -440,7 +440,7 @@ queue: cpu_relax(); arch_mcs_spin_unlock_contended(next-locked); - pv_kick_node(next); + pv_scan_next(lock, next); release: /* @@ -461,7 +461,7 @@ EXPORT_SYMBOL(queue_spin_lock_slowpath); #undef pv_init_node #undef pv_wait_node -#undef pv_kick_node +#undef pv_scan_next #undef pv_wait_head #undef queue_spin_lock_slowpath diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index 084e5c1..9b4ac3d 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -21,9 +21,16 @@ #define _Q_SLOW_VAL(3U _Q_LOCKED_OFFSET) +/* + * The vcpu_hashed is a special state that is set by the new lock holder on + * the new queue head to indicate that _Q_SLOW_VAL is set and hash entry + * filled. With this state, the queue head CPU will always be kicked even + * if it is not halted to avoid potential racing condition. + */ enum vcpu_state { vcpu_running = 0, vcpu_halted, + vcpu_hashed }; struct pv_node { @@ -162,13 +169,13 @@ __visible void __pv_queue_spin_unlock(struct qspinlock *lock) * The queue head has been halted. Need to locate it and wake it up. */ node = pv_hash_find(lock); - smp_store_release(l-locked, 0); + (void)xchg(l-locked, 0); /* * At this point the memory pointed at by lock can be freed/reused, * however we can still use the PV node to kick the CPU. */ - if (READ_ONCE(node-state) == vcpu_halted) + if (READ_ONCE(node-state) != vcpu_running) pv_kick(node-cpu); } /* @@ -195,7 +202,8 @@ static void pv_init_node(struct mcs_spinlock *node) /* * Wait for node-locked to become true, halt the vcpu after a short spin. - * pv_kick_node() is used to wake the vcpu again. + * pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its + * behalf. */ static void pv_wait_node(struct mcs_spinlock *node) { @@ -214,9 +222,9 @@ static void pv_wait_node(struct mcs_spinlock *node) * * [S] pn-state = vcpu_halted[S] next-locked = 1 * MB MB -* [L] pn-locked [RmW] pn-state = vcpu_running +* [L] pn-locked [RmW] pn-state = vcpu_hashed * -* Matches the xchg() from pv_kick_node(). +* Matches the cmpxchg() from pv_scan_next(). */ (void)xchg(pn-state, vcpu_halted); @@ -224,9 +232,9 @@ static void pv_wait_node(struct mcs_spinlock *node) pv_wait(pn-state, vcpu_halted); /* -* Reset the vCPU state to avoid unncessary CPU kicking +* Reset the state except when vcpu_hashed is set. */ - WRITE_ONCE(pn-state, vcpu_running); + (void
[Xen-devel] [PATCH v16 08/14] pvqspinlock: Implement simple paravirt support for the qspinlock
Provide a separate (second) version of the spin_lock_slowpath for paravirt along with a special unlock path. The second slowpath is generated by adding a few pv hooks to the normal slowpath, but where those will compile away for the native case, they expand into special wait/wake code for the pv version. The actual MCS queue can use extra storage in the mcs_nodes[] array to keep track of state and therefore uses directed wakeups. The head contender has no such storage directly visible to the unlocker. So the unlocker searches a hash table with open addressing using a simple binary Galois linear feedback shift register. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 68 +++- kernel/locking/qspinlock_paravirt.h | 324 +++ 2 files changed, 391 insertions(+), 1 deletions(-) create mode 100644 kernel/locking/qspinlock_paravirt.h diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index fc2e5ab..c009120 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -18,6 +18,9 @@ * Authors: Waiman Long waiman.l...@hp.com * Peter Zijlstra pet...@infradead.org */ + +#ifndef _GEN_PV_LOCK_SLOWPATH + #include linux/smp.h #include linux/bug.h #include linux/cpumask.h @@ -65,13 +68,21 @@ #include mcs_spinlock.h +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define MAX_NODES 8 +#else +#define MAX_NODES 4 +#endif + /* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. * * Exactly fits one 64-byte cacheline on a 64-bit architecture. + * + * PV doubles the storage and uses the second cacheline for PV state. */ -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); +static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]); /* * We must be able to distinguish between no-tail and the tail at 0:0, @@ -220,6 +231,32 @@ static __always_inline void set_locked(struct qspinlock *lock) WRITE_ONCE(l-locked, _Q_LOCKED_VAL); } + +/* + * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for + * all the PV callbacks. + */ + +static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { } + +static __always_inline void __pv_wait_head(struct qspinlock *lock, + struct mcs_spinlock *node) { } + +#define pv_enabled() false + +#define pv_init_node __pv_init_node +#define pv_wait_node __pv_wait_node +#define pv_kick_node __pv_kick_node +#define pv_wait_head __pv_wait_head + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define queue_spin_lock_slowpath native_queue_spin_lock_slowpath +#endif + +#endif /* _GEN_PV_LOCK_SLOWPATH */ + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure @@ -249,6 +286,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (pv_enabled()) + goto queue; + if (virt_queue_spin_lock(lock)) return; @@ -325,6 +365,7 @@ queue: node += idx; node-locked = 0; node-next = NULL; + pv_init_node(node); /* * We touched a (possibly) cold cacheline in the per-cpu queue node; @@ -350,6 +391,7 @@ queue: prev = decode_tail(old); WRITE_ONCE(prev-next, node); + pv_wait_node(node); arch_mcs_spin_lock_contended(node-locked); } @@ -365,6 +407,7 @@ queue: * does not imply a full barrier. * */ + pv_wait_head(lock, node); while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_PENDING_MASK) cpu_relax(); @@ -397,6 +440,7 @@ queue: cpu_relax(); arch_mcs_spin_unlock_contended(next-locked); + pv_kick_node(next); release: /* @@ -405,3 +449,25 @@ release: this_cpu_dec(mcs_nodes[0].count); } EXPORT_SYMBOL(queue_spin_lock_slowpath); + +/* + * Generate the paravirt code for queue_spin_unlock_slowpath(). + */ +#if !defined(_GEN_PV_LOCK_SLOWPATH) defined(CONFIG_PARAVIRT_SPINLOCKS) +#define _GEN_PV_LOCK_SLOWPATH + +#undef pv_enabled +#define pv_enabled() true + +#undef pv_init_node +#undef pv_wait_node +#undef pv_kick_node +#undef pv_wait_head + +#undef queue_spin_lock_slowpath +#define queue_spin_lock_slowpath __pv_queue_spin_lock_slowpath + +#include qspinlock_paravirt.h +#include qspinlock.c + +#endif diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h new file mode 100644 index 000..084e5c1 --- /dev/null +++ b/kernel/locking/qspinlock_paravirt.h @@ -0,0 +1,324
[Xen-devel] [PATCH v16 05/14] qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra (Intel) pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 13 ++ kernel/locking/qspinlock.c| 69 - 2 files changed, 81 insertions(+), 1 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index ef36613..f01b55d 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 11f6ad9..bcc99e6 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -24,6 +24,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -56,6 +57,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include mcs_spinlock.h @@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * + * *,1,0 - *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void clear_pending_set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + WRITE_ONCE(l-locked_pending, _Q_LOCKED_VAL); +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -131,6 +192,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) } return old; } +#endif /* _Q_PENDING_BITS == 8
[Xen-devel] [PATCH v16 11/14] pvqspinlock, x86: Enable PV qspinlock for Xen
From: David Vrabel david.vra...@citrix.com This patch adds the necessary Xen specific code to allow Xen to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: David Vrabel david.vra...@citrix.com Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/xen/spinlock.c | 64 --- kernel/Kconfig.locks|2 +- 2 files changed, 61 insertions(+), 5 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 956374c..3215ffc 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -17,6 +17,56 @@ #include xen-ops.h #include debugfs.h +static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(char *, irq_name); +static bool xen_pvspin = true; + +#ifdef CONFIG_QUEUE_SPINLOCK + +#include asm/qspinlock.h + +static void xen_qlock_kick(int cpu) +{ + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); +} + +/* + * Halt the current CPU release it back to the host + */ +static void xen_qlock_wait(u8 *byte, u8 val) +{ + int irq = __this_cpu_read(lock_kicker_irq); + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return; + + /* clear pending */ + xen_clear_irq_pending(irq); + barrier(); + + /* +* We check the byte value after clearing pending IRQ to make sure +* that we won't miss a wakeup event because of the clearing. +* +* The sync_clear_bit() call in xen_clear_irq_pending() is atomic. +* So it is effectively a memory barrier for x86. +*/ + if (READ_ONCE(*byte) != val) + return; + + /* +* If an interrupt happens here, it will leave the wakeup irq +* pending, which will cause xen_poll_irq() to return +* immediately. +*/ + + /* Block until irq becomes pending (or perhaps a spurious wakeup) */ + xen_poll_irq(irq); +} + +#else /* CONFIG_QUEUE_SPINLOCK */ + enum xen_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -100,12 +150,9 @@ struct xen_lock_waiting { __ticket_t want; }; -static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; -static DEFINE_PER_CPU(char *, irq_name); static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); static cpumask_t waiting_cpus; -static bool xen_pvspin = true; __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want) { int irq = __this_cpu_read(lock_kicker_irq); @@ -217,6 +264,7 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next) } } } +#endif /* CONFIG_QUEUE_SPINLOCK */ static irqreturn_t dummy_handler(int irq, void *dev_id) { @@ -280,8 +328,16 @@ void __init xen_init_spinlocks(void) return; } printk(KERN_DEBUG xen: PV spinlocks enabled\n); +#ifdef CONFIG_QUEUE_SPINLOCK + __pv_init_lock_hash(); + pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath; + pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock); + pv_lock_ops.wait = xen_qlock_wait; + pv_lock_ops.kick = xen_qlock_kick; +#else pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); pv_lock_ops.unlock_kick = xen_unlock_kick; +#endif } /* @@ -310,7 +366,7 @@ static __init int xen_parse_nopvspin(char *arg) } early_param(xen_nopvspin, xen_parse_nopvspin); -#ifdef CONFIG_XEN_DEBUG_FS +#if defined(CONFIG_XEN_DEBUG_FS) !defined(CONFIG_QUEUE_SPINLOCK) static struct dentry *d_spin_debug; diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 537b13e..0b42933 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK config QUEUE_SPINLOCK def_bool y if ARCH_USE_QUEUE_SPINLOCK - depends on SMP (!PARAVIRT_SPINLOCKS || !XEN) + depends on SMP config ARCH_USE_QUEUE_RWLOCK bool -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v16 07/14] qspinlock: Revert to test-and-set on hypervisors
From: Peter Zijlstra (Intel) pet...@infradead.org When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h |7 +++ kernel/locking/qspinlock.c |3 +++ 3 files changed, 24 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index 222995b..64c925e 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include asm/cpufeature.h #include asm-generic/qspinlock_types.h #definequeue_spin_unlock queue_spin_unlock @@ -15,6 +16,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock) smp_store_release((u8 *)lock, 0); } +#define virt_queue_spin_lock virt_queue_spin_lock + +static inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include asm-generic/qspinlock.h #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index 315d6dc..bcbbc5e 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wait(struct qspinlock *lock) cpu_relax(); } +#ifndef virt_queue_spin_lock +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 99503ef..fc2e5ab 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -249,6 +249,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (virt_queue_spin_lock(lock)) + return; + /* * wait for in-progress pending-locked hand-overs * -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v16 02/14] qspinlock, x86: Enable x86-64 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 20 arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 4 files changed, 30 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b7d31ca..49fecb1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -125,6 +125,7 @@ config X86 select MODULES_USE_ELF_RELA if X86_64 select CLONE_BACKWARDS if X86_32 select ARCH_USE_BUILTIN_BSWAP + select ARCH_USE_QUEUE_SPINLOCK select ARCH_USE_QUEUE_RWLOCK select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION select OLD_SIGACTION if X86_32 diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 000..222995b --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,20 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#definequeue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * A smp_store_release() on the least-significant byte. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + smp_store_release((u8 *)lock, 0); +} + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index cf87de3..a9c01fd 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) cpu_relax(); } } +#endif /* CONFIG_QUEUE_SPINLOCK */ /* * Read-write spinlocks, allowing multiple readers diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 5f9d757..5d654a1 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm-generic/qrwlock_types.h -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On 04/13/2015 11:09 AM, Peter Zijlstra wrote: On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote: +__visible void __pv_queue_spin_unlock(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + struct pv_node *node; + + if (likely(cmpxchg(l-locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL)) + return; + + /* +* The queue head has been halted. Need to locate it and wake it up. +*/ + node = pv_hash_find(lock); + smp_store_release(l-locked, 0); Ah yes, clever that. + /* +* At this point the memory pointed at by lock can be freed/reused, +* however we can still use the PV node to kick the CPU. +*/ + if (READ_ONCE(node-state) == vcpu_halted) + pv_kick(node-cpu); +} +PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock); However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86 code. That is why I originally put my version of the qspinlock_paravirt.h header file under arch/x86/include/asm. Maybe we should move it back there. Putting the thunk in arch/x86/kernel/kvm.c didn't work when you consider that the Xen code also need that. Well the function is 'generic' and belong here I think. Its just the PV_CALLEE_SAVE_REGS_THUNK thing that arch specific. Should have live in arch/x86/kernel/paravirt-spinlocks.c instead? Another alternative is to put the PV_CALLEE_SAVE_REGS_THUNK into a separate asm/qspinlock_paravirt.h file and included in the generic qspinlock_paravirt.h file. By putting the thunk with the __pv_queue_spin_unlock together in the same file, we can make the thunk and the unlock fast path (_Q_SLOW_VAL not set) go into two adjacent instruction cachelines. I think that may help a bit in term of performance. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On 04/13/2015 10:47 AM, Peter Zijlstra wrote: On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote: +void __init __pv_init_lock_hash(void) +{ + int pv_hash_size = 4 * num_possible_cpus(); + + if (pv_hash_size (1U LFSR_MIN_BITS)) + pv_hash_size = (1U LFSR_MIN_BITS); + /* +* Allocate space from bootmem which should be page-size aligned +* and hence cacheline aligned. +*/ + pv_lock_hash = alloc_large_system_hash(PV qspinlock, + sizeof(struct pv_hash_bucket), + pv_hash_size, 0, HASH_EARLY, + pv_lock_hash_bits, NULL, + pv_hash_size, pv_hash_size); pv_taps = lfsr_taps(pv_lock_hash_bits); I don't understand what you meant here. Let me explain (even though I propose taking all the LFSR stuff out). pv_lock_hash_bit is a runtime variable, therefore it cannot compile time evaluate the forest of if statements required to compute the taps value. Therefore its best to compute the taps _once_, and this seems like a good place to do so. OK, I got it. That make sense. + goto done; + } + } + + hash = lfsr(hash, pv_lock_hash_bits, 0); Since pv_lock_hash_bits is a variable, you end up running through that massive if() forest to find the corresponding tap every single time. It cannot compile-time optimize it. The minimum bits size is now 8. So unless the system has more than 64 vCPUs, it will get the right value in the first if statement. Still, no reason to not pre-compute the taps value, its simple enough. Still, we need to keep the hash_bits value as it will needed by the hashing function. I have taken out the lfsr code and use linear probing in the updated qspinlock patch that I am working on. However, we can always add that back in as an additional patch. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On 04/13/2015 11:08 AM, Peter Zijlstra wrote: On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote: +static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) +{ + struct __qspinlock *l = (void *)lock; + struct qspinlock **lp = NULL; + struct pv_node *pn = (struct pv_node *)node; + int slow_set = false; + int loop; + + for (;;) { + for (loop = SPIN_THRESHOLD; loop; loop--) { + if (!READ_ONCE(l-locked)) + return; + + cpu_relax(); + } + + WRITE_ONCE(pn-state, vcpu_halted); + if (!lp) + lp = pv_hash(lock, pn); + /* +* lp must be set before setting _Q_SLOW_VAL +* +* [S] lp = lock[RmW] l = l-locked = 0 +* MB MB +* [S] l-locked = _Q_SLOW_VAL [L] lp +* +* Matches the cmpxchg() in pv_queue_spin_unlock(). +*/ + if (!slow_set + !cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) { + /* +* The lock is free and _Q_SLOW_VAL has never been +* set. Need to clear the hash bucket before getting +* the lock. +*/ + WRITE_ONCE(*lp, NULL); + return; + } else if (slow_set !READ_ONCE(l-locked)) + return; + slow_set = true; I'm somewhat puzzled by the slow_set thing; what is wrong with the thing I had, namely: if (!lp) { lp = pv_hash(lock, pn); /* * comment */ lv = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); if (lv != _Q_LOCKED_VAL) { /* we're woken, unhash and return */ WRITE_ONCE(*lp, NULL); return; } } + + pv_wait(l-locked, _Q_SLOW_VAL); If we get a spurious wakeup (due to device interrupts or random kick) we'll loop around but -locked will remain _Q_SLOW_VAL. The purpose of the slow_set flag is not about the lock value. It is to make sure that pv_hash_find() will always find a match. Consider the following scenario: cpu1cpu2cpu3 pv_wait spurious wakeup loop l-locked read _Q_SLOW_VAL pv_hash_find() unlock pv_hash()- same entry cmpxchg(l-locked) clear hash (?) Here, the entry for cpu3 will be removed leading to panic when pv_hash_find() can find the entry. So the hash entry can only be removed if the other cpu has no chance to see _Q_SLOW_VAL. Still confused. Afaict that cannot happen with my code. We only do the cmpxchg(, _Q_SLOW_VAL) _once_. Only on the first time around that loop will we hash the lock and set the slow flag. And cpu3 cannot come in on the same entry because we've not yet released the lock when we find and unhash. Maybe I am not clear in my illustration. More than one locks can be hashed to the same value. So cpu3 is accessing a different lock which has the same hashed value as the lock used by cpu1 and cpu2. Anyway, I remove the slow_set flag by unrolling the retry loop so that after pv_wait(), it goes into the 2nd loop instead of going back to the top. As a result, cmpxchg and pv_hash can only be called once. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On 04/09/2015 02:23 PM, Peter Zijlstra wrote: On Thu, Apr 09, 2015 at 08:13:27PM +0200, Peter Zijlstra wrote: On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote: +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket)) +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node) +{ + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits); + struct pv_hash_bucket *hb, *end; + + if (!hash) + hash = 1; + + init_hash = hash; + hb =pv_lock_hash[hash_align(hash)]; + for (;;) { + for (end = hb + PV_HB_PER_LINE; hb end; hb++) { + if (!cmpxchg(hb-lock, NULL, lock)) { + WRITE_ONCE(hb-node, node); + /* +* We haven't set the _Q_SLOW_VAL yet. So +* the order of writing doesn't matter. +*/ + smp_wmb(); /* matches rmb from pv_hash_find */ + goto done; + } + } + + hash = lfsr(hash, pv_lock_hash_bits, 0); Since pv_lock_hash_bits is a variable, you end up running through that massive if() forest to find the corresponding tap every single time. It cannot compile-time optimize it. Hence: hash = lfsr(hash, pv_taps); (I don't get the bits argument to the lfsr). In any case, like I said before, I think we should try a linear probe sequence first, the lfsr was over engineering from my side. + hb =pv_lock_hash[hash_align(hash)]; So one thing this does -- and one of the reasons I figured I should ditch the LFSR instead of fixing it -- is that you end up scanning each bucket HB_PER_LINE times. I am aware of that when I was trying to add the hash table debug code, but I want to get the code out for review and so hasn't made any change yet. I have just done testing by adding some debug code to check the hashing efficiency. With the kernel build workload, with over 1M calls to pv_hash(), all of them get an empty entry on the first try. Maybe the minimum hash table size of 256 helps. The 'fix' would be to LFSR on cachelines instead of HBs but then you're stuck with the 0-th cacheline. This should not be a big problem. I just need to add a check at the end of the for loop that if hash is 0, change it to a certain non-0 value instead of calling lfsr(). As for ditching the lfsr idea, I am fine with that. So there will be 4 entries (1 cacheline) for each hash value. If all the entries are full, we proceed to the next cacheline. Right? Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock
For a virtual guest with the qspinlock patch, a simple unfair byte lock will be used if PV spinlock is not configured in or the hypervisor isn't either KVM or Xen. The byte lock works fine with small guest of just a few vCPUs. On a much larger guest, however, byte lock can have serious performance problem. There are 2 major drawbacks for the unfair byte lock: 1) The constant reading and occasionally writing to the lock word can put a lot of cacheline contention traffic on the affected cacheline. 2) Lock starvation is a real possibility especially if the number of vCPUs is large. This patch introduces a queue based unfair lock where all the vCPUs on the queue can opportunistically steal the lock, but the frequency of doing so decreases exponentially the further it is away from the queue head. This can greatly reduce cacheline contention problem as well as encouraging a more FIFO like order of getting the lock. This patch has no impact on native qspinlock performance at all. To measure the performance benefit of such a scheme, a linux kernel build workload (make -j n) was run on a virtual KVM guest with different configurations on a 8-socket with 4 active cores/socket Westmere-EX server (8x4). n is the number of vCPUs in the guest. The test configurations were: 1) 32 NUMA-pinned vCPUs (8x4) 2) 32 vCPUs (no pinning or NUMA) 3) 60 vCPUs (overcommitted) The kernel build times for different 4.0-rc6 based kernels were: KernelVM1 VM2 VM3 ----- --- --- PV ticket lock 3m49.7s 11m45.6s15m59.3s PV qspinlock 3m50.4s5m51.6s15m33.6s unfair byte lock 3m50.6s 61m01.1s 122m07.7s unfair qspinlock 3m48.3s5m03.1s 4m31.1s For VM1, all the locks give essentially the same performance. In both VM2 and VM3, the performance of byte lock was terrible whereas the unfair qspinlock was the fastest of all. For this particular test, the unfair qspinlock can be more than 10X faster than a simple byte lock. VM2 also illustrated the performance benefit of qspinlock versus ticket spinlock which got reduced in VM3 due to the overhead of constant vCPUs halting and kicking. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/qspinlock.h | 15 +-- kernel/locking/qspinlock.c| 94 +-- kernel/locking/qspinlock_unfair.h | 231 + 3 files changed, 319 insertions(+), 21 deletions(-) create mode 100644 kernel/locking/qspinlock_unfair.h diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index c8290db..8113685 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -39,17 +39,16 @@ static inline void queue_spin_unlock(struct qspinlock *lock) } #endif -#define virt_queue_spin_lock virt_queue_spin_lock +#ifndef static_cpu_has_hypervisor +#define static_cpu_has_hypervisor static_cpu_has(X86_FEATURE_HYPERVISOR) +#endif -static inline bool virt_queue_spin_lock(struct qspinlock *lock) +#define queue_spin_trylock_unfair queue_spin_trylock_unfair +static inline bool queue_spin_trylock_unfair(struct qspinlock *lock) { - if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) - return false; - - while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) - cpu_relax(); + u8 *l = (u8 *)lock; - return true; + return !READ_ONCE(*l) (xchg(l, _Q_LOCKED_VAL) == 0); } #include asm-generic/qspinlock.h diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index b9ba83b..5fda6d5 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -19,7 +19,11 @@ * Peter Zijlstra pet...@infradead.org */ -#ifndef _GEN_PV_LOCK_SLOWPATH +#if defined(_GEN_PV_LOCK_SLOWPATH) || defined(_GEN_UNFAIR_LOCK_SLOWPATH) +#define_GEN_LOCK_SLOWPATH +#endif + +#ifndef _GEN_LOCK_SLOWPATH #include linux/smp.h #include linux/bug.h @@ -68,12 +72,6 @@ #include mcs_spinlock.h -#ifdef CONFIG_PARAVIRT_SPINLOCKS -#define MAX_NODES 8 -#else -#define MAX_NODES 4 -#endif - /* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. @@ -81,7 +79,9 @@ * Exactly fits one 64-byte cacheline on a 64-bit architecture. * * PV doubles the storage and uses the second cacheline for PV state. + * Unfair lock (mutually exclusive to PV) also uses the second cacheline. */ +#define MAX_NODES 8 static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]); /* @@ -234,7 +234,7 @@ static __always_inline void set_locked(struct qspinlock *lock) /* * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for - * all the PV callbacks. + * all the PV and unfair callbacks. */ static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } @@ -244,19 +244,36 @@ static __always_inline
Re: [Xen-devel] [PATCH v15 12/15] pvqspinlock, x86: Enable PV qspinlock for Xen
On 04/08/2015 08:01 AM, David Vrabel wrote: On 07/04/15 03:55, Waiman Long wrote: This patch adds the necessary Xen specific code to allow Xen to support the CPU halting and kicking operations needed by the queue spinlock PV code. This basically looks the same as the version I wrote, except I think you broke it. +static void xen_qlock_wait(u8 *byte, u8 val) +{ + int irq = __this_cpu_read(lock_kicker_irq); + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return; + + /* clear pending */ + xen_clear_irq_pending(irq); + + /* +* We check the byte value after clearing pending IRQ to make sure +* that we won't miss a wakeup event because of the clearing. My version had a barrier() here to ensure this. The documentation of READ_ONCE() suggests that it is not sufficient to meet this requirement (and a READ_ONCE() here is not required anyway). I have the misconception that a READ_ONCE/WRITE_ONCE() implies a compiler barrier. You are right that it may not be the case. I will add back the missing barrier when I update the patch and add you as the author of this patch if you don't mind. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 01/15] qspinlock: A simple generic 4-byte queue spinlock
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- include/asm-generic/qspinlock.h | 132 + include/asm-generic/qspinlock_types.h | 58 + kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 209 + 6 files changed, 408 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 000..315d6dc --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,132 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(lock-val) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32
[Xen-devel] [PATCH v15 11/15] pvqspinlock, x86: Enable PV qspinlock for KVM
This patch adds the necessary KVM specific code to allow KVM to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/kernel/kvm.c | 43 +++ kernel/Kconfig.locks |2 +- 2 files changed, 44 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index e354cc6..4bb42c0 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -584,6 +584,39 @@ static void kvm_kick_cpu(int cpu) kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid); } + +#ifdef CONFIG_QUEUE_SPINLOCK + +#include asm/qspinlock.h + +static void kvm_wait(u8 *ptr, u8 val) +{ + unsigned long flags; + + if (in_nmi()) + return; + + local_irq_save(flags); + + if (READ_ONCE(*ptr) != val) + goto out; + + /* +* halt until it's our turn and kicked. Note that we do safe halt +* for irq enabled case to avoid hang when lock info is overwritten +* in irq spinlock slowpath and no spurious interrupt occur to save us. +*/ + if (arch_irqs_disabled_flags(flags)) + halt(); + else + safe_halt(); + +out: + local_irq_restore(flags); +} + +#else /* !CONFIG_QUEUE_SPINLOCK */ + enum kvm_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -817,6 +850,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) } } +#endif /* !CONFIG_QUEUE_SPINLOCK */ + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ @@ -828,8 +863,16 @@ void __init kvm_spinlock_init(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return; +#ifdef CONFIG_QUEUE_SPINLOCK + __pv_init_lock_hash(); + pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath; + pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock); + pv_lock_ops.wait = kvm_wait; + pv_lock_ops.kick = kvm_kick_cpu; +#else /* !CONFIG_QUEUE_SPINLOCK */ pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning); pv_lock_ops.unlock_kick = kvm_unlock_kick; +#endif } static __init int kvm_spinlock_init_jump(void) diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index c6a8f7c..537b13e 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK config QUEUE_SPINLOCK def_bool y if ARCH_USE_QUEUE_SPINLOCK - depends on SMP !PARAVIRT_SPINLOCKS + depends on SMP (!PARAVIRT_SPINLOCKS || !XEN) config ARCH_USE_QUEUE_RWLOCK bool -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 04/15] qspinlock: Extract out code snippets for the next patch
This is a preparatory patch that extracts out the following 2 code snippets to prepare for the next performance optimization patch. 1) the logic for the exchange of new and previous tail code words into a new xchg_tail() function. 2) the logic for clearing the pending bit and setting the locked bit into a new clear_pending_set_locked() function. This patch also simplifies the trylock operation before queuing by calling queue_spin_trylock() directly. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 79 - 2 files changed, 50 insertions(+), 31 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index 9c3f5c2..ef36613 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -58,6 +58,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 0351f78..11f6ad9 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -97,6 +97,42 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * + * *,1,0 - *,0,1 + */ +static __always_inline void clear_pending_set_locked(struct qspinlock *lock) +{ + atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, lock-val); +} + +/** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(lock-val); + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -178,15 +214,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) * * *,1,0 - *,0,1 */ - for (;;) { - new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + clear_pending_set_locked(lock); return; /* @@ -203,37 +231,26 @@ queue: node-next = NULL; /* -* We have already touched the queueing cacheline; don't bother with -* pending stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked. -* p,y,x - n,y,x ; tail was p - tail is n; preserving locked. +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; /* -* we won the trylock; forget about queueing. +* We have already touched the queueing cacheline; don't bother with +* pending stuff. +* +* p,*,* - n,*,* */ - if (new == _Q_LOCKED_VAL) - goto release; + old = xchg_tail(lock, tail); /* * if there was a previous node; link it and wait until reaching the * head of the waitqueue. */ - if (old ~_Q_LOCKED_PENDING_MASK) { + if (old _Q_TAIL_MASK) { prev = decode_tail(old); WRITE_ONCE(prev-next, node); -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 07/15] qspinlock: Revert to test-and-set on hypervisors
From: Peter Zijlstra (Intel) pet...@infradead.org When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h |7 +++ kernel/locking/qspinlock.c |3 +++ 3 files changed, 24 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index 222995b..64c925e 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include asm/cpufeature.h #include asm-generic/qspinlock_types.h #definequeue_spin_unlock queue_spin_unlock @@ -15,6 +16,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock) smp_store_release((u8 *)lock, 0); } +#define virt_queue_spin_lock virt_queue_spin_lock + +static inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include asm-generic/qspinlock.h #endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h index 315d6dc..bcbbc5e 100644 --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wait(struct qspinlock *lock) cpu_relax(); } +#ifndef virt_queue_spin_lock +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 99503ef..fc2e5ab 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -249,6 +249,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (virt_queue_spin_lock(lock)) + return; + /* * wait for in-progress pending-locked hand-overs * -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 03/15] qspinlock: Add pending bit
From: Peter Zijlstra (Intel) pet...@infradead.org Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 12 +++- kernel/locking/qspinlock.c| 119 +++-- 2 files changed, 107 insertions(+), 24 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index c9348d8..9c3f5c2 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 3456819..0351f78 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + * uncontended :(n,x,y) +-- (n,0,0) --' | : + * queue : | ^--' | : + * : v | : + * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : + * queue : ^--' : */ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) { @@ -121,6 +125,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + /* +* wait for in-progress pending-locked hand-overs +* +* 0,1,0 - 0,0,1 +*/ + if (val == _Q_PENDING_VAL) { + while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL
[Xen-devel] [PATCH v15 05/15] qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra (Intel) pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 13 ++ kernel/locking/qspinlock.c| 69 - 2 files changed, 81 insertions(+), 1 deletions(-) diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h index ef36613..f01b55d 100644 --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 11f6ad9..bcc99e6 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -24,6 +24,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -56,6 +57,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include mcs_spinlock.h @@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * + * *,1,0 - *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void clear_pending_set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + WRITE_ONCE(l-locked_pending, _Q_LOCKED_VAL); +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + struct __qspinlock *l = (void *)lock; + + return (u32)xchg(l-tail, tail _Q_TAIL_OFFSET) _Q_TAIL_OFFSET; +} + +#else /* _Q_PENDING_BITS == 8 */ + /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -131,6 +192,7 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) } return old; } +#endif /* _Q_PENDING_BITS == 8
[Xen-devel] [PATCH v15 14/15] pvqspinlock: Improve slowpath performance by avoiding cmpxchg
In the pv_scan_next() function, the slow cmpxchg atomic operation is performed even if the other CPU is not even close to being halted. This extra cmpxchg can harm slowpath performance. This patch introduces the new mayhalt flag to indicate if the other spinning CPU is close to being halted or not. The current threshold for x86 is 2k cpu_relax() calls. If this flag is not set, the other spinning CPU will have at least 2k more cpu_relax() calls before it can enter the halt state. This should give enough time for the setting of the locked flag in struct mcs_spinlock to propagate to that CPU without using atomic op. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock_paravirt.h | 28 +--- 1 files changed, 25 insertions(+), 3 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index a210061..a9fe10d 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -16,7 +16,8 @@ * native_queue_spin_unlock(). */ -#define _Q_SLOW_VAL(3U _Q_LOCKED_OFFSET) +#define _Q_SLOW_VAL(3U _Q_LOCKED_OFFSET) +#define MAYHALT_THRESHOLD (SPIN_THRESHOLD 4) /* * The vcpu_hashed is a special state that is set by the new lock holder on @@ -36,6 +37,7 @@ struct pv_node { int cpu; u8 state; + u8 mayhalt; }; /* @@ -187,6 +189,7 @@ static void pv_init_node(struct mcs_spinlock *node) pn-cpu = smp_processor_id(); pn-state = vcpu_running; + pn-mayhalt = false; } /* @@ -203,17 +206,27 @@ static void pv_wait_node(struct mcs_spinlock *node) for (loop = SPIN_THRESHOLD; loop; loop--) { if (READ_ONCE(node-locked)) return; + if (loop == MAYHALT_THRESHOLD) + xchg(pn-mayhalt, true); cpu_relax(); } /* -* Order pn-state vs pn-locked thusly: +* Order pn-state/pn-mayhalt vs pn-locked thusly: * -* [S] pn-state = vcpu_halted[S] next-locked = 1 +* [S] pn-mayhalt = 1[S] next-locked = 1 +* MB, delay barrier() +* [S] pn-state = vcpu_halted[L] pn-mayhalt * MB MB * [L] pn-locked [RmW] pn-state = vcpu_hashed * * Matches the cmpxchg() from pv_scan_next(). +* +* As the new lock holder may quit (when pn-mayhalt is not +* set) without memory barrier, a sufficiently long delay is +* inserted between the setting of pn-mayhalt and pn-state +* to ensure that there is enough time for the new pn-locked +* value to be propagated here to be checked below. */ (void)xchg(pn-state, vcpu_halted); @@ -226,6 +239,7 @@ static void pv_wait_node(struct mcs_spinlock *node) * needs to move on to pv_wait_head(). */ (void)cmpxchg(pn-state, vcpu_halted, vcpu_running); + pn-mayhalt = false; } /* @@ -246,6 +260,14 @@ static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node) struct __qspinlock *l = (void *)lock; /* +* If mayhalt is not set, there is enough time for the just set value +* in pn-locked to be propagated to the other CPU before it is time +* to halt. +*/ + if (!READ_ONCE(pn-mayhalt)) + return; + + /* * Transition CPU state: halted = hashed * Quit if the transition failed. */ -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 00/15] qspinlock: a 4-byte queue spinlock with PV support
data/reaim.config -f data/workfile.compute -i 16 -e 256 Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI Forked Time SysTime UTime Minute Child Time Percent 1 2.15 0.361.192818.602818.600.00 0.00100 17 6.29 42.41 22.43 16378.38 963.43 0.27 4.4395 33 56.241611.77 47.04 3555.83107.75 1.58 2.8897 49 120.88 5576.87 63.54 2456.4950.13 1.75 1.4798 65 199.17 12328.42 81.27 1977.7130.43 2.42 1.2398 81 270.89 20943.99 103.99 1812.0322.37 3.85 1.4498 97 315.31 29211.63 121.30 1864.2619.22 4.48 1.4498 113 397.68 42766.15 144.48 1721.9415.24 7.04 1.8098 129 520.74 63442.27 162.03 1501.2111.64 10.241.9998 ... Patched with qspinlocks v14: Num Parent Child Child Jobs per Jobs/min/ Std_dev Std_dev JTI Forked Time SysTime UTime Minute Child Time Percent 1 1.98 0.361.213060.613060.610.00 0.00100 17 5.60 25.41 22.39 18396.43 1082.140.16 3.0097 33 12.67153.64 49.17 15783.74 478.30 0.50 4.1695 49 21.15365.55 76.61 14039.72 286.52 0.83 4.1995 65 27.57556.87 105.96 14287.27 219.80 0.75 2.8097 81 33.07712.92 127.85 14843.06 183.25 1.16 3.6996 97 42.811009.02 161.87 13730.90 141.56 1.75 4.2795 113 49.131166.04 185.10 13938.12 123.35 2.11 4.5395 129 56.621262.07 231.92 13806.78 107.03 2.67 4.9995 145 66.411420.29 250.24 13231.44 91.25 2.95 4.7095 179 90.152237.78 325.18 12032.61 67.22 4.23 4.9495 193 85.411865.81 326.64 13693.71 70.95 4.29 5.3294 220 93.582298.77 376.46 14246.63 64.76 4.97 5.6594 Max Jobs per Minute 18396.43 The compute workload is mostly userspace code with some synchronization between working threads (HPC type workload). With large user count, the patched kernel has close to 8X the performance of an unpatched kernel. Peter Zijlstra (Intel) (4): qspinlock: Add pending bit qspinlock: Optimize for smaller NR_CPUS qspinlock: Revert to test-and-set on hypervisors pvqspinlock: Implement the paravirt qspinlock for x86 Waiman Long (11): qspinlock: A simple generic 4-byte queue spinlock qspinlock, x86: Enable x86-64 to use queue spinlock qspinlock: Extract out code snippets for the next patch qspinlock: Use a simple write to grab the lock lfsr: a simple binary Galois linear feedback shift register pvqspinlock: Implement simple paravirt support for the qspinlock pvqspinlock, x86: Enable PV qspinlock for KVM pvqspinlock, x86: Enable PV qspinlock for Xen pvqspinlock: Only kick CPU at unlock time pvqspinlock: Improve slowpath performance by avoiding cmpxchg pvqspinlock: Add debug code to check for PV lock hash sanity arch/x86/Kconfig |3 +- arch/x86/include/asm/paravirt.h | 28 ++- arch/x86/include/asm/paravirt_types.h | 10 + arch/x86/include/asm/qspinlock.h | 57 arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 + arch/x86/kernel/kvm.c | 43 +++ arch/x86/kernel/paravirt-spinlocks.c | 24 ++- arch/x86/kernel/paravirt_patch_32.c | 22 ++- arch/x86/kernel/paravirt_patch_64.c | 22 ++- arch/x86/xen/spinlock.c | 63 - include/asm-generic/qspinlock.h | 139 ++ include/asm-generic/qspinlock_types.h | 79 ++ include/linux/lfsr.h | 80 ++ kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 474 + kernel/locking/qspinlock_paravirt.h | 433 ++ 19 files changed, 1480 insertions(+), 15 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 include/linux/lfsr.h create mode 100644 kernel/locking/qspinlock.c create mode 100644 kernel/locking/qspinlock_paravirt.h ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 13/15] pvqspinlock: Only kick CPU at unlock time
Before this patch, a CPU may have been kicked twice before getting the lock - one before it becomes queue head and once before it gets the lock. All these CPU kicking and halting (VMEXIT) can be expensive and slow down system performance, especially in an overcommitted guest. This patch add a new vCPU state (vcpu_hashed) which enables the code to delay CPU kicking until at unlock time. Once this state is set, the new lock holder will set _Q_SLOW_VAL and fill in the hash table on behalf of the halted queue head vCPU. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 10 ++-- kernel/locking/qspinlock_paravirt.h | 76 +-- 2 files changed, 59 insertions(+), 27 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index 33b3f54..b9ba83b 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock *lock) static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } -static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { } - +static __always_inline void __pv_scan_next(struct qspinlock *lock, + struct mcs_spinlock *node) { } static __always_inline void __pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) { } @@ -248,7 +248,7 @@ static __always_inline void __pv_wait_head(struct qspinlock *lock, #define pv_init_node __pv_init_node #define pv_wait_node __pv_wait_node -#define pv_kick_node __pv_kick_node +#define pv_scan_next __pv_scan_next #define pv_wait_head __pv_wait_head @@ -441,7 +441,7 @@ queue: cpu_relax(); arch_mcs_spin_unlock_contended(next-locked); - pv_kick_node(next); + pv_scan_next(lock, next); release: /* @@ -462,7 +462,7 @@ EXPORT_SYMBOL(queue_spin_lock_slowpath); #undef pv_init_node #undef pv_wait_node -#undef pv_kick_node +#undef pv_scan_next #undef pv_wait_head #undef queue_spin_lock_slowpath diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index 49dbd39..a210061 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -18,9 +18,16 @@ #define _Q_SLOW_VAL(3U _Q_LOCKED_OFFSET) +/* + * The vcpu_hashed is a special state that is set by the new lock holder on + * the new queue head to indicate that _Q_SLOW_VAL is set and hash entry + * filled. With this state, the queue head CPU will always be kicked even + * if it is not halted to avoid potential racing condition. + */ enum vcpu_state { vcpu_running = 0, vcpu_halted, + vcpu_hashed }; struct pv_node { @@ -97,7 +104,13 @@ static inline u32 hash_align(u32 hash) return hash ~(PV_HB_PER_LINE - 1); } -static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node) +/* + * Set up an entry in the lock hash table + * This is not inlined to reduce size of generated code as it is included + * twice and is used only in the slowest path of handling CPU halting. + */ +static noinline struct qspinlock ** +pv_hash(struct qspinlock *lock, struct pv_node *node) { unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits); struct pv_hash_bucket *hb, *end; @@ -178,7 +191,8 @@ static void pv_init_node(struct mcs_spinlock *node) /* * Wait for node-locked to become true, halt the vcpu after a short spin. - * pv_kick_node() is used to wake the vcpu again. + * pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its + * behalf. */ static void pv_wait_node(struct mcs_spinlock *node) { @@ -189,7 +203,6 @@ static void pv_wait_node(struct mcs_spinlock *node) for (loop = SPIN_THRESHOLD; loop; loop--) { if (READ_ONCE(node-locked)) return; - cpu_relax(); } @@ -198,17 +211,21 @@ static void pv_wait_node(struct mcs_spinlock *node) * * [S] pn-state = vcpu_halted[S] next-locked = 1 * MB MB -* [L] pn-locked [RmW] pn-state = vcpu_running +* [L] pn-locked [RmW] pn-state = vcpu_hashed * -* Matches the xchg() from pv_kick_node(). +* Matches the cmpxchg() from pv_scan_next(). */ (void)xchg(pn-state, vcpu_halted); if (!READ_ONCE(node-locked)) pv_wait(pn-state, vcpu_halted); - /* Make sure that state is correct for spurious wakeup */ - WRITE_ONCE(pn-state, vcpu_running
[Xen-devel] [PATCH v15 02/15] qspinlock, x86: Enable x86-64 to use queue spinlock
This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 20 arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 4 files changed, 30 insertions(+), 0 deletions(-) create mode 100644 arch/x86/include/asm/qspinlock.h diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b7d31ca..49fecb1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -125,6 +125,7 @@ config X86 select MODULES_USE_ELF_RELA if X86_64 select CLONE_BACKWARDS if X86_32 select ARCH_USE_BUILTIN_BSWAP + select ARCH_USE_QUEUE_SPINLOCK select ARCH_USE_QUEUE_RWLOCK select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION select OLD_SIGACTION if X86_32 diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h new file mode 100644 index 000..222995b --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,20 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#definequeue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * A smp_store_release() on the least-significant byte. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + smp_store_release((u8 *)lock, 0); +} + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index cf87de3..a9c01fd 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) cpu_relax(); } } +#endif /* CONFIG_QUEUE_SPINLOCK */ /* * Read-write spinlocks, allowing multiple readers diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 5f9d757..5d654a1 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm-generic/qrwlock_types.h -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 06/15] qspinlock: Use a simple write to grab the lock
Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the tail code. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the tail code in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasksBefore patchAfter patch %Change ----- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasksBefore patchAfter patch %Change ----- -- --- 312242/12237 12087/12093 -1%/-1% 410688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- kernel/locking/qspinlock.c | 66 +-- 1 files changed, 50 insertions(+), 16 deletions(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index bcc99e6..99503ef 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -105,24 +105,37 @@ static inline struct mcs_spinlock *decode_tail(u32 tail) * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + struct { + u8 locked; + u8 pending; + }; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[2]; + u8 pending; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue spinlock structure @@ -195,6 +208,19 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) #endif /* _Q_PENDING_BITS == 8 */ /** + * set_locked - Set the lock bit and own the lock + * @lock: Pointer to queue spinlock structure + * + * *,*,0 - *,0,1 + */ +static __always_inline void set_locked(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + + WRITE_ONCE(l-locked, _Q_LOCKED_VAL
[Xen-devel] [PATCH v15 10/15] pvqspinlock: Implement the paravirt qspinlock for x86
From: Peter Zijlstra (Intel) pet...@infradead.org We use the regular paravirt call patching to switch between: native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath() native_queue_spin_unlock()__pv_queue_spin_unlock() We use a callee saved call for the unlock function which reduces the i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions again. We further optimize the unlock path by patching the direct call with a movb $0,%arg1 if we are indeed using the native unlock code. This makes the unlock code almost as fast as the !PARAVIRT case. This significantly lowers the overhead of having CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/Kconfig |2 +- arch/x86/include/asm/paravirt.h | 28 +++- arch/x86/include/asm/paravirt_types.h | 10 ++ arch/x86/include/asm/qspinlock.h | 25 - arch/x86/kernel/paravirt-spinlocks.c | 24 +++- arch/x86/kernel/paravirt_patch_32.c | 22 ++ arch/x86/kernel/paravirt_patch_64.c | 22 ++ 7 files changed, 121 insertions(+), 12 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 49fecb1..a0946e7 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -661,7 +661,7 @@ config PARAVIRT_DEBUG config PARAVIRT_SPINLOCKS bool Paravirtualization layer for spinlocks depends on PARAVIRT SMP - select UNINLINE_SPIN_UNLOCK + select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK ---help--- Paravirtualized spinlocks allow a pvops backend to replace the spinlock implementation with something virtualization-friendly diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 965c47d..dd40269 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,30 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK + +static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val); +} + +static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock) +{ + PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock); +} + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + PVOP_VCALL2(pv_lock_ops.wait, ptr, val); +} + +static __always_inline void pv_kick(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick, cpu); +} + +#else /* !CONFIG_QUEUE_SPINLOCK */ + static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -724,7 +748,9 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } -#endif +#endif /* CONFIG_QUEUE_SPINLOCK */ + +#endif /* SMP PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 #define PV_SAVE_REGS pushl %ecx; pushl %edx; diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 7549b8b..f6acaea 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,19 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +struct qspinlock; + struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val); + struct paravirt_callee_save queue_spin_unlock; + + void (*wait)(u8 *ptr, u8 val); + void (*kick)(int cpu); +#else /* !CONFIG_QUEUE_SPINLOCK */ struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket); +#endif /* !CONFIG_QUEUE_SPINLOCK */ }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index 64c925e..c8290db 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -3,6 +3,7 @@ #include asm/cpufeature.h #include asm-generic/qspinlock_types.h +#include asm/paravirt.h #definequeue_spin_unlock queue_spin_unlock /** @@ -11,11 +12,33 @@ * * A smp_store_release() on the least-significant byte. */ -static inline void queue_spin_unlock(struct qspinlock *lock) +static inline void native_queue_spin_unlock(struct qspinlock *lock) { smp_store_release((u8 *)lock, 0); } +#ifdef CONFIG_PARAVIRT_SPINLOCKS +extern void native_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_init_lock_hash(void); +extern void __pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void
[Xen-devel] [PATCH v15 08/15] lfsr: a simple binary Galois linear feedback shift register
This patch is based on the code sent out by Peter Zijstra as part of his queue spinlock patch to provide a hashing function with open addressing. The lfsr() function can be used to return a sequence of numbers that cycle through all the bit patterns (2^n -1) of a given bit width n except the value 0 in a somewhat random fashion depending on the LFSR taps that is being used. Callers can provide their own taps value or use the default. Signed-off-by: Waiman Long waiman.l...@hp.com --- include/linux/lfsr.h | 80 ++ 1 files changed, 80 insertions(+), 0 deletions(-) create mode 100644 include/linux/lfsr.h diff --git a/include/linux/lfsr.h b/include/linux/lfsr.h new file mode 100644 index 000..f570819 --- /dev/null +++ b/include/linux/lfsr.h @@ -0,0 +1,80 @@ +#ifndef _LINUX_LFSR_H +#define _LINUX_LFSR_H + +/* + * Simple Binary Galois Linear Feedback Shift Register + * + * http://en.wikipedia.org/wiki/Linear_feedback_shift_register + * + * This function only currently supports only bits values of 4-30. Callers + * that doesn't pass in a constant bits value can optionally define + * LFSR_MIN_BITS and LFSR_MAX_BITS before including the lfsr.h header file + * to reduce the size of the jump table in the compiled code, if desired. + */ +#ifndef LFSR_MIN_BITS +#define LFSR_MIN_BITS 4 +#endif + +#ifndef LFSR_MAX_BITS +#define LFSR_MAX_BITS 30 +#endif + +static __always_inline u32 lfsr_taps(int bits) +{ + BUG_ON((bits LFSR_MIN_BITS) || (bits LFSR_MAX_BITS)); + BUILD_BUG_ON((LFSR_MIN_BITS 4) || (LFSR_MAX_BITS 30)); + +#define _IF_BITS_EQ(x) \ + if (((x) = LFSR_MIN_BITS) ((x) = LFSR_MAX_BITS) ((x) == bits)) + + /* +* Feedback terms copied from +* http://users.ece.cmu.edu/~koopman/lfsr/index.html +*/ + _IF_BITS_EQ(4) return 0x0009; + _IF_BITS_EQ(5) return 0x0012; + _IF_BITS_EQ(6) return 0x0021; + _IF_BITS_EQ(7) return 0x0041; + _IF_BITS_EQ(8) return 0x008E; + _IF_BITS_EQ(9) return 0x0108; + _IF_BITS_EQ(10) return 0x0204; + _IF_BITS_EQ(11) return 0x0402; + _IF_BITS_EQ(12) return 0x0829; + _IF_BITS_EQ(13) return 0x100D; + _IF_BITS_EQ(14) return 0x2015; + _IF_BITS_EQ(15) return 0x4122; + _IF_BITS_EQ(16) return 0x8112; + _IF_BITS_EQ(17) return 0x102C9; + _IF_BITS_EQ(18) return 0x20195; + _IF_BITS_EQ(19) return 0x403FE; + _IF_BITS_EQ(20) return 0x80637; + _IF_BITS_EQ(21) return 0x100478; + _IF_BITS_EQ(22) return 0x20069E; + _IF_BITS_EQ(23) return 0x4004B2; + _IF_BITS_EQ(24) return 0x800B87; + _IF_BITS_EQ(25) return 0x10004F3; + _IF_BITS_EQ(26) return 0x200072D; + _IF_BITS_EQ(27) return 0x40006AE; + _IF_BITS_EQ(28) return 0x80009E3; + _IF_BITS_EQ(29) return 0x1583; + _IF_BITS_EQ(30) return 0x2C92; +#undef _IF_BITS_EQ + + /* Unreachable */ + return 0; +} + +/* + * Please note that LFSR doesn't work with a start state of 0. + */ +static inline u32 lfsr(u32 val, int bits, u32 taps) +{ + u32 bit = val 1; + + val = 1; + if (bit) + val ^= taps ? taps : lfsr_taps(bits); + return val; +} + +#endif /* _LINUX_LFSR_H */ -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 12/15] pvqspinlock, x86: Enable PV qspinlock for Xen
This patch adds the necessary Xen specific code to allow Xen to support the CPU halting and kicking operations needed by the queue spinlock PV code. Signed-off-by: Waiman Long waiman.l...@hp.com --- arch/x86/xen/spinlock.c | 63 --- kernel/Kconfig.locks|2 +- 2 files changed, 60 insertions(+), 5 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 956374c..728b45b 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -17,6 +17,55 @@ #include xen-ops.h #include debugfs.h +static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(char *, irq_name); +static bool xen_pvspin = true; + +#ifdef CONFIG_QUEUE_SPINLOCK + +#include asm/qspinlock.h + +static void xen_qlock_kick(int cpu) +{ + xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR); +} + +/* + * Halt the current CPU release it back to the host + */ +static void xen_qlock_wait(u8 *byte, u8 val) +{ + int irq = __this_cpu_read(lock_kicker_irq); + + /* If kicker interrupts not initialized yet, just spin */ + if (irq == -1) + return; + + /* clear pending */ + xen_clear_irq_pending(irq); + + /* +* We check the byte value after clearing pending IRQ to make sure +* that we won't miss a wakeup event because of the clearing. +* +* The sync_clear_bit() call in xen_clear_irq_pending() is atomic. +* So it is effectively a memory barrier for x86. +*/ + if (READ_ONCE(*byte) != val) + return; + + /* +* If an interrupt happens here, it will leave the wakeup irq +* pending, which will cause xen_poll_irq() to return +* immediately. +*/ + + /* Block until irq becomes pending (or perhaps a spurious wakeup) */ + xen_poll_irq(irq); +} + +#else /* CONFIG_QUEUE_SPINLOCK */ + enum xen_contention_stat { TAKEN_SLOW, TAKEN_SLOW_PICKUP, @@ -100,12 +149,9 @@ struct xen_lock_waiting { __ticket_t want; }; -static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; -static DEFINE_PER_CPU(char *, irq_name); static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); static cpumask_t waiting_cpus; -static bool xen_pvspin = true; __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want) { int irq = __this_cpu_read(lock_kicker_irq); @@ -217,6 +263,7 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next) } } } +#endif /* CONFIG_QUEUE_SPINLOCK */ static irqreturn_t dummy_handler(int irq, void *dev_id) { @@ -280,8 +327,16 @@ void __init xen_init_spinlocks(void) return; } printk(KERN_DEBUG xen: PV spinlocks enabled\n); +#ifdef CONFIG_QUEUE_SPINLOCK + __pv_init_lock_hash(); + pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath; + pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock); + pv_lock_ops.wait = xen_qlock_wait; + pv_lock_ops.kick = xen_qlock_kick; +#else pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); pv_lock_ops.unlock_kick = xen_unlock_kick; +#endif } /* @@ -310,7 +365,7 @@ static __init int xen_parse_nopvspin(char *arg) } early_param(xen_nopvspin, xen_parse_nopvspin); -#ifdef CONFIG_XEN_DEBUG_FS +#if defined(CONFIG_XEN_DEBUG_FS) !defined(CONFIG_QUEUE_SPINLOCK) static struct dentry *d_spin_debug; diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 537b13e..0b42933 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK config QUEUE_SPINLOCK def_bool y if ARCH_USE_QUEUE_SPINLOCK - depends on SMP (!PARAVIRT_SPINLOCKS || !XEN) + depends on SMP config ARCH_USE_QUEUE_RWLOCK bool -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
Provide a separate (second) version of the spin_lock_slowpath for paravirt along with a special unlock path. The second slowpath is generated by adding a few pv hooks to the normal slowpath, but where those will compile away for the native case, they expand into special wait/wake code for the pv version. The actual MCS queue can use extra storage in the mcs_nodes[] array to keep track of state and therefore uses directed wakeups. The head contender has no such storage directly visible to the unlocker. So the unlocker searches a hash table with open addressing using a simple binary Galois linear feedback shift register. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock.c | 69 - kernel/locking/qspinlock_paravirt.h | 321 +++ 2 files changed, 389 insertions(+), 1 deletions(-) create mode 100644 kernel/locking/qspinlock_paravirt.h diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index fc2e5ab..33b3f54 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -18,6 +18,9 @@ * Authors: Waiman Long waiman.l...@hp.com * Peter Zijlstra pet...@infradead.org */ + +#ifndef _GEN_PV_LOCK_SLOWPATH + #include linux/smp.h #include linux/bug.h #include linux/cpumask.h @@ -65,13 +68,21 @@ #include mcs_spinlock.h +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define MAX_NODES 8 +#else +#define MAX_NODES 4 +#endif + /* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. * * Exactly fits one 64-byte cacheline on a 64-bit architecture. + * + * PV doubles the storage and uses the second cacheline for PV state. */ -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); +static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]); /* * We must be able to distinguish between no-tail and the tail at 0:0, @@ -220,6 +231,33 @@ static __always_inline void set_locked(struct qspinlock *lock) WRITE_ONCE(l-locked, _Q_LOCKED_VAL); } + +/* + * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for + * all the PV callbacks. + */ + +static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { } + +static __always_inline void __pv_wait_head(struct qspinlock *lock, + struct mcs_spinlock *node) { } + +#define pv_enabled() false + +#define pv_init_node __pv_init_node +#define pv_wait_node __pv_wait_node +#define pv_kick_node __pv_kick_node + +#define pv_wait_head __pv_wait_head + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define queue_spin_lock_slowpath native_queue_spin_lock_slowpath +#endif + +#endif /* _GEN_PV_LOCK_SLOWPATH */ + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure @@ -249,6 +287,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (pv_enabled()) + goto queue; + if (virt_queue_spin_lock(lock)) return; @@ -325,6 +366,7 @@ queue: node += idx; node-locked = 0; node-next = NULL; + pv_init_node(node); /* * We touched a (possibly) cold cacheline in the per-cpu queue node; @@ -350,6 +392,7 @@ queue: prev = decode_tail(old); WRITE_ONCE(prev-next, node); + pv_wait_node(node); arch_mcs_spin_lock_contended(node-locked); } @@ -365,6 +408,7 @@ queue: * does not imply a full barrier. * */ + pv_wait_head(lock, node); while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_PENDING_MASK) cpu_relax(); @@ -397,6 +441,7 @@ queue: cpu_relax(); arch_mcs_spin_unlock_contended(next-locked); + pv_kick_node(next); release: /* @@ -405,3 +450,25 @@ release: this_cpu_dec(mcs_nodes[0].count); } EXPORT_SYMBOL(queue_spin_lock_slowpath); + +/* + * Generate the paravirt code for queue_spin_unlock_slowpath(). + */ +#if !defined(_GEN_PV_LOCK_SLOWPATH) defined(CONFIG_PARAVIRT_SPINLOCKS) +#define _GEN_PV_LOCK_SLOWPATH + +#undef pv_enabled +#define pv_enabled() true + +#undef pv_init_node +#undef pv_wait_node +#undef pv_kick_node +#undef pv_wait_head + +#undef queue_spin_lock_slowpath +#define queue_spin_lock_slowpath __pv_queue_spin_lock_slowpath + +#include qspinlock_paravirt.h +#include qspinlock.c + +#endif diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h new file mode 100644 index 000..49dbd39 --- /dev/null +++ b/kernel/locking/qspinlock_paravirt.h @@ -0,0 +1,321
[Xen-devel] [PATCH v15 15/15] pvqspinlock: Add debug code to check for PV lock hash sanity
The current code for PV lock hash table processing will panic the system if pv_hash_find() can't find the desired hash bucket. However, there is no check to see if there is more than one entry for a given lock which should never happen. This patch adds a pv_hash_check_duplicate() function to do that which will only be enabled if CONFIG_DEBUG_SPINLOCK is defined because of the performance overhead it introduces. Signed-off-by: Waiman Long waiman.l...@hp.com --- kernel/locking/qspinlock_paravirt.h | 58 +++ 1 files changed, 58 insertions(+), 0 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index a9fe10d..4d39c8b 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -107,6 +107,63 @@ static inline u32 hash_align(u32 hash) } /* + * Hash table debugging code + */ +#ifdef CONFIG_DEBUG_SPINLOCK + +#define _NODE_IDX(pn) unsigned long)pn) (SMP_CACHE_BYTES - 1)) /\ + sizeof(struct mcs_spinlock)) +/* + * Check if there is additional hash buckets with the same lock which + * should not happen. + */ +static inline void pv_hash_check_duplicate(struct qspinlock *lock) +{ + struct pv_hash_bucket *hb, *end, *hb1 = NULL; + int count = 0, used = 0; + + end = pv_lock_hash[1 pv_lock_hash_bits]; + for (hb = pv_lock_hash; hb end; hb++) { + struct qspinlock *l = READ_ONCE(hb-lock); + struct pv_node *pn; + + if (l) + used++; + if (l != lock) + continue; + if (++count == 1) { + hb1 = hb; + continue; + } + WARN_ON(count == 2); + if (hb1) { + pn = READ_ONCE(hb1-node); + printk(KERN_ERR PV lock hash error: duplicated entry + #%d - hash %ld, node %ld, cpu %d\n, 1, + hb1 - pv_lock_hash, _NODE_IDX(pn), + pn ? pn-cpu : -1); + hb1 = NULL; + } + pn = READ_ONCE(hb-node); + printk(KERN_ERR PV lock hash error: duplicated entry #%d - + hash %ld, node %ld, cpu %d\n, count, hb - pv_lock_hash, + _NODE_IDX(pn), pn ? pn-cpu : -1); + } + /* +* Warn if more than half of the buckets are used +*/ + if (used (1 (pv_lock_hash_bits - 1))) + printk(KERN_WARNING PV lock hash warning: + %d hash entries used!\n, used); +} + +#else /* CONFIG_DEBUG_SPINLOCK */ + +static inline void pv_hash_check_duplicate(struct qspinlock *lock) {} + +#endif /* CONFIG_DEBUG_SPINLOCK */ + +/* * Set up an entry in the lock hash table * This is not inlined to reduce size of generated code as it is included * twice and is used only in the slowest path of handling CPU halting. @@ -141,6 +198,7 @@ pv_hash(struct qspinlock *lock, struct pv_node *node) } done: + pv_hash_check_duplicate(lock); return hb-lock; } -- 1.7.1 ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 04/02/2015 03:48 PM, Peter Zijlstra wrote: On Thu, Apr 02, 2015 at 07:20:57PM +0200, Peter Zijlstra wrote: pv_wait_head(): pv_hash() /* MB as per cmpxchg */ cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); VS __pv_queue_spin_unlock(): if (xchg(l-locked, 0) != _Q_SLOW_VAL) return; /* MB as per xchg */ pv_hash_find(lock); Something like so.. compile tested only. I took out the LFSR because that was likely over engineering from my side :-) --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -2,6 +2,8 @@ #error do not include this file #endif +#includelinux/hash.h + /* * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead * of spinning them. @@ -107,7 +109,84 @@ static void pv_kick_node(struct mcs_spin pv_kick(pn-cpu); } -static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait); +/* + * Hash table using open addressing with an linear probe sequence. + * + * Since we should not be holding locks from NMI context (very rare indeed) the + * max load factor is 0.75, which is around the point where open adressing + * breaks down. + * + * Instead of probing just the immediate bucket we probe all buckets in the + * same cacheline. + * + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing + * + */ + +struct pv_hash_bucket { + struct qspinlock *lock; + int cpu; +}; + +/* + * XXX dynamic allocate using nr_cpu_ids instead... + */ +#define PV_LOCK_HASH_BITS (2 + NR_CPUS_BITS) + +#if PV_LOCK_HASH_BITS 6 +#undef PV_LOCK_HASH_BITS +#define PB_LOCK_HASH_BITS 6 +#endif + +#define PV_LOCK_HASH_SIZE (1 PV_LOCK_HASH_BITS) + +static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] cacheline_aligned; + +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket)) + +static inline u32 hash_align(u32 hash) +{ + return hash ~(PV_HB_PER_LINE - 1); +} + +#define for_each_hash_bucket(hb, off, hash) \ + for (hash = hash_align(hash), off = 0, hb =__pv_lock_hash[hash + off];\ + off PV_LOCK_HASH_SIZE; \ + off++, hb =__pv_lock_hash[(hash + off) % PV_LOCK_HASH_SIZE]) + +static struct pv_hash_bucket *pv_hash_insert(struct qspinlock *lock) +{ + u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb; + + for_each_hash_bucket(hb, offset, hash) { + if (!cmpxchg(hb-lock, NULL, lock)) { + WRITE_ONCE(hb-cpu, smp_processor_id()); + return hb; + } + } + + /* +* Hard assumes there is an empty bucket somewhere. +*/ + BUG(); +} + +static struct pv_hash_bucket *pv_hash_find(struct qspinlock *lock) +{ + u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb; + + for_each_hash_bucket(hb, offset, hash) { + if (READ_ONCE(hb-lock) == lock) + return hb; + } + + /* +* Hard assumes we _WILL_ find a match. +*/ + BUG(); +} /* * Wait for l-locked to become clear; halt the vcpu after a short spin. @@ -116,7 +195,9 @@ static DEFINE_PER_CPU(struct qspinlock * static void pv_wait_head(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; + struct pv_hash_bucket *hb = NULL; int loop; + u8 o; for (;;) { for (loop = SPIN_THRESHOLD; loop; loop--) { @@ -126,29 +207,47 @@ static void pv_wait_head(struct qspinloc cpu_relax(); } - this_cpu_write(__pv_lock_wait, lock); - /* -* __pv_lock_wait must be set before setting _Q_SLOW_VAL -* -* [S] __pv_lock_wait = lock[RmW] l = l-locked = 0 -* MB MB -* [S] l-locked = _Q_SLOW_VAL [L] __pv_lock_wait -* -* Matches the xchg() in pv_queue_spin_unlock(). -*/ - if (!cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) - goto done; + if (!hb) { + hb = pv_hash_insert(lock); + /* +* hb must be set before setting _Q_SLOW_VAL +* +* [S] hb- lock [RmW] l = l-locked = 0 +* MB MB +* [RmW] l-locked ?= _Q_SLOW_VAL [L] hb +* +* Matches the xchg() in pv_queue_spin_unlock(). +*/ + o = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); + if (!o) { + /* +
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 04/01/2015 05:03 PM, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote: On 04/01/2015 02:48 PM, Peter Zijlstra wrote: I am sorry that I don't quite get what you mean here. My point is that in the hashing step, a cpu will need to scan an empty bucket to put the lock in. In the interim, an previously used bucket before the empty one may get freed. In the lookup step for that lock, the scanning will stop because of an empty bucket in front of the target one. Right, that's broken. So we need to do something else to limit the lookup, because without that break, a lookup that needs to iterate the entire array in order to determine -ENOENT, which is expensive. So my alternative proposal is that IFF we can guarantee that every lookup will succeed -- the entry we're looking for is always there, we don't need the break on empty but can probe until we find the entry. This will be bound in cost to the same number if probes we required for insertion and avoids the full array scan. Now I think we can indeed do this, if as said earlier we do not clear the bucket on insert if the cmpxchg succeeds, in that case the unlock will observe _Q_SLOW_VAL and do the lookup, the lookup will then find the entry. And we then need the unlock to clear the entry. _Q_SLOW_VAL Does that explain this? Or should I try again with code? OK, I got your proposal now. However, there is still the issue that setting the _Q_SLOW_VAL flag and the hash bucket are not atomic wrt each other. It is possible a CPU has set the _Q_SLOW_VAL flag but not yet filling in the hash bucket while another one is trying to look for it. So we need to have some kind of synchronization mechanism to let the lookup CPU know when is a good time to look up. One possibility is to delay setting _Q_SLOW_VAL until the hash bucket is set up. Maybe we can make that work. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 03/19/2015 08:25 AM, Peter Zijlstra wrote: On Thu, Mar 19, 2015 at 11:12:42AM +0100, Peter Zijlstra wrote: So I was now thinking of hashing the lock pointer; let me go and quickly put something together. A little something like so; ideally we'd allocate the hashtable since NR_CPUS is kinda bloated, but it shows the idea I think. And while this has loops in (the rehashing thing) their fwd progress does not depend on other CPUs. And I suspect that for the typical lock contention scenarios its unlikely we ever really get into long rehashing chains. --- include/linux/lfsr.h| 49 kernel/locking/qspinlock_paravirt.h | 143 2 files changed, 178 insertions(+), 14 deletions(-) --- /dev/null + +static int pv_hash_find(struct qspinlock *lock) +{ + u64 hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb, *end; + int cpu = -1; + + if (!hash) + hash = 1; + + hb =__pv_lock_hash[hash_align(hash)]; + for (;;) { + for (end = hb + PV_HB_PER_LINE; hb end; hb++) { + struct qspinlock *l = READ_ONCE(hb-lock); + + /* +* If we hit an unused bucket, there is no match. +*/ + if (!l) + goto done; After more careful reading, I think the assumption that the presence of an unused bucket means there is no match is not true. Consider the scenario: 1. cpu 0 puts lock1 into hb[0] 2. cpu 1 puts lock2 into hb[1] 3. cpu 2 clears hb[0] 4. cpu 3 looks for lock2 and doesn't find it I was thinking about putting some USED flag in the buckets, but then we will eventually fill them all up as used. If we put the entries into a hashed linked list, we have to deal with the complicated synchronization issues with link list update. At this point, I am thinking using back your previous idea of passing the queue head information down the queue. I am now convinced that the unlock call site patching should work. So I will incorporate that in my next update. Please let me know if you think my reasoning is not correct. Thanks, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 04/01/2015 02:17 PM, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote: Hohumm.. time to think more I think ;-) So bear with me, I've not really pondered this well so it could be full of holes (again). After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the spin_unlock() must do the hash lookup, right? We can make the lookup unhash. If the cmpxchg() fails the unlock will not do the lookup and we must unhash. The idea being that the result is that any lookup is guaranteed to find an entry, which reduces our worst case lookup cost to whatever the worst case insertion cost was. I think it doesn't matter who did the unhashing. Multiple independent locks can be hashed to the same value. Since they can be unhashed independently, there is no way to know whether you have checked all the possible buckets. -Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 04/01/2015 02:48 PM, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 02:54:45PM -0400, Waiman Long wrote: On 04/01/2015 02:17 PM, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote: Hohumm.. time to think more I think ;-) So bear with me, I've not really pondered this well so it could be full of holes (again). After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the spin_unlock() must do the hash lookup, right? We can make the lookup unhash. If the cmpxchg() fails the unlock will not do the lookup and we must unhash. The idea being that the result is that any lookup is guaranteed to find an entry, which reduces our worst case lookup cost to whatever the worst case insertion cost was. I think it doesn't matter who did the unhashing. Multiple independent locks can be hashed to the same value. Since they can be unhashed independently, there is no way to know whether you have checked all the possible buckets. oh but the crux is that you guarantee a lookup will find an entry. it will never need to iterate the entire array. I am sorry that I don't quite get what you mean here. My point is that in the hashing step, a cpu will need to scan an empty bucket to put the lock in. In the interim, an previously used bucket before the empty one may get freed. In the lookup step for that lock, the scanning will stop because of an empty bucket in front of the target one. -Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 04/01/2015 01:12 PM, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 12:20:30PM -0400, Waiman Long wrote: After more careful reading, I think the assumption that the presence of an unused bucket means there is no match is not true. Consider the scenario: 1. cpu 0 puts lock1 into hb[0] 2. cpu 1 puts lock2 into hb[1] 3. cpu 2 clears hb[0] 4. cpu 3 looks for lock2 and doesn't find it Hmm, yes. The only way I can see that being true is if we assume entries are never taken out again. The wikipedia page could use some clarification here, this is not clear. At this point, I am thinking using back your previous idea of passing the queue head information down the queue. Having to scan the entire array for a lookup sure sucks, but the wait loops involved in the other idea can get us in the exact predicament we were trying to get out, because their forward progress depends on other CPUs. For the waiting loop, the worst case is when a new CPU get queued right before we write the head value to the previous tail node. In the case, the maximum number of retries is equal to the total number of CPUs - 2. But that should rarely happen. I do find a way to guarantee forward progress in a few steps. I will try the normal way once. If that fails, I will insert the head node to the tail once again after saving the next pointer. After modifying the previous tail node, cmpxchg will be used to restore the previous tail. If that fails, we just have to wait until the next pointer is updated and write it out to the previous tail node. We can now restore the next pointer and move forward. Let me know if that looks reasonable to you. -Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15
On 03/25/2015 03:47 PM, Konrad Rzeszutek Wilk wrote: On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote: Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. The goal is to replace ticket spinlock by queue spinlock. We may not want to support 2 different spinlock implementations in the kernel. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. I did it differently in my PV portion of the qspinlock patch. Instead of just waking up the CPU, the new lock holder will check if the new queue head has been halted. If so, it will set the slowpath flag for the halted queue head in the lock so as to wake it up at unlock time. This should eliminate your concern of dong twice as many VMEXIT in an overcommitted scenario. BTW, I did some qspinlock vs. ticketspinlock benchmarks using AIM7 high_systime workload on a 4-socket IvyBridge-EX system (60 cores, 120 threads) with some interesting results. In term of the performance benefit of this patch, I ran the high_systime workload (which does a lot of fork() and exit()) at various load levels (500, 1000, 1500 and 2000 users) on a 4-socket IvyBridge-EX bare-metal system (60 cores, 120 threads) with intel_pstate driver and performance scaling governor. The JPM (jobs/minutes) and execution time results were as follows: Kernel JPMExecution Time -- ----- At 500 users: 3.19118857.1426.25s 3.19-qspinlock134889.7523.13s % change +13.5%-11.9% At 1000 users: 3.19204255.3230.55s 3.19-qspinlock239631.3426.04s % change +17.3%-14.8% At 1500 users: 3.19177272.7352.80s 3.19-qspinlock326132.4028.70s % change +84.0%-45.6% At 2000 users: 3.19196690.3163.45s 3.19-qspinlock341730.5636.52s % change +73.7%-42.4% It turns out that this workload was causing quite a lot of spinlock contention in the vanilla 3.19 kernel. The performance advantage of this patch increases with heavier loads. With the powersave governor, the JPM data were as follows: Users3.19 3.19-qspinlock % Change --- 500 112635.38 132596.69 +17.7% 1000 171240.40 240369.80 +40.4% 1500 130507.53 324436.74 +148.6% 2000 175972.93 341637.01 +94.1% With the qspinlock patch, there wasn't too much difference in performance between the 2 scaling governors. Without this patch, the powersave governor was much slower than the performance governor. By disabling the intel_pstate driver and use acpi_cpufreq instead, the benchmark performance (JPM) at 1000 users level for the performance and ondemand governors were: Governor 3.193.19-qspinlock % Change -- performance 124949.94 219950.65+76.0% ondemand 4838.90 206690.96+4171% The performance was just horrible when there was significant spinlock contention with the ondemand governor. There was also significant run-to-run variation. A second run of the same benchmark gave a result of 22115 JPMs. With the qspinlock patch, however, the performance was much more stable under different cpufreq drivers and governors. That is not the case with the default ticket spinlock implementation. The %CPU times spent on spinlock contention (from perf) with the performance governor and the intel_pstate driver were: Kernel Function3.19 kernel3.19-qspinlock kernel ------
Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15
On 03/30/2015 12:29 PM, Peter Zijlstra wrote: On Mon, Mar 30, 2015 at 12:25:12PM -0400, Waiman Long wrote: I did it differently in my PV portion of the qspinlock patch. Instead of just waking up the CPU, the new lock holder will check if the new queue head has been halted. If so, it will set the slowpath flag for the halted queue head in the lock so as to wake it up at unlock time. This should eliminate your concern of dong twice as many VMEXIT in an overcommitted scenario. We can still do that on top of all this right? As you might have realized I'm a fan of gradual complexity :-) Of course. I am just saying that the concern can be addressed with some additional code change. -Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15
On 03/27/2015 10:07 AM, Konrad Rzeszutek Wilk wrote: On Thu, Mar 26, 2015 at 09:21:53PM +0100, Peter Zijlstra wrote: On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote: Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Yes I suppose we can do something similar for the ticket and patch in the right increment. We'd need to restructure the code a bit, but its not fundamentally impossible. We could equally apply the head hashing to the current ticket implementation and avoid the current bitmap iteration. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. An astute observation, I had not considered that. Thank you. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. You presume I benchmarked :-) I managed to boot something virt and run hackbench in it. I wouldn't know a representative virt setup if I ran into it. The thing is, we want this qspinlock for real hardware because its faster and I really want to avoid having to carry two spinlock implementations -- although I suppose that if we really really have to we could. In some way you already have that - for virtualized environments where you don't have an PV mechanism you just use the byte spinlock - which is good. And switching to PV ticketlock implementation after boot.. ugh. I feel your pain. What if you used an PV bytelock implemenation? The code you posted already 'sprays' all the vCPUS to wake up. And that is exactly what you need for PV bytelocks - well, you only need to wake up the vCPUS that have gone to sleep waiting on an specific 'struct spinlock' and just stash those in an per-cpu area. The old Xen spinlock code (Before 3.11?) had this. Just an idea thought. The current code should have just waken up one sleeping vCPU. We shouldn't want to wake up all of them and have almost all except one go back to sleep. I think the PV bytelock you suggest is workable. It should also simplify the implementation. It is just a matter of how much we value the fairness attribute of the PV ticket or queue spinlock implementation that we have. -Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 03/19/2015 08:25 AM, Peter Zijlstra wrote: On Thu, Mar 19, 2015 at 11:12:42AM +0100, Peter Zijlstra wrote: So I was now thinking of hashing the lock pointer; let me go and quickly put something together. A little something like so; ideally we'd allocate the hashtable since NR_CPUS is kinda bloated, but it shows the idea I think. And while this has loops in (the rehashing thing) their fwd progress does not depend on other CPUs. And I suspect that for the typical lock contention scenarios its unlikely we ever really get into long rehashing chains. --- include/linux/lfsr.h| 49 kernel/locking/qspinlock_paravirt.h | 143 2 files changed, 178 insertions(+), 14 deletions(-) This is a much better alternative. --- /dev/null +++ b/include/linux/lfsr.h @@ -0,0 +1,49 @@ +#ifndef _LINUX_LFSR_H +#define _LINUX_LFSR_H + +/* + * Simple Binary Galois Linear Feedback Shift Register + * + * http://en.wikipedia.org/wiki/Linear_feedback_shift_register + * + */ + +extern void __lfsr_needs_more_taps(void); + +static __always_inline u32 lfsr_taps(int bits) +{ + if (bits == 1) return 0x0001; + if (bits == 2) return 0x0001; + if (bits == 3) return 0x0003; + if (bits == 4) return 0x0009; + if (bits == 5) return 0x0012; + if (bits == 6) return 0x0021; + if (bits == 7) return 0x0041; + if (bits == 8) return 0x008E; + if (bits == 9) return 0x0108; + if (bits == 10) return 0x0204; + if (bits == 11) return 0x0402; + if (bits == 12) return 0x0829; + if (bits == 13) return 0x100D; + if (bits == 14) return 0x2015; + + /* +* For more taps see: +* http://users.ece.cmu.edu/~koopman/lfsr/index.html +*/ + __lfsr_needs_more_taps(); + + return 0; +} + +static inline u32 lfsr(u32 val, int bits) +{ + u32 bit = val 1; + + val= 1; + if (bit) + val ^= lfsr_taps(bits); + return val; +} + +#endif /* _LINUX_LFSR_H */ --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -2,6 +2,9 @@ #error do not include this file #endif +#includelinux/hash.h +#includelinux/lfsr.h + /* * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead * of spinning them. @@ -107,7 +110,120 @@ static void pv_kick_node(struct mcs_spin pv_kick(pn-cpu); } -static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait); +/* + * Hash table using open addressing with an LFSR probe sequence. + * + * Since we should not be holding locks from NMI context (very rare indeed) the + * max load factor is 0.75, which is around the point where open addressing + * breaks down. + * + * Instead of probing just the immediate bucket we probe all buckets in the + * same cacheline. + * + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing + * + */ + +#define HB_RESERVED((struct qspinlock *)1) + +struct pv_hash_bucket { + struct qspinlock *lock; + int cpu; +}; + +/* + * XXX dynamic allocate using nr_cpu_ids instead... + */ +#define PV_LOCK_HASH_BITS (2 + NR_CPUS_BITS) + As said here, we should make it dynamically allocated depending on num_possible_cpus(). +#if PV_LOCK_HASH_BITS 6 +#undef PV_LOCK_HASH_BITS +#define PB_LOCK_HASH_BITS 6 +#endif + +#define PV_LOCK_HASH_SIZE (1 PV_LOCK_HASH_BITS) + +static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] cacheline_aligned; + +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket)) + +static inline u32 hash_align(u32 hash) +{ + return hash ~(PV_HB_PER_LINE - 1); +} + +static struct qspinlock **pv_hash(struct qspinlock *lock) +{ + u32 hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb, *end; + + if (!hash) + hash = 1; + + hb =__pv_lock_hash[hash_align(hash)]; + for (;;) { + for (end = hb + PV_HB_PER_LINE; hb end; hb++) { + if (cmpxchg(hb-lock, NULL, HB_RESERVED)) { + WRITE_ONCE(hb-cpu, smp_processor_id()); + /* +* Since we must read lock first and cpu +* second, we must write cpu first and lock +* second, therefore use HB_RESERVE to mark an +* entry in use before writing the values. +* +* This can cause hb_hash_find() to not find a +* cpu even though _Q_SLOW_VAL, this is not a +* problem since we re-check l-locked before +* going to sleep and the unlock will have +* cleared l-locked already. +*/ +
Re: [Xen-devel] [PATCH 9/9] qspinlock, x86, kvm: Implement KVM support for paravirt qspinlock
On 03/19/2015 06:01 AM, Peter Zijlstra wrote: On Wed, Mar 18, 2015 at 10:45:55PM -0400, Waiman Long wrote: On 03/16/2015 09:16 AM, Peter Zijlstra wrote: I do have some concern about this call site patching mechanism as the modification is not atomic. The spin_unlock() calls are in many places in the kernel. There is a possibility that a thread is calling a certain spin_unlock call site while it is being patched by another one with the alternative() function call. So far, I don't see any problem with bare metal where paravirt_patch_insns() is used to patch it to the move instruction. However, in a virtual guest enivornment where paravirt_patch_call() was used, there were situations where the system panic because of page fault on some invalid memory in the kthread. If you look at the paravirt_patch_call(), you will see: : b-opcode = 0xe8; /* call */ b-delta = delta; If another CPU reads the instruction at the call site at the right moment, it will get the modified call instruction, but not the new delta value. It will then jump to a random location. I believe that was causing the system panic that I saw. So I think it is kind of risky to use it here unless we can guarantee that call site patching is atomic wrt other CPUs. Just look at where the patching is done: init/main.c:start_kernel() check_bugs() alternative_instructions() apply_paravirt() We're UP and not holding any locks, disable IRQs (see text_poke_early()) and have NMIs 'disabled'. You are probably right. The initial apply_paravirt() was done before the SMP boot. Subsequent ones were at kernel module load time. I put a counter in the __native_queue_spin_unlock() and it registered 26949 unlock calls in a 16-cpu guest before it got patched out. The panic that I observed before might be due to some coding error of my own. -Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15
On 03/16/2015 09:16 AM, Peter Zijlstra wrote: Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. I feel that if someone were to do a Xen patch we can go ahead and merge this stuff (finally!). These patches do not implement the paravirt spinlock debug stats currently implemented (separately) by KVM and Xen, but that should not be too hard to do on top and in the 'generic' code -- no reason to duplicate all that. Of course; once this lands people can look at improving the paravirt nonsense. Thanks for sending this out. I have no problem with the !paravirt patch. I do have some comments on the paravirt one which I will reply individually. Cheers, Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support
On 03/16/2015 09:16 AM, Peter Zijlstra wrote: Implement simple paravirt support for the qspinlock. Provide a separate (second) version of the spin_lock_slowpath for paravirt along with a special unlock path. The second slowpath is generated by adding a few pv hooks to the normal slowpath, but where those will compile away for the native case, they expand into special wait/wake code for the pv version. The actual MCS queue can use extra storage in the mcs_nodes[] array to keep track of state and therefore uses directed wakeups. The head contender has no such storage available and reverts to the per-cpu lock entry similar to the current kvm code. We can do a single enrty because any nesting will wake the vcpu and cause the lower loop to retry. Signed-off-by: Peter Zijlstra (Intel)pet...@infradead.org --- include/asm-generic/qspinlock.h |3 kernel/locking/qspinlock.c | 69 +- kernel/locking/qspinlock_paravirt.h | 177 3 files changed, 248 insertions(+), 1 deletion(-) --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -118,6 +118,9 @@ static __always_inline bool virt_queue_s } #endif +extern void __pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queue_spin_unlock(struct qspinlock *lock); + /* * Initializier */ --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -18,6 +18,9 @@ * Authors: Waiman Longwaiman.l...@hp.com * Peter Zijlstrapet...@infradead.org */ + +#ifndef _GEN_PV_LOCK_SLOWPATH + #includelinux/smp.h #includelinux/bug.h #includelinux/cpumask.h @@ -65,13 +68,21 @@ #include mcs_spinlock.h +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define MAX_NODES 8 +#else +#define MAX_NODES 4 +#endif + /* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. * * Exactly fits one 64-byte cacheline on a 64-bit architecture. + * + * PV doubles the storage and uses the second cacheline for PV state. */ -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); +static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]); /* * We must be able to distinguish between no-tail and the tail at 0:0, @@ -230,6 +241,32 @@ static __always_inline void set_locked(s WRITE_ONCE(l-locked, _Q_LOCKED_VAL); } + +/* + * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for + * all the PV callbacks. + */ + +static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { } + +static __always_inline void __pv_wait_head(struct qspinlock *lock) { } + +#define pv_enabled() false + +#define pv_init_node __pv_init_node +#define pv_wait_node __pv_wait_node +#define pv_kick_node __pv_kick_node + +#define pv_wait_head __pv_wait_head + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define queue_spin_lock_slowpath native_queue_spin_lock_slowpath +#endif + +#endif /* _GEN_PV_LOCK_SLOWPATH */ + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure @@ -259,6 +296,9 @@ void queue_spin_lock_slowpath(struct qsp BUILD_BUG_ON(CONFIG_NR_CPUS= (1U _Q_TAIL_CPU_BITS)); + if (pv_enabled()) + goto queue; + if (virt_queue_spin_lock(lock)) return; @@ -335,6 +375,7 @@ void queue_spin_lock_slowpath(struct qsp node += idx; node-locked = 0; node-next = NULL; + pv_init_node(node); /* * We touched a (possibly) cold cacheline in the per-cpu queue node; @@ -360,6 +401,7 @@ void queue_spin_lock_slowpath(struct qsp prev = decode_tail(old); WRITE_ONCE(prev-next, node); + pv_wait_node(node); arch_mcs_spin_lock_contended(node-locked); } @@ -374,6 +416,7 @@ void queue_spin_lock_slowpath(struct qsp * sequentiality; this is because the set_locked() function below * does not imply a full barrier. */ + pv_wait_head(lock); while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_PENDING_MASK) cpu_relax(); @@ -406,6 +449,7 @@ void queue_spin_lock_slowpath(struct qsp cpu_relax(); arch_mcs_spin_unlock_contended(next-locked); + pv_kick_node(next); release: /* @@ -414,3 +458,26 @@ void queue_spin_lock_slowpath(struct qsp this_cpu_dec(mcs_nodes[0].count); } EXPORT_SYMBOL(queue_spin_lock_slowpath); + +/* + * Generate the paravirt code for queue_spin_unlock_slowpath(). + */ +#if !defined(_GEN_PV_LOCK_SLOWPATH) defined(CONFIG_PARAVIRT_SPINLOCKS) +#define _GEN_PV_LOCK_SLOWPATH + +#undef pv_enabled +#define pv_enabled()
Re: [Xen-devel] [PATCH 9/9] qspinlock, x86, kvm: Implement KVM support for paravirt qspinlock
On 03/16/2015 09:16 AM, Peter Zijlstra wrote: Implement the paravirt qspinlock for x86-kvm. We use the regular paravirt call patching to switch between: native_queue_spin_lock_slowpath()__pv_queue_spin_lock_slowpath() native_queue_spin_unlock() __pv_queue_spin_unlock() We use a callee saved call for the unlock function which reduces the i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions again. We further optimize the unlock path by patching the direct call with a movb $0,%arg1 if we are indeed using the native unlock code. This makes the unlock code almost as fast as the !PARAVIRT case. This significantly lowers the overhead of having CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code. Signed-off-by: Peter Zijlstra (Intel)pet...@infradead.org I do have some concern about this call site patching mechanism as the modification is not atomic. The spin_unlock() calls are in many places in the kernel. There is a possibility that a thread is calling a certain spin_unlock call site while it is being patched by another one with the alternative() function call. So far, I don't see any problem with bare metal where paravirt_patch_insns() is used to patch it to the move instruction. However, in a virtual guest enivornment where paravirt_patch_call() was used, there were situations where the system panic because of page fault on some invalid memory in the kthread. If you look at the paravirt_patch_call(), you will see: : b-opcode = 0xe8; /* call */ b-delta = delta; If another CPU reads the instruction at the call site at the right moment, it will get the modified call instruction, but not the new delta value. It will then jump to a random location. I believe that was causing the system panic that I saw. So I think it is kind of risky to use it here unless we can guarantee that call site patching is atomic wrt other CPUs. -Longman ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [PATCH v14 01/11] qspinlock: A simple generic 4-byte queue spinlock
This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra pet...@infradead.org --- include/asm-generic/qspinlock.h | 132 + include/asm-generic/qspinlock_types.h | 58 + kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 + kernel/locking/mcs_spinlock.h |1 + kernel/locking/qspinlock.c| 207 + 6 files changed, 406 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 000..aafb7b4 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,132 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!atomic_read(lock-val) + (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0)) + return 1; + return 0; +} + +extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32 val