Re: [Xen-devel] [PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin

2017-11-02 Thread Waiman Long
On 11/01/2017 06:01 PM, Boris Ostrovsky wrote:
> On 11/01/2017 04:58 PM, Waiman Long wrote:
>> +/* TODO: To be removed in a future kernel version */
>>  static __init int xen_parse_nopvspin(char *arg)
>>  {
>> -xen_pvspin = false;
>> +pr_warn("xen_nopvspin is deprecated, replace it with 
>> \"pvlock_type=queued\"!\n");
>> +if (!pv_spinlock_type)
>> +pv_spinlock_type = locktype_queued;
> Since we currently end up using unfair locks and because you are
> deprecating xen_nopvspin I wonder whether it would be better to set this
> to locktype_unfair so that current behavior doesn't change. (Sorry, I
> haven't responded to your earlier message before you posted this). Juergen?

I think the latest patch from Juergen in tip is to use native qspinlock
when xen_nopvspin is specified. Right? That is why I made the current
choice. I can certainly change to unfair if it is what you guys want.

> I am also not sure I agree with making pv_spinlock an enum *and* a
> bitmask at the same time. I understand that it makes checks easier but I
> think not assuming a value or a pattern would be better, especially
> since none of the uses is on a critical path.
>
> (For example, !pv_spinlock_type is the same as locktype_auto, which is
> defined but never used)

OK, I will take out the enum and make explicit use of locktype_auto.

Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH-tip v2 0/2] x86/paravirt: Enable users to choose PV lock type

2017-11-01 Thread Waiman Long
v1->v2:
 - Make pv_spinlock_type a bit mask for easier checking.
 - Add patch 2 to deprecate xen_nopvspin

v1 - https://lkml.org/lkml/2017/11/1/381

Patch 1 adds a new pvlock_type parameter for the administrators to
specify the type of lock to be used in a para-virtualized kernel.

Patch 2 deprecates Xen's xen_nopvspin parameter as it is no longer
needed.

Waiman Long (2):
  x86/paravirt: Add kernel parameter to choose paravirt lock type
  x86/xen: Deprecate xen_nopvspin

 Documentation/admin-guide/kernel-parameters.txt | 11 ---
 arch/x86/include/asm/paravirt.h |  9 ++
 arch/x86/kernel/kvm.c   |  3 ++
 arch/x86/kernel/paravirt.c  | 40 -
 arch/x86/xen/spinlock.c | 17 +--
 5 files changed, 65 insertions(+), 15 deletions(-)

-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH-tip v2 1/2] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
Currently, there are 3 different lock types that can be chosen for
the x86 architecture:

 - qspinlock
 - pvqspinlock
 - unfair lock

One of the above lock types will be chosen at boot time depending on
a number of different factors.

Ideally, the hypervisors should be able to pick the best performing
lock type for the current VM configuration. That is not currently
the case as the performance of each lock type are affected by many
different factors like the number of vCPUs in the VM, the amount vCPU
overcommitment, the CPU type and so on.

Generally speaking, unfair lock performs well for VMs with a small
number of vCPUs. Native qspinlock may perform better than pvqspinlock
if there is vCPU pinning and there is no vCPU over-commitment.

This patch adds a new kernel parameter to allow administrator to
choose the paravirt spinlock type to be used. VM administrators can
experiment with the different lock types and choose one that can best
suit their need, if they want to. Hypervisor developers can also use
that to experiment with different lock types so that they can come
up with a better algorithm to pick the best lock type.

The hypervisor paravirt spinlock code will override this new parameter
in determining if pvqspinlock should be used. The parameter, however,
will override Xen's xen_nopvspin in term of disabling unfair lock.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +
 arch/x86/include/asm/paravirt.h |  9 ++
 arch/x86/kernel/kvm.c   |  3 ++
 arch/x86/kernel/paravirt.c  | 40 -
 arch/x86/xen/spinlock.c | 12 
 5 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index f7df49d..c98d9c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3275,6 +3275,13 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
 
+   pvlock_type=[X86,PV_OPS]
+   Specify the paravirt spinlock type to be used.
+   Options are:
+   queued - native queued spinlock
+   pv - paravirt queued spinlock
+   unfair - simple TATAS unfair lock
+
quiet   [KNL] Disable most log messages
 
r128=   [HW,DRM]
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 12deec7..c8f9ad9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu)
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
 
+enum pv_spinlock_type {
+   locktype_auto = 0,
+   locktype_queued   = 1,
+   locktype_paravirt = 2,
+   locktype_unfair   = 4,
+};
+
+extern enum pv_spinlock_type pv_spinlock_type;
+
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;"
 #define PV_RESTORE_REGS "popl %edx; popl %ecx;"
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..3faee63 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -646,6 +646,9 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair))
+   return;
+
__pv_init_lock_hash();
pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 041096b..aee6756 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void 
*target,
return 5;
 }
 
+/*
+ * The kernel argument "pvlock_type=" can be used to explicitly specify
+ * which type of spinlocks to be used. Currently, there are 3 options:
+ * 1) queued - the native queued spinlock
+ * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS)
+ * 3) unfair - the simple TATAS unfair lock
+ *
+ * If this argument is not specified, the kernel will automatically choose
+ * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor
+ * specific settings.
+ */
+enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto;
+
+static int __init pvlock_setup(char *s)
+{
+   if (!s)
+   return -EINVAL;
+
+   if (!strcmp(s, "queued"))
+   pv_spinlock_type = locktype_queued;
+   else if (!strcmp(s, "pv"))
+   pv_spinlock_type = locktype_paravirt;
+   else if (!strcmp(s, "u

[Xen-devel] [PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin

2017-11-01 Thread Waiman Long
With the new pvlock_type kernel parameter, xen_nopvspin is no longer
needed. This patch deprecates the xen_nopvspin parameter by removing
its documentation and treating it as an alias of "pvlock_type=queued".

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  4 
 arch/x86/xen/spinlock.c | 19 +++
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index c98d9c7..683a817 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4596,10 +4596,6 @@
the unplug protocol
never -- do not unplug even if version check succeeds
 
-   xen_nopvspin[X86,XEN]
-   Disables the ticketlock slowpath using Xen PV
-   optimizations.
-
xen_nopv[X86]
Disables the PV optimizations forcing the HVM guest to
run as generic HVM guest with no PV drivers.
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d5f79ac..19e2e75 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -20,7 +20,6 @@
 
 static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
 static DEFINE_PER_CPU(char *, irq_name);
-static bool xen_pvspin = true;
 
 #include 
 
@@ -81,12 +80,8 @@ void xen_init_lock_cpu(int cpu)
int irq;
char *name;
 
-   if (!xen_pvspin ||
-  (pv_spinlock_type & (locktype_queued|locktype_unfair))) {
-   if ((cpu == 0) && !pv_spinlock_type)
-   static_branch_disable(_spin_lock_key);
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair))
return;
-   }
 
WARN(per_cpu(lock_kicker_irq, cpu) >= 0, "spinlock on CPU%d exists on 
IRQ%d!\n",
 cpu, per_cpu(lock_kicker_irq, cpu));
@@ -110,8 +105,7 @@ void xen_init_lock_cpu(int cpu)
 
 void xen_uninit_lock_cpu(int cpu)
 {
-   if (!xen_pvspin ||
-  (pv_spinlock_type & (locktype_queued|locktype_unfair)))
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair))
return;
 
unbind_from_irqhandler(per_cpu(lock_kicker_irq, cpu), NULL);
@@ -132,8 +126,7 @@ void xen_uninit_lock_cpu(int cpu)
  */
 void __init xen_init_spinlocks(void)
 {
-   if (!xen_pvspin ||
-  (pv_spinlock_type & (locktype_queued|locktype_unfair))) {
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair)) {
printk(KERN_DEBUG "xen: PV spinlocks disabled\n");
return;
}
@@ -147,10 +140,12 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
 }
 
+/* TODO: To be removed in a future kernel version */
 static __init int xen_parse_nopvspin(char *arg)
 {
-   xen_pvspin = false;
+   pr_warn("xen_nopvspin is deprecated, replace it with 
\"pvlock_type=queued\"!\n");
+   if (!pv_spinlock_type)
+   pv_spinlock_type = locktype_queued;
return 0;
 }
 early_param("xen_nopvspin", xen_parse_nopvspin);
-
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
On 11/01/2017 03:01 PM, Boris Ostrovsky wrote:
> On 11/01/2017 12:28 PM, Waiman Long wrote:
>> On 11/01/2017 11:51 AM, Juergen Gross wrote:
>>> On 01/11/17 16:32, Waiman Long wrote:
>>>> Currently, there are 3 different lock types that can be chosen for
>>>> the x86 architecture:
>>>>
>>>>  - qspinlock
>>>>  - pvqspinlock
>>>>  - unfair lock
>>>>
>>>> One of the above lock types will be chosen at boot time depending on
>>>> a number of different factors.
>>>>
>>>> Ideally, the hypervisors should be able to pick the best performing
>>>> lock type for the current VM configuration. That is not currently
>>>> the case as the performance of each lock type are affected by many
>>>> different factors like the number of vCPUs in the VM, the amount vCPU
>>>> overcommitment, the CPU type and so on.
>>>>
>>>> Generally speaking, unfair lock performs well for VMs with a small
>>>> number of vCPUs. Native qspinlock may perform better than pvqspinlock
>>>> if there is vCPU pinning and there is no vCPU over-commitment.
>>>>
>>>> This patch adds a new kernel parameter to allow administrator to
>>>> choose the paravirt spinlock type to be used. VM administrators can
>>>> experiment with the different lock types and choose one that can best
>>>> suit their need, if they want to. Hypervisor developers can also use
>>>> that to experiment with different lock types so that they can come
>>>> up with a better algorithm to pick the best lock type.
>>>>
>>>> The hypervisor paravirt spinlock code will override this new parameter
>>>> in determining if pvqspinlock should be used. The parameter, however,
>>>> will override Xen's xen_nopvspin in term of disabling unfair lock.
>>> Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others
>>> think?
>> I don't think we need xen_nopvspin, but I don't want to remove that
>> without agreement from the community.
> I also don't think xen_nopvspin will be needed after pvlock_type is
> introduced.
>
> -boris

Another reason that I didn't try to remove xen_nopvspin is backward 
compatibility concern. One way to handle it is to depreciate it and
treat it as an alias to pvlock_type=queued.

Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
On 11/01/2017 11:51 AM, Juergen Gross wrote:
> On 01/11/17 16:32, Waiman Long wrote:
>> Currently, there are 3 different lock types that can be chosen for
>> the x86 architecture:
>>
>>  - qspinlock
>>  - pvqspinlock
>>  - unfair lock
>>
>> One of the above lock types will be chosen at boot time depending on
>> a number of different factors.
>>
>> Ideally, the hypervisors should be able to pick the best performing
>> lock type for the current VM configuration. That is not currently
>> the case as the performance of each lock type are affected by many
>> different factors like the number of vCPUs in the VM, the amount vCPU
>> overcommitment, the CPU type and so on.
>>
>> Generally speaking, unfair lock performs well for VMs with a small
>> number of vCPUs. Native qspinlock may perform better than pvqspinlock
>> if there is vCPU pinning and there is no vCPU over-commitment.
>>
>> This patch adds a new kernel parameter to allow administrator to
>> choose the paravirt spinlock type to be used. VM administrators can
>> experiment with the different lock types and choose one that can best
>> suit their need, if they want to. Hypervisor developers can also use
>> that to experiment with different lock types so that they can come
>> up with a better algorithm to pick the best lock type.
>>
>> The hypervisor paravirt spinlock code will override this new parameter
>> in determining if pvqspinlock should be used. The parameter, however,
>> will override Xen's xen_nopvspin in term of disabling unfair lock.
> Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others
> think?

I don't think we need xen_nopvspin, but I don't want to remove that
without agreement from the community.
>>  DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
>>  
>>  void __init native_pv_lock_init(void)
>>  {
>> -if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>> +if (pv_spinlock_type == locktype_unfair)
>> +return;
>> +
>> +if (!static_cpu_has(X86_FEATURE_HYPERVISOR) ||
>> +   (pv_spinlock_type != locktype_auto))
>>  static_branch_disable(_spin_lock_key);
> Really? I don't think locktype_paravirt should disable the static key.

With paravirt spinlock, it doesn't matter if the static key is disabled
or not. Without CONFIG_PARAVIRT_SPINLOCKS, however, it does degenerate
into the native qspinlock. So you are right, I should check for paravirt
type as well.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
Currently, there are 3 different lock types that can be chosen for
the x86 architecture:

 - qspinlock
 - pvqspinlock
 - unfair lock

One of the above lock types will be chosen at boot time depending on
a number of different factors.

Ideally, the hypervisors should be able to pick the best performing
lock type for the current VM configuration. That is not currently
the case as the performance of each lock type are affected by many
different factors like the number of vCPUs in the VM, the amount vCPU
overcommitment, the CPU type and so on.

Generally speaking, unfair lock performs well for VMs with a small
number of vCPUs. Native qspinlock may perform better than pvqspinlock
if there is vCPU pinning and there is no vCPU over-commitment.

This patch adds a new kernel parameter to allow administrator to
choose the paravirt spinlock type to be used. VM administrators can
experiment with the different lock types and choose one that can best
suit their need, if they want to. Hypervisor developers can also use
that to experiment with different lock types so that they can come
up with a better algorithm to pick the best lock type.

The hypervisor paravirt spinlock code will override this new parameter
in determining if pvqspinlock should be used. The parameter, however,
will override Xen's xen_nopvspin in term of disabling unfair lock.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +
 arch/x86/include/asm/paravirt.h |  9 ++
 arch/x86/kernel/kvm.c   |  4 +++
 arch/x86/kernel/paravirt.c  | 40 -
 arch/x86/xen/spinlock.c |  6 ++--
 5 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index f7df49d..c98d9c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3275,6 +3275,13 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
 
+   pvlock_type=[X86,PV_OPS]
+   Specify the paravirt spinlock type to be used.
+   Options are:
+   queued - native queued spinlock
+   pv - paravirt queued spinlock
+   unfair - simple TATAS unfair lock
+
quiet   [KNL] Disable most log messages
 
r128=   [HW,DRM]
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 12deec7..941a046 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu)
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
 
+enum pv_spinlock_type {
+   locktype_auto,
+   locktype_queued,
+   locktype_paravirt,
+   locktype_unfair,
+};
+
+extern enum pv_spinlock_type pv_spinlock_type;
+
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;"
 #define PV_RESTORE_REGS "popl %edx; popl %ecx;"
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..3a5d3ec4 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -646,6 +646,10 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+   if ((pv_spinlock_type == locktype_queued) ||
+   (pv_spinlock_type == locktype_unfair))
+   return;
+
__pv_init_lock_hash();
pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 041096b..ca35cd3 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void 
*target,
return 5;
 }
 
+/*
+ * The kernel argument "pvlock_type=" can be used to explicitly specify
+ * which type of spinlocks to be used. Currently, there are 3 options:
+ * 1) queued - the native queued spinlock
+ * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS)
+ * 3) unfair - the simple TATAS unfair lock
+ *
+ * If this argument is not specified, the kernel will automatically choose
+ * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor
+ * specific settings.
+ */
+enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto;
+
+static int __init pvlock_setup(char *s)
+{
+   if (!s)
+   return -EINVAL;
+
+   if (!strcmp(s, "queued"))
+   pv_spinlock_type = locktype_queued;
+   else if (!strcmp(s, "pv"))
+   pv_spinlock_type = locktype_paravirt;
+   else if (!strcmp(

Re: [Xen-devel] [PATCH v3 0/2] guard virt_spin_lock() with a static key

2017-09-25 Thread Waiman Long
On 09/25/2017 09:59 AM, Juergen Gross wrote:
> Ping?
>
> On 06/09/17 19:36, Juergen Gross wrote:
>> With virt_spin_lock() being guarded by a static key the bare metal case
>> can be optimized by patching the call away completely. In case a kernel
>> running as a guest it can decide whether to use paravitualized
>> spinlocks, the current fallback to the unfair test-and-set scheme, or
>> to mimic the bare metal behavior.
>>
>> V3:
>> - remove test for hypervisor environment from virt_spin_lock(9 as
>>   suggested by Waiman Long
>>
>> V2:
>> - use static key instead of making virt_spin_lock() a pvops function
>>
>> Juergen Gross (2):
>>   paravirt/locks: use new static key for controlling call of
>> virt_spin_lock()
>>   paravirt,xen: correct xen_nopvspin case
>>
>>  arch/x86/include/asm/qspinlock.h | 11 ++-
>>  arch/x86/kernel/paravirt-spinlocks.c |  6 ++
>>  arch/x86/kernel/smpboot.c|  2 ++
>>  arch/x86/xen/spinlock.c  |  2 ++
>>  kernel/locking/qspinlock.c   |  4 
>>  5 files changed, 24 insertions(+), 1 deletion(-)
>>
Acked-by: Waiman Long <long...@redhat.com>


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()

2017-09-06 Thread Waiman Long
On 09/06/2017 12:04 PM, Peter Zijlstra wrote:
> On Wed, Sep 06, 2017 at 11:49:49AM -0400, Waiman Long wrote:
>>>  #define virt_spin_lock virt_spin_lock
>>>  static inline bool virt_spin_lock(struct qspinlock *lock)
>>>  {
>>> +   if (!static_branch_likely(_spin_lock_key))
>>> +   return false;
>>> if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> return false;
>>>  
> Now native has two NOPs instead of one. Can't we merge these two static
> branches?


I guess we can remove the static_cpu_has() call. Just that any spin_lock
calls before native_pv_lock_init() will use the virt_spin_lock(). That
is still OK as the init call is done before SMP starts.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()

2017-09-06 Thread Waiman Long
On 09/06/2017 11:29 AM, Juergen Gross wrote:
> There are cases where a guest tries to switch spinlocks to bare metal
> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
> has the downside of falling back to unfair test and set scheme for
> qspinlocks due to virt_spin_lock() detecting the virtualized
> environment.
>
> Add a static key controlling whether virt_spin_lock() should be
> called or not. When running on bare metal set the new key to false.
>
> Signed-off-by: Juergen Gross <jgr...@suse.com>
> ---
>  arch/x86/include/asm/qspinlock.h | 11 +++
>  arch/x86/kernel/paravirt-spinlocks.c |  6 ++
>  arch/x86/kernel/smpboot.c|  2 ++
>  kernel/locking/qspinlock.c   |  4 
>  4 files changed, 23 insertions(+)
>
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 48a706f641f2..fc39389f196b 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -1,6 +1,7 @@
>  #ifndef _ASM_X86_QSPINLOCK_H
>  #define _ASM_X86_QSPINLOCK_H
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -46,9 +47,15 @@ static inline void queued_spin_unlock(struct qspinlock 
> *lock)
>  #endif
>  
>  #ifdef CONFIG_PARAVIRT
> +DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
> +
> +void native_pv_lock_init(void) __init;
> +
>  #define virt_spin_lock virt_spin_lock
>  static inline bool virt_spin_lock(struct qspinlock *lock)
>  {
> + if (!static_branch_likely(_spin_lock_key))
> + return false;
>   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>   return false;
>  
> @@ -65,6 +72,10 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
>  
>   return true;
>  }
> +#else
> +static inline void native_pv_lock_init(void)
> +{
> +}
>  #endif /* CONFIG_PARAVIRT */
>  
>  #include 
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
> b/arch/x86/kernel/paravirt-spinlocks.c
> index 8f2d1c9d43a8..2fc65ddea40d 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -42,3 +42,9 @@ struct pv_lock_ops pv_lock_ops = {
>  #endif /* SMP */
>  };
>  EXPORT_SYMBOL(pv_lock_ops);
> +
> +void __init native_pv_lock_init(void)
> +{
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> + static_branch_disable(_spin_lock_key);
> +}
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 54b9e89d4d6b..21500d3ba359 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -77,6 +77,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Number of siblings per CPU package */
>  int smp_num_siblings = 1;
> @@ -1381,6 +1382,7 @@ void __init native_smp_prepare_boot_cpu(void)
>   /* already set me in cpu_online_mask in boot_cpu_init() */
>   cpumask_set_cpu(me, cpu_callout_mask);
>   cpu_set_state_online(me);
> + native_pv_lock_init();
>  }
>  
>  void __init native_smp_cpus_done(unsigned int max_cpus)
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 294294c71ba4..838d235b87ef 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -76,6 +76,10 @@
>  #define MAX_NODES4
>  #endif
>  
> +#ifdef CONFIG_PARAVIRT
> +DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
> +#endif
> +
>  /*
>   * Per-CPU queue node structures; we can never have more than 4 nested
>   * contexts: task, softirq, hardirq, nmi.

Acked-by: Waiman Long <long...@redhat.com>


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 2/2] paravirt, xen: correct xen_nopvspin case

2017-09-06 Thread Waiman Long
On 09/06/2017 11:29 AM, Juergen Gross wrote:
> With the boot parameter "xen_nopvspin" specified a Xen guest should not
> make use of paravirt spinlocks, but behave as if running on bare
> metal. This is not true, however, as the qspinlock code will fall back
> to a test-and-set scheme when it is detecting a hypervisor.
>
> In order to avoid this disable the virt_spin_lock_key.
>
> Signed-off-by: Juergen Gross <jgr...@suse.com>
> ---
>  arch/x86/xen/spinlock.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 25a7c4302ce7..e8ab80ad7a6f 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -10,6 +10,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -129,6 +130,7 @@ void __init xen_init_spinlocks(void)
>  
>   if (!xen_pvspin) {
>   printk(KERN_DEBUG "xen: PV spinlocks disabled\n");
> + static_branch_disable(_spin_lock_key);
>   return;
>   }
>   printk(KERN_DEBUG "xen: PV spinlocks enabled\n");

Acked-by: Waiman Long <long...@redhat.com>


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-06 Thread Waiman Long
On 09/06/2017 09:06 AM, Peter Zijlstra wrote:
> On Wed, Sep 06, 2017 at 08:44:09AM -0400, Waiman Long wrote:
>> On 09/06/2017 03:08 AM, Peter Zijlstra wrote:
>>> Guys, please trim email.
>>>
>>> On Tue, Sep 05, 2017 at 10:31:46AM -0400, Waiman Long wrote:
>>>> For clarification, I was actually asking if you consider just adding one
>>>> more jump label to skip it for Xen/KVM instead of making
>>>> virt_spin_lock() a pv-op.
>>> I don't understand. What performance are you worried about. Native will
>>> now do: "xor rax,rax; jnz some_cold_label" that's fairly trival code.
>> It is not native that I am talking about. I am worry about VM with
>> non-Xen/KVM hypervisor where virt_spin_lock() will actually be called.
>> Now that function will become a callee-saved function call instead of
>> being inlined into the native slowpath function.
> But only if we actually end up using the test-and-set thing, because if
> you have paravirt we end up using that.

I am talking about scenario like a kernel with paravirt spinlock running
in a VM managed by VMware, for example. We may not want to penalize them
while there are alternatives with less penalty.

Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-06 Thread Waiman Long
On 09/06/2017 03:08 AM, Peter Zijlstra wrote:
> Guys, please trim email.
>
> On Tue, Sep 05, 2017 at 10:31:46AM -0400, Waiman Long wrote:
>> For clarification, I was actually asking if you consider just adding one
>> more jump label to skip it for Xen/KVM instead of making
>> virt_spin_lock() a pv-op.
> I don't understand. What performance are you worried about. Native will
> now do: "xor rax,rax; jnz some_cold_label" that's fairly trival code.

It is not native that I am talking about. I am worry about VM with
non-Xen/KVM hypervisor where virt_spin_lock() will actually be called.
Now that function will become a callee-saved function call instead of
being inlined into the native slowpath function.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 10:24 AM, Waiman Long wrote:
> On 09/05/2017 10:18 AM, Juergen Gross wrote:
>> On 05/09/17 16:10, Waiman Long wrote:
>>> On 09/05/2017 09:24 AM, Juergen Gross wrote:
>>>> There are cases where a guest tries to switch spinlocks to bare metal
>>>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
>>>> has the downside of falling back to unfair test and set scheme for
>>>> qspinlocks due to virt_spin_lock() detecting the virtualized
>>>> environment.
>>>>
>>>> Make virt_spin_lock() a paravirt operation in order to enable users
>>>> to select an explicit behavior like bare metal.
>>>>
>>>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>>>> ---
>>>>  arch/x86/include/asm/paravirt.h   |  5 
>>>>  arch/x86/include/asm/paravirt_types.h |  1 +
>>>>  arch/x86/include/asm/qspinlock.h  | 48 
>>>> ---
>>>>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>>>>  arch/x86/kernel/smpboot.c |  2 ++
>>>>  5 files changed, 55 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/paravirt.h 
>>>> b/arch/x86/include/asm/paravirt.h
>>>> index c25dd22f7c70..d9e954fb37df 100644
>>>> --- a/arch/x86/include/asm/paravirt.h
>>>> +++ b/arch/x86/include/asm/paravirt.h
>>>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
>>>> cpu)
>>>>return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>>>>  }
>>>>  
>>>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
>>>> +{
>>>> +  return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
>>>> +}
>>>> +
>>>>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>>>>  
>>>>  #ifdef CONFIG_X86_32
>>>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>>>> b/arch/x86/include/asm/paravirt_types.h
>>>> index 19efefc0e27e..928f5e7953a7 100644
>>>> --- a/arch/x86/include/asm/paravirt_types.h
>>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>>> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>>>>void (*kick)(int cpu);
>>>>  
>>>>struct paravirt_callee_save vcpu_is_preempted;
>>>> +  struct paravirt_callee_save virt_spin_lock;
>>>>  } __no_randomize_layout;
>>>>  
>>>>  /* This contains all the paravirt structures: we get a convenient
>>>> diff --git a/arch/x86/include/asm/qspinlock.h 
>>>> b/arch/x86/include/asm/qspinlock.h
>>>> index 48a706f641f2..fbd98896385c 100644
>>>> --- a/arch/x86/include/asm/qspinlock.h
>>>> +++ b/arch/x86/include/asm/qspinlock.h
>>>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
>>>> qspinlock *lock)
>>>>smp_store_release((u8 *)lock, 0);
>>>>  }
>>>>  
>>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
>>>> +{
>>>> +  if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>>> +  return false;
>>>> +
>>>> +  /*
>>>> +   * On hypervisors without PARAVIRT_SPINLOCKS support we fall
>>>> +   * back to a Test-and-Set spinlock, because fair locks have
>>>> +   * horrible lock 'holder' preemption issues.
>>>> +   */
>>>> +
>>>> +  do {
>>>> +  while (atomic_read(>val) != 0)
>>>> +  cpu_relax();
>>>> +  } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
>>>> +
>>>> +  return true;
>>>> +}
>>>> +
>>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
>>>> val);
>>>>  extern void __pv_init_lock_hash(void);
>>>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu)
>>>>  {
>>>>return pv_vcpu_is_preempted(cpu);
>>>>  }
>>>> +
>>>> +void native_pv_lock_init(void) __init;
>>>>  #else
>>>>  static inline void queued_spin_unlock(struct qspinlock *lock)
>>>>  {
>>>>native_queued_spin_unlock(lock);
>>>>  }
>>>> +
>>>> +static inline void native_pv_lock_init(void)
>>>> +{
>>>> +}
>>>>  #endif
>>>>  
>>>>  #ifdef CONFIG_PARAVIRT
>>>>  #define virt_spin_lock virt_spin_lock
>>>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>  static inline bool virt_spin_lock(struct qspinlock *lock)
>>>>  {
>>>> -  if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>>> -  return false;
>>> Have you consider just add one more jump label here to skip
>>> virt_spin_lock when KVM or Xen want to do so?
>> Why? Did you look at patch 4? This is the way to do it...
> I asked this because of my performance concern as stated later in the email.

For clarification, I was actually asking if you consider just adding one
more jump label to skip it for Xen/KVM instead of making
virt_spin_lock() a pv-op.

Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 10:18 AM, Juergen Gross wrote:
> On 05/09/17 16:10, Waiman Long wrote:
>> On 09/05/2017 09:24 AM, Juergen Gross wrote:
>>> There are cases where a guest tries to switch spinlocks to bare metal
>>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
>>> has the downside of falling back to unfair test and set scheme for
>>> qspinlocks due to virt_spin_lock() detecting the virtualized
>>> environment.
>>>
>>> Make virt_spin_lock() a paravirt operation in order to enable users
>>> to select an explicit behavior like bare metal.
>>>
>>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>>> ---
>>>  arch/x86/include/asm/paravirt.h   |  5 
>>>  arch/x86/include/asm/paravirt_types.h |  1 +
>>>  arch/x86/include/asm/qspinlock.h  | 48 
>>> ---
>>>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>>>  arch/x86/kernel/smpboot.c |  2 ++
>>>  5 files changed, 55 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/paravirt.h 
>>> b/arch/x86/include/asm/paravirt.h
>>> index c25dd22f7c70..d9e954fb37df 100644
>>> --- a/arch/x86/include/asm/paravirt.h
>>> +++ b/arch/x86/include/asm/paravirt.h
>>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
>>> cpu)
>>> return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>>>  }
>>>  
>>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
>>> +{
>>> +   return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
>>> +}
>>> +
>>>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>>>  
>>>  #ifdef CONFIG_X86_32
>>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>>> b/arch/x86/include/asm/paravirt_types.h
>>> index 19efefc0e27e..928f5e7953a7 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>>> void (*kick)(int cpu);
>>>  
>>> struct paravirt_callee_save vcpu_is_preempted;
>>> +   struct paravirt_callee_save virt_spin_lock;
>>>  } __no_randomize_layout;
>>>  
>>>  /* This contains all the paravirt structures: we get a convenient
>>> diff --git a/arch/x86/include/asm/qspinlock.h 
>>> b/arch/x86/include/asm/qspinlock.h
>>> index 48a706f641f2..fbd98896385c 100644
>>> --- a/arch/x86/include/asm/qspinlock.h
>>> +++ b/arch/x86/include/asm/qspinlock.h
>>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
>>> qspinlock *lock)
>>> smp_store_release((u8 *)lock, 0);
>>>  }
>>>  
>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
>>> +{
>>> +   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> +   return false;
>>> +
>>> +   /*
>>> +* On hypervisors without PARAVIRT_SPINLOCKS support we fall
>>> +* back to a Test-and-Set spinlock, because fair locks have
>>> +* horrible lock 'holder' preemption issues.
>>> +*/
>>> +
>>> +   do {
>>> +   while (atomic_read(>val) != 0)
>>> +   cpu_relax();
>>> +   } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
>>> +
>>> +   return true;
>>> +}
>>> +
>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
>>> val);
>>>  extern void __pv_init_lock_hash(void);
>>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu)
>>>  {
>>> return pv_vcpu_is_preempted(cpu);
>>>  }
>>> +
>>> +void native_pv_lock_init(void) __init;
>>>  #else
>>>  static inline void queued_spin_unlock(struct qspinlock *lock)
>>>  {
>>> native_queued_spin_unlock(lock);
>>>  }
>>> +
>>> +static inline void native_pv_lock_init(void)
>>> +{
>>> +}
>>>  #endif
>>>  
>>>  #ifdef CONFIG_PARAVIRT
>>>  #define virt_spin_lock virt_spin_lock
>>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>  static inline bool virt_spin_lock(struct qspinlock *lock)
>>>  {
>>> -   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> -   return false;
>> Have you consider just add one more j

Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 10:08 AM, Peter Zijlstra wrote:
> On Tue, Sep 05, 2017 at 10:02:57AM -0400, Waiman Long wrote:
>> On 09/05/2017 09:24 AM, Juergen Gross wrote:
>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
>>> +{
>>> +   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> +   return false;
>>> +
>> I think you can take the above if statement out as you has done test in
>> native_pv_lock_init(). So the test will also be false here.
> That does mean we'll run a test-and-set spinlock until paravirt patching
> happens though. I prefer to not do that.
>
> One important point.. we must not be holding any locks when we switch
> over between the two locks. Back then I spend some time making sure that
> didn't happen with the X86 feature flag muck.

AFAICT, native_pv_lock_init() is called before SMP init. So it shouldn't
matter.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 09:24 AM, Juergen Gross wrote:
> There are cases where a guest tries to switch spinlocks to bare metal
> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
> has the downside of falling back to unfair test and set scheme for
> qspinlocks due to virt_spin_lock() detecting the virtualized
> environment.
>
> Make virt_spin_lock() a paravirt operation in order to enable users
> to select an explicit behavior like bare metal.
>
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/include/asm/paravirt.h   |  5 
>  arch/x86/include/asm/paravirt_types.h |  1 +
>  arch/x86/include/asm/qspinlock.h  | 48 
> ---
>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>  arch/x86/kernel/smpboot.c |  2 ++
>  5 files changed, 55 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25dd22f7c70..d9e954fb37df 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
> cpu)
>   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>  }
>  
> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
> +{
> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
> +}
> +
>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/paravirt_types.h 
> b/arch/x86/include/asm/paravirt_types.h
> index 19efefc0e27e..928f5e7953a7 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>   void (*kick)(int cpu);
>  
>   struct paravirt_callee_save vcpu_is_preempted;
> + struct paravirt_callee_save virt_spin_lock;
>  } __no_randomize_layout;
>  
>  /* This contains all the paravirt structures: we get a convenient
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 48a706f641f2..fbd98896385c 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
> qspinlock *lock)
>   smp_store_release((u8 *)lock, 0);
>  }
>  
> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
> +{
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> + return false;
> +
> + /*
> +  * On hypervisors without PARAVIRT_SPINLOCKS support we fall
> +  * back to a Test-and-Set spinlock, because fair locks have
> +  * horrible lock 'holder' preemption issues.
> +  */
> +
> + do {
> + while (atomic_read(>val) != 0)
> + cpu_relax();
> + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
> +
> + return true;
> +}
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
> val);
>  extern void __pv_init_lock_hash(void);
> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu)
>  {
>   return pv_vcpu_is_preempted(cpu);
>  }
> +
> +void native_pv_lock_init(void) __init;
>  #else
>  static inline void queued_spin_unlock(struct qspinlock *lock)
>  {
>   native_queued_spin_unlock(lock);
>  }
> +
> +static inline void native_pv_lock_init(void)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_PARAVIRT
>  #define virt_spin_lock virt_spin_lock
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>  static inline bool virt_spin_lock(struct qspinlock *lock)
>  {
> - if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> - return false;

Have you consider just add one more jump label here to skip
virt_spin_lock when KVM or Xen want to do so?

> -
> - /*
> -  * On hypervisors without PARAVIRT_SPINLOCKS support we fall
> -  * back to a Test-and-Set spinlock, because fair locks have
> -  * horrible lock 'holder' preemption issues.
> -  */
> -
> - do {
> - while (atomic_read(>val) != 0)
> - cpu_relax();
> - } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
> -
> - return true;
> + return pv_virt_spin_lock(lock);
> +}
> +#else
> +static inline bool virt_spin_lock(struct qspinlock *lock)
> +{
> + return native_virt_spin_lock(lock);
>  }
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  #endif /* CONFIG_PARAVIRT */
>  
>  #include 
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
> b/arch/x86/kernel/paravirt-spinlocks.c
> index 26e4bd92f309..1be187ef8a38 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -20,6 +20,12 @@ bool pv_is_native_spin_unlock(void)
>   __raw_callee_save___native_queued_spin_unlock;
>  }
>  
> +__visible bool __native_virt_spin_lock(struct qspinlock *lock)
> +{
> + return native_virt_spin_lock(lock);
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(__native_virt_spin_lock);

I have some 

Re: [Xen-devel] [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 09:24 AM, Juergen Gross wrote:
> There are cases where a guest tries to switch spinlocks to bare metal
> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
> has the downside of falling back to unfair test and set scheme for
> qspinlocks due to virt_spin_lock() detecting the virtualized
> environment.
>
> Make virt_spin_lock() a paravirt operation in order to enable users
> to select an explicit behavior like bare metal.
>
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/include/asm/paravirt.h   |  5 
>  arch/x86/include/asm/paravirt_types.h |  1 +
>  arch/x86/include/asm/qspinlock.h  | 48 
> ---
>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>  arch/x86/kernel/smpboot.c |  2 ++
>  5 files changed, 55 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25dd22f7c70..d9e954fb37df 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
> cpu)
>   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>  }
>  
> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
> +{
> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
> +}
> +
>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/paravirt_types.h 
> b/arch/x86/include/asm/paravirt_types.h
> index 19efefc0e27e..928f5e7953a7 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>   void (*kick)(int cpu);
>  
>   struct paravirt_callee_save vcpu_is_preempted;
> + struct paravirt_callee_save virt_spin_lock;
>  } __no_randomize_layout;
>  
>  /* This contains all the paravirt structures: we get a convenient
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 48a706f641f2..fbd98896385c 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
> qspinlock *lock)
>   smp_store_release((u8 *)lock, 0);
>  }
>  
> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
> +{
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> + return false;
> +

I think you can take the above if statement out as you has done test in
native_pv_lock_init(). So the test will also be false here.

As this patch series is x86 specific, you should probably add "x86/" in
front of paravirt in the patch titles.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-20 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type  w/o patchwith patch
     ---
   random read   8141.2 MB/s  8497.1 MB/s
   seq read  8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write 1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/kernel/asm-offsets_64.c |  9 +
 arch/x86/kernel/kvm.c| 24 
 2 files changed, 33 insertions(+)

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 210927e..99332f5 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -13,6 +13,10 @@
 #include 
 };
 
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#include 
+#endif
+
 int main(void)
 {
 #ifdef CONFIG_PARAVIRT
@@ -22,6 +26,11 @@ int main(void)
BLANK();
 #endif
 
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+   OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted);
+   BLANK();
+#endif
+
 #define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry)
ENTRY(bx);
ENTRY(cx);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..14f65a5 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
@@ -597,6 +598,29 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+#include 
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack.
+ */
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq  __per_cpu_offset(,%rdi,8), %rax;"
+"cmpb  $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
+"setne %al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead

2017-02-20 Thread Waiman Long
 v4->v5:
  - As suggested by PeterZ, use the asm-offsets header file generation
mechanism to get the offset of the preempted field in
kvm_steal_time instead of hardcoding it.

 v3->v4:
  - Fix x86-32 build error.

 v2->v3:
  - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted()
in assembly as suggested by PeterZ.
  - Add a new patch to change vcpu_is_preempted() argument type to long
to ease the writing of the assembly code.

 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

As it was found that the overhead of callee-save vcpu_is_preempted()
can have some impact on system performance on a VM guest, especially
of x86-64 guest, this patch set intends to reduce this performance
overhead by replacing the C __kvm_vcpu_is_preempted() function by
an optimized version of __raw_callee_save___kvm_vcpu_is_preempted()
written in assembly.

Waiman Long (2):
  x86/paravirt: Change vcp_is_preempted() arg type to long
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

 arch/x86/include/asm/paravirt.h  |  2 +-
 arch/x86/include/asm/qspinlock.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c |  9 +
 arch/x86/kernel/kvm.c| 26 +-
 arch/x86/kernel/paravirt-spinlocks.c |  2 +-
 5 files changed, 37 insertions(+), 4 deletions(-)

-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v5 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-20 Thread Waiman Long
The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h  | 2 +-
 arch/x86/include/asm/qspinlock.h | 2 +-
 arch/x86/kernel/kvm.c| 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca..f75fbfe 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu)
PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
return false;
 }
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-16 Thread Waiman Long
On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
>> The cpu argument in the function prototype of vcpu_is_preempted()
>> is changed from int to long. That makes it easier to provide a better
>> optimized assembly version of that function.
>>
>> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
>> downcast from long to int is not a problem as vCPU number won't exceed
>> 32 bits.
>>
> Note that because of the cast in PVOP_CALL_ARG1() this patch is
> pointless.
>
> Then again, it doesn't seem to affect code generation, so why not. Takes
> away the reliance on that weird cast.

I add this patch because I am a bit uneasy about clearing the upper 32
bits of rdi and assuming that the compiler won't have a previous use of
those bits. It gives me peace of mind.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-16 Thread Waiman Long
On 02/16/2017 11:48 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
>> +/*
>> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
>> + * restoring to/from the stack. It is assumed that the preempted value
>> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
>> + * which is verified by the BUILD_BUG_ON() macro below.
>> + */
>> +#define PREEMPTED_OFFSET16
> As per Andrew's suggestion, the 'right' way is something like so.

Thanks for the tip. I was not aware of the asm-offsets stuff. I will
update the patch to use it.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead

2017-02-15 Thread Waiman Long
 v3->v4:
  - Fix x86-32 build error.

 v2->v3:
  - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted()
in assembly as suggested by PeterZ.
  - Add a new patch to change vcpu_is_preempted() argument type to long
to ease the writing of the assembly code.

 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

As it was found that the overhead of callee-save vcpu_is_preempted()
can have some impact on system performance on a VM guest, especially
of x86-64 guest, this patch set intends to reduce this performance
overhead by replacing the C __kvm_vcpu_is_preempted() function by
an optimized version of __raw_callee_save___kvm_vcpu_is_preempted()
written in assembly.

Waiman Long (2):
  x86/paravirt: Change vcp_is_preempted() arg type to long
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

 arch/x86/include/asm/paravirt.h  |  2 +-
 arch/x86/include/asm/qspinlock.h |  2 +-
 arch/x86/kernel/kvm.c| 32 +++-
 arch/x86/kernel/paravirt-spinlocks.c |  2 +-
 4 files changed, 34 insertions(+), 4 deletions(-)

-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-15 Thread Waiman Long
The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h  | 2 +-
 arch/x86/include/asm/qspinlock.h | 2 +-
 arch/x86/kernel/kvm.c| 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca..f75fbfe 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu)
PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
return false;
 }
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-15 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type  w/o patchwith patch
     ---
   random read   8141.2 MB/s  8497.1 MB/s
   seq read  8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write 1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/kernel/kvm.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..e423435 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
@@ -597,11 +598,40 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack. It is assumed that the preempted value
+ * is at an offset of 16 from the beginning of the kvm_steal_time structure
+ * which is verified by the BUILD_BUG_ON() macro below.
+ */
+#define PREEMPTED_OFFSET   16
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq  __per_cpu_offset(,%rdi,8), %rax;"
+"cmpb  $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"setne %al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
+#ifdef CONFIG_X86_64
+   BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
+   != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
+#endif
+
if (!kvm_para_available())
return;
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead

2017-02-15 Thread Waiman Long
 v2->v3:
  - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted()
in assembly as suggested by PeterZ.
  - Add a new patch to change vcpu_is_preempted() argument type to long
to ease the writing of the assembly code.

 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

As it was found that the overhead of callee-save vcpu_is_preempted()
can have some impact on system performance on a VM guest, especially
of x86-64 guest, this patch set intends to reduce this performance
overhead by replacing the C __kvm_vcpu_is_preempted() function by
an optimized version of __raw_callee_save___kvm_vcpu_is_preempted()
written in assembly.

Waiman Long (2):
  x86/paravirt: Change vcp_is_preempted() arg type to long
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

 arch/x86/include/asm/paravirt.h  |  2 +-
 arch/x86/include/asm/qspinlock.h |  2 +-
 arch/x86/kernel/kvm.c| 30 +-
 arch/x86/kernel/paravirt-spinlocks.c |  2 +-
 4 files changed, 32 insertions(+), 4 deletions(-)

-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-15 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type  w/o patchwith patch
     ---
   random read   8141.2 MB/s  8497.1 MB/s
   seq read  8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write 1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/kernel/kvm.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..83e22c1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
@@ -597,11 +598,38 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack. It is assumed that the preempted value
+ * is at an offset of 16 from the beginning of the kvm_steal_time structure
+ * which is verified by the BUILD_BUG_ON() macro below.
+ */
+#define PREEMPTED_OFFSET   16
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq  __per_cpu_offset(,%rdi,8), %rax;"
+"cmpb  $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"setne %al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
+   BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
+   != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
+
if (!kvm_para_available())
return;
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v3 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-15 Thread Waiman Long
The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h  | 2 +-
 arch/x86/include/asm/qspinlock.h | 2 +-
 arch/x86/kernel/kvm.c| 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 864f57b..2a46f73 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -674,7 +674,7 @@ static __always_inline void pv_kick(int cpu)
PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
return false;
 }
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-14 Thread Waiman Long
On 02/14/2017 04:39 AM, Peter Zijlstra wrote:
> On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote:
>> It is the address of _time that will exceed the 32-bit limit.
> That seems extremely unlikely. That would mean we have more than 4G
> worth of per-cpu variables declared in the kernel.

I have some doubt about if the compiler is able to properly use
RIP-relative addressing for this. Anyway, it seems like constraints
aren't allowed for asm() when not in the function context, at least for
the the compiler that I am using (4.8.5). So it is a moot point.

Cheers,
Longman



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 04:52 PM, Peter Zijlstra wrote:
> On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote:
>> On 02/13/2017 02:42 PM, Waiman Long wrote:
>>> On 02/13/2017 05:53 AM, Peter Zijlstra wrote:
>>>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>>>>> That way we'd end up with something like:
>>>>>
>>>>> asm("
>>>>> push %rdi;
>>>>> movslq %edi, %rdi;
>>>>> movq __per_cpu_offset(,%rdi,8), %rax;
>>>>> cmpb $0, %[offset](%rax);
>>>>> setne %al;
>>>>> pop %rdi;
>>>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
>>>>> steal_time, preempted)));
>>>>>
>>>>> And if we could get rid of the sign extend on edi we could avoid all the
>>>>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>>>>> this asm foo isn't my strongest point).
>>>> Maybe:
>>>>
>>>> movsql %edi, %rax;
>>>> movq __per_cpu_offset(,%rax,8), %rax;
>>>> cmpb $0, %[offset](%rax);
>>>> setne %al;
>>>>
>>>> ?
>>> Yes, that looks good to me.
>>>
>>> Cheers,
>>> Longman
>>>
>> Sorry, I am going to take it back. The displacement or offset can only
>> be up to 32-bit. So we will still need to use at least one more
>> register, I think.
> I don't think that would be a problem, I very much doubt we declare more
> than 4G worth of per-cpu variables in the kernel.
>
> In any case, use "e" or "Z" as constraint (I never quite know when to
> use which). That are s32 and u32 displacement immediates resp. and
> should fail compile with a semi-sensible failure if the displacement is
> too big.
>
It is the address of _time that will exceed the 32-bit limit.

Cheers,
Longman




___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 03:06 PM, h...@zytor.com wrote:
> On February 13, 2017 2:53:43 AM PST, Peter Zijlstra  
> wrote:
>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>>> That way we'd end up with something like:
>>>
>>> asm("
>>> push %rdi;
>>> movslq %edi, %rdi;
>>> movq __per_cpu_offset(,%rdi,8), %rax;
>>> cmpb $0, %[offset](%rax);
>>> setne %al;
>>> pop %rdi;
>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct
>> steal_time, preempted)));
>>> And if we could get rid of the sign extend on edi we could avoid all
>> the
>>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>>> this asm foo isn't my strongest point).
>> Maybe:
>>
>> movsql %edi, %rax;
>> movq __per_cpu_offset(,%rax,8), %rax;
>> cmpb $0, %[offset](%rax);
>> setne %al;
>>
>> ?
> We could kill the zero or sign extend by changing the calling interface to 
> pass an unsigned long instead of an int.  It is much more likely that a zero 
> extend is free for the caller than a sign extend.

I have thought of that too. However, the goal is to eliminate memory
read/write from/to stack. Eliminating a register sign-extend instruction
won't help much in term of performance.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 02:42 PM, Waiman Long wrote:
> On 02/13/2017 05:53 AM, Peter Zijlstra wrote:
>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>>> That way we'd end up with something like:
>>>
>>> asm("
>>> push %rdi;
>>> movslq %edi, %rdi;
>>> movq __per_cpu_offset(,%rdi,8), %rax;
>>> cmpb $0, %[offset](%rax);
>>> setne %al;
>>> pop %rdi;
>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
>>> steal_time, preempted)));
>>>
>>> And if we could get rid of the sign extend on edi we could avoid all the
>>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>>> this asm foo isn't my strongest point).
>> Maybe:
>>
>> movsql %edi, %rax;
>> movq __per_cpu_offset(,%rax,8), %rax;
>> cmpb $0, %[offset](%rax);
>> setne %al;
>>
>> ?
> Yes, that looks good to me.
>
> Cheers,
> Longman
>
Sorry, I am going to take it back. The displacement or offset can only
be up to 32-bit. So we will still need to use at least one more
register, I think.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 05:53 AM, Peter Zijlstra wrote:
> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>> That way we'd end up with something like:
>>
>> asm("
>> push %rdi;
>> movslq %edi, %rdi;
>> movq __per_cpu_offset(,%rdi,8), %rax;
>> cmpb $0, %[offset](%rax);
>> setne %al;
>> pop %rdi;
>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
>> steal_time, preempted)));
>>
>> And if we could get rid of the sign extend on edi we could avoid all the
>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>> this asm foo isn't my strongest point).
> Maybe:
>
> movsql %edi, %rax;
> movq __per_cpu_offset(,%rax,8), %rax;
> cmpb $0, %[offset](%rax);
> setne %al;
>
> ?

Yes, that looks good to me.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 05:47 AM, Peter Zijlstra wrote:
> On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote:
>
>>>> +asm(
>>>> +".pushsection .text;"
>>>> +".global __raw_callee_save___kvm_vcpu_is_preempted;"
>>>> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
>>>> +"__raw_callee_save___kvm_vcpu_is_preempted:"
>>>> +FRAME_BEGIN
>>>> +"push %rdi;"
>>>> +"push %rdx;"
>>>> +"movslq  %edi, %rdi;"
>>>> +"movq$steal_time+16, %rax;"
>>>> +"movq__per_cpu_offset(,%rdi,8), %rdx;"
>>>> +"cmpb$0, (%rdx,%rax);"
> Could we not put the $steal_time+16 displacement as an immediate in the
> cmpb and save a whole register here?
>
> That way we'd end up with something like:
>
> asm("
> push %rdi;
> movslq %edi, %rdi;
> movq __per_cpu_offset(,%rdi,8), %rax;
> cmpb $0, %[offset](%rax);
> setne %al;
> pop %rdi;
> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
> steal_time, preempted)));
>
> And if we could get rid of the sign extend on edi we could avoid all the
> push-pop nonsense, but I'm not sure I see how to do that (then again,
> this asm foo isn't my strongest point).

Yes, I think that can work. I will try to ran this patch to see how
thing goes.

>>>> +"setne   %al;"
>>>> +"pop %rdx;"
>>>> +"pop %rdi;"
>>>> +FRAME_END
>>>> +"ret;"
>>>> +".popsection");
>>>> +
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
>>>>   */
>>> That should work for now. I have done something similar for
>>> __pv_queued_spin_unlock. However, this has the problem of creating a
>>> dependency on the exact layout of the steal_time structure. Maybe the
>>> constant 16 can be passed in as a parameter offsetof(struct
>>> kvm_steal_time, preempted) to the asm call.
> Yeah it should be well possible to pass that in. But ideally we'd have
> GCC grow something like __attribute__((callee_saved)) or somesuch and it
> would do all this for us.

That will be really nice too. I am not too fond of working in assembly.

>> One more thing, that will improve KVM performance, but it won't help Xen.
> People still use Xen? ;-) In any case, their implementation looks very
> similar and could easily crib this.

In Red Hat, my focus will be on KVM performance. I do believe that there
are still Xen users out there. So we still need to keep their interest
into consideration. Given that, I am OK to make it work better in KVM
first and then think about Xen later.

>> I looked into the assembly code for rwsem_spin_on_owner, It need to save
>> and restore 2 additional registers with my patch. Doing it your way,
>> will transfer the save and restore overhead to the assembly code.
>> However, __kvm_vcpu_is_preempted() is called multiple times per
>> invocation of rwsem_spin_on_owner. That function is simple enough that
>> making __kvm_vcpu_is_preempted() callee-save won't produce much compiler
>> optimization opportunity.
> This is because of that noinline, right? Otherwise it would've been
> folded and register pressure would be much higher.

Yes, I guess so. The noinline is there so that we know what the CPU time
is for spinning rather than other activities within the slowpath.

>
>> The outer function rwsem_down_write_failed()
>> does appear to be a bit bigger (from 866 bytes to 884 bytes) though.
> I suspect GCC is being clever and since all this is static it plays
> games with the calling convention and pushes these clobbers out.
>
>

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-10 Thread Waiman Long
On 02/10/2017 11:35 AM, Waiman Long wrote:
> On 02/10/2017 11:19 AM, Peter Zijlstra wrote:
>> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote:
>>> It was found when running fio sequential write test with a XFS ramdisk
>>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported
>>> by perf were as follows:
>>>
>>>  69.75%  0.59%  fio  [k] down_write
>>>  69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
>>>  67.12%  1.12%  fio  [k] rwsem_down_write_failed
>>>  63.48% 52.77%  fio  [k] osq_lock
>>>   9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
>>>   3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted
>>>
>> Thinking about this again, wouldn't something like the below also work?
>>
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 099fcba4981d..6aa33702c15c 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
>>  local_irq_restore(flags);
>>  }
>>  
>> +#ifdef CONFIG_X86_32
>>  __visible bool __kvm_vcpu_is_preempted(int cpu)
>>  {
>>  struct kvm_steal_time *src = _cpu(steal_time, cpu);
>> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
>>  }
>>  PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
>>  
>> +#else
>> +
>> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int);
>> +
>> +asm(
>> +".pushsection .text;"
>> +".global __raw_callee_save___kvm_vcpu_is_preempted;"
>> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
>> +"__raw_callee_save___kvm_vcpu_is_preempted:"
>> +FRAME_BEGIN
>> +"push %rdi;"
>> +"push %rdx;"
>> +"movslq  %edi, %rdi;"
>> +"movq$steal_time+16, %rax;"
>> +"movq__per_cpu_offset(,%rdi,8), %rdx;"
>> +"cmpb$0, (%rdx,%rax);"
>> +"setne   %al;"
>> +"pop %rdx;"
>> +"pop %rdi;"
>> +FRAME_END
>> +"ret;"
>> +".popsection");
>> +
>> +#endif
>> +
>>  /*
>>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
>>   */
> That should work for now. I have done something similar for
> __pv_queued_spin_unlock. However, this has the problem of creating a
> dependency on the exact layout of the steal_time structure. Maybe the
> constant 16 can be passed in as a parameter offsetof(struct
> kvm_steal_time, preempted) to the asm call.
>
> Cheers,
> Longman

One more thing, that will improve KVM performance, but it won't help Xen.

I looked into the assembly code for rwsem_spin_on_owner, It need to save
and restore 2 additional registers with my patch. Doing it your way,
will transfer the save and restore overhead to the assembly code.
However, __kvm_vcpu_is_preempted() is called multiple times per
invocation of rwsem_spin_on_owner. That function is simple enough that
making __kvm_vcpu_is_preempted() callee-save won't produce much compiler
optimization opportunity. The outer function rwsem_down_write_failed()
does appear to be a bit bigger (from 866 bytes to 884 bytes) though.

Cheers,
Longman



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-10 Thread Waiman Long
On 02/10/2017 11:19 AM, Peter Zijlstra wrote:
> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote:
>> It was found when running fio sequential write test with a XFS ramdisk
>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported
>> by perf were as follows:
>>
>>  69.75%  0.59%  fio  [k] down_write
>>  69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
>>  67.12%  1.12%  fio  [k] rwsem_down_write_failed
>>  63.48% 52.77%  fio  [k] osq_lock
>>   9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
>>   3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted
>>
> Thinking about this again, wouldn't something like the below also work?
>
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 099fcba4981d..6aa33702c15c 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
>   local_irq_restore(flags);
>  }
>  
> +#ifdef CONFIG_X86_32
>  __visible bool __kvm_vcpu_is_preempted(int cpu)
>  {
>   struct kvm_steal_time *src = _cpu(steal_time, cpu);
> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
>  }
>  PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
>  
> +#else
> +
> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int);
> +
> +asm(
> +".pushsection .text;"
> +".global __raw_callee_save___kvm_vcpu_is_preempted;"
> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
> +"__raw_callee_save___kvm_vcpu_is_preempted:"
> +FRAME_BEGIN
> +"push %rdi;"
> +"push %rdx;"
> +"movslq  %edi, %rdi;"
> +"movq$steal_time+16, %rax;"
> +"movq__per_cpu_offset(,%rdi,8), %rdx;"
> +"cmpb$0, (%rdx,%rax);"
> +"setne   %al;"
> +"pop %rdx;"
> +"pop %rdi;"
> +FRAME_END
> +"ret;"
> +".popsection");
> +
> +#endif
> +
>  /*
>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
>   */

That should work for now. I have done something similar for
__pv_queued_spin_unlock. However, this has the problem of creating a
dependency on the exact layout of the steal_time structure. Maybe the
constant 16 can be passed in as a parameter offsetof(struct
kvm_steal_time, preempted) to the asm call.

Cheers,
Longman



___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-10 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a VM running on a 2-socket x86-64 system, the %CPU times as reported
by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call. As
vcpu_is_preempted() is called within the spinlock, mutex and rwsem
slowpaths, there isn't much to gain by making it callee-save. So it
is now changed to a normal function call instead.

With this patch applied on both bare-metal & KVM guest on a 2-socekt
16-core 32-thread system with 16 parallel jobs (8 on each socket), the
aggregrate bandwidth of the fio test on an XFS ramdisk were as follows:

   Bare MetalKVM Guest
   I/O Type  w/o patchwith patch   w/o patchwith patch
     ---   ---
   random read   8650.5 MB/s  8560.9 MB/s  7602.9 MB/s  8196.1 MB/s  
   seq read  9104.8 MB/s  9397.2 MB/s  8293.7 MB/s  8566.9 MB/s
   random write  1623.8 MB/s  1626.7 MB/s  1590.6 MB/s  1700.7 MB/s
   seq write 1626.4 MB/s  1624.9 MB/s  1604.8 MB/s  1726.3 MB/s

The perf data (on KVM guest) now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

On bare metal, the patch doesn't introduce any performance
regression. On KVM guest, it produces noticeable performance
improvement (up to 7%).

Signed-off-by: Waiman Long <long...@redhat.com>
---
 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

 arch/x86/include/asm/paravirt.h   | 2 +-
 arch/x86/include/asm/paravirt_types.h | 2 +-
 arch/x86/kernel/kvm.c | 7 ++-
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 arch/x86/xen/spinlock.c   | 4 +---
 5 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 864f57b..2515885 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu)
 
 static __always_inline bool pv_vcpu_is_preempted(int cpu)
 {
-   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
+   return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index bb2de45..88dc852 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -309,7 +309,7 @@ struct pv_lock_ops {
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
 
-   struct paravirt_callee_save vcpu_is_preempted;
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..eb3753d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
 
return !!src->preempted;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void)
pv_lock_ops.wait = kvm_wait;
pv_lock_ops.kick = kvm_kick_cpu;
 
-   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-   pv_lock_ops.vcpu_is_preempted =
-   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
-   }
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME))
+   pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted;
 }
 
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..da050bc 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu)
 {
return false;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted);
 
 bool pv_is_native_vcpu_is_preempted(void)
 {
-   return pv_lock_ops.vcpu_is_preempted.func ==
-   __raw_callee_save___

Re: [Xen-devel] [PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-08 Thread Waiman Long
On 02/08/2017 02:05 PM, Peter Zijlstra wrote:
> On Wed, Feb 08, 2017 at 01:00:24PM -0500, Waiman Long wrote:
>> It was found when running fio sequential write test with a XFS ramdisk
>> on a 2-socket x86-64 system, the %CPU times as reported by perf were
>> as follows:
>>
>>  71.27%  0.28%  fio  [k] down_write
>>  70.99%  0.01%  fio  [k] call_rwsem_down_write_failed
>>  69.43%  1.18%  fio  [k] rwsem_down_write_failed
>>  65.51% 54.57%  fio  [k] osq_lock
>>   9.72%  7.99%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempted
>>   4.16%  4.16%  fio  [k] __kvm_vcpu_is_preempted
>>
>> So making vcpu_is_preempted() a callee-save function has a pretty high
>> cost associated with it. As vcpu_is_preempted() is called within the
>> spinlock, mutex and rwsem slowpaths, there isn't much to gain by making
>> it callee-save. So it is now changed to a normal function call instead.
>>
> Numbers for bare metal too please.

I will run the test on bare metal, but I doubt there will be noticeable
difference.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 2/2] locking/mutex, rwsem: Reduce vcpu_is_preempted() calling frequency

2017-02-08 Thread Waiman Long
On 02/08/2017 02:05 PM, Peter Zijlstra wrote:
> On Wed, Feb 08, 2017 at 01:00:25PM -0500, Waiman Long wrote:
>> As the vcpu_is_preempted() call is pretty costly compared with other
>> checks within mutex_spin_on_owner() and rwsem_spin_on_owner(), they
>> are done at a reduce frequency of once every 256 iterations.
> That's just disgusting.

I do have some doubt myself on the effectiveness of this patch. Anyway,
it is the first patch that I think is more beneficial.

Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 2/2] locking/mutex, rwsem: Reduce vcpu_is_preempted() calling frequency

2017-02-08 Thread Waiman Long
As the vcpu_is_preempted() call is pretty costly compared with other
checks within mutex_spin_on_owner() and rwsem_spin_on_owner(), they
are done at a reduce frequency of once every 256 iterations.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 kernel/locking/mutex.c  | 5 -
 kernel/locking/rwsem-xadd.c | 6 --
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index ad2d9e2..2ece0c4 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -423,6 +423,7 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner,
 struct ww_acquire_ctx *ww_ctx, struct mutex_waiter 
*waiter)
 {
bool ret = true;
+   int loop = 0;
 
rcu_read_lock();
while (__mutex_owner(lock) == owner) {
@@ -436,9 +437,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct 
task_struct *owner,
 
/*
 * Use vcpu_is_preempted to detect lock holder preemption issue.
+* As vcpu_is_preempted is more costly to use, it is called at
+* a reduced frequencey (once every 256 iterations).
 */
if (!owner->on_cpu || need_resched() ||
-   vcpu_is_preempted(task_cpu(owner))) {
+  (!(++loop & 0xff) && vcpu_is_preempted(task_cpu(owner {
ret = false;
break;
}
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2ad8d8d..7a884a6 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -351,6 +351,7 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
 static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
 {
struct task_struct *owner = READ_ONCE(sem->owner);
+   int loop = 0;
 
if (!rwsem_owner_is_writer(owner))
goto out;
@@ -367,10 +368,11 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 
/*
 * abort spinning when need_resched or owner is not running or
-* owner's cpu is preempted.
+* owner's cpu is preempted. The preemption check is done at
+* a lower frequencey because of its high cost.
 */
if (!owner->on_cpu || need_resched() ||
-   vcpu_is_preempted(task_cpu(owner))) {
+  (!(++loop & 0xff) && vcpu_is_preempted(task_cpu(owner {
rcu_read_unlock();
return false;
}
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-08 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a 2-socket x86-64 system, the %CPU times as reported by perf were
as follows:

 71.27%  0.28%  fio  [k] down_write
 70.99%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.43%  1.18%  fio  [k] rwsem_down_write_failed
 65.51% 54.57%  fio  [k] osq_lock
  9.72%  7.99%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempted
  4.16%  4.16%  fio  [k] __kvm_vcpu_is_preempted

So making vcpu_is_preempted() a callee-save function has a pretty high
cost associated with it. As vcpu_is_preempted() is called within the
spinlock, mutex and rwsem slowpaths, there isn't much to gain by making
it callee-save. So it is now changed to a normal function call instead.

With this patch applied, the aggregrate bandwidth of the fio sequential
write test increased slightly from 2563.3MB/s to 2588.1MB/s (about 1%).

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h   | 2 +-
 arch/x86/include/asm/paravirt_types.h | 2 +-
 arch/x86/kernel/kvm.c | 7 ++-
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 arch/x86/xen/spinlock.c   | 4 +---
 5 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 864f57b..2515885 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu)
 
 static __always_inline bool pv_vcpu_is_preempted(int cpu)
 {
-   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
+   return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index bb2de45..88dc852 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -309,7 +309,7 @@ struct pv_lock_ops {
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
 
-   struct paravirt_callee_save vcpu_is_preempted;
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..eb3753d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
 
return !!src->preempted;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void)
pv_lock_ops.wait = kvm_wait;
pv_lock_ops.kick = kvm_kick_cpu;
 
-   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-   pv_lock_ops.vcpu_is_preempted =
-   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
-   }
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME))
+   pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted;
 }
 
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..da050bc 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu)
 {
return false;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted);
 
 bool pv_is_native_vcpu_is_preempted(void)
 {
-   return pv_lock_ops.vcpu_is_preempted.func ==
-   __raw_callee_save___native_vcpu_is_preempted;
+   return pv_lock_ops.vcpu_is_preempted == __native_vcpu_is_preempted;
 }
 
 struct pv_lock_ops pv_lock_ops = {
@@ -38,7 +36,7 @@ struct pv_lock_ops pv_lock_ops = {
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
-   .vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted),
+   .vcpu_is_preempted = __native_vcpu_is_preempted,
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 25a7c43..c85bb8f 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,8 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-PV_CALLEE_SAVE_REGS_THUNK(xen_vcpu_stolen);
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -138,7 +136,7 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
-   pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
+   pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
 }
 
 static __init int xen_parse_nopvspin(char 

[Xen-devel] [PATCH v2] x86, locking/spinlocks: Remove paravirt_ticketlocks_enabled

2017-01-12 Thread Waiman Long
This is a follow-up of commit cfd8983f03c7b2 ("x86, locking/spinlocks:
Remove ticket (spin)lock implementation"). The static_key structure
paravirt_ticketlocks_enabled is now removed as it is no longer used.
As a result, the init functions kvm_spinlock_init_jump() and
xen_init_spinlocks_jump() are also removed.

A simple build and boot test was done to verify it.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 v1->v2:
  - Remove init functions kvm_spinlock_init_jump() and
xen_init_spinlocks_jump().

 arch/x86/include/asm/spinlock.h  |  3 ---
 arch/x86/kernel/kvm.c| 14 --
 arch/x86/kernel/paravirt-spinlocks.c |  3 ---
 arch/x86/xen/spinlock.c  | 19 ---
 4 files changed, 39 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..6d39190 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -23,9 +23,6 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1 << 15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
-static __always_inline bool static_key_false(struct static_key *key);
-
 #include 
 
 /*
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 36bc664..099fcba 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -620,18 +620,4 @@ void __init kvm_spinlock_init(void)
}
 }
 
-static __init int kvm_spinlock_init_jump(void)
-{
-   if (!kvm_para_available())
-   return 0;
-   if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
-   return 0;
-
-   static_key_slow_inc(_ticketlocks_enabled);
-   printk(KERN_INFO "KVM setup paravirtual spinlock\n");
-
-   return 0;
-}
-early_initcall(kvm_spinlock_init_jump);
-
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6d4bf81..6259327 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -42,6 +42,3 @@ struct pv_lock_ops pv_lock_ops = {
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
-
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index e8a9ea7..25a7c43 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -141,25 +141,6 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
 }
 
-/*
- * While the jump_label init code needs to happend _after_ the jump labels are
- * enabled and before SMP is started. Hence we use pre-SMP initcall level
- * init. We cannot do it in xen_init_spinlocks as that is done before
- * jump labels are activated.
- */
-static __init int xen_init_spinlocks_jump(void)
-{
-   if (!xen_pvspin)
-   return 0;
-
-   if (!xen_domain())
-   return 0;
-
-   static_key_slow_inc(_ticketlocks_enabled);
-   return 0;
-}
-early_initcall(xen_init_spinlocks_jump);
-
 static __init int xen_parse_nopvspin(char *arg)
 {
xen_pvspin = false;
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] [PATCH] x86, locking/spinlocks: Remove paravirt_ticketlocks_enabled

2017-01-12 Thread Waiman Long
This is a follow-up of commit cfd8983f03c7b2 ("x86, locking/spinlocks:
Remove ticket (spin)lock implementation"). The static_key structure
paravirt_ticketlocks_enabled is now removed as it is no longer used.

A simple build and boot test was done to verify it.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/spinlock.h  | 3 ---
 arch/x86/kernel/kvm.c| 1 -
 arch/x86/kernel/paravirt-spinlocks.c | 3 ---
 arch/x86/xen/spinlock.c  | 1 -
 4 files changed, 8 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..6d39190 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -23,9 +23,6 @@
 /* How long a lock should spin before we consider blocking */
 #define SPIN_THRESHOLD (1 << 15)
 
-extern struct static_key paravirt_ticketlocks_enabled;
-static __always_inline bool static_key_false(struct static_key *key);
-
 #include 
 
 /*
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 36bc664..6750fdc 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -627,7 +627,6 @@ static __init int kvm_spinlock_init_jump(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return 0;
 
-   static_key_slow_inc(_ticketlocks_enabled);
printk(KERN_INFO "KVM setup paravirtual spinlock\n");
 
return 0;
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6d4bf81..6259327 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -42,6 +42,3 @@ struct pv_lock_ops pv_lock_ops = {
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
-
-struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE;
-EXPORT_SYMBOL(paravirt_ticketlocks_enabled);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index e8a9ea7..a822606 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -155,7 +155,6 @@ static __init int xen_init_spinlocks_jump(void)
if (!xen_domain())
return 0;
 
-   static_key_slow_inc(_ticketlocks_enabled);
return 0;
 }
 early_initcall(xen_init_spinlocks_jump);
-- 
1.8.3.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v16 08/14] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-05-04 Thread Waiman Long

On 05/04/2015 10:20 AM, Peter Zijlstra wrote:

I changed it to the below; I've not gotten around to compiling or even
running it yet :-(

The biggest change is the pv_hash/pv_unhash functions, which I've
rewritten to hopefully be clearer (and also hopefully not wrecked them).
I took out the cacheline sized structure which takes out that double
loop and simplifies things. I've also added some comments which
hopefully explain how/why we ended up with this exact scheme.

I've also moved the __pv_queue_spin_unlock() function to the tail, such
that we keep the 'wait'/'kick' order for both node and head.

In any case, like I just wrote on the other email, I've stuck some
things in my queue (up to and including patch 11) and if it all works
out we can continue from there.

---
Subject: pvqspinlock: Implement simple paravirt support for the qspinlock
From: Waiman Longwaiman.l...@hp.com
Date: Fri, 24 Apr 2015 14:56:37 -0400

Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.

The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.

The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.

The head contender has no such storage directly visible to the
unlocker.  So the unlocker searches a hash table with open addressing
using a simple binary Galois linear feedback shift register.


I am fine with the change as it makes it simpler. BTW, I just saw a 
build error mail. That should be fixed easily with some minor edit.


Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015-05-04 Thread Waiman Long

On 05/04/2015 10:05 AM, Peter Zijlstra wrote:

On Thu, Apr 30, 2015 at 02:49:26PM -0400, Waiman Long wrote:

On 04/29/2015 02:11 PM, Peter Zijlstra wrote:

On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:

In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.

This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.

Yuck! I'm not at all sure you can make assumptions like that. And the
worst part is, if it goes wrong the borkage is subtle and painful.

I do think the code is OK. However, you are right that if my reasoning is
incorrect, the resulting bug will be really subtle.

So I do not think its correct, imagine the fabrics used for the 4096 cpu
SGI machine, now add some serious traffic to them. There is no saying
your random 2k relax loop will be enough to propagate the change.

Equally, another arch (this is generic code) might have starvation
issues on its inter-cpu fabric and delay the store just long enough.

The thing is, one should _never_ rely on timing for correctness, _ever_.



Yes, you are right. Having a dependency on timing can be dangerous.


So I am going to
withdraw this particular patch as it has no functional impact to the overall
patch series. Please let me know if you have any other comments on other
parts of the series and I will send send out a new series without this
particular patch.

Please wait a little while, I've queued the 'basic' patches, once that
settles in tip we can look at the others.

Also, I have some local changes (sorry, I could not help mysef) I should
post, I've been somewhat delayed by illness.


Sure. I will wait until you finish your tip test.

I am sorry to hear that you are bothered with illness. I hope you get 
well by now.


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015-04-30 Thread Waiman Long

On 04/29/2015 02:27 PM, Linus Torvalds wrote:

On Wed, Apr 29, 2015 at 11:11 AM, Peter Zijlstrapet...@infradead.org  wrote:

On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:

In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.

This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.

Yuck! I'm not at all sure you can make assumptions like that. And the
worst part is, if it goes wrong the borkage is subtle and painful.\

I have to agree with Peter.

But it goes beyond this particular patch. Patterns like this:

xchg(pn-mayhalt, true);

are just evil and disgusting. Even befoe this patch, that code had

 (void)xchg(pn-state, vcpu_halted);

which is *wrong* and should never be done.

If you want it to be set_mb() (which sets a value and has a memory
barrier), then use set_mb(). Yes, it happens to use a xchg() to do
so, but dammit, it documents that whole this is a memory barrier in
the name.
Also, anybody who does this should damn well document why the memory
barrier is needed. The xchg(pn-state, vcpu_halted) at least is
preceded by a comment about the barriers. The new mayhalt has no sane
comment in it, and the reason seems to be that no sane comment is
possible. The xchg() seems to be some black magic thing.

Let's not introduce magic stuff in our locking primitives. At least
not undocumented magic that makes no sense.

Linus


Thanks for the comments. I will withdraw this patch and use set_mb() in 
the code as suggested for better readability.


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015-04-30 Thread Waiman Long

On 04/29/2015 02:11 PM, Peter Zijlstra wrote:

On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:

In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.

This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.

Yuck! I'm not at all sure you can make assumptions like that. And the
worst part is, if it goes wrong the borkage is subtle and painful.


I do think the code is OK. However, you are right that if my reasoning 
is incorrect, the resulting bug will be really subtle. So I am going to 
withdraw this particular patch as it has no functional impact to the 
overall patch series. Please let me know if you have any other comments 
on other parts of the series and I will send send out a new series 
without this particular patch.


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v16 04/14] qspinlock: Extract out code snippets for the next patch

2015-04-24 Thread Waiman Long
This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.

 1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
 2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.

This patch also simplifies the trylock operation before queuing by
calling queue_spin_trylock() directly.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   79 -
 2 files changed, 50 insertions(+), 31 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 9c3f5c2..ef36613 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -58,6 +58,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0351f78..11f6ad9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -97,6 +97,42 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+   atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, lock-val);
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -178,15 +214,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 *
 * *,1,0 - *,0,1
 */
-   for (;;) {
-   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   clear_pending_set_locked(lock);
return;
 
/*
@@ -203,37 +231,26 @@ queue:
node-next = NULL;
 
/*
-* We have already touched the queueing cacheline; don't bother with
-* pending stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked.
-* p,y,x - n,y,x ; tail was p - tail is n; preserving locked.
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* We have already touched the queueing cacheline; don't bother with
+* pending stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail);
 
/*
 * if there was a previous node; link it and wait until reaching the
 * head of the waitqueue.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v16 01/14] qspinlock: A simple generic 4-byte queue spinlock

2015-04-24 Thread Waiman Long
This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  132 +
 include/asm-generic/qspinlock_types.h |   58 +
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  209 +
 6 files changed, 408 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..315d6dc
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,132 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32

[Xen-devel] [PATCH v16 00/14] qspinlock: a 4-byte queue spinlock with PV support

2015-04-24 Thread Waiman Long
 that the queue spinlock patch can improve the
performance of an I/O and interrupt intensive stress test with a lot
of spinlock contention on a 2-socket system by up to 20%.

Daniel Blueman of NumaScale had tested the v14 qspinlock patch on an
older 512-core NumaConnect system for testing with 3.19.1:

$ src/reaim -c data/reaim.config -f data/workfile.compute -i 16 -e 256
Num Parent   Child   Child  Jobs per   Jobs/min/  Std_dev  Std_dev  JTI
Forked  Time SysTime UTime   Minute Child  Time Percent
1   2.15 0.361.192818.602818.600.00 0.00100
17  6.29 42.41   22.43   16378.38   963.43 0.27 4.4395
33  56.241611.77 47.04   3555.83107.75 1.58 2.8897
49  120.88   5576.87 63.54   2456.4950.13  1.75 1.4798
65  199.17   12328.42 81.27   1977.7130.43  2.42 1.2398
81  270.89   20943.99 103.99  1812.0322.37  3.85 1.4498
97  315.31   29211.63 121.30  1864.2619.22  4.48 1.4498
113 397.68   42766.15 144.48  1721.9415.24  7.04 1.8098
129 520.74   63442.27 162.03  1501.2111.64  10.241.9998
...

Patched with qspinlocks v14:

Num Parent   Child   Child  Jobs per   Jobs/min/  Std_dev  Std_dev  JTI
Forked  Time SysTime UTime   Minute Child  Time Percent
1   1.98 0.361.213060.613060.610.00 0.00100
17  5.60 25.41   22.39   18396.43   1082.140.16 3.0097
33  12.67153.64  49.17   15783.74   478.30 0.50 4.1695
49  21.15365.55  76.61   14039.72   286.52 0.83 4.1995
65  27.57556.87  105.96  14287.27   219.80 0.75 2.8097
81  33.07712.92  127.85  14843.06   183.25 1.16 3.6996
97  42.811009.02 161.87  13730.90   141.56 1.75 4.2795
113 49.131166.04 185.10  13938.12   123.35 2.11 4.5395
129 56.621262.07 231.92  13806.78   107.03 2.67 4.9995
145 66.411420.29 250.24  13231.44   91.25  2.95 4.7095
179 90.152237.78 325.18  12032.61   67.22  4.23 4.9495
193 85.411865.81 326.64  13693.71   70.95  4.29 5.3294
220 93.582298.77 376.46  14246.63   64.76  4.97 5.6594
Max Jobs per Minute 18396.43 

The compute workload is mostly userspace code with some synchronization
between working threads (HPC type workload). With large user count, the
patched kernel has close to 8X the performance of an unpatched kernel.

Paravirt qspinlock patch performance


A linux kernel build workload (make -j n) was run on KVM and Xen
guests with and without the qspinlock patch. The host machine was
an 8-socket Westmere-EX server with 4 active cores/socket (8x4).

For the KVM test, the VM configurations were:

 1) 32 NUMA-pinned vCPUs (8x4)
 2) 32 vCPUs (no pinning or NUMA)
 3) 60 vCPUs (overcommitted)
 
The kernel build times for different 4.0-rc6 based kernels were:

KernelVM1   VM2 VM3
-----   --- ---
 PV ticket lock 3m49.7s   11m45.6s15m59.3s
 PV qspinlock   3m50.4s5m51.6s15m33.6s

For VM1, both the qspinlock and ticket lock have essentially the same
performance.  VM2 illustrated the performance benefit of qspinlock
versus ticket spinlock which got reduced in VM3 due to the overhead
of constant vCPUs halting and kicking.

For the Xen test, the VM configurations were:

 1) 32 vCPUs (no pinning or NUMA)
 2) 64 vCPUs
 2) 128 vCPUs

The kernel build times for different 4.0 based kernels were:

KernelVM1   VM2 VM3
-----   --- ---
 PV ticket lock 7m27.3s   16m14.1s18m26.9s
 PV qspinlock   5m51.6s   12m02.6s14m19.2s

Here, the PV qspinlock was faster than the PV ticket lock.

David Vrabel (1):
  pvqspinlock, x86: Enable PV qspinlock for Xen

Peter Zijlstra (Intel) (4):
  qspinlock: Add pending bit
  qspinlock: Optimize for smaller NR_CPUS
  qspinlock: Revert to test-and-set on hypervisors
  pvqspinlock, x86: Implement the paravirt qspinlock call patching

Waiman Long (9):
  qspinlock: A simple generic 4-byte queue spinlock
  qspinlock, x86: Enable x86-64 to use queue spinlock
  qspinlock: Extract out code snippets for the next patch
  qspinlock: Use a simple write to grab the lock
  pvqspinlock: Implement simple paravirt support for the qspinlock
  pvqspinlock, x86: Enable PV qspinlock for KVM
  pvqspinlock: Only kick CPU at unlock time
  pvqspinlock: Improve slowpath performance by avoiding cmpxchg
  pvqspinlock: Collect slowpath lock statistics

 arch/x86/Kconfig  |3 +-
 arch/x86/include/asm/paravirt.h   |   29 ++-
 arch/x86/include/asm/paravirt_types.h |   10 +
 arch/x86/include/asm/qspinlock.h  |   57

[Xen-devel] [PATCH v16 09/14] pvqspinlock, x86: Implement the paravirt qspinlock call patching

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

We use the regular paravirt call patching to switch between:

  native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath()
  native_queue_spin_unlock()__pv_queue_spin_unlock()

We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.

We further optimize the unlock path by patching the direct call with a
movb $0,%arg1 if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.

This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/Kconfig  |2 +-
 arch/x86/include/asm/paravirt.h   |   29 -
 arch/x86/include/asm/paravirt_types.h |   10 ++
 arch/x86/include/asm/qspinlock.h  |   25 -
 arch/x86/include/asm/qspinlock_paravirt.h |6 ++
 arch/x86/kernel/paravirt-spinlocks.c  |   24 +++-
 arch/x86/kernel/paravirt_patch_32.c   |   22 ++
 arch/x86/kernel/paravirt_patch_64.c   |   22 ++
 8 files changed, 128 insertions(+), 12 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock_paravirt.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 49fecb1..a0946e7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -661,7 +661,7 @@ config PARAVIRT_DEBUG
 config PARAVIRT_SPINLOCKS
bool Paravirtualization layer for spinlocks
depends on PARAVIRT  SMP
-   select UNINLINE_SPIN_UNLOCK
+   select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK
---help---
  Paravirtualized spinlocks allow a pvops backend to replace the
  spinlock implementation with something virtualization-friendly
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 965c47d..f76199a 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,31 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock *lock,
+   u32 val)
+{
+   PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val);
+}
+
+static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock)
+{
+   PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock);
+}
+
+static __always_inline void pv_wait(u8 *ptr, u8 val)
+{
+   PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick, cpu);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -724,7 +749,9 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
-#endif
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
+#endif /* SMP  PARAVIRT_SPINLOCKS */
 
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS pushl %ecx; pushl %edx;
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..f6acaea 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,19 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+struct qspinlock;
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+   struct paravirt_callee_save queue_spin_unlock;
+
+   void (*wait)(u8 *ptr, u8 val);
+   void (*kick)(int cpu);
+#else /* !CONFIG_QUEUE_SPINLOCK */
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 64c925e..c8290db 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -3,6 +3,7 @@
 
 #include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
+#include asm/paravirt.h
 
 #definequeue_spin_unlock queue_spin_unlock
 /**
@@ -11,11 +12,33 @@
  *
  * A smp_store_release() on the least-significant byte.
  */
-static inline void queue_spin_unlock(struct qspinlock *lock)
+static inline void native_queue_spin_unlock(struct qspinlock *lock)
 {
smp_store_release((u8 *)lock, 0);
 }
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void

[Xen-devel] [PATCH v16 10/14] pvqspinlock, x86: Enable PV qspinlock for KVM

2015-04-24 Thread Waiman Long
This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/kernel/kvm.c |   43 +++
 kernel/Kconfig.locks  |2 +-
 2 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e354cc6..4bb42c0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -584,6 +584,39 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include asm/qspinlock.h
+
+static void kvm_wait(u8 *ptr, u8 val)
+{
+   unsigned long flags;
+
+   if (in_nmi())
+   return;
+
+   local_irq_save(flags);
+
+   if (READ_ONCE(*ptr) != val)
+   goto out;
+
+   /*
+* halt until it's our turn and kicked. Note that we do safe halt
+* for irq enabled case to avoid hang when lock info is overwritten
+* in irq spinlock slowpath and no spurious interrupt occur to save us.
+*/
+   if (arch_irqs_disabled_flags(flags))
+   halt();
+   else
+   safe_halt();
+
+out:
+   local_irq_restore(flags);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
 enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -817,6 +850,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
__ticket_t ticket)
}
 }
 
+#endif /* !CONFIG_QUEUE_SPINLOCK */
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
@@ -828,8 +863,16 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+   __pv_init_lock_hash();
+   pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+   pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+   pv_lock_ops.wait = kvm_wait;
+   pv_lock_ops.kick = kvm_kick_cpu;
+#else /* !CONFIG_QUEUE_SPINLOCK */
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index c6a8f7c..537b13e 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
-   depends on SMP  !PARAVIRT_SPINLOCKS
+   depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
 
 config ARCH_USE_QUEUE_RWLOCK
bool
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v16 06/14] qspinlock: Use a simple write to grab the lock

2015-04-24 Thread Waiman Long
Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 kernel/locking/qspinlock.c |   66 +--
 1 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index bcc99e6..99503ef 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -105,24 +105,37 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   struct {
+   u8  locked;
+   u8  pending;
+   };
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[2];
+   u8  pending;
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -195,6 +208,19 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 - *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l-locked, _Q_LOCKED_VAL

[Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015-04-24 Thread Waiman Long
In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.

This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock_paravirt.h |   28 +---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 9b4ac3d..41ee033 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -19,7 +19,8 @@
  * native_queue_spin_unlock().
  */
 
-#define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
+#define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
+#define MAYHALT_THRESHOLD  (SPIN_THRESHOLD  4)
 
 /*
  * The vcpu_hashed is a special state that is set by the new lock holder on
@@ -39,6 +40,7 @@ struct pv_node {
 
int cpu;
u8  state;
+   u8  mayhalt;
 };
 
 /*
@@ -198,6 +200,7 @@ static void pv_init_node(struct mcs_spinlock *node)
 
pn-cpu = smp_processor_id();
pn-state = vcpu_running;
+   pn-mayhalt = false;
 }
 
 /*
@@ -214,23 +217,34 @@ static void pv_wait_node(struct mcs_spinlock *node)
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node-locked))
return;
+   if (loop == MAYHALT_THRESHOLD)
+   xchg(pn-mayhalt, true);
cpu_relax();
}
 
/*
-* Order pn-state vs pn-locked thusly:
+* Order pn-state/pn-mayhalt vs pn-locked thusly:
 *
-* [S] pn-state = vcpu_halted[S] next-locked = 1
+* [S] pn-mayhalt = 1[S] next-locked = 1
+* MB, delay  barrier()
+* [S] pn-state = vcpu_halted[L] pn-mayhalt
 * MB MB
 * [L] pn-locked   [RmW] pn-state = vcpu_hashed
 *
 * Matches the cmpxchg() from pv_scan_next().
+*
+* As the new lock holder may quit (when pn-mayhalt is not
+* set) without memory barrier, a sufficiently long delay is
+* inserted between the setting of pn-mayhalt and pn-state
+* to ensure that there is enough time for the new pn-locked
+* value to be propagated here to be checked below.
 */
(void)xchg(pn-state, vcpu_halted);
 
if (!READ_ONCE(node-locked))
pv_wait(pn-state, vcpu_halted);
 
+   pn-mayhalt = false;
/*
 * Reset the state except when vcpu_hashed is set.
 */
@@ -263,6 +277,14 @@ static void pv_scan_next(struct qspinlock *lock, struct 
mcs_spinlock *node)
struct __qspinlock *l = (void *)lock;
 
/*
+* If mayhalt is not set, there is enough time for the just set value
+* in pn-locked to be propagated to the other CPU before it is time
+* to halt.
+*/
+   if (!READ_ONCE(pn-mayhalt))
+   return;
+
+   /*
 * Transition CPU state: halted = hashed
 * Quit if the transition failed.
 */
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v16 03/14] qspinlock: Add pending bit

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c|  119 +++--
 2 files changed, 107 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index c9348d8..9c3f5c2 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -54,5 +59,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3456819..0351f78 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock value)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock value)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
@@ -121,6 +125,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   /*
+* wait for in-progress pending-locked hand-overs
+*
+* 0,1,0 - 0,0,1
+*/
+   if (val == _Q_PENDING_VAL) {
+   while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL

[Xen-devel] [PATCH v16 14/14] pvqspinlock: Collect slowpath lock statistics

2015-04-24 Thread Waiman Long
This patch enables the accumulation of PV qspinlock statistics
when either one of the following three sets of CONFIG parameters
are enabled:

 1) CONFIG_LOCK_STAT  CONFIG_DEBUG_FS
 2) CONFIG_KVM_DEBUG_FS
 3) CONFIG_XEN_DEBUG_FS

The accumulated lock statistics will be reported in debugfs under the
pv-qspinlock directory.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock_paravirt.h |  100 ++-
 1 files changed, 98 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 41ee033..d512d9b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -43,6 +43,86 @@ struct pv_node {
u8  mayhalt;
 };
 
+#if defined(CONFIG_KVM_DEBUG_FS) || defined(CONFIG_XEN_DEBUG_FS) ||\
+   (defined(CONFIG_LOCK_STAT)  defined(CONFIG_DEBUG_FS))
+#define PV_QSPINLOCK_STAT
+#endif
+
+/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+   pv_stat_wait_head,
+   pv_stat_wait_node,
+   pv_stat_wait_hash,
+   pv_stat_kick_cpu,
+   pv_stat_no_kick,
+   pv_stat_spurious,
+   pv_stat_hash,
+   pv_stat_hops,
+   pv_stat_num /* Total number of statistics counts */
+};
+
+#ifdef PV_QSPINLOCK_STAT
+
+#include linux/debugfs.h
+
+static const char * const stat_fsnames[pv_stat_num] = {
+   [pv_stat_wait_head] = wait_head_count,
+   [pv_stat_wait_node] = wait_node_count,
+   [pv_stat_wait_hash] = wait_hash_count,
+   [pv_stat_kick_cpu]  = kick_cpu_count,
+   [pv_stat_no_kick]   = no_kick_count,
+   [pv_stat_spurious]  = spurious_wakeup,
+   [pv_stat_hash]  = hash_count,
+   [pv_stat_hops]  = hash_hops_count,
+};
+
+static atomic_t pv_stats[pv_stat_num];
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+   struct dentry *d_pvqlock = debugfs_create_dir(pv-qspinlock, NULL);
+   int i;
+
+   if (!d_pvqlock)
+   printk(KERN_WARNING
+  Could not create 'pv-qspinlock' debugfs directory\n);
+
+   for (i = 0; i  pv_stat_num; i++)
+   debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)pv_stats[i]);
+   return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat)
+{
+   atomic_inc(pv_stats[stat]);
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+   atomic_inc(pv_stats[pv_stat_hash]);
+   atomic_add(hopcnt, pv_stats[pv_stat_hops]);
+}
+
+#else /* PV_QSPINLOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat) { }
+static inline void pvstat_hop(int hopcnt)  { }
+
+#endif /* PV_QSPINLOCK_STAT */
+
 /*
  * Lock and MCS node addresses hash table for fast lookup
  *
@@ -102,11 +182,13 @@ pv_hash(struct qspinlock *lock, struct pv_node *node)
 {
unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he, *end;
+   int hopcnt = 0;
 
init_hash = hash;
for (;;) {
he = pv_lock_hash[hash].ent;
for (end = he + PV_HE_PER_LINE; he  end; he++) {
+   hopcnt++;
if (!cmpxchg(he-lock, NULL, lock)) {
/*
 * We haven't set the _Q_SLOW_VAL yet. So
@@ -122,6 +204,7 @@ pv_hash(struct qspinlock *lock, struct pv_node *node)
}
 
 done:
+   pvstat_hop(hopcnt);
return he-lock;
 }
 
@@ -177,8 +260,12 @@ __visible void __pv_queue_spin_unlock(struct qspinlock 
*lock)
 * At this point the memory pointed at by lock can be freed/reused,
 * however we can still use the PV node to kick the CPU.
 */
-   if (READ_ONCE(node-state) != vcpu_running)
+   if (READ_ONCE(node-state) != vcpu_running) {
+   pvstat_inc(pv_stat_kick_cpu);
pv_kick(node-cpu);
+   } else {
+   pvstat_inc(pv_stat_no_kick);
+   }
 }
 /*
  * Include the architecture specific callee-save thunk of the
@@ -241,8 +328,10 @@ static void pv_wait_node(struct mcs_spinlock *node)
 */
(void)xchg(pn-state, vcpu_halted);
 
-   if (!READ_ONCE(node-locked))
+   if (!READ_ONCE(node-locked)) {
+   pvstat_inc(pv_stat_wait_node);
pv_wait(pn-state, vcpu_halted);
+   }
 
pn-mayhalt = false;
/*
@@ -250,6 +339,8 @@ static void pv_wait_node(struct mcs_spinlock *node)
 */
(void)cmpxchg(pn-state, vcpu_halted, vcpu_running);
 
+   if (READ_ONCE(node-locked))
+   break;
/*
 * If the locked flag is still

[Xen-devel] [PATCH v16 12/14] pvqspinlock: Only kick CPU at unlock time

2015-04-24 Thread Waiman Long
Before this patch, a CPU may have been kicked twice before getting
the lock - one before it becomes queue head and once before it gets
the lock. All these CPU kicking and halting (VMEXIT) can be expensive
and slow down system performance, especially in an overcommitted guest.

This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU.

It also adds a second synchronization point in __pv_queue_spin_unlock()
so as to do the pv_kick() only if it is really necessary.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c  |   10 ++--
 kernel/locking/qspinlock_paravirt.h |   76 +-
 2 files changed, 61 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index c009120..0a3a109 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock 
*lock)
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_scan_next(struct qspinlock *lock,
+  struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_head(struct qspinlock *lock,
   struct mcs_spinlock *node) { }
 
@@ -248,7 +248,7 @@ static __always_inline void __pv_wait_head(struct qspinlock 
*lock,
 
 #define pv_init_node   __pv_init_node
 #define pv_wait_node   __pv_wait_node
-#define pv_kick_node   __pv_kick_node
+#define pv_scan_next   __pv_scan_next
 #define pv_wait_head   __pv_wait_head
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
@@ -440,7 +440,7 @@ queue:
cpu_relax();
 
arch_mcs_spin_unlock_contended(next-locked);
-   pv_kick_node(next);
+   pv_scan_next(lock, next);
 
 release:
/*
@@ -461,7 +461,7 @@ EXPORT_SYMBOL(queue_spin_lock_slowpath);
 
 #undef pv_init_node
 #undef pv_wait_node
-#undef pv_kick_node
+#undef pv_scan_next
 #undef pv_wait_head
 
 #undef  queue_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 084e5c1..9b4ac3d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -21,9 +21,16 @@
 
 #define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
 
+/*
+ * The vcpu_hashed is a special state that is set by the new lock holder on
+ * the new queue head to indicate that _Q_SLOW_VAL is set and hash entry
+ * filled. With this state, the queue head CPU will always be kicked even
+ * if it is not halted to avoid potential racing condition.
+ */
 enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
+   vcpu_hashed
 };
 
 struct pv_node {
@@ -162,13 +169,13 @@ __visible void __pv_queue_spin_unlock(struct qspinlock 
*lock)
 * The queue head has been halted. Need to locate it and wake it up.
 */
node = pv_hash_find(lock);
-   smp_store_release(l-locked, 0);
+   (void)xchg(l-locked, 0);
 
/*
 * At this point the memory pointed at by lock can be freed/reused,
 * however we can still use the PV node to kick the CPU.
 */
-   if (READ_ONCE(node-state) == vcpu_halted)
+   if (READ_ONCE(node-state) != vcpu_running)
pv_kick(node-cpu);
 }
 /*
@@ -195,7 +202,8 @@ static void pv_init_node(struct mcs_spinlock *node)
 
 /*
  * Wait for node-locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
  */
 static void pv_wait_node(struct mcs_spinlock *node)
 {
@@ -214,9 +222,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
 *
 * [S] pn-state = vcpu_halted[S] next-locked = 1
 * MB MB
-* [L] pn-locked   [RmW] pn-state = vcpu_running
+* [L] pn-locked   [RmW] pn-state = vcpu_hashed
 *
-* Matches the xchg() from pv_kick_node().
+* Matches the cmpxchg() from pv_scan_next().
 */
(void)xchg(pn-state, vcpu_halted);
 
@@ -224,9 +232,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
pv_wait(pn-state, vcpu_halted);
 
/*
-* Reset the vCPU state to avoid unncessary CPU kicking
+* Reset the state except when vcpu_hashed is set.
 */
-   WRITE_ONCE(pn-state, vcpu_running);
+   (void

[Xen-devel] [PATCH v16 08/14] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-24 Thread Waiman Long
Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.

The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.

The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.

The head contender has no such storage directly visible to the
unlocker.  So the unlocker searches a hash table with open addressing
using a simple binary Galois linear feedback shift register.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c  |   68 +++-
 kernel/locking/qspinlock_paravirt.h |  324 +++
 2 files changed, 391 insertions(+), 1 deletions(-)
 create mode 100644 kernel/locking/qspinlock_paravirt.h

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc2e5ab..c009120 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -18,6 +18,9 @@
  * Authors: Waiman Long waiman.l...@hp.com
  *  Peter Zijlstra pet...@infradead.org
  */
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
 #include linux/smp.h
 #include linux/bug.h
 #include linux/cpumask.h
@@ -65,13 +68,21 @@
 
 #include mcs_spinlock.h
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define MAX_NODES  8
+#else
+#define MAX_NODES  4
+#endif
+
 /*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -220,6 +231,32 @@ static __always_inline void set_locked(struct qspinlock 
*lock)
WRITE_ONCE(l-locked, _Q_LOCKED_VAL);
 }
 
+
+/*
+ * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for
+ * all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
+
+static __always_inline void __pv_wait_head(struct qspinlock *lock,
+  struct mcs_spinlock *node) { }
+
+#define pv_enabled()   false
+
+#define pv_init_node   __pv_init_node
+#define pv_wait_node   __pv_wait_node
+#define pv_kick_node   __pv_kick_node
+#define pv_wait_head   __pv_wait_head
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queue_spin_lock_slowpath   native_queue_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
@@ -249,6 +286,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (pv_enabled())
+   goto queue;
+
if (virt_queue_spin_lock(lock))
return;
 
@@ -325,6 +365,7 @@ queue:
node += idx;
node-locked = 0;
node-next = NULL;
+   pv_init_node(node);
 
/*
 * We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -350,6 +391,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
+   pv_wait_node(node);
arch_mcs_spin_lock_contended(node-locked);
}
 
@@ -365,6 +407,7 @@ queue:
 * does not imply a full barrier.
 *
 */
+   pv_wait_head(lock, node);
while ((val = smp_load_acquire(lock-val.counter))  
_Q_LOCKED_PENDING_MASK)
cpu_relax();
 
@@ -397,6 +440,7 @@ queue:
cpu_relax();
 
arch_mcs_spin_unlock_contended(next-locked);
+   pv_kick_node(next);
 
 release:
/*
@@ -405,3 +449,25 @@ release:
this_cpu_dec(mcs_nodes[0].count);
 }
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+/*
+ * Generate the paravirt code for queue_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef  pv_enabled
+#define pv_enabled()   true
+
+#undef pv_init_node
+#undef pv_wait_node
+#undef pv_kick_node
+#undef pv_wait_head
+
+#undef  queue_spin_lock_slowpath
+#define queue_spin_lock_slowpath   __pv_queue_spin_lock_slowpath
+
+#include qspinlock_paravirt.h
+#include qspinlock.c
+
+#endif
diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
new file mode 100644
index 000..084e5c1
--- /dev/null
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -0,0 +1,324

[Xen-devel] [PATCH v16 05/14] qspinlock: Optimize for smaller NR_CPUS

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 ++
 kernel/locking/qspinlock.c|   69 -
 2 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index ef36613..f01b55d 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 11f6ad9..bcc99e6 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -24,6 +24,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -56,6 +57,10 @@
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
  * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
+ *
  */
 
 #include mcs_spinlock.h
@@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l-locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -131,6 +192,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
}
return old;
 }
+#endif /* _Q_PENDING_BITS == 8

[Xen-devel] [PATCH v16 11/14] pvqspinlock, x86: Enable PV qspinlock for Xen

2015-04-24 Thread Waiman Long
From: David Vrabel david.vra...@citrix.com

This patch adds the necessary Xen specific code to allow Xen to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: David Vrabel david.vra...@citrix.com
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/xen/spinlock.c |   64 ---
 kernel/Kconfig.locks|2 +-
 2 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 956374c..3215ffc 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,56 @@
 #include xen-ops.h
 #include debugfs.h
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include asm/qspinlock.h
+
+static void xen_qlock_kick(int cpu)
+{
+   xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+}
+
+/*
+ * Halt the current CPU  release it back to the host
+ */
+static void xen_qlock_wait(u8 *byte, u8 val)
+{
+   int irq = __this_cpu_read(lock_kicker_irq);
+
+   /* If kicker interrupts not initialized yet, just spin */
+   if (irq == -1)
+   return;
+
+   /* clear pending */
+   xen_clear_irq_pending(irq);
+   barrier();
+
+   /*
+* We check the byte value after clearing pending IRQ to make sure
+* that we won't miss a wakeup event because of the clearing.
+*
+* The sync_clear_bit() call in xen_clear_irq_pending() is atomic.
+* So it is effectively a memory barrier for x86.
+*/
+   if (READ_ONCE(*byte) != val)
+   return;
+
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
+   /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+   xen_poll_irq(irq);
+}
+
+#else /* CONFIG_QUEUE_SPINLOCK */
+
 enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +150,9 @@ struct xen_lock_waiting {
__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
int irq = __this_cpu_read(lock_kicker_irq);
@@ -217,6 +264,7 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
}
}
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
@@ -280,8 +328,16 @@ void __init xen_init_spinlocks(void)
return;
}
printk(KERN_DEBUG xen: PV spinlocks enabled\n);
+#ifdef CONFIG_QUEUE_SPINLOCK
+   __pv_init_lock_hash();
+   pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+   pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+   pv_lock_ops.wait = xen_qlock_wait;
+   pv_lock_ops.kick = xen_qlock_kick;
+#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
+#endif
 }
 
 /*
@@ -310,7 +366,7 @@ static __init int xen_parse_nopvspin(char *arg)
 }
 early_param(xen_nopvspin, xen_parse_nopvspin);
 
-#ifdef CONFIG_XEN_DEBUG_FS
+#if defined(CONFIG_XEN_DEBUG_FS)  !defined(CONFIG_QUEUE_SPINLOCK)
 
 static struct dentry *d_spin_debug;
 
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 537b13e..0b42933 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
-   depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
+   depends on SMP
 
 config ARCH_USE_QUEUE_RWLOCK
bool
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v16 07/14] qspinlock: Revert to test-and-set on hypervisors

2015-04-24 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/qspinlock.h |   14 ++
 include/asm-generic/qspinlock.h  |7 +++
 kernel/locking/qspinlock.c   |3 +++
 3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 222995b..64c925e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
 
 #definequeue_spin_unlock queue_spin_unlock
@@ -15,6 +16,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
smp_store_release((u8 *)lock, 0);
 }
 
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include asm-generic/qspinlock.h
 
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 315d6dc..bcbbc5e 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wait(struct qspinlock 
*lock)
cpu_relax();
 }
 
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 99503ef..fc2e5ab 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -249,6 +249,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (virt_queue_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending-locked hand-overs
 *
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v16 02/14] qspinlock, x86: Enable x86-64 to use queue spinlock

2015-04-24 Thread Waiman Long
This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   20 
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 30 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..49fecb1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+   select ARCH_USE_QUEUE_SPINLOCK
select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 000..222995b
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   smp_store_release((u8 *)lock, 0);
+}
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index cf87de3..a9c01fd 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
*lock)
cpu_relax();
}
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Read-write spinlocks, allowing multiple readers
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 5f9d757..5d654a1 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm-generic/qrwlock_types.h
 
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-13 Thread Waiman Long

On 04/13/2015 11:09 AM, Peter Zijlstra wrote:

On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:

+__visible void __pv_queue_spin_unlock(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+   struct pv_node *node;
+
+   if (likely(cmpxchg(l-locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL))
+   return;
+
+   /*
+* The queue head has been halted. Need to locate it and wake it up.
+*/
+   node = pv_hash_find(lock);
+   smp_store_release(l-locked, 0);

Ah yes, clever that.


+   /*
+* At this point the memory pointed at by lock can be freed/reused,
+* however we can still use the PV node to kick the CPU.
+*/
+   if (READ_ONCE(node-state) == vcpu_halted)
+   pv_kick(node-cpu);
+}
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);

However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86
code.

That is why I originally put my version of the qspinlock_paravirt.h header
file under arch/x86/include/asm. Maybe we should move it back there. Putting
the thunk in arch/x86/kernel/kvm.c didn't work when you consider that the
Xen code also need that.

Well the function is 'generic' and belong here I think. Its just the
PV_CALLEE_SAVE_REGS_THUNK thing that arch specific. Should have live in
arch/x86/kernel/paravirt-spinlocks.c instead?


Another alternative is to put the PV_CALLEE_SAVE_REGS_THUNK into a 
separate asm/qspinlock_paravirt.h file and included in the generic 
qspinlock_paravirt.h file. By putting the thunk with the 
__pv_queue_spin_unlock together in the same file, we can make the thunk 
and the unlock fast path (_Q_SLOW_VAL not set) go into two adjacent 
instruction cachelines. I think that may help a bit in term of performance.


Cheers,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-13 Thread Waiman Long

On 04/13/2015 10:47 AM, Peter Zijlstra wrote:

On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:

+void __init __pv_init_lock_hash(void)
+{
+   int pv_hash_size = 4 * num_possible_cpus();
+
+   if (pv_hash_size   (1U   LFSR_MIN_BITS))
+   pv_hash_size = (1U   LFSR_MIN_BITS);
+   /*
+* Allocate space from bootmem which should be page-size aligned
+* and hence cacheline aligned.
+*/
+   pv_lock_hash = alloc_large_system_hash(PV qspinlock,
+  sizeof(struct pv_hash_bucket),
+  pv_hash_size, 0, HASH_EARLY,
+   pv_lock_hash_bits, NULL,
+  pv_hash_size, pv_hash_size);

pv_taps = lfsr_taps(pv_lock_hash_bits);


I don't understand what you meant here.

Let me explain (even though I propose taking all the LFSR stuff out).

pv_lock_hash_bit is a runtime variable, therefore it cannot compile time
evaluate the forest of if statements required to compute the taps value.

Therefore its best to compute the taps _once_, and this seems like a
good place to do so.


OK, I got it. That make sense.


+   goto done;
+   }
+   }
+
+   hash = lfsr(hash, pv_lock_hash_bits, 0);

Since pv_lock_hash_bits is a variable, you end up running through that
massive if() forest to find the corresponding tap every single time. It
cannot compile-time optimize it.

The minimum bits size is now 8. So unless the system has more than 64 vCPUs,
it will get the right value in the first if statement.

Still, no reason to not pre-compute the taps value, its simple enough.



Still, we need to keep the hash_bits value as it will needed by the 
hashing function.


I have taken out the lfsr code and use linear probing in the updated 
qspinlock patch that I am working on. However, we can always add that 
back in as an additional patch.


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-13 Thread Waiman Long

On 04/13/2015 11:08 AM, Peter Zijlstra wrote:

On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:


+static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
+{
+   struct __qspinlock *l = (void *)lock;
+   struct qspinlock **lp = NULL;
+   struct pv_node *pn = (struct pv_node *)node;
+   int slow_set = false;
+   int loop;
+
+   for (;;) {
+   for (loop = SPIN_THRESHOLD; loop; loop--) {
+   if (!READ_ONCE(l-locked))
+   return;
+
+   cpu_relax();
+   }
+
+   WRITE_ONCE(pn-state, vcpu_halted);
+   if (!lp)
+   lp = pv_hash(lock, pn);
+   /*
+* lp must be set before setting _Q_SLOW_VAL
+*
+* [S] lp = lock[RmW] l = l-locked = 0
+* MB MB
+* [S] l-locked = _Q_SLOW_VAL  [L]   lp
+*
+* Matches the cmpxchg() in pv_queue_spin_unlock().
+*/
+   if (!slow_set
+   !cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
+   /*
+* The lock is free and _Q_SLOW_VAL has never been
+* set. Need to clear the hash bucket before getting
+* the lock.
+*/
+   WRITE_ONCE(*lp, NULL);
+   return;
+   } else if (slow_set   !READ_ONCE(l-locked))
+   return;
+   slow_set = true;

I'm somewhat puzzled by the slow_set thing; what is wrong with the thing
I had, namely:

if (!lp) {
lp = pv_hash(lock, pn);

/*
 * comment
 */
lv = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
if (lv != _Q_LOCKED_VAL) {
/* we're woken, unhash and return */
WRITE_ONCE(*lp, NULL);
return;
}
}

+
+   pv_wait(l-locked, _Q_SLOW_VAL);

If we get a spurious wakeup (due to device interrupts or random kick)
we'll loop around but -locked will remain _Q_SLOW_VAL.

The purpose of the slow_set flag is not about the lock value. It is to make
sure that pv_hash_find() will always find a match. Consider the following
scenario:

cpu1cpu2cpu3

pv_wait
spurious wakeup
loop l-locked

 read _Q_SLOW_VAL
 pv_hash_find()
 unlock

 pv_hash()- same entry

cmpxchg(l-locked)
clear hash (?)

Here, the entry for cpu3 will be removed leading to panic when
pv_hash_find() can find the entry. So the hash entry can only be
removed if the other cpu has no chance to see _Q_SLOW_VAL.

Still confused. Afaict that cannot happen with my code. We only do the
cmpxchg(, _Q_SLOW_VAL) _once_.

Only on the first time around that loop will we hash the lock and set
the slow flag. And cpu3 cannot come in on the same entry because we've
not yet released the lock when we find and unhash.




Maybe I am not clear in my illustration. More than one locks can be 
hashed to the same value. So cpu3 is accessing a different lock which 
has the same hashed value as the lock used by cpu1 and cpu2.


Anyway, I remove the slow_set flag by unrolling the retry loop so that 
after pv_wait(), it goes into the 2nd loop instead of going back to the 
top. As a result, cmpxchg and pv_hash can only be called once.


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-09 Thread Waiman Long

On 04/09/2015 02:23 PM, Peter Zijlstra wrote:

On Thu, Apr 09, 2015 at 08:13:27PM +0200, Peter Zijlstra wrote:

On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote:

+#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket))
+static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
+{
+   unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
+   struct pv_hash_bucket *hb, *end;
+
+   if (!hash)
+   hash = 1;
+
+   init_hash = hash;
+   hb =pv_lock_hash[hash_align(hash)];
+   for (;;) {
+   for (end = hb + PV_HB_PER_LINE; hb  end; hb++) {
+   if (!cmpxchg(hb-lock, NULL, lock)) {
+   WRITE_ONCE(hb-node, node);
+   /*
+* We haven't set the _Q_SLOW_VAL yet. So
+* the order of writing doesn't matter.
+*/
+   smp_wmb(); /* matches rmb from pv_hash_find */
+   goto done;
+   }
+   }
+
+   hash = lfsr(hash, pv_lock_hash_bits, 0);

Since pv_lock_hash_bits is a variable, you end up running through that
massive if() forest to find the corresponding tap every single time. It
cannot compile-time optimize it.

Hence:
hash = lfsr(hash, pv_taps);

(I don't get the bits argument to the lfsr).

In any case, like I said before, I think we should try a linear probe
sequence first, the lfsr was over engineering from my side.


+   hb =pv_lock_hash[hash_align(hash)];
  

So one thing this does -- and one of the reasons I figured I should
ditch the LFSR instead of fixing it -- is that you end up scanning each
bucket HB_PER_LINE times.


I am aware of that when I was trying to add the hash table debug code, 
but I want to get the code out for review and so hasn't made any change 
yet. I have just done testing by adding some debug code to check the 
hashing efficiency. With the kernel build workload, with over 1M calls 
to pv_hash(), all of them get an empty entry on the first try. Maybe the 
minimum hash table size of 256 helps.




The 'fix' would be to LFSR on cachelines instead of HBs but then you're
stuck with the 0-th cacheline.


This should not be a big problem. I just need to add a check at the end 
of the for loop that if hash is 0, change it to a certain non-0 value 
instead of calling lfsr().


As for ditching the lfsr idea, I am fine with that. So there will be 4 
entries (1 cacheline) for each hash value. If all the entries are full, 
we proceed to the next cacheline.  Right?


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock

2015-04-08 Thread Waiman Long
For a virtual guest with the qspinlock patch, a simple unfair byte lock
will be used if PV spinlock is not configured in or the hypervisor
isn't either KVM or Xen. The byte lock works fine with small guest
of just a few vCPUs. On a much larger guest, however, byte lock can
have serious performance problem.

There are 2 major drawbacks for the unfair byte lock:
 1) The constant reading and occasionally writing to the lock word can
put a lot of cacheline contention traffic on the affected
cacheline.
 2) Lock starvation is a real possibility especially if the number
of vCPUs is large.

This patch introduces a queue based unfair lock where all the vCPUs
on the queue can opportunistically steal the lock, but the frequency
of doing so decreases exponentially the further it is away from the
queue head. This can greatly reduce cacheline contention problem as
well as encouraging a more FIFO like order of getting the lock.

This patch has no impact on native qspinlock performance at all.

To measure the performance benefit of such a scheme, a linux kernel
build workload (make -j n) was run on a virtual KVM guest with
different configurations on a 8-socket with 4 active cores/socket
Westmere-EX server (8x4). n is the number of vCPUs in the guest. The
test configurations were:

 1) 32 NUMA-pinned vCPUs (8x4)
 2) 32 vCPUs (no pinning or NUMA)
 3) 60 vCPUs (overcommitted)

The kernel build times for different 4.0-rc6 based kernels were:

KernelVM1   VM2 VM3
-----   --- ---
 PV ticket lock 3m49.7s   11m45.6s15m59.3s
 PV qspinlock   3m50.4s5m51.6s15m33.6s
 unfair byte lock   3m50.6s   61m01.1s   122m07.7s
 unfair qspinlock   3m48.3s5m03.1s 4m31.1s

For VM1, all the locks give essentially the same performance. In both
VM2 and VM3, the performance of byte lock was terrible whereas the
unfair qspinlock was the fastest of all. For this particular test, the
unfair qspinlock can be more than 10X faster than a simple byte lock.

VM2 also illustrated the performance benefit of qspinlock versus
ticket spinlock which got reduced in VM3 due to the overhead of
constant vCPUs halting and kicking.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/qspinlock.h  |   15 +--
 kernel/locking/qspinlock.c|   94 +--
 kernel/locking/qspinlock_unfair.h |  231 +
 3 files changed, 319 insertions(+), 21 deletions(-)
 create mode 100644 kernel/locking/qspinlock_unfair.h

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c8290db..8113685 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -39,17 +39,16 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
 }
 #endif
 
-#define virt_queue_spin_lock virt_queue_spin_lock
+#ifndef static_cpu_has_hypervisor
+#define static_cpu_has_hypervisor  static_cpu_has(X86_FEATURE_HYPERVISOR)
+#endif
 
-static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+#define queue_spin_trylock_unfair queue_spin_trylock_unfair
+static inline bool queue_spin_trylock_unfair(struct qspinlock *lock)
 {
-   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
-   return false;
-
-   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
-   cpu_relax();
+   u8 *l = (u8 *)lock;
 
-   return true;
+   return !READ_ONCE(*l)  (xchg(l, _Q_LOCKED_VAL) == 0);
 }
 
 #include asm-generic/qspinlock.h
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b9ba83b..5fda6d5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -19,7 +19,11 @@
  *  Peter Zijlstra pet...@infradead.org
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if defined(_GEN_PV_LOCK_SLOWPATH) || defined(_GEN_UNFAIR_LOCK_SLOWPATH)
+#define_GEN_LOCK_SLOWPATH
+#endif
+
+#ifndef _GEN_LOCK_SLOWPATH
 
 #include linux/smp.h
 #include linux/bug.h
@@ -68,12 +72,6 @@
 
 #include mcs_spinlock.h
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define MAX_NODES  8
-#else
-#define MAX_NODES  4
-#endif
-
 /*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
@@ -81,7 +79,9 @@
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
  *
  * PV doubles the storage and uses the second cacheline for PV state.
+ * Unfair lock (mutually exclusive to PV) also uses the second cacheline.
  */
+#define MAX_NODES  8
 static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
 
 /*
@@ -234,7 +234,7 @@ static __always_inline void set_locked(struct qspinlock 
*lock)
 
 /*
  * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for
- * all the PV callbacks.
+ * all the PV and unfair callbacks.
  */
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
@@ -244,19 +244,36 @@ static __always_inline

Re: [Xen-devel] [PATCH v15 12/15] pvqspinlock, x86: Enable PV qspinlock for Xen

2015-04-08 Thread Waiman Long

On 04/08/2015 08:01 AM, David Vrabel wrote:

On 07/04/15 03:55, Waiman Long wrote:

This patch adds the necessary Xen specific code to allow Xen to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

This basically looks the same as the version I wrote, except I think you
broke it.


+static void xen_qlock_wait(u8 *byte, u8 val)
+{
+   int irq = __this_cpu_read(lock_kicker_irq);
+
+   /* If kicker interrupts not initialized yet, just spin */
+   if (irq == -1)
+   return;
+
+   /* clear pending */
+   xen_clear_irq_pending(irq);
+
+   /*
+* We check the byte value after clearing pending IRQ to make sure
+* that we won't miss a wakeup event because of the clearing.

My version had a barrier() here to ensure this.  The documentation of
READ_ONCE() suggests that it is not sufficient to meet this requirement
(and a READ_ONCE() here is not required anyway).


I have the misconception that a READ_ONCE/WRITE_ONCE() implies a 
compiler barrier. You are right that it may not be the case. I will add 
back the missing barrier when I update the patch and add you as the 
author of this patch if you don't mind.


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 01/15] qspinlock: A simple generic 4-byte queue spinlock

2015-04-06 Thread Waiman Long
This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  132 +
 include/asm-generic/qspinlock_types.h |   58 +
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  209 +
 6 files changed, 408 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..315d6dc
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,132 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32

[Xen-devel] [PATCH v15 11/15] pvqspinlock, x86: Enable PV qspinlock for KVM

2015-04-06 Thread Waiman Long
This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/kernel/kvm.c |   43 +++
 kernel/Kconfig.locks  |2 +-
 2 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index e354cc6..4bb42c0 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -584,6 +584,39 @@ static void kvm_kick_cpu(int cpu)
kvm_hypercall2(KVM_HC_KICK_CPU, flags, apicid);
 }
 
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include asm/qspinlock.h
+
+static void kvm_wait(u8 *ptr, u8 val)
+{
+   unsigned long flags;
+
+   if (in_nmi())
+   return;
+
+   local_irq_save(flags);
+
+   if (READ_ONCE(*ptr) != val)
+   goto out;
+
+   /*
+* halt until it's our turn and kicked. Note that we do safe halt
+* for irq enabled case to avoid hang when lock info is overwritten
+* in irq spinlock slowpath and no spurious interrupt occur to save us.
+*/
+   if (arch_irqs_disabled_flags(flags))
+   halt();
+   else
+   safe_halt();
+
+out:
+   local_irq_restore(flags);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
 enum kvm_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -817,6 +850,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, 
__ticket_t ticket)
}
 }
 
+#endif /* !CONFIG_QUEUE_SPINLOCK */
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
@@ -828,8 +863,16 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+   __pv_init_lock_hash();
+   pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+   pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+   pv_lock_ops.wait = kvm_wait;
+   pv_lock_ops.kick = kvm_kick_cpu;
+#else /* !CONFIG_QUEUE_SPINLOCK */
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(kvm_lock_spinning);
pv_lock_ops.unlock_kick = kvm_unlock_kick;
+#endif
 }
 
 static __init int kvm_spinlock_init_jump(void)
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index c6a8f7c..537b13e 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
-   depends on SMP  !PARAVIRT_SPINLOCKS
+   depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
 
 config ARCH_USE_QUEUE_RWLOCK
bool
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 04/15] qspinlock: Extract out code snippets for the next patch

2015-04-06 Thread Waiman Long
This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.

 1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
 2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.

This patch also simplifies the trylock operation before queuing by
calling queue_spin_trylock() directly.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/asm-generic/qspinlock_types.h |2 +
 kernel/locking/qspinlock.c|   79 -
 2 files changed, 50 insertions(+), 31 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index 9c3f5c2..ef36613 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -58,6 +58,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 0351f78..11f6ad9 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -97,6 +97,42 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+   atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, lock-val);
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -178,15 +214,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 *
 * *,1,0 - *,0,1
 */
-   for (;;) {
-   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   clear_pending_set_locked(lock);
return;
 
/*
@@ -203,37 +231,26 @@ queue:
node-next = NULL;
 
/*
-* We have already touched the queueing cacheline; don't bother with
-* pending stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked.
-* p,y,x - n,y,x ; tail was p - tail is n; preserving locked.
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* We have already touched the queueing cacheline; don't bother with
+* pending stuff.
+*
+* p,*,* - n,*,*
 */
-   if (new == _Q_LOCKED_VAL)
-   goto release;
+   old = xchg_tail(lock, tail);
 
/*
 * if there was a previous node; link it and wait until reaching the
 * head of the waitqueue.
 */
-   if (old  ~_Q_LOCKED_PENDING_MASK) {
+   if (old  _Q_TAIL_MASK) {
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 07/15] qspinlock: Revert to test-and-set on hypervisors

2015-04-06 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/include/asm/qspinlock.h |   14 ++
 include/asm-generic/qspinlock.h  |7 +++
 kernel/locking/qspinlock.c   |3 +++
 3 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 222995b..64c925e 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
 
 #definequeue_spin_unlock queue_spin_unlock
@@ -15,6 +16,19 @@ static inline void queue_spin_unlock(struct qspinlock *lock)
smp_store_release((u8 *)lock, 0);
 }
 
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include asm-generic/qspinlock.h
 
 #endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 315d6dc..bcbbc5e 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wait(struct qspinlock 
*lock)
cpu_relax();
 }
 
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 99503ef..fc2e5ab 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -249,6 +249,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (virt_queue_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending-locked hand-overs
 *
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 03/15] qspinlock: Add pending bit

2015-04-06 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 +++-
 kernel/locking/qspinlock.c|  119 +++--
 2 files changed, 107 insertions(+), 24 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index c9348d8..9c3f5c2 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -54,5 +59,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 3456819..0351f78 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock value)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock value)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
@@ -121,6 +125,75 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   /*
+* wait for in-progress pending-locked hand-overs
+*
+* 0,1,0 - 0,0,1
+*/
+   if (val == _Q_PENDING_VAL) {
+   while ((val = atomic_read(lock-val)) == _Q_PENDING_VAL

[Xen-devel] [PATCH v15 05/15] qspinlock: Optimize for smaller NR_CPUS

2015-04-06 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 ++
 kernel/locking/qspinlock.c|   69 -
 2 files changed, 81 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/qspinlock_types.h 
b/include/asm-generic/qspinlock_types.h
index ef36613..f01b55d 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 11f6ad9..bcc99e6 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -24,6 +24,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -56,6 +57,10 @@
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
  * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
+ *
  */
 
 #include mcs_spinlock.h
@@ -96,6 +101,62 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l-locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   return (u32)xchg(l-tail, tail  _Q_TAIL_OFFSET)  _Q_TAIL_OFFSET;
+}
+
+#else /* _Q_PENDING_BITS == 8 */
+
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -131,6 +192,7 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
}
return old;
 }
+#endif /* _Q_PENDING_BITS == 8

[Xen-devel] [PATCH v15 14/15] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015-04-06 Thread Waiman Long
In the pv_scan_next() function, the slow cmpxchg atomic operation is
performed even if the other CPU is not even close to being halted. This
extra cmpxchg can harm slowpath performance.

This patch introduces the new mayhalt flag to indicate if the other
spinning CPU is close to being halted or not. The current threshold
for x86 is 2k cpu_relax() calls. If this flag is not set, the other
spinning CPU will have at least 2k more cpu_relax() calls before
it can enter the halt state. This should give enough time for the
setting of the locked flag in struct mcs_spinlock to propagate to
that CPU without using atomic op.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock_paravirt.h |   28 +---
 1 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index a210061..a9fe10d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -16,7 +16,8 @@
  * native_queue_spin_unlock().
  */
 
-#define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
+#define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
+#define MAYHALT_THRESHOLD  (SPIN_THRESHOLD  4)
 
 /*
  * The vcpu_hashed is a special state that is set by the new lock holder on
@@ -36,6 +37,7 @@ struct pv_node {
 
int cpu;
u8  state;
+   u8  mayhalt;
 };
 
 /*
@@ -187,6 +189,7 @@ static void pv_init_node(struct mcs_spinlock *node)
 
pn-cpu = smp_processor_id();
pn-state = vcpu_running;
+   pn-mayhalt = false;
 }
 
 /*
@@ -203,17 +206,27 @@ static void pv_wait_node(struct mcs_spinlock *node)
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node-locked))
return;
+   if (loop == MAYHALT_THRESHOLD)
+   xchg(pn-mayhalt, true);
cpu_relax();
}
 
/*
-* Order pn-state vs pn-locked thusly:
+* Order pn-state/pn-mayhalt vs pn-locked thusly:
 *
-* [S] pn-state = vcpu_halted[S] next-locked = 1
+* [S] pn-mayhalt = 1[S] next-locked = 1
+* MB, delay  barrier()
+* [S] pn-state = vcpu_halted[L] pn-mayhalt
 * MB MB
 * [L] pn-locked   [RmW] pn-state = vcpu_hashed
 *
 * Matches the cmpxchg() from pv_scan_next().
+*
+* As the new lock holder may quit (when pn-mayhalt is not
+* set) without memory barrier, a sufficiently long delay is
+* inserted between the setting of pn-mayhalt and pn-state
+* to ensure that there is enough time for the new pn-locked
+* value to be propagated here to be checked below.
 */
(void)xchg(pn-state, vcpu_halted);
 
@@ -226,6 +239,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
 * needs to move on to pv_wait_head().
 */
(void)cmpxchg(pn-state, vcpu_halted, vcpu_running);
+   pn-mayhalt = false;
}
 
/*
@@ -246,6 +260,14 @@ static void pv_scan_next(struct qspinlock *lock, struct 
mcs_spinlock *node)
struct __qspinlock *l = (void *)lock;
 
/*
+* If mayhalt is not set, there is enough time for the just set value
+* in pn-locked to be propagated to the other CPU before it is time
+* to halt.
+*/
+   if (!READ_ONCE(pn-mayhalt))
+   return;
+
+   /*
 * Transition CPU state: halted = hashed
 * Quit if the transition failed.
 */
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 00/15] qspinlock: a 4-byte queue spinlock with PV support

2015-04-06 Thread Waiman Long
 data/reaim.config -f data/workfile.compute -i 16 -e 256
Num Parent   Child   Child  Jobs per   Jobs/min/  Std_dev  Std_dev  JTI
Forked  Time SysTime UTime   Minute Child  Time Percent
1   2.15 0.361.192818.602818.600.00 0.00100
17  6.29 42.41   22.43   16378.38   963.43 0.27 4.4395
33  56.241611.77 47.04   3555.83107.75 1.58 2.8897
49  120.88   5576.87 63.54   2456.4950.13  1.75 1.4798
65  199.17   12328.42 81.27   1977.7130.43  2.42 1.2398
81  270.89   20943.99 103.99  1812.0322.37  3.85 1.4498
97  315.31   29211.63 121.30  1864.2619.22  4.48 1.4498
113 397.68   42766.15 144.48  1721.9415.24  7.04 1.8098
129 520.74   63442.27 162.03  1501.2111.64  10.241.9998
...

Patched with qspinlocks v14:

Num Parent   Child   Child  Jobs per   Jobs/min/  Std_dev  Std_dev  JTI
Forked  Time SysTime UTime   Minute Child  Time Percent
1   1.98 0.361.213060.613060.610.00 0.00100
17  5.60 25.41   22.39   18396.43   1082.140.16 3.0097
33  12.67153.64  49.17   15783.74   478.30 0.50 4.1695
49  21.15365.55  76.61   14039.72   286.52 0.83 4.1995
65  27.57556.87  105.96  14287.27   219.80 0.75 2.8097
81  33.07712.92  127.85  14843.06   183.25 1.16 3.6996
97  42.811009.02 161.87  13730.90   141.56 1.75 4.2795
113 49.131166.04 185.10  13938.12   123.35 2.11 4.5395
129 56.621262.07 231.92  13806.78   107.03 2.67 4.9995
145 66.411420.29 250.24  13231.44   91.25  2.95 4.7095
179 90.152237.78 325.18  12032.61   67.22  4.23 4.9495
193 85.411865.81 326.64  13693.71   70.95  4.29 5.3294
220 93.582298.77 376.46  14246.63   64.76  4.97 5.6594
Max Jobs per Minute 18396.43 

The compute workload is mostly userspace code with some synchronization
between working threads (HPC type workload). With large user count, the
patched kernel has close to 8X the performance of an unpatched kernel.

Peter Zijlstra (Intel) (4):
  qspinlock: Add pending bit
  qspinlock: Optimize for smaller NR_CPUS
  qspinlock: Revert to test-and-set on hypervisors
  pvqspinlock: Implement the paravirt qspinlock for x86

Waiman Long (11):
  qspinlock: A simple generic 4-byte queue spinlock
  qspinlock, x86: Enable x86-64 to use queue spinlock
  qspinlock: Extract out code snippets for the next patch
  qspinlock: Use a simple write to grab the lock
  lfsr: a simple binary Galois linear feedback shift register
  pvqspinlock: Implement simple paravirt support for the qspinlock
  pvqspinlock, x86: Enable PV qspinlock for KVM
  pvqspinlock, x86: Enable PV qspinlock for Xen
  pvqspinlock: Only kick CPU at unlock time
  pvqspinlock: Improve slowpath performance by avoiding cmpxchg
  pvqspinlock: Add debug code to check for PV lock hash sanity

 arch/x86/Kconfig  |3 +-
 arch/x86/include/asm/paravirt.h   |   28 ++-
 arch/x86/include/asm/paravirt_types.h |   10 +
 arch/x86/include/asm/qspinlock.h  |   57 
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 +
 arch/x86/kernel/kvm.c |   43 +++
 arch/x86/kernel/paravirt-spinlocks.c  |   24 ++-
 arch/x86/kernel/paravirt_patch_32.c   |   22 ++-
 arch/x86/kernel/paravirt_patch_64.c   |   22 ++-
 arch/x86/xen/spinlock.c   |   63 -
 include/asm-generic/qspinlock.h   |  139 ++
 include/asm-generic/qspinlock_types.h |   79 ++
 include/linux/lfsr.h  |   80 ++
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  474 +
 kernel/locking/qspinlock_paravirt.h   |  433 ++
 19 files changed, 1480 insertions(+), 15 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 include/linux/lfsr.h
 create mode 100644 kernel/locking/qspinlock.c
 create mode 100644 kernel/locking/qspinlock_paravirt.h


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 13/15] pvqspinlock: Only kick CPU at unlock time

2015-04-06 Thread Waiman Long
Before this patch, a CPU may have been kicked twice before getting
the lock - one before it becomes queue head and once before it gets
the lock. All these CPU kicking and halting (VMEXIT) can be expensive
and slow down system performance, especially in an overcommitted guest.

This patch add a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c  |   10 ++--
 kernel/locking/qspinlock_paravirt.h |   76 +--
 2 files changed, 59 insertions(+), 27 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 33b3f54..b9ba83b 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -239,8 +239,8 @@ static __always_inline void set_locked(struct qspinlock 
*lock)
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
-
+static __always_inline void __pv_scan_next(struct qspinlock *lock,
+  struct mcs_spinlock *node) { }
 static __always_inline void __pv_wait_head(struct qspinlock *lock,
   struct mcs_spinlock *node) { }
 
@@ -248,7 +248,7 @@ static __always_inline void __pv_wait_head(struct qspinlock 
*lock,
 
 #define pv_init_node   __pv_init_node
 #define pv_wait_node   __pv_wait_node
-#define pv_kick_node   __pv_kick_node
+#define pv_scan_next   __pv_scan_next
 
 #define pv_wait_head   __pv_wait_head
 
@@ -441,7 +441,7 @@ queue:
cpu_relax();
 
arch_mcs_spin_unlock_contended(next-locked);
-   pv_kick_node(next);
+   pv_scan_next(lock, next);
 
 release:
/*
@@ -462,7 +462,7 @@ EXPORT_SYMBOL(queue_spin_lock_slowpath);
 
 #undef pv_init_node
 #undef pv_wait_node
-#undef pv_kick_node
+#undef pv_scan_next
 #undef pv_wait_head
 
 #undef  queue_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index 49dbd39..a210061 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -18,9 +18,16 @@
 
 #define _Q_SLOW_VAL(3U  _Q_LOCKED_OFFSET)
 
+/*
+ * The vcpu_hashed is a special state that is set by the new lock holder on
+ * the new queue head to indicate that _Q_SLOW_VAL is set and hash entry
+ * filled. With this state, the queue head CPU will always be kicked even
+ * if it is not halted to avoid potential racing condition.
+ */
 enum vcpu_state {
vcpu_running = 0,
vcpu_halted,
+   vcpu_hashed
 };
 
 struct pv_node {
@@ -97,7 +104,13 @@ static inline u32 hash_align(u32 hash)
return hash  ~(PV_HB_PER_LINE - 1);
 }
 
-static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
+/*
+ * Set up an entry in the lock hash table
+ * This is not inlined to reduce size of generated code as it is included
+ * twice and is used only in the slowest path of handling CPU halting.
+ */
+static noinline struct qspinlock **
+pv_hash(struct qspinlock *lock, struct pv_node *node)
 {
unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_bucket *hb, *end;
@@ -178,7 +191,8 @@ static void pv_init_node(struct mcs_spinlock *node)
 
 /*
  * Wait for node-locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to wake the vcpu again.
+ * pv_scan_next() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
  */
 static void pv_wait_node(struct mcs_spinlock *node)
 {
@@ -189,7 +203,6 @@ static void pv_wait_node(struct mcs_spinlock *node)
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node-locked))
return;
-
cpu_relax();
}
 
@@ -198,17 +211,21 @@ static void pv_wait_node(struct mcs_spinlock *node)
 *
 * [S] pn-state = vcpu_halted[S] next-locked = 1
 * MB MB
-* [L] pn-locked   [RmW] pn-state = vcpu_running
+* [L] pn-locked   [RmW] pn-state = vcpu_hashed
 *
-* Matches the xchg() from pv_kick_node().
+* Matches the cmpxchg() from pv_scan_next().
 */
(void)xchg(pn-state, vcpu_halted);
 
if (!READ_ONCE(node-locked))
pv_wait(pn-state, vcpu_halted);
 
-   /* Make sure that state is correct for spurious wakeup */
-   WRITE_ONCE(pn-state, vcpu_running

[Xen-devel] [PATCH v15 02/15] qspinlock, x86: Enable x86-64 to use queue spinlock

2015-04-06 Thread Waiman Long
This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   20 
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 30 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/qspinlock.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b7d31ca..49fecb1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+   select ARCH_USE_QUEUE_SPINLOCK
select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
new file mode 100644
index 000..222995b
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   smp_store_release((u8 *)lock, 0);
+}
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index cf87de3..a9c01fd 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t 
*lock)
cpu_relax();
}
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Read-write spinlocks, allowing multiple readers
diff --git a/arch/x86/include/asm/spinlock_types.h 
b/arch/x86/include/asm/spinlock_types.h
index 5f9d757..5d654a1 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm-generic/qrwlock_types.h
 
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 06/15] qspinlock: Use a simple write to grab the lock

2015-04-06 Thread Waiman Long
Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 kernel/locking/qspinlock.c |   66 +--
 1 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index bcc99e6..99503ef 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -105,24 +105,37 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   struct {
+   u8  locked;
+   u8  pending;
+   };
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[2];
+   u8  pending;
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue spinlock structure
@@ -195,6 +208,19 @@ static __always_inline u32 xchg_tail(struct qspinlock 
*lock, u32 tail)
 #endif /* _Q_PENDING_BITS == 8 */
 
 /**
+ * set_locked - Set the lock bit and own the lock
+ * @lock: Pointer to queue spinlock structure
+ *
+ * *,*,0 - *,0,1
+ */
+static __always_inline void set_locked(struct qspinlock *lock)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l-locked, _Q_LOCKED_VAL

[Xen-devel] [PATCH v15 10/15] pvqspinlock: Implement the paravirt qspinlock for x86

2015-04-06 Thread Waiman Long
From: Peter Zijlstra (Intel) pet...@infradead.org

We use the regular paravirt call patching to switch between:

  native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath()
  native_queue_spin_unlock()__pv_queue_spin_unlock()

We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.

We further optimize the unlock path by patching the direct call with a
movb $0,%arg1 if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.

This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/Kconfig  |2 +-
 arch/x86/include/asm/paravirt.h   |   28 +++-
 arch/x86/include/asm/paravirt_types.h |   10 ++
 arch/x86/include/asm/qspinlock.h  |   25 -
 arch/x86/kernel/paravirt-spinlocks.c  |   24 +++-
 arch/x86/kernel/paravirt_patch_32.c   |   22 ++
 arch/x86/kernel/paravirt_patch_64.c   |   22 ++
 7 files changed, 121 insertions(+), 12 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 49fecb1..a0946e7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -661,7 +661,7 @@ config PARAVIRT_DEBUG
 config PARAVIRT_SPINLOCKS
bool Paravirtualization layer for spinlocks
depends on PARAVIRT  SMP
-   select UNINLINE_SPIN_UNLOCK
+   select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK
---help---
  Paravirtualized spinlocks allow a pvops backend to replace the
  spinlock implementation with something virtualization-friendly
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 965c47d..dd40269 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,30 @@ static inline void __set_fixmap(unsigned /* enum 
fixed_addresses */ idx,
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock 
*lock, u32 val)
+{
+   PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val);
+}
+
+static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock)
+{
+   PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock);
+}
+
+static __always_inline void pv_wait(u8 *ptr, u8 val)
+{
+   PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick, cpu);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -724,7 +748,9 @@ static __always_inline void __ticket_unlock_kick(struct 
arch_spinlock *lock,
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
-#endif
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
+#endif /* SMP  PARAVIRT_SPINLOCKS */
 
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS pushl %ecx; pushl %edx;
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 7549b8b..f6acaea 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,19 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+struct qspinlock;
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+   struct paravirt_callee_save queue_spin_unlock;
+
+   void (*wait)(u8 *ptr, u8 val);
+   void (*kick)(int cpu);
+#else /* !CONFIG_QUEUE_SPINLOCK */
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 64c925e..c8290db 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -3,6 +3,7 @@
 
 #include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
+#include asm/paravirt.h
 
 #definequeue_spin_unlock queue_spin_unlock
 /**
@@ -11,11 +12,33 @@
  *
  * A smp_store_release() on the least-significant byte.
  */
-static inline void queue_spin_unlock(struct qspinlock *lock)
+static inline void native_queue_spin_unlock(struct qspinlock *lock)
 {
smp_store_release((u8 *)lock, 0);
 }
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_init_lock_hash(void);
+extern void __pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void

[Xen-devel] [PATCH v15 08/15] lfsr: a simple binary Galois linear feedback shift register

2015-04-06 Thread Waiman Long
This patch is based on the code sent out by Peter Zijstra as part
of his queue spinlock patch to provide a hashing function with open
addressing.  The lfsr() function can be used to return a sequence of
numbers that cycle through all the bit patterns (2^n -1) of a given
bit width n except the value 0 in a somewhat random fashion depending
on the LFSR taps that is being used. Callers can provide their own
taps value or use the default.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 include/linux/lfsr.h |   80 ++
 1 files changed, 80 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/lfsr.h

diff --git a/include/linux/lfsr.h b/include/linux/lfsr.h
new file mode 100644
index 000..f570819
--- /dev/null
+++ b/include/linux/lfsr.h
@@ -0,0 +1,80 @@
+#ifndef _LINUX_LFSR_H
+#define _LINUX_LFSR_H
+
+/*
+ * Simple Binary Galois Linear Feedback Shift Register
+ *
+ * http://en.wikipedia.org/wiki/Linear_feedback_shift_register
+ *
+ * This function only currently supports only bits values of 4-30. Callers
+ * that doesn't pass in a constant bits value can optionally define
+ * LFSR_MIN_BITS and LFSR_MAX_BITS before including the lfsr.h header file
+ * to reduce the size of the jump table in the compiled code, if desired.
+ */
+#ifndef LFSR_MIN_BITS
+#define LFSR_MIN_BITS  4
+#endif
+
+#ifndef LFSR_MAX_BITS
+#define LFSR_MAX_BITS  30
+#endif
+
+static __always_inline u32 lfsr_taps(int bits)
+{
+   BUG_ON((bits  LFSR_MIN_BITS) || (bits  LFSR_MAX_BITS));
+   BUILD_BUG_ON((LFSR_MIN_BITS  4) || (LFSR_MAX_BITS  30));
+
+#define _IF_BITS_EQ(x) \
+   if (((x) = LFSR_MIN_BITS)  ((x) = LFSR_MAX_BITS)  ((x) == bits))
+
+   /*
+* Feedback terms copied from
+* http://users.ece.cmu.edu/~koopman/lfsr/index.html
+*/
+   _IF_BITS_EQ(4)  return 0x0009;
+   _IF_BITS_EQ(5)  return 0x0012;
+   _IF_BITS_EQ(6)  return 0x0021;
+   _IF_BITS_EQ(7)  return 0x0041;
+   _IF_BITS_EQ(8)  return 0x008E;
+   _IF_BITS_EQ(9)  return 0x0108;
+   _IF_BITS_EQ(10) return 0x0204;
+   _IF_BITS_EQ(11) return 0x0402;
+   _IF_BITS_EQ(12) return 0x0829;
+   _IF_BITS_EQ(13) return 0x100D;
+   _IF_BITS_EQ(14) return 0x2015;
+   _IF_BITS_EQ(15) return 0x4122;
+   _IF_BITS_EQ(16) return 0x8112;
+   _IF_BITS_EQ(17) return 0x102C9;
+   _IF_BITS_EQ(18) return 0x20195;
+   _IF_BITS_EQ(19) return 0x403FE;
+   _IF_BITS_EQ(20) return 0x80637;
+   _IF_BITS_EQ(21) return 0x100478;
+   _IF_BITS_EQ(22) return 0x20069E;
+   _IF_BITS_EQ(23) return 0x4004B2;
+   _IF_BITS_EQ(24) return 0x800B87;
+   _IF_BITS_EQ(25) return 0x10004F3;
+   _IF_BITS_EQ(26) return 0x200072D;
+   _IF_BITS_EQ(27) return 0x40006AE;
+   _IF_BITS_EQ(28) return 0x80009E3;
+   _IF_BITS_EQ(29) return 0x1583;
+   _IF_BITS_EQ(30) return 0x2C92;
+#undef _IF_BITS_EQ
+
+   /* Unreachable */
+   return 0;
+}
+
+/*
+ * Please note that LFSR doesn't work with a start state of 0.
+ */
+static inline u32 lfsr(u32 val, int bits, u32 taps)
+{
+   u32 bit = val  1;
+
+   val = 1;
+   if (bit)
+   val ^= taps ? taps : lfsr_taps(bits);
+   return val;
+}
+
+#endif /* _LINUX_LFSR_H */
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 12/15] pvqspinlock, x86: Enable PV qspinlock for Xen

2015-04-06 Thread Waiman Long
This patch adds the necessary Xen specific code to allow Xen to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 arch/x86/xen/spinlock.c |   63 ---
 kernel/Kconfig.locks|2 +-
 2 files changed, 60 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 956374c..728b45b 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -17,6 +17,55 @@
 #include xen-ops.h
 #include debugfs.h
 
+static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
+static DEFINE_PER_CPU(char *, irq_name);
+static bool xen_pvspin = true;
+
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+#include asm/qspinlock.h
+
+static void xen_qlock_kick(int cpu)
+{
+   xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
+}
+
+/*
+ * Halt the current CPU  release it back to the host
+ */
+static void xen_qlock_wait(u8 *byte, u8 val)
+{
+   int irq = __this_cpu_read(lock_kicker_irq);
+
+   /* If kicker interrupts not initialized yet, just spin */
+   if (irq == -1)
+   return;
+
+   /* clear pending */
+   xen_clear_irq_pending(irq);
+
+   /*
+* We check the byte value after clearing pending IRQ to make sure
+* that we won't miss a wakeup event because of the clearing.
+*
+* The sync_clear_bit() call in xen_clear_irq_pending() is atomic.
+* So it is effectively a memory barrier for x86.
+*/
+   if (READ_ONCE(*byte) != val)
+   return;
+
+   /*
+* If an interrupt happens here, it will leave the wakeup irq
+* pending, which will cause xen_poll_irq() to return
+* immediately.
+*/
+
+   /* Block until irq becomes pending (or perhaps a spurious wakeup) */
+   xen_poll_irq(irq);
+}
+
+#else /* CONFIG_QUEUE_SPINLOCK */
+
 enum xen_contention_stat {
TAKEN_SLOW,
TAKEN_SLOW_PICKUP,
@@ -100,12 +149,9 @@ struct xen_lock_waiting {
__ticket_t want;
 };
 
-static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
-static DEFINE_PER_CPU(char *, irq_name);
 static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting);
 static cpumask_t waiting_cpus;
 
-static bool xen_pvspin = true;
 __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
 {
int irq = __this_cpu_read(lock_kicker_irq);
@@ -217,6 +263,7 @@ static void xen_unlock_kick(struct arch_spinlock *lock, 
__ticket_t next)
}
}
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 static irqreturn_t dummy_handler(int irq, void *dev_id)
 {
@@ -280,8 +327,16 @@ void __init xen_init_spinlocks(void)
return;
}
printk(KERN_DEBUG xen: PV spinlocks enabled\n);
+#ifdef CONFIG_QUEUE_SPINLOCK
+   __pv_init_lock_hash();
+   pv_lock_ops.queue_spin_lock_slowpath = __pv_queue_spin_lock_slowpath;
+   pv_lock_ops.queue_spin_unlock = PV_CALLEE_SAVE(__pv_queue_spin_unlock);
+   pv_lock_ops.wait = xen_qlock_wait;
+   pv_lock_ops.kick = xen_qlock_kick;
+#else
pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning);
pv_lock_ops.unlock_kick = xen_unlock_kick;
+#endif
 }
 
 /*
@@ -310,7 +365,7 @@ static __init int xen_parse_nopvspin(char *arg)
 }
 early_param(xen_nopvspin, xen_parse_nopvspin);
 
-#ifdef CONFIG_XEN_DEBUG_FS
+#if defined(CONFIG_XEN_DEBUG_FS)  !defined(CONFIG_QUEUE_SPINLOCK)
 
 static struct dentry *d_spin_debug;
 
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 537b13e..0b42933 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -240,7 +240,7 @@ config ARCH_USE_QUEUE_SPINLOCK
 
 config QUEUE_SPINLOCK
def_bool y if ARCH_USE_QUEUE_SPINLOCK
-   depends on SMP  (!PARAVIRT_SPINLOCKS || !XEN)
+   depends on SMP
 
 config ARCH_USE_QUEUE_RWLOCK
bool
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-06 Thread Waiman Long
Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.

The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.

The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.

The head contender has no such storage directly visible to the
unlocker.  So the unlocker searches a hash table with open addressing
using a simple binary Galois linear feedback shift register.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock.c  |   69 -
 kernel/locking/qspinlock_paravirt.h |  321 +++
 2 files changed, 389 insertions(+), 1 deletions(-)
 create mode 100644 kernel/locking/qspinlock_paravirt.h

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc2e5ab..33b3f54 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -18,6 +18,9 @@
  * Authors: Waiman Long waiman.l...@hp.com
  *  Peter Zijlstra pet...@infradead.org
  */
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
 #include linux/smp.h
 #include linux/bug.h
 #include linux/cpumask.h
@@ -65,13 +68,21 @@
 
 #include mcs_spinlock.h
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define MAX_NODES  8
+#else
+#define MAX_NODES  4
+#endif
+
 /*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -220,6 +231,33 @@ static __always_inline void set_locked(struct qspinlock 
*lock)
WRITE_ONCE(l-locked, _Q_LOCKED_VAL);
 }
 
+
+/*
+ * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for
+ * all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
+
+static __always_inline void __pv_wait_head(struct qspinlock *lock,
+  struct mcs_spinlock *node) { }
+
+#define pv_enabled()   false
+
+#define pv_init_node   __pv_init_node
+#define pv_wait_node   __pv_wait_node
+#define pv_kick_node   __pv_kick_node
+
+#define pv_wait_head   __pv_wait_head
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queue_spin_lock_slowpath   native_queue_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
@@ -249,6 +287,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (pv_enabled())
+   goto queue;
+
if (virt_queue_spin_lock(lock))
return;
 
@@ -325,6 +366,7 @@ queue:
node += idx;
node-locked = 0;
node-next = NULL;
+   pv_init_node(node);
 
/*
 * We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -350,6 +392,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
+   pv_wait_node(node);
arch_mcs_spin_lock_contended(node-locked);
}
 
@@ -365,6 +408,7 @@ queue:
 * does not imply a full barrier.
 *
 */
+   pv_wait_head(lock, node);
while ((val = smp_load_acquire(lock-val.counter))  
_Q_LOCKED_PENDING_MASK)
cpu_relax();
 
@@ -397,6 +441,7 @@ queue:
cpu_relax();
 
arch_mcs_spin_unlock_contended(next-locked);
+   pv_kick_node(next);
 
 release:
/*
@@ -405,3 +450,25 @@ release:
this_cpu_dec(mcs_nodes[0].count);
 }
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+/*
+ * Generate the paravirt code for queue_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef  pv_enabled
+#define pv_enabled()   true
+
+#undef pv_init_node
+#undef pv_wait_node
+#undef pv_kick_node
+#undef pv_wait_head
+
+#undef  queue_spin_lock_slowpath
+#define queue_spin_lock_slowpath   __pv_queue_spin_lock_slowpath
+
+#include qspinlock_paravirt.h
+#include qspinlock.c
+
+#endif
diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
new file mode 100644
index 000..49dbd39
--- /dev/null
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -0,0 +1,321

[Xen-devel] [PATCH v15 15/15] pvqspinlock: Add debug code to check for PV lock hash sanity

2015-04-06 Thread Waiman Long
The current code for PV lock hash table processing will panic the
system if pv_hash_find() can't find the desired hash bucket. However,
there is no check to see if there is more than one entry for a given
lock which should never happen.

This patch adds a pv_hash_check_duplicate() function to do that which
will only be enabled if CONFIG_DEBUG_SPINLOCK is defined because of
the performance overhead it introduces.

Signed-off-by: Waiman Long waiman.l...@hp.com
---
 kernel/locking/qspinlock_paravirt.h |   58 +++
 1 files changed, 58 insertions(+), 0 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h 
b/kernel/locking/qspinlock_paravirt.h
index a9fe10d..4d39c8b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -107,6 +107,63 @@ static inline u32 hash_align(u32 hash)
 }
 
 /*
+ * Hash table debugging code
+ */
+#ifdef CONFIG_DEBUG_SPINLOCK
+
+#define _NODE_IDX(pn)  unsigned long)pn)  (SMP_CACHE_BYTES - 1)) /\
+   sizeof(struct mcs_spinlock))
+/*
+ * Check if there is additional hash buckets with the same lock which
+ * should not happen.
+ */
+static inline void pv_hash_check_duplicate(struct qspinlock *lock)
+{
+   struct pv_hash_bucket *hb, *end, *hb1 = NULL;
+   int count = 0, used = 0;
+
+   end = pv_lock_hash[1  pv_lock_hash_bits];
+   for (hb = pv_lock_hash; hb  end; hb++) {
+   struct qspinlock *l = READ_ONCE(hb-lock);
+   struct pv_node *pn;
+
+   if (l)
+   used++;
+   if (l != lock)
+   continue;
+   if (++count == 1) {
+   hb1 = hb;
+   continue;
+   }
+   WARN_ON(count == 2);
+   if (hb1) {
+   pn = READ_ONCE(hb1-node);
+   printk(KERN_ERR PV lock hash error: duplicated entry 
+  #%d - hash %ld, node %ld, cpu %d\n, 1,
+  hb1 - pv_lock_hash, _NODE_IDX(pn),
+  pn ? pn-cpu : -1);
+   hb1 = NULL;
+   }
+   pn = READ_ONCE(hb-node);
+   printk(KERN_ERR PV lock hash error: duplicated entry #%d - 
+  hash %ld, node %ld, cpu %d\n, count, hb - pv_lock_hash,
+  _NODE_IDX(pn), pn ? pn-cpu : -1);
+   }
+   /*
+* Warn if more than half of the buckets are used
+*/
+   if (used  (1  (pv_lock_hash_bits - 1)))
+   printk(KERN_WARNING PV lock hash warning: 
+  %d hash entries used!\n, used);
+}
+
+#else /* CONFIG_DEBUG_SPINLOCK */
+
+static inline void pv_hash_check_duplicate(struct qspinlock *lock) {}
+
+#endif /* CONFIG_DEBUG_SPINLOCK */
+
+/*
  * Set up an entry in the lock hash table
  * This is not inlined to reduce size of generated code as it is included
  * twice and is used only in the slowest path of handling CPU halting.
@@ -141,6 +198,7 @@ pv_hash(struct qspinlock *lock, struct pv_node *node)
}
 
 done:
+   pv_hash_check_duplicate(lock);
return hb-lock;
 }
 
-- 
1.7.1


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-02 Thread Waiman Long

On 04/02/2015 03:48 PM, Peter Zijlstra wrote:

On Thu, Apr 02, 2015 at 07:20:57PM +0200, Peter Zijlstra wrote:

pv_wait_head():

pv_hash()
/* MB as per cmpxchg */
cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);

VS

__pv_queue_spin_unlock():

if (xchg(l-locked, 0) != _Q_SLOW_VAL)
return;

/* MB as per xchg */
pv_hash_find(lock);




Something like so.. compile tested only.

I took out the LFSR because that was likely over engineering from my
side :-)

--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -2,6 +2,8 @@
  #error do not include this file
  #endif

+#includelinux/hash.h
+
  /*
   * Implement paravirt qspinlocks; the general idea is to halt the vcpus 
instead
   * of spinning them.
@@ -107,7 +109,84 @@ static void pv_kick_node(struct mcs_spin
pv_kick(pn-cpu);
  }

-static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait);
+/*
+ * Hash table using open addressing with an linear probe sequence.
+ *
+ * Since we should not be holding locks from NMI context (very rare indeed) the
+ * max load factor is 0.75, which is around the point where open adressing
+ * breaks down.
+ *
+ * Instead of probing just the immediate bucket we probe all buckets in the
+ * same cacheline.
+ *
+ * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
+ *
+ */
+
+struct pv_hash_bucket {
+   struct qspinlock *lock;
+   int cpu;
+};
+
+/*
+ * XXX dynamic allocate using nr_cpu_ids instead...
+ */
+#define PV_LOCK_HASH_BITS  (2 + NR_CPUS_BITS)
+
+#if PV_LOCK_HASH_BITS  6
+#undef PV_LOCK_HASH_BITS
+#define PB_LOCK_HASH_BITS  6
+#endif
+
+#define PV_LOCK_HASH_SIZE  (1  PV_LOCK_HASH_BITS)
+
+static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] 
cacheline_aligned;
+
+#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct 
pv_hash_bucket))
+
+static inline u32 hash_align(u32 hash)
+{
+   return hash  ~(PV_HB_PER_LINE - 1);
+}
+
+#define for_each_hash_bucket(hb, off, hash)
\
+   for (hash = hash_align(hash), off = 0, hb =__pv_lock_hash[hash + off];\
+   off  PV_LOCK_HASH_SIZE; \
+   off++, hb =__pv_lock_hash[(hash + off) % PV_LOCK_HASH_SIZE])
+
+static struct pv_hash_bucket *pv_hash_insert(struct qspinlock *lock)
+{
+   u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
+   struct pv_hash_bucket *hb;
+
+   for_each_hash_bucket(hb, offset, hash) {
+   if (!cmpxchg(hb-lock, NULL, lock)) {
+   WRITE_ONCE(hb-cpu, smp_processor_id());
+   return hb;
+   }
+   }
+
+   /*
+* Hard assumes there is an empty bucket somewhere.
+*/
+   BUG();
+}
+
+static struct pv_hash_bucket *pv_hash_find(struct qspinlock *lock)
+{
+   u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
+   struct pv_hash_bucket *hb;
+
+   for_each_hash_bucket(hb, offset, hash) {
+   if (READ_ONCE(hb-lock) == lock)
+   return hb;
+   }
+
+   /*
+* Hard assumes we _WILL_ find a match.
+*/
+   BUG();
+}

  /*
   * Wait for l-locked to become clear; halt the vcpu after a short spin.
@@ -116,7 +195,9 @@ static DEFINE_PER_CPU(struct qspinlock *
  static void pv_wait_head(struct qspinlock *lock)
  {
struct __qspinlock *l = (void *)lock;
+   struct pv_hash_bucket *hb = NULL;
int loop;
+   u8 o;

for (;;) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
@@ -126,29 +207,47 @@ static void pv_wait_head(struct qspinloc
cpu_relax();
}

-   this_cpu_write(__pv_lock_wait, lock);
-   /*
-* __pv_lock_wait must be set before setting _Q_SLOW_VAL
-*
-* [S] __pv_lock_wait = lock[RmW] l = l-locked = 0
-* MB MB
-* [S] l-locked = _Q_SLOW_VAL  [L]   __pv_lock_wait
-*
-* Matches the xchg() in pv_queue_spin_unlock().
-*/
-   if (!cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL))
-   goto done;
+   if (!hb) {
+   hb = pv_hash_insert(lock);
+   /*
+* hb  must be set before setting _Q_SLOW_VAL
+*
+* [S]   hb- lock   [RmW] l = l-locked = 0
+*   MB MB
+* [RmW] l-locked ?= _Q_SLOW_VAL [L]   hb
+*
+* Matches the xchg() in pv_queue_spin_unlock().
+*/
+   o = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
+   if (!o) {
+   /*
+

Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-02 Thread Waiman Long

On 04/01/2015 05:03 PM, Peter Zijlstra wrote:

On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote:

On 04/01/2015 02:48 PM, Peter Zijlstra wrote:
I am sorry that I don't quite get what you mean here. My point is that in
the hashing step, a cpu will need to scan an empty bucket to put the lock
in. In the interim, an previously used bucket before the empty one may get
freed. In the lookup step for that lock, the scanning will stop because of
an empty bucket in front of the target one.

Right, that's broken. So we need to do something else to limit the
lookup, because without that break, a lookup that needs to iterate the
entire array in order to determine -ENOENT, which is expensive.

So my alternative proposal is that IFF we can guarantee that every
lookup will succeed -- the entry we're looking for is always there, we
don't need the break on empty but can probe until we find the entry.
This will be bound in cost to the same number if probes we required for
insertion and avoids the full array scan.

Now I think we can indeed do this, if as said earlier we do not clear
the bucket on insert if the cmpxchg succeeds, in that case the unlock
will observe _Q_SLOW_VAL and do the lookup, the lookup will then find
the entry. And we then need the unlock to clear the entry.
_Q_SLOW_VAL
Does that explain this? Or should I try again with code?


OK, I got your proposal now. However, there is still the issue that 
setting the _Q_SLOW_VAL flag and the hash bucket are not atomic wrt each 
other. It is possible a CPU has set the _Q_SLOW_VAL flag but not yet 
filling in the hash bucket while another one is trying to look for it. 
So we need to have some kind of synchronization mechanism to let the 
lookup CPU know when is a good time to look up.


One possibility is to delay setting _Q_SLOW_VAL until the hash bucket is 
set up. Maybe we can make that work.


Cheers,
Longman



___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Waiman Long

On 03/19/2015 08:25 AM, Peter Zijlstra wrote:

On Thu, Mar 19, 2015 at 11:12:42AM +0100, Peter Zijlstra wrote:

So I was now thinking of hashing the lock pointer; let me go and quickly
put something together.

A little something like so; ideally we'd allocate the hashtable since
NR_CPUS is kinda bloated, but it shows the idea I think.

And while this has loops in (the rehashing thing) their fwd progress
does not depend on other CPUs.

And I suspect that for the typical lock contention scenarios its
unlikely we ever really get into long rehashing chains.

---
  include/linux/lfsr.h|   49 
  kernel/locking/qspinlock_paravirt.h |  143 

  2 files changed, 178 insertions(+), 14 deletions(-)

--- /dev/null

+
+static int pv_hash_find(struct qspinlock *lock)
+{
+   u64 hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
+   struct pv_hash_bucket *hb, *end;
+   int cpu = -1;
+
+   if (!hash)
+   hash = 1;
+
+   hb =__pv_lock_hash[hash_align(hash)];
+   for (;;) {
+   for (end = hb + PV_HB_PER_LINE; hb  end; hb++) {
+   struct qspinlock *l = READ_ONCE(hb-lock);
+
+   /*
+* If we hit an unused bucket, there is no match.
+*/
+   if (!l)
+   goto done;


After more careful reading, I think the assumption that the presence of 
an unused bucket means there is no match is not true. Consider the scenario:


1. cpu 0 puts lock1 into hb[0]
2. cpu 1 puts lock2 into hb[1]
3. cpu 2 clears hb[0]
4. cpu 3 looks for lock2 and doesn't find it

I was thinking about putting some USED flag in the buckets, but then we 
will eventually fill them all up as used. If we put the entries into a 
hashed linked list, we have to deal with the complicated synchronization 
issues with link list update.


At this point, I am thinking using back your previous idea of passing 
the queue head information down the queue. I am now convinced that the 
unlock call site patching should work. So I will incorporate that in my 
next update.


Please let me know if you think my reasoning is not correct.

Thanks,
Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Waiman Long

On 04/01/2015 02:17 PM, Peter Zijlstra wrote:

On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote:

Hohumm.. time to think more I think ;-)

So bear with me, I've not really pondered this well so it could be full
of holes (again).

After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the
spin_unlock() must do the hash lookup, right? We can make the lookup
unhash.

If the cmpxchg() fails the unlock will not do the lookup and we must
unhash.

The idea being that the result is that any lookup is guaranteed to find
an entry, which reduces our worst case lookup cost to whatever the worst
case insertion cost was.



I think it doesn't matter who did the unhashing. Multiple independent 
locks can be hashed to the same value. Since they can be unhashed 
independently, there is no way to know whether you have checked all the 
possible buckets.


-Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Waiman Long

On 04/01/2015 02:48 PM, Peter Zijlstra wrote:

On Wed, Apr 01, 2015 at 02:54:45PM -0400, Waiman Long wrote:

On 04/01/2015 02:17 PM, Peter Zijlstra wrote:

On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote:

Hohumm.. time to think more I think ;-)

So bear with me, I've not really pondered this well so it could be full
of holes (again).

After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the
spin_unlock() must do the hash lookup, right? We can make the lookup
unhash.

If the cmpxchg() fails the unlock will not do the lookup and we must
unhash.

The idea being that the result is that any lookup is guaranteed to find
an entry, which reduces our worst case lookup cost to whatever the worst
case insertion cost was.


I think it doesn't matter who did the unhashing. Multiple independent locks
can be hashed to the same value. Since they can be unhashed independently,
there is no way to know whether you have checked all the possible buckets.

oh but the crux is that you guarantee a lookup will find an entry. it will
never need to iterate the entire array.


I am sorry that I don't quite get what you mean here. My point is that 
in the hashing step, a cpu will need to scan an empty bucket to put the 
lock in. In the interim, an previously used bucket before the empty one 
may get freed. In the lookup step for that lock, the scanning will stop 
because of an empty bucket in front of the target one.


-Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Waiman Long

On 04/01/2015 01:12 PM, Peter Zijlstra wrote:

On Wed, Apr 01, 2015 at 12:20:30PM -0400, Waiman Long wrote:

After more careful reading, I think the assumption that the presence of an
unused bucket means there is no match is not true. Consider the scenario:

1. cpu 0 puts lock1 into hb[0]
2. cpu 1 puts lock2 into hb[1]
3. cpu 2 clears hb[0]
4. cpu 3 looks for lock2 and doesn't find it

Hmm, yes. The only way I can see that being true is if we assume entries
are never taken out again.

The wikipedia page could use some clarification here, this is not clear.


At this point, I am thinking using back your previous idea of passing the
queue head information down the queue.

Having to scan the entire array for a lookup sure sucks, but the wait
loops involved in the other idea can get us in the exact predicament we
were trying to get out, because their forward progress depends on other
CPUs.


For the waiting loop, the worst case is when a new CPU get queued right 
before we write the head value to the previous tail node. In the case, 
the maximum number of retries is equal to the total number of CPUs - 2. 
But that should rarely happen.


I do find a way to guarantee forward progress in a few steps. I will try 
the normal way once. If that fails, I will insert the head node to the 
tail once again after saving the next pointer. After modifying the 
previous tail node, cmpxchg will be used to restore the previous tail. 
If that fails, we just have to wait until the next pointer is updated 
and write it out to the previous tail node. We can now restore the next 
pointer and move forward.


Let me know if that looks reasonable to you.

-Longman


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15

2015-03-30 Thread Waiman Long

On 03/25/2015 03:47 PM, Konrad Rzeszutek Wilk wrote:

On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote:

Hi Waiman,

As promised; here is the paravirt stuff I did during the trip to BOS last week.

All the !paravirt patches are more or less the same as before (the only real
change is the copyright lines in the first patch).

The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more
convoluted and I've no real way to test that but it should be stright fwd to
make work.

I ran this using the virtme tool (thanks Andy) on my laptop with a 4x
overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and
it both booted and survived a hackbench run (perf bench sched messaging -g 20
-l 5000).

So while the paravirt code isn't the most optimal code ever conceived it does 
work.

Also, the paravirt patching includes replacing the call with movb $0, %arg1
for the native case, which should greatly reduce the cost of having
CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware.

Ah nice. That could be spun out as a seperate patch to optimize the existing
ticket locks I presume.


The goal is to replace ticket spinlock by queue spinlock. We may not 
want to support 2 different spinlock implementations in the kernel.




Now with the old pv ticketlock code an vCPU would only go to sleep once and
be woken up when it was its turn. With this new code it is woken up twice
(and twice it goes to sleep). With an overcommit scenario this would imply
that we will have at least twice as many VMEXIT as with the previous code.


I did it differently in my PV portion of the qspinlock patch. Instead of 
just waking up the CPU, the new lock holder will check if the new queue 
head has been halted. If so, it will set the slowpath flag for the 
halted queue head in the lock so as to wake it up at unlock time. This 
should eliminate your concern of dong twice as many VMEXIT in an 
overcommitted scenario.


BTW, I did some qspinlock vs. ticketspinlock benchmarks using AIM7 
high_systime workload on a 4-socket IvyBridge-EX system (60 cores, 120 
threads) with some interesting results.


In term of the performance benefit of this patch, I ran the
high_systime workload (which does a lot of fork() and exit())
at various load levels (500, 1000, 1500 and 2000 users) on a
4-socket IvyBridge-EX bare-metal system (60 cores, 120 threads)
with intel_pstate driver and performance scaling governor. The JPM
(jobs/minutes) and execution time results were as follows:

Kernel  JPMExecution Time
--  -----
At 500 users:
 3.19118857.1426.25s
3.19-qspinlock134889.7523.13s
% change +13.5%-11.9%

At 1000 users:
 3.19204255.3230.55s
 3.19-qspinlock239631.3426.04s
% change +17.3%-14.8%

At 1500 users:
 3.19177272.7352.80s
 3.19-qspinlock326132.4028.70s
% change +84.0%-45.6%

At 2000 users:
 3.19196690.3163.45s
 3.19-qspinlock341730.5636.52s
% change +73.7%-42.4%

It turns out that this workload was causing quite a lot of spinlock
contention in the vanilla 3.19 kernel. The performance advantage of
this patch increases with heavier loads.

With the powersave governor, the JPM data were as follows:

Users3.19 3.19-qspinlock  % Change
---  
 500  112635.38  132596.69   +17.7%
1000  171240.40  240369.80   +40.4%
1500  130507.53  324436.74  +148.6%
2000  175972.93  341637.01   +94.1%

With the qspinlock patch, there wasn't too much difference in
performance between the 2 scaling governors. Without this patch,
the powersave governor was much slower than the performance governor.

By disabling the intel_pstate driver and use acpi_cpufreq instead,
the benchmark performance (JPM) at 1000 users level for the performance
and ondemand governors were:

  Governor  3.193.19-qspinlock   % Change
    --   
  performance   124949.94   219950.65+76.0%
  ondemand  4838.90   206690.96+4171%

The performance was just horrible when there was significant spinlock
contention with the ondemand governor. There was also significant
run-to-run variation.  A second run of the same benchmark gave a result
of 22115 JPMs. With the qspinlock patch, however, the performance was
much more stable under different cpufreq drivers and governors. That
is not the case with the default ticket spinlock implementation.

The %CPU times spent on spinlock contention (from perf) with the
performance governor and the intel_pstate driver were:

  Kernel Function3.19 kernel3.19-qspinlock kernel
  ------

Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15

2015-03-30 Thread Waiman Long

On 03/30/2015 12:29 PM, Peter Zijlstra wrote:

On Mon, Mar 30, 2015 at 12:25:12PM -0400, Waiman Long wrote:

I did it differently in my PV portion of the qspinlock patch. Instead of
just waking up the CPU, the new lock holder will check if the new queue head
has been halted. If so, it will set the slowpath flag for the halted queue
head in the lock so as to wake it up at unlock time. This should eliminate
your concern of dong twice as many VMEXIT in an overcommitted scenario.

We can still do that on top of all this right? As you might have
realized I'm a fan of gradual complexity :-)


Of course. I am just saying that the concern can be addressed with some 
additional code change.


-Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15

2015-03-30 Thread Waiman Long

On 03/27/2015 10:07 AM, Konrad Rzeszutek Wilk wrote:

On Thu, Mar 26, 2015 at 09:21:53PM +0100, Peter Zijlstra wrote:

On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote:

Ah nice. That could be spun out as a seperate patch to optimize the existing
ticket locks I presume.

Yes I suppose we can do something similar for the ticket and patch in
the right increment. We'd need to restructure the code a bit, but
its not fundamentally impossible.

We could equally apply the head hashing to the current ticket
implementation and avoid the current bitmap iteration.


Now with the old pv ticketlock code an vCPU would only go to sleep once and
be woken up when it was its turn. With this new code it is woken up twice
(and twice it goes to sleep). With an overcommit scenario this would imply
that we will have at least twice as many VMEXIT as with the previous code.

An astute observation, I had not considered that.

Thank you.

I presume when you did benchmarking this did not even register? Thought
I wonder if it would if you ran the benchmark for a week or so.

You presume I benchmarked :-) I managed to boot something virt and run
hackbench in it. I wouldn't know a representative virt setup if I ran
into it.

The thing is, we want this qspinlock for real hardware because its
faster and I really want to avoid having to carry two spinlock
implementations -- although I suppose that if we really really have to
we could.

In some way you already have that - for virtualized environments where you
don't have an PV mechanism you just use the byte spinlock - which is good.

And switching to PV ticketlock implementation after boot.. ugh. I feel your 
pain.

What if you used an PV bytelock implemenation? The code you posted already
'sprays' all the vCPUS to wake up. And that is exactly what you need for PV
bytelocks - well, you only need to wake up the vCPUS that have gone to sleep
waiting on an specific 'struct spinlock' and just stash those in an per-cpu
area. The old Xen spinlock code (Before 3.11?) had this.

Just an idea thought.


The current code should have just waken up one sleeping vCPU. We 
shouldn't want to wake up all of them and have almost all except one go 
back to sleep. I think the PV bytelock you suggest is workable. It 
should also simplify the implementation. It is just a matter of how much 
we value the fairness attribute of the PV ticket or queue spinlock 
implementation that we have.


-Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-03-19 Thread Waiman Long

On 03/19/2015 08:25 AM, Peter Zijlstra wrote:

On Thu, Mar 19, 2015 at 11:12:42AM +0100, Peter Zijlstra wrote:

So I was now thinking of hashing the lock pointer; let me go and quickly
put something together.

A little something like so; ideally we'd allocate the hashtable since
NR_CPUS is kinda bloated, but it shows the idea I think.

And while this has loops in (the rehashing thing) their fwd progress
does not depend on other CPUs.

And I suspect that for the typical lock contention scenarios its
unlikely we ever really get into long rehashing chains.

---
  include/linux/lfsr.h|   49 
  kernel/locking/qspinlock_paravirt.h |  143 

  2 files changed, 178 insertions(+), 14 deletions(-)


This is a much better alternative.


--- /dev/null
+++ b/include/linux/lfsr.h
@@ -0,0 +1,49 @@
+#ifndef _LINUX_LFSR_H
+#define _LINUX_LFSR_H
+
+/*
+ * Simple Binary Galois Linear Feedback Shift Register
+ *
+ * http://en.wikipedia.org/wiki/Linear_feedback_shift_register
+ *
+ */
+
+extern void __lfsr_needs_more_taps(void);
+
+static __always_inline u32 lfsr_taps(int bits)
+{
+   if (bits ==  1) return 0x0001;
+   if (bits ==  2) return 0x0001;
+   if (bits ==  3) return 0x0003;
+   if (bits ==  4) return 0x0009;
+   if (bits ==  5) return 0x0012;
+   if (bits ==  6) return 0x0021;
+   if (bits ==  7) return 0x0041;
+   if (bits ==  8) return 0x008E;
+   if (bits ==  9) return 0x0108;
+   if (bits == 10) return 0x0204;
+   if (bits == 11) return 0x0402;
+   if (bits == 12) return 0x0829;
+   if (bits == 13) return 0x100D;
+   if (bits == 14) return 0x2015;
+
+   /*
+* For more taps see:
+*   http://users.ece.cmu.edu/~koopman/lfsr/index.html
+*/
+   __lfsr_needs_more_taps();
+
+   return 0;
+}
+
+static inline u32 lfsr(u32 val, int bits)
+{
+   u32 bit = val  1;
+
+   val= 1;
+   if (bit)
+   val ^= lfsr_taps(bits);
+   return val;
+}
+
+#endif /* _LINUX_LFSR_H */
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -2,6 +2,9 @@
  #error do not include this file
  #endif

+#includelinux/hash.h
+#includelinux/lfsr.h
+
  /*
   * Implement paravirt qspinlocks; the general idea is to halt the vcpus 
instead
   * of spinning them.
@@ -107,7 +110,120 @@ static void pv_kick_node(struct mcs_spin
pv_kick(pn-cpu);
  }

-static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait);
+/*
+ * Hash table using open addressing with an LFSR probe sequence.
+ *
+ * Since we should not be holding locks from NMI context (very rare indeed) the
+ * max load factor is 0.75, which is around the point where open addressing
+ * breaks down.
+ *
+ * Instead of probing just the immediate bucket we probe all buckets in the
+ * same cacheline.
+ *
+ * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
+ *
+ */
+
+#define HB_RESERVED((struct qspinlock *)1)
+
+struct pv_hash_bucket {
+   struct qspinlock *lock;
+   int cpu;
+};
+
+/*
+ * XXX dynamic allocate using nr_cpu_ids instead...
+ */
+#define PV_LOCK_HASH_BITS  (2 + NR_CPUS_BITS)
+


As said here, we should make it dynamically allocated depending on 
num_possible_cpus().



+#if PV_LOCK_HASH_BITS  6
+#undef PV_LOCK_HASH_BITS
+#define PB_LOCK_HASH_BITS  6
+#endif
+
+#define PV_LOCK_HASH_SIZE  (1  PV_LOCK_HASH_BITS)
+
+static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] 
cacheline_aligned;
+
+#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct 
pv_hash_bucket))
+
+static inline u32 hash_align(u32 hash)
+{
+   return hash  ~(PV_HB_PER_LINE - 1);
+}
+
+static struct qspinlock **pv_hash(struct qspinlock *lock)
+{
+   u32 hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
+   struct pv_hash_bucket *hb, *end;
+
+   if (!hash)
+   hash = 1;
+
+   hb =__pv_lock_hash[hash_align(hash)];
+   for (;;) {
+   for (end = hb + PV_HB_PER_LINE; hb  end; hb++) {
+   if (cmpxchg(hb-lock, NULL, HB_RESERVED)) {
+   WRITE_ONCE(hb-cpu, smp_processor_id());
+   /*
+* Since we must read lock first and cpu
+* second, we must write cpu first and lock
+* second, therefore use HB_RESERVE to mark an
+* entry in use before writing the values.
+*
+* This can cause hb_hash_find() to not find a
+* cpu even though _Q_SLOW_VAL, this is not a
+* problem since we re-check l-locked before
+* going to sleep and the unlock will have
+* cleared l-locked already.
+*/
+   

Re: [Xen-devel] [PATCH 9/9] qspinlock, x86, kvm: Implement KVM support for paravirt qspinlock

2015-03-19 Thread Waiman Long

On 03/19/2015 06:01 AM, Peter Zijlstra wrote:

On Wed, Mar 18, 2015 at 10:45:55PM -0400, Waiman Long wrote:

On 03/16/2015 09:16 AM, Peter Zijlstra wrote:
I do have some concern about this call site patching mechanism as the
modification is not atomic. The spin_unlock() calls are in many places in
the kernel. There is a possibility that a thread is calling a certain
spin_unlock call site while it is being patched by another one with the
alternative() function call.

So far, I don't see any problem with bare metal where paravirt_patch_insns()
is used to patch it to the move instruction. However, in a virtual guest
enivornment where paravirt_patch_call() was used, there were situations
where the system panic because of page fault on some invalid memory in the
kthread. If you look at the paravirt_patch_call(), you will see:

 :
b-opcode = 0xe8; /* call */
b-delta = delta;

If another CPU reads the instruction at the call site at the right moment,
it will get the modified call instruction, but not the new delta value. It
will then jump to a random location. I believe that was causing the system
panic that I saw.

So I think it is kind of risky to use it here unless we can guarantee that
call site patching is atomic wrt other CPUs.

Just look at where the patching is done:

init/main.c:start_kernel()
   check_bugs()
 alternative_instructions()
   apply_paravirt()

We're UP and not holding any locks, disable IRQs (see text_poke_early())
and have NMIs 'disabled'.


You are probably right. The initial apply_paravirt() was done before the 
SMP boot. Subsequent ones were at kernel module load time. I put a 
counter in the __native_queue_spin_unlock() and it registered 26949 
unlock calls in a 16-cpu guest before it got patched out.


The panic that I observed before might be due to some coding error of my 
own.


-Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15

2015-03-18 Thread Waiman Long

On 03/16/2015 09:16 AM, Peter Zijlstra wrote:

Hi Waiman,

As promised; here is the paravirt stuff I did during the trip to BOS last week.

All the !paravirt patches are more or less the same as before (the only real
change is the copyright lines in the first patch).

The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more
convoluted and I've no real way to test that but it should be stright fwd to
make work.

I ran this using the virtme tool (thanks Andy) on my laptop with a 4x
overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and
it both booted and survived a hackbench run (perf bench sched messaging -g 20
-l 5000).

So while the paravirt code isn't the most optimal code ever conceived it does 
work.

Also, the paravirt patching includes replacing the call with movb $0, %arg1
for the native case, which should greatly reduce the cost of having
CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware.

I feel that if someone were to do a Xen patch we can go ahead and merge this
stuff (finally!).

These patches do not implement the paravirt spinlock debug stats currently
implemented (separately) by KVM and Xen, but that should not be too hard to do
on top and in the 'generic' code -- no reason to duplicate all that.

Of course; once this lands people can look at improving the paravirt nonsense.



Thanks for sending this out. I have no problem with the !paravirt patch. 
I do have some comments on the paravirt one which I will reply individually.


Cheers,
Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 8/9] qspinlock: Generic paravirt support

2015-03-18 Thread Waiman Long

On 03/16/2015 09:16 AM, Peter Zijlstra wrote:

Implement simple paravirt support for the qspinlock.

Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.

The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.

The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.

The head contender has no such storage available and reverts to the
per-cpu lock entry similar to the current kvm code. We can do a single
enrty because any nesting will wake the vcpu and cause the lower loop
to retry.

Signed-off-by: Peter Zijlstra (Intel)pet...@infradead.org
---
  include/asm-generic/qspinlock.h |3
  kernel/locking/qspinlock.c  |   69 +-
  kernel/locking/qspinlock_paravirt.h |  177 

  3 files changed, 248 insertions(+), 1 deletion(-)

--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -118,6 +118,9 @@ static __always_inline bool virt_queue_s
  }
  #endif

+extern void __pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_queue_spin_unlock(struct qspinlock *lock);
+
  /*
   * Initializier
   */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -18,6 +18,9 @@
   * Authors: Waiman Longwaiman.l...@hp.com
   *  Peter Zijlstrapet...@infradead.org
   */
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
  #includelinux/smp.h
  #includelinux/bug.h
  #includelinux/cpumask.h
@@ -65,13 +68,21 @@

  #include mcs_spinlock.h

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define MAX_NODES  8
+#else
+#define MAX_NODES  4
+#endif
+
  /*
   * Per-CPU queue node structures; we can never have more than 4 nested
   * contexts: task, softirq, hardirq, nmi.
   *
   * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
   */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);

  /*
   * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -230,6 +241,32 @@ static __always_inline void set_locked(s
WRITE_ONCE(l-locked, _Q_LOCKED_VAL);
  }

+
+/*
+ * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for
+ * all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
+
+static __always_inline void __pv_wait_head(struct qspinlock *lock) { }
+
+#define pv_enabled()   false
+
+#define pv_init_node   __pv_init_node
+#define pv_wait_node   __pv_wait_node
+#define pv_kick_node   __pv_kick_node
+
+#define pv_wait_head   __pv_wait_head
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queue_spin_lock_slowpath   native_queue_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
  /**
   * queue_spin_lock_slowpath - acquire the queue spinlock
   * @lock: Pointer to queue spinlock structure
@@ -259,6 +296,9 @@ void queue_spin_lock_slowpath(struct qsp

BUILD_BUG_ON(CONFIG_NR_CPUS= (1U  _Q_TAIL_CPU_BITS));

+   if (pv_enabled())
+   goto queue;
+
if (virt_queue_spin_lock(lock))
return;

@@ -335,6 +375,7 @@ void queue_spin_lock_slowpath(struct qsp
node += idx;
node-locked = 0;
node-next = NULL;
+   pv_init_node(node);

/*
 * We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -360,6 +401,7 @@ void queue_spin_lock_slowpath(struct qsp
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);

+   pv_wait_node(node);
arch_mcs_spin_lock_contended(node-locked);
}

@@ -374,6 +416,7 @@ void queue_spin_lock_slowpath(struct qsp
 * sequentiality; this is because the set_locked() function below
 * does not imply a full barrier.
 */
+   pv_wait_head(lock);
while ((val = smp_load_acquire(lock-val.counter))  
_Q_LOCKED_PENDING_MASK)
cpu_relax();

@@ -406,6 +449,7 @@ void queue_spin_lock_slowpath(struct qsp
cpu_relax();

arch_mcs_spin_unlock_contended(next-locked);
+   pv_kick_node(next);

  release:
/*
@@ -414,3 +458,26 @@ void queue_spin_lock_slowpath(struct qsp
this_cpu_dec(mcs_nodes[0].count);
  }
  EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+/*
+ * Generate the paravirt code for queue_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef pv_enabled
+#define pv_enabled()   

Re: [Xen-devel] [PATCH 9/9] qspinlock, x86, kvm: Implement KVM support for paravirt qspinlock

2015-03-18 Thread Waiman Long

On 03/16/2015 09:16 AM, Peter Zijlstra wrote:

Implement the paravirt qspinlock for x86-kvm.

We use the regular paravirt call patching to switch between:

   native_queue_spin_lock_slowpath()__pv_queue_spin_lock_slowpath()
   native_queue_spin_unlock()   __pv_queue_spin_unlock()

We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.

We further optimize the unlock path by patching the direct call with a
movb $0,%arg1 if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.

This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.


Signed-off-by: Peter Zijlstra (Intel)pet...@infradead.org


I do have some concern about this call site patching mechanism as the 
modification is not atomic. The spin_unlock() calls are in many places 
in the kernel. There is a possibility that a thread is calling a certain 
spin_unlock call site while it is being patched by another one with the 
alternative() function call.


So far, I don't see any problem with bare metal where 
paravirt_patch_insns() is used to patch it to the move instruction. 
However, in a virtual guest enivornment where paravirt_patch_call() was 
used, there were situations where the system panic because of page fault 
on some invalid memory in the kthread. If you look at the 
paravirt_patch_call(), you will see:


:
b-opcode = 0xe8; /* call */
b-delta = delta;

If another CPU reads the instruction at the call site at the right 
moment, it will get the modified call instruction, but not the new delta 
value. It will then jump to a random location. I believe that was 
causing the system panic that I saw.


So I think it is kind of risky to use it here unless we can guarantee 
that call site patching is atomic wrt other CPUs.


-Longman

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


[Xen-devel] [PATCH v14 01/11] qspinlock: A simple generic 4-byte queue spinlock

2015-01-20 Thread Waiman Long
This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
---
 include/asm-generic/qspinlock.h   |  132 +
 include/asm-generic/qspinlock_types.h |   58 +
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 +
 kernel/locking/mcs_spinlock.h |1 +
 kernel/locking/qspinlock.c|  207 +
 6 files changed, 406 insertions(+), 0 deletions(-)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
new file mode 100644
index 000..aafb7b4
--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,132 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock(struct qspinlock *lock)
+{
+   if (!atomic_read(lock-val) 
+  (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+extern void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/**
+ * queue_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock(struct qspinlock *lock)
+{
+   u32 val

  1   2   >