Re: [PATCH v11 06/16] qspinlock: prolong the stay in the pending bit path

2014-06-11 Thread Peter Zijlstra
On Fri, May 30, 2014 at 11:43:52AM -0400, Waiman Long wrote:
 ---
  kernel/locking/qspinlock.c |   18 --
  1 files changed, 16 insertions(+), 2 deletions(-)
 
 diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
 index fc7fd8c..7f10758 100644
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -233,11 +233,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, 
 u32 val)
*/
   for (;;) {
   /*
 -  * If we observe any contention; queue.
 +  * If we observe that the queue is not empty or both
 +  * the pending and lock bits are set, queue
*/
 - if (val  ~_Q_LOCKED_MASK)
 + if ((val  _Q_TAIL_MASK) ||
 + (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)))
   goto queue;
  
 + if (val == _Q_PENDING_VAL) {
 + /*
 +  * Pending bit is set, but not the lock bit.
 +  * Assuming that the pending bit holder is going to
 +  * set the lock bit and clear the pending bit soon,
 +  * it is better to wait than to exit at this point.
 +  */
 + cpu_relax();
 + val = atomic_read(lock-val);
 + continue;
 + }
 +
   new = _Q_LOCKED_VAL;
   if (val == new)
   new |= _Q_PENDING_VAL;


So, again, you just posted a new version without replying to the
previous discussion; so let me try again, what's wrong with the proposal
here:

  lkml.kernel.org/r/20140417163640.gt11...@twins.programming.kicks-ass.net




pgpblndpE3D11.pgp
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-06-11 Thread Peter Zijlstra
On Fri, May 30, 2014 at 11:43:55AM -0400, Waiman Long wrote:
 Enabling this configuration feature causes a slight decrease the
 performance of an uncontended lock-unlock operation by about 1-2%
 mainly due to the use of a static key. However, uncontended lock-unlock
 operation are really just a tiny percentage of a real workload. So
 there should no noticeable change in application performance.

No, entirely unacceptable.

 +#ifdef CONFIG_VIRT_UNFAIR_LOCKS
 +/**
 + * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
 + * @lock : Pointer to queue spinlock structure
 + * Return: 1 if lock acquired, 0 if failed
 + */
 +static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
 +{
 + union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 +
 + if (!qlock-locked  (cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0))
 + return 1;
 + return 0;
 +}
 +
 +/**
 + * queue_spin_lock_unfair - acquire a queue spinlock unfairly
 + * @lock: Pointer to queue spinlock structure
 + */
 +static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
 +{
 + union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 +
 + if (likely(cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0))
 + return;
 + /*
 +  * Since the lock is now unfair, we should not activate the 2-task
 +  * pending bit spinning code path which disallows lock stealing.
 +  */
 + queue_spin_lock_slowpath(lock, -1);
 +}

Why is this needed?

 +/*
 + * Redefine arch_spin_lock and arch_spin_trylock as inline functions that 
 will
 + * jump to the unfair versions if the static key virt_unfairlocks_enabled
 + * is true.
 + */
 +#undef arch_spin_lock
 +#undef arch_spin_trylock
 +#undef arch_spin_lock_flags
 +
 +/**
 + * arch_spin_lock - acquire a queue spinlock
 + * @lock: Pointer to queue spinlock structure
 + */
 +static inline void arch_spin_lock(struct qspinlock *lock)
 +{
 + if (static_key_false(virt_unfairlocks_enabled))
 + queue_spin_lock_unfair(lock);
 + else
 + queue_spin_lock(lock);
 +}
 +
 +/**
 + * arch_spin_trylock - try to acquire the queue spinlock
 + * @lock : Pointer to queue spinlock structure
 + * Return: 1 if lock acquired, 0 if failed
 + */
 +static inline int arch_spin_trylock(struct qspinlock *lock)
 +{
 + if (static_key_false(virt_unfairlocks_enabled))
 + return queue_spin_trylock_unfair(lock);
 + else
 + return queue_spin_trylock(lock);
 +}

So I really don't see the point of all this? Why do you need special
{try,}lock paths for this case? Are you worried about the upper 24bits?

 diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
 index ae1b19d..3723c83 100644
 --- a/kernel/locking/qspinlock.c
 +++ b/kernel/locking/qspinlock.c
 @@ -217,6 +217,14 @@ static __always_inline int try_set_locked(struct 
 qspinlock *lock)
  {
   struct __qspinlock *l = (void *)lock;
  
 +#ifdef CONFIG_VIRT_UNFAIR_LOCKS
 + /*
 +  * Need to use atomic operation to grab the lock when lock stealing
 +  * can happen.
 +  */
 + if (static_key_false(virt_unfairlocks_enabled))
 + return cmpxchg(l-locked, 0, _Q_LOCKED_VAL) == 0;
 +#endif
   barrier();
   ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
   barrier();

Why? If we have a simple test-and-set lock like below, we'll never get
here at all.

 @@ -252,6 +260,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, 
 u32 val)
  
   BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
  
 +#ifdef CONFIG_VIRT_UNFAIR_LOCKS
 + /*
 +  * A simple test and set unfair lock
 +  */
 + if (static_key_false(virt_unfairlocks_enabled)) {
 + cpu_relax();/* Relax after a failed lock attempt */

Meh, I don't think anybody can tell the difference if you put that in or
not, therefore don't.

 + while (!queue_spin_trylock(lock))
 + cpu_relax();
 + return;
 + }
 +#endif /* CONFIG_VIRT_UNFAIR_LOCKS */

If you're really worried about those upper 24bits, you can always clear
them when you get here.


pgpIuA4TLXzJ1.pgp
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-06-11 Thread Peter Zijlstra
On Wed, Jun 11, 2014 at 12:54:02PM +0200, Peter Zijlstra wrote:
  @@ -252,6 +260,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, 
  u32 val)
   
  BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
   
  +#ifdef CONFIG_VIRT_UNFAIR_LOCKS
  +   /*
  +* A simple test and set unfair lock
  +*/
  +   if (static_key_false(virt_unfairlocks_enabled)) {
  +   cpu_relax();/* Relax after a failed lock attempt */
 
 Meh, I don't think anybody can tell the difference if you put that in or
 not, therefore don't.
 
  +   while (!queue_spin_trylock(lock))
  +   cpu_relax();
  +   return;
  +   }
  +#endif /* CONFIG_VIRT_UNFAIR_LOCKS */
 
 If you're really worried about those upper 24bits, you can always clear
 them when you get here.

I don't think its a problem at all; flipping the static_key requires
stop_machine, which guarantees us that there are no spinlocks held. So I
think you can actually BUG_ON() the upper 24bits.


pgpJoerokfAAj.pgp
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PULL] vhost: infrastructure changes for 3.16

2014-06-11 Thread Michael S. Tsirkin
Hi Linus,
Please pull the following.
Please note this needs to be merged before merging
target-pending PULL which Nicholas will be sending
out shortly.

Thanks!

The following changes since commit 1860e379875dfe7271c649058aeddffe5afd9d0d:

  Linux 3.15 (2014-06-08 11:19:54 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 47283bef7ed356629467d1fac61687756e48f254:

  vhost: move memory pointer to VQs (2014-06-09 16:21:07 +0300)


vhost: infrastructure changes for 3.16

This reworks vhost core dropping unnecessary RCU uses in favor of VQ mutexes
which are used on fast path anyway.  This fixes worst-case latency for users
which change the memory mappings a lot.
Memory allocation for vhost-net now supports fallback on vmalloc (same as for
vhost-scsi) this makes it possible to create the device on systems where memory
is very fragmented, with slightly lower performance.

Signed-off-by: Michael S. Tsirkin m...@redhat.com


Michael S. Tsirkin (4):
  vhost-net: extend device allocation to vmalloc
  vhost: replace rcu with mutex
  vhost: move acked_features to VQs
  vhost: move memory pointer to VQs

 drivers/vhost/vhost.h | 19 --
 drivers/vhost/net.c   | 35 ---
 drivers/vhost/scsi.c  | 26 --
 drivers/vhost/test.c  | 11 +++---
 drivers/vhost/vhost.c | 97 ++-
 5 files changed, 101 insertions(+), 87 deletions(-)
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] block: virtio_blk: don't hold spin lock during world switch

2014-06-11 Thread Paolo Bonzini

Il 02/06/2014 15:06, Ming Lei ha scritto:


 If you're running SMP under an emulator where exits are expensive, then
 this wins.  Under KVM it's marginal at best.

Both my tests on arm64 and x86 are under KVM, and looks the
patch can improve performance a lot. IMO, even though under
KVM, virtio-blk performance still depends how well hypervisor(
qemu, ...) emulates the device, and basically speaking, it is
expensive to switch from guest to host and let host handle the
notification.


The difference is that virtio-pci supports ioeventfd and virtio-mmio 
doesn't.


With ioeventfd you can tell KVM I don't care about the value that is 
written to a memory location, only that it is accessed.  Then when the 
write happens, KVM doesn't do an expensive userspace exit; it just 
writes 1 to an eventfd.


It then returns to the guest, userspace picks up the eventfd via its 
poll() loop and services the device.


This is already useful for throughput on UP, and the small latency cost 
(because of the cost of the event loop in the I/O thread, and possibly 
the cost of waking up the thread) is usually offset by the benefit.


But on SMP you get double benefit.  Obviously, the kernel doesn't have 
to spin while userspace does its stuff.  On top of this, there is also a 
latency improvement from ioeventfd, because QEMU processes 
virtqueue_notify under its big QEMU lock.  With ioeventfd, serialized 
virtqueue processing can be a bottleneck, but it doesn't affect latency. 
 Without ioeventfd it affects the VCPUs' latency and negates a lot of 
the benefit of Ming Lei's patch.


You can try disabling ioeventfd with -global 
virtio-blk-pci.ioeventfd=off on the QEMU command line.  Performance 
will plummet. :)


Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 06/16] qspinlock: prolong the stay in the pending bit path

2014-06-11 Thread Long, Wai Man


On 6/11/2014 6:26 AM, Peter Zijlstra wrote:

On Fri, May 30, 2014 at 11:43:52AM -0400, Waiman Long wrote:

---
  kernel/locking/qspinlock.c |   18 --
  1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc7fd8c..7f10758 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -233,11 +233,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 */
for (;;) {
/*
-* If we observe any contention; queue.
+* If we observe that the queue is not empty or both
+* the pending and lock bits are set, queue
 */
-   if (val  ~_Q_LOCKED_MASK)
+   if ((val  _Q_TAIL_MASK) ||
+   (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)))
goto queue;
  
+		if (val == _Q_PENDING_VAL) {

+   /*
+* Pending bit is set, but not the lock bit.
+* Assuming that the pending bit holder is going to
+* set the lock bit and clear the pending bit soon,
+* it is better to wait than to exit at this point.
+*/
+   cpu_relax();
+   val = atomic_read(lock-val);
+   continue;
+   }
+
new = _Q_LOCKED_VAL;
if (val == new)
new |= _Q_PENDING_VAL;


So, again, you just posted a new version without replying to the
previous discussion; so let me try again, what's wrong with the proposal
here:

   lkml.kernel.org/r/20140417163640.gt11...@twins.programming.kicks-ass.net




I thought I had answered you before, maybe the message was lost or the 
answer was not complete. Anyway, I will try to response to your question 
again here.



Wouldn't something like:

while (atomic_read(lock-val) == _Q_PENDING_VAL)
cpu_relax();

before the cmpxchg loop have gotten you all this?


That is not exactly the same. The loop will exit if other bits are set or the 
pending
bit cleared. In the case, we will need to do the same check at the beginning of 
the
for loop in order to avoid doing an extra cmpxchg that is not necessary.



I just tried this on my code and I cannot see a difference.


As I said before, I did see a difference with that change. I think it 
depends on the CPU chip that we used for testing. I ran my test on a 
10-core Westmere-EX chip. I run my microbench on different pairs of core 
within the same chip. It produces different results that varies from 
779.5ms to up to 1192ms. Without that patch, the lowest value I can get 
is still close to 800ms, but the highest can be up to 1800ms or so. So I 
believe it is just a matter of timing that you did not observed in your 
test machine.


-Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-06-11 Thread Long, Wai Man


On 6/11/2014 6:54 AM, Peter Zijlstra wrote:

On Fri, May 30, 2014 at 11:43:55AM -0400, Waiman Long wrote:

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

No, entirely unacceptable.


+#ifdef CONFIG_VIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+   if (!qlock-locked  (cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock unfairly
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
+{
+   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+   if (likely(cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0))
+   return;
+   /*
+* Since the lock is now unfair, we should not activate the 2-task
+* pending bit spinning code path which disallows lock stealing.
+*/
+   queue_spin_lock_slowpath(lock, -1);
+}

Why is this needed?


I added the unfair version of lock and trylock as my original version 
isn't a simple test-and-set lock. Now I changed the core part to use the 
simple test-and-set lock. However, I still think that an unfair version 
in the fast path can be helpful to performance when both the unfair lock 
and paravirt spinlock are enabled. In this case, paravirt spinlock code 
will disable the unfair lock code in the slowpath, but still allow the 
unfair version in the fast path to get the best possible performance in 
a virtual guest.


Yes, I could take that out to allow either unfair or paravirt spinlock, 
but not both. I do think that a little bit of unfairness will help in 
the virtual environment.



+/*
+ * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
+ * jump to the unfair versions if the static key virt_unfairlocks_enabled
+ * is true.
+ */
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_lock_flags
+
+/**
+ * arch_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static inline void arch_spin_lock(struct qspinlock *lock)
+{
+   if (static_key_false(virt_unfairlocks_enabled))
+   queue_spin_lock_unfair(lock);
+   else
+   queue_spin_lock(lock);
+}
+
+/**
+ * arch_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int arch_spin_trylock(struct qspinlock *lock)
+{
+   if (static_key_false(virt_unfairlocks_enabled))
+   return queue_spin_trylock_unfair(lock);
+   else
+   return queue_spin_trylock(lock);
+}

So I really don't see the point of all this? Why do you need special
{try,}lock paths for this case? Are you worried about the upper 24bits?


No, as I said above. I was planning for the coexistence of unfair lock 
in the fast path and paravirt spinlock in the slowpath.



diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ae1b19d..3723c83 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -217,6 +217,14 @@ static __always_inline int try_set_locked(struct qspinlock 
*lock)
  {
struct __qspinlock *l = (void *)lock;
  
+#ifdef CONFIG_VIRT_UNFAIR_LOCKS

+   /*
+* Need to use atomic operation to grab the lock when lock stealing
+* can happen.
+*/
+   if (static_key_false(virt_unfairlocks_enabled))
+   return cmpxchg(l-locked, 0, _Q_LOCKED_VAL) == 0;
+#endif
barrier();
ACCESS_ONCE(l-locked) = _Q_LOCKED_VAL;
barrier();

Why? If we have a simple test-and-set lock like below, we'll never get
here at all.


Again, it is due the coexistence of unfair lock in fast path and 
paravirt spinlock in the slowpath.



@@ -252,6 +260,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
  
  	BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
  
+#ifdef CONFIG_VIRT_UNFAIR_LOCKS

+   /*
+* A simple test and set unfair lock
+*/
+   if (static_key_false(virt_unfairlocks_enabled)) {
+   cpu_relax();/* Relax after a failed lock attempt */

Meh, I don't think anybody can tell the difference if you put that in or
not, therefore don't.


Yes, I can take out the cpu_relax() here.

-Longman
___
Virtualization mailing list

Re: Using virtio for inter-VM communication

2014-06-11 Thread Rusty Russell
Henning Schild henning.sch...@siemens.com writes:
 Hi,

 i am working on the jailhouse[1] project and am currently looking at
 inter-VM communication. We want to connect guests directly with virtual
 consoles based on shared memory. The code complexity in the hypervisor
 should be minimal, it should just make the shared memory discoverable
 and provide a signaling mechanism.

Hi Henning,

The virtio assumption was that the host can see all of guest
memory.  This simplifies things significantly, and makes it efficient.

If you don't have this, *someone* needs to do a copy.  Usually the guest
OS does a bounce buffer into your shared region.  Goodbye performance.
Or you can play remapping tricks.  Goodbye performance again.

My preferred model is to have a trusted helper (ie. host) which
understands how to copy between virtio rings.  The backend guest (to
steal Xen vocab) R/O maps the descriptor, avail ring and used rings in
the guest.  It then asks the trusted helper to do various operation
(copy into writable descriptor, copy out of readable descriptor, mark
used).  The virtio ring itself acts as a grant table.

Note: that helper mechanism is completely protocol agnostic.  It was
also explicitly designed into the virtio mechanism (with its 4k
boundaries for data structures and its 'len' field to indicate how much
was written into the descriptor). 

It was also never implemented, and remains a thought experiment.
However, implementing it in lguest should be fairly easy.

Cheers,
Rusty.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Using virtio for inter-VM communication

2014-06-11 Thread Jan Kiszka
On 2014-06-12 04:27, Rusty Russell wrote:
 Henning Schild henning.sch...@siemens.com writes:
 Hi,

 i am working on the jailhouse[1] project and am currently looking at
 inter-VM communication. We want to connect guests directly with virtual
 consoles based on shared memory. The code complexity in the hypervisor
 should be minimal, it should just make the shared memory discoverable
 and provide a signaling mechanism.
 
 Hi Henning,
 
 The virtio assumption was that the host can see all of guest
 memory.  This simplifies things significantly, and makes it efficient.
 
 If you don't have this, *someone* needs to do a copy.  Usually the guest
 OS does a bounce buffer into your shared region.  Goodbye performance.
 Or you can play remapping tricks.  Goodbye performance again.
 
 My preferred model is to have a trusted helper (ie. host) which
 understands how to copy between virtio rings.  The backend guest (to
 steal Xen vocab) R/O maps the descriptor, avail ring and used rings in
 the guest.  It then asks the trusted helper to do various operation
 (copy into writable descriptor, copy out of readable descriptor, mark
 used).  The virtio ring itself acts as a grant table.
 
 Note: that helper mechanism is completely protocol agnostic.  It was
 also explicitly designed into the virtio mechanism (with its 4k
 boundaries for data structures and its 'len' field to indicate how much
 was written into the descriptor). 
 
 It was also never implemented, and remains a thought experiment.
 However, implementing it in lguest should be fairly easy.

The reason why a trusted helper, i.e. additional logic in the
hypervisor, is not our favorite solution is that we'd like to keep the
hypervisor as small as possible. I wouldn't exclude such an approach
categorically, but we have to weigh the costs (lines of code, additional
hypervisor interface) carefully against the gain (existing
specifications and guest driver infrastructure).

Back to VIRTIO_F_RING_SHMEM_ADDR (which you once brought up in an MCA
working group discussion): What speaks against introducing an
alternative encoding of addresses inside virtio data structures? The
idea of this flag was to replace guest-physical addresses with offsets
into a shared memory region associated with or part of a virtio device.
That would preserve zero-copy capabilities (as long as you can work
against the shared mem directly, e.g. doing DMA from a physical NIC or
storage device into it) and keep the hypervisor out of the loop. Is it
too invasive to existing infrastructure or does it have some other pitfalls?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-06-11 Thread Peter Zijlstra
On Wed, Jun 11, 2014 at 09:37:55PM -0400, Long, Wai Man wrote:
 
 On 6/11/2014 6:54 AM, Peter Zijlstra wrote:
 On Fri, May 30, 2014 at 11:43:55AM -0400, Waiman Long wrote:
 Enabling this configuration feature causes a slight decrease the
 performance of an uncontended lock-unlock operation by about 1-2%
 mainly due to the use of a static key. However, uncontended lock-unlock
 operation are really just a tiny percentage of a real workload. So
 there should no noticeable change in application performance.
 No, entirely unacceptable.
 
 +#ifdef CONFIG_VIRT_UNFAIR_LOCKS
 +/**
 + * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
 + * @lock : Pointer to queue spinlock structure
 + * Return: 1 if lock acquired, 0 if failed
 + */
 +static __always_inline int queue_spin_trylock_unfair(struct qspinlock 
 *lock)
 +{
 +   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 +
 +   if (!qlock-locked  (cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0))
 +   return 1;
 +   return 0;
 +}
 +
 +/**
 + * queue_spin_lock_unfair - acquire a queue spinlock unfairly
 + * @lock: Pointer to queue spinlock structure
 + */
 +static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
 +{
 +   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 +
 +   if (likely(cmpxchg(qlock-locked, 0, _Q_LOCKED_VAL) == 0))
 +   return;
 +   /*
 +* Since the lock is now unfair, we should not activate the 2-task
 +* pending bit spinning code path which disallows lock stealing.
 +*/
 +   queue_spin_lock_slowpath(lock, -1);
 +}
 Why is this needed?
 
 I added the unfair version of lock and trylock as my original version isn't
 a simple test-and-set lock. Now I changed the core part to use the simple
 test-and-set lock. However, I still think that an unfair version in the fast
 path can be helpful to performance when both the unfair lock and paravirt
 spinlock are enabled. In this case, paravirt spinlock code will disable the
 unfair lock code in the slowpath, but still allow the unfair version in the
 fast path to get the best possible performance in a virtual guest.
 
 Yes, I could take that out to allow either unfair or paravirt spinlock, but
 not both. I do think that a little bit of unfairness will help in the
 virtual environment.

When will you learn to like simplicity and stop this massive over
engineering effort?

There's no sane reason to have the test-and-set virt and paravirt locks
enabled at the same bloody time.

There's 3 distinct cases:

 - native
 - virt
 - paravirt

And they do not overlap. Furthermore, if there is any possibility at all
of not polluting the native code, grab it with both hands.

Native performance is king, try your very utmost bestest to preserve
that, paravirt is a distant second and nobody sane should care about the
virt case at all.

If you want extra lock stealing in the paravirt case, put it in the
slowpath code before you start queueing.


pgpqW6UFxU8pH.pgp
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization