Re: [PATCH v11 09/16] qspinlock, x86: Allow unfair spinlock in a virtual guest

2014-06-11 Thread Long, Wai Man


On 6/11/2014 6:54 AM, Peter Zijlstra wrote:

On Fri, May 30, 2014 at 11:43:55AM -0400, Waiman Long wrote:

Enabling this configuration feature causes a slight decrease the
performance of an uncontended lock-unlock operation by about 1-2%
mainly due to the use of a static key. However, uncontended lock-unlock
operation are really just a tiny percentage of a real workload. So
there should no noticeable change in application performance.

No, entirely unacceptable.


+#ifdef CONFIG_VIRT_UNFAIR_LOCKS
+/**
+ * queue_spin_trylock_unfair - try to acquire the queue spinlock unfairly
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static __always_inline int queue_spin_trylock_unfair(struct qspinlock *lock)
+{
+   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+   if (!qlock->locked && (cmpxchg(&qlock->locked, 0, _Q_LOCKED_VAL) == 0))
+   return 1;
+   return 0;
+}
+
+/**
+ * queue_spin_lock_unfair - acquire a queue spinlock unfairly
+ * @lock: Pointer to queue spinlock structure
+ */
+static __always_inline void queue_spin_lock_unfair(struct qspinlock *lock)
+{
+   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+
+   if (likely(cmpxchg(&qlock->locked, 0, _Q_LOCKED_VAL) == 0))
+   return;
+   /*
+* Since the lock is now unfair, we should not activate the 2-task
+* pending bit spinning code path which disallows lock stealing.
+*/
+   queue_spin_lock_slowpath(lock, -1);
+}

Why is this needed?


I added the unfair version of lock and trylock as my original version 
isn't a simple test-and-set lock. Now I changed the core part to use the 
simple test-and-set lock. However, I still think that an unfair version 
in the fast path can be helpful to performance when both the unfair lock 
and paravirt spinlock are enabled. In this case, paravirt spinlock code 
will disable the unfair lock code in the slowpath, but still allow the 
unfair version in the fast path to get the best possible performance in 
a virtual guest.


Yes, I could take that out to allow either unfair or paravirt spinlock, 
but not both. I do think that a little bit of unfairness will help in 
the virtual environment.



+/*
+ * Redefine arch_spin_lock and arch_spin_trylock as inline functions that will
+ * jump to the unfair versions if the static key virt_unfairlocks_enabled
+ * is true.
+ */
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_lock_flags
+
+/**
+ * arch_spin_lock - acquire a queue spinlock
+ * @lock: Pointer to queue spinlock structure
+ */
+static inline void arch_spin_lock(struct qspinlock *lock)
+{
+   if (static_key_false(&virt_unfairlocks_enabled))
+   queue_spin_lock_unfair(lock);
+   else
+   queue_spin_lock(lock);
+}
+
+/**
+ * arch_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int arch_spin_trylock(struct qspinlock *lock)
+{
+   if (static_key_false(&virt_unfairlocks_enabled))
+   return queue_spin_trylock_unfair(lock);
+   else
+   return queue_spin_trylock(lock);
+}

So I really don't see the point of all this? Why do you need special
{try,}lock paths for this case? Are you worried about the upper 24bits?


No, as I said above. I was planning for the coexistence of unfair lock 
in the fast path and paravirt spinlock in the slowpath.



diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index ae1b19d..3723c83 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -217,6 +217,14 @@ static __always_inline int try_set_locked(struct qspinlock 
*lock)
  {
struct __qspinlock *l = (void *)lock;
  
+#ifdef CONFIG_VIRT_UNFAIR_LOCKS

+   /*
+* Need to use atomic operation to grab the lock when lock stealing
+* can happen.
+*/
+   if (static_key_false(&virt_unfairlocks_enabled))
+   return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;
+#endif
barrier();
ACCESS_ONCE(l->locked) = _Q_LOCKED_VAL;
barrier();

Why? If we have a simple test-and-set lock like below, we'll never get
here at all.


Again, it is due the coexistence of unfair lock in fast path and 
paravirt spinlock in the slowpath.



@@ -252,6 +260,18 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
  
  	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
  
+#ifdef CONFIG_VIRT_UNFAIR_LOCKS

+   /*
+* A simple test and set unfair lock
+*/
+   if (static_key_false(&virt_unfairlocks_enabled)) {
+   cpu_relax();/* Relax after a failed lock attempt */

Meh, I don't think anybody can tell the difference if you put that in or
not, therefore don't.


Yes, I can take out the cpu_relax() here.

-Longman
___
Virtualization mailing

Re: [PATCH v11 06/16] qspinlock: prolong the stay in the pending bit path

2014-06-11 Thread Long, Wai Man


On 6/11/2014 6:26 AM, Peter Zijlstra wrote:

On Fri, May 30, 2014 at 11:43:52AM -0400, Waiman Long wrote:

---
  kernel/locking/qspinlock.c |   18 --
  1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index fc7fd8c..7f10758 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -233,11 +233,25 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 */
for (;;) {
/*
-* If we observe any contention; queue.
+* If we observe that the queue is not empty or both
+* the pending and lock bits are set, queue
 */
-   if (val & ~_Q_LOCKED_MASK)
+   if ((val & _Q_TAIL_MASK) ||
+   (val == (_Q_LOCKED_VAL|_Q_PENDING_VAL)))
goto queue;
  
+		if (val == _Q_PENDING_VAL) {

+   /*
+* Pending bit is set, but not the lock bit.
+* Assuming that the pending bit holder is going to
+* set the lock bit and clear the pending bit soon,
+* it is better to wait than to exit at this point.
+*/
+   cpu_relax();
+   val = atomic_read(&lock->val);
+   continue;
+   }
+
new = _Q_LOCKED_VAL;
if (val == new)
new |= _Q_PENDING_VAL;


So, again, you just posted a new version without replying to the
previous discussion; so let me try again, what's wrong with the proposal
here:

   lkml.kernel.org/r/20140417163640.gt11...@twins.programming.kicks-ass.net




I thought I had answered you before, maybe the message was lost or the 
answer was not complete. Anyway, I will try to response to your question 
again here.



Wouldn't something like:

while (atomic_read(&lock->val) == _Q_PENDING_VAL)
cpu_relax();

before the cmpxchg loop have gotten you all this?


That is not exactly the same. The loop will exit if other bits are set or the 
pending
bit cleared. In the case, we will need to do the same check at the beginning of 
the
for loop in order to avoid doing an extra cmpxchg that is not necessary.



I just tried this on my code and I cannot see a difference.


As I said before, I did see a difference with that change. I think it 
depends on the CPU chip that we used for testing. I ran my test on a 
10-core Westmere-EX chip. I run my microbench on different pairs of core 
within the same chip. It produces different results that varies from 
779.5ms to up to 1192ms. Without that patch, the lowest value I can get 
is still close to 800ms, but the highest can be up to 1800ms or so. So I 
believe it is just a matter of timing that you did not observed in your 
test machine.


-Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization