Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-05 Thread Peter Zijlstra
On Tue, Mar 04, 2014 at 11:40:43PM +0100, Peter Zijlstra wrote:
 On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote:
  Peter,
  
  I was trying to implement the generic queue code exchange code using
  cmpxchg as suggested by you. However, when I gathered the performance
  data, the code performed worse than I expected at a higher contention
  level. Below were the execution time of the benchmark tool that I sent
  you:
 
 I'm just not seeing that; with test-4 modified to take the AMD compute
 units into account:

OK; I tried on a few larger machines and I can indeed see it there.

That said; our code doesn't differ that much. I see why you're not doing
too well on the 2 CPU contention. You've got an atomic op too much in
that path. But given you see benefit even with 2 atomic ops (I had mixed
results on that) we can do the pending/waiter thing unconditionally for
NR_CPUS16k.

I also think I can do your full xchg thing without allowing lock steals.

I'll try and do a full series tomorrow that starts with simple code and
builds on that, doing each optimization one by one.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-04 Thread Waiman Long

On 03/02/2014 08:16 AM, Oleg Nesterov wrote:

On 02/26, Waiman Long wrote:

@@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct 
qspinlock *lock)
int qlcode = atomic_read(lock-qlcode);

if (!(qlcode  _QSPINLOCK_LOCKED)  (atomic_cmpxchg(lock-qlcode,
-   qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
+   qlcode, code|_QSPINLOCK_LOCKED) == qlcode))

Hmm. didn't read the patch, but this change looks like accidental typo...

Oleg.



Thank for catching that typo error. That generic code wasn't compiled in 
and so I missed it. I will fix that in the next version.


-Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-04 Thread Waiman Long

On 03/03/2014 12:43 PM, Peter Zijlstra wrote:

Hi,

Here are some numbers for my version -- also attached is the test code.
I found that booting big machines is tediously slow so I lifted the
whole lot to userspace.

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
(2 socket) Intel Westmere-EP.

AMD (ticket)AMD (qspinlock + pending + opt)

Local:  Local:

1:324.4255301:324.102142
2:  17141.3240502:620.185930
3:  52212.2323433:  25242.574661
4:  93136.4583144:  47982.037866
6: 167967.4559656:  95345.011864
8: 245402.5348698: 142412.451438

2 - nodes:  2 - nodes:

2:  12763.6409562:   1879.460823
4:  94423.0271234:  48278.719130
6: 167903.6983616:  96747.767310
8: 257243.5082948: 144672.846317

4 - nodes:  4 - nodes:

  4:  82408.8536034:  49820.323075
  8: 260492.9523558: 143538.264724
16: 630099.031148   16: 337796.553795



Intel (ticket)  Intel (qspinlock + pending + opt)

Local:  Local:

1:19.002249 1:29.002844
2:  5093.275530 2:  1282.209519
3: 22300.859761 3: 22127.477388
4: 44929.922325 4: 44493.881832
6: 86338.755247 6: 86360.083940

2 - nodes:  2 - nodes:

2:   1509.1938242:   1209.090219
4:  48154.4959984:  48547.242379
8: 137946.7872448: 141381.498125

---

There a few curious facts I found (assuming my test code is sane).

  - Intel seems to be an order of magnitude faster on uncontended LOCKed
ops compared to AMD

  - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
than the uncontended ticket xadd -- although both are plenty fast
when compared to AMD.

  - In general, replacing cmpxchg loops with unconditional atomic ops
doesn't seem to matter a whole lot when the thing is contended.

Below is the (rather messy) qspinlock slow path code (the only thing
that really differs between our versions.

I'll try and slot your version in tomorrow.

---



It is curious to see that the qspinlock code offers a big benefit on AMD 
machines, but no so much on Intel. Anyway, I am working on a revised 
version of the patch that includes some of your comments. I will also 
try to see if I can get an AMD machine to run test on.


-Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-04 Thread Peter Zijlstra

Updated version, this includes numbers for my SNB desktop and Waiman's
variant.

Curiously Waiman's version seems consistently slower on 2 cross node
CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs.

There's something weird with the ticket lock numbers; when I compile
the code with:

  gcc (Debian 4.7.2-5) 4.7.2

I get the first set; when I compile with:

  gcc (Ubuntu/Linaro 4.7.3-2ubuntu4) 4.7.3

I get the second set; afaict the other locks don't seem to have this
problem, but I only just noticed.

---

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, a 2 node
(2 socket) Intel Westmere-EP and my i7-2600K (SNB) desktop.


(ticket)(qspinlock + all)   (waiman)


AMD Interlagos

Local:

 1:324.4255301:324.1021421:323.857834
 2:  17141.3240502:620.1859302:618.737681
 3:  52212.2323433:  25242.5746613:  24888.154161
 4:  93136.4583144:  47982.0378664:  48227.610976
 6: 167967.4559656:  95345.0118646:  94372.276116
 8: 245402.5348698: 142412.4514388: 140516.525791

 1: 324.071515
 2: 981.778516
 3: 24414.144262
 4: 50868.376667
 6: 99536.890504
 8: 151616.395779

2 - nodes:

 2:  12763.6409562:   1879.4608232:   2023.594014
 4:  94423.0271234:  48278.7191304:  48621.724929
 6: 167903.6983616:  96747.7673106:  95815.242461
 8: 257243.5082948: 144672.8463178: 143282.222038

 2:   1875.637053
 4:  50082.796058
 6: 107780.361523
 8: 163166.728218

4 - nodes:

 4:  82408.8536034:  49820.3230754:  50566.036473
 8: 260492.9523558: 143538.2647248: 143485.584624
16: 630099.031148   16: 337796.553795   16: 333419.421338

 4: 55409.896671
 8: 167340.905593
16: 443195.057052



Intel WSM-EP

Local:

1:19.002249 1:29.002844 1:28.979917
2:  5093.275530 2:  1282.209519 2:  1236.624785
3: 22300.859761 3: 22127.477388 3: 22336.241120
4: 44929.922325 4: 44493.881832 4: 44832.450598
6: 86338.755247 6: 86360.083940 6: 85808.603491

1: 20.009974
2: 1206.419074
3: 22071.535000
4: 44606.831373
6: 86498.760774

2 - nodes:

2:   1527.4661592:   1227.0519932:   1434.204666
4:  46004.2321794:  46450.7872344:  46999.356429
6:  89226.4720576:  90124.9843246:  90110.423115
8: 137225.4724068: 137909.1843588: 137988.290401


Intel SNB

Local:

1:15.276759 1:25.336807 1:25.353041
2:   714.621152 2:   843.240641 2:   711.281211
3: 11339.104267 3: 11751.159167 3: 11684.286334
4: 22648.387376 4: 23454.798068 4: 22903.498910





































spinlocks.tar.bz2
Description: Binary data
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-04 Thread Waiman Long

Peter,

I was trying to implement the generic queue code exchange code using
cmpxchg as suggested by you. However, when I gathered the performance
data, the code performed worse than I expected at a higher contention
level. Below were the execution time of the benchmark tool that I sent
you:

[xchg][cmpxchg]
  # of tasksTicket lock Queue lock  Queue Lock
  ----- --- --
   1  135135  135
   2  732   13151102
   3 1827   23722681
   4 2689   2934 5392
   5 3736   3658 7696
   6 4942   44349876
   7 6304   5176   11901
   8 7736   5955   14551

Below is the code that I used:

static inline u32 queue_code_xchg(struct qspinlock *lock, u32 *ocode, 
u32 ncode)

{
while (true) {
u32 qlcode = atomic_read(lock-qlcode);

if (qlcode == 0) {
/*
 * Try to get the lock
 */
if (atomic_cmpxchg(lock-qlcode, 0,
   _QSPINLOCK_LOCKED) == 0)
return 1;
} else if (qlcode  _QSPINLOCK_LOCKED) {
*ocode = atomic_cmpxchg(lock-qlcode, qlcode,
ncode | _QSPINLOCK_LOCKED);
if (*ocode == qlcode) {
/* Clear lock bit before return */
*ocode = ~_QSPINLOCK_LOCKED;
return 0;
}
}
/*
 * Wait if atomic_cmpxchg() fails or lock is 
temporarily free.

 */
arch_mutex_cpu_relax();
}
}

My cmpxchg code is not optimal, and I can probably tune the code to
make it perform better. Given the trend that I was seeing, however,
I think I will keep the current xchg code, but I will package it in
an inline function.

-Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-04 Thread Peter Zijlstra
On Tue, Mar 04, 2014 at 05:58:00PM +0100, Peter Zijlstra wrote:
  2:  17141.3240502:620.1859302:618.737681

So I forgot that AMD has compute units that share L2:

root@interlagos:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 1 ; $LOCK 0 2) | 
awk '/^total/ { print $2 }'
982.938839
1325.702905
root@interlagos:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 1 
; $LOCK 0 2) | awk '/^total/ { print $2 }'
630.684313
999.119087
root@interlagos:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 1 ; $LOCK 0 2) | 
awk '/^total/ { print $2 }'
620.562791
1644.700639


Doing the same for Intel SMT, which shares L1:


root@westmere:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 12 ; $LOCK 0 1) | 
awk '/^total/ { print $2 }'
45.765302
1292.721827
root@westmere:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 12 ; 
$LOCK 0 1) | awk '/^total/ { print $2 }'
54.536890
1260.467527
root@westmere:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 12 ; $LOCK 0 1) | 
awk '/^total/ { print $2 }'
65.944794
1230.522895




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-04 Thread Peter Zijlstra
On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote:
 Peter,
 
 I was trying to implement the generic queue code exchange code using
 cmpxchg as suggested by you. However, when I gathered the performance
 data, the code performed worse than I expected at a higher contention
 level. Below were the execution time of the benchmark tool that I sent
 you:
 
 [xchg][cmpxchg]
   # of tasksTicket lock Queue lock  Queue Lock
   ----- --- --
1  135135  135
2  732   13151102
3 1827   23722681
4 2689   2934 5392
5 3736   3658 7696
6 4942   44349876
7 6304   5176   11901
8 7736   5955   14551
 

I'm just not seeing that; with test-4 modified to take the AMD compute
units into account:

root@interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; 
LOCK=./qspinlock-pending-opt2 ./test-4.sh
 4: 50783.509653
 8: 146295.875715
16: 332942.964709
 4: 51033.341441
 8: 146320.656285
16: 332586.355194

And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg
loops with unconditional ops (xchg8 and xchg16).

And I'd think that 4 CPUs x 4 Nodes would be heavy contention.

I'll have another poke tomorrow; including verifying asm tomorrow, need
to go sleep now.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-03 Thread Peter Zijlstra
Hi,

Here are some numbers for my version -- also attached is the test code.
I found that booting big machines is tediously slow so I lifted the
whole lot to userspace.

I measure the cycles spend in arch_spin_lock() + arch_spin_unlock().

The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node
(2 socket) Intel Westmere-EP.

AMD (ticket)AMD (qspinlock + pending + opt)

Local:  Local:

1:324.4255301:324.102142
2:  17141.3240502:620.185930
3:  52212.2323433:  25242.574661
4:  93136.4583144:  47982.037866
6: 167967.4559656:  95345.011864
8: 245402.5348698: 142412.451438

2 - nodes:  2 - nodes:

2:  12763.6409562:   1879.460823
4:  94423.0271234:  48278.719130
6: 167903.6983616:  96747.767310
8: 257243.5082948: 144672.846317

4 - nodes:  4 - nodes:

 4:  82408.8536034:  49820.323075
 8: 260492.9523558: 143538.264724
16: 630099.031148   16: 337796.553795



Intel (ticket)  Intel (qspinlock + pending + opt)

Local:  Local:

1:19.002249 1:29.002844
2:  5093.275530 2:  1282.209519
3: 22300.859761 3: 22127.477388
4: 44929.922325 4: 44493.881832
6: 86338.755247 6: 86360.083940

2 - nodes:  2 - nodes:

2:   1509.1938242:   1209.090219
4:  48154.4959984:  48547.242379
8: 137946.7872448: 141381.498125

---

There a few curious facts I found (assuming my test code is sane).

 - Intel seems to be an order of magnitude faster on uncontended LOCKed
   ops compared to AMD

 - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower
   than the uncontended ticket xadd -- although both are plenty fast
   when compared to AMD.

 - In general, replacing cmpxchg loops with unconditional atomic ops
   doesn't seem to matter a whole lot when the thing is contended.

Below is the (rather messy) qspinlock slow path code (the only thing
that really differs between our versions.

I'll try and slot your version in tomorrow.

---


/*
 * Exactly fills one cacheline on 64bit.
 */
static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);

static inline u32 encode_tail(int cpu, int idx)
{
u32 code;

code  = (cpu + 1)  _Q_TAIL_CPU_OFFSET;
code |= idx  _Q_TAIL_IDX_OFFSET; /* assume  4 */

return code;
}

static inline struct mcs_spinlock *decode_tail(u32 code)
{
int cpu = (code  _Q_TAIL_CPU_OFFSET) - 1;
int idx = (code  _Q_TAIL_IDX_OFFSET)  _Q_TAIL_IDX_MASK;

return per_cpu_ptr(mcs_nodes[idx], cpu);
}

#define _QSPINLOCK_PENDING  (1U  _Q_PENDING_OFFSET)
#define _QSPINLOCK_MASK (_QSPINLOCK_LOCKED | _QSPINLOCK_PENDING)

// PENDING - enables the pending bit logic
// OPT - removes one atomic op at the cost of making pending a byte
// OPT2- replaces some cmpxchg loops with unconditional atomic ops
//
// PENDING looks to be a win, even with 2 atomic ops on Intel, and a loss on AMD
// OPT is a full win
// OPT2 somehow doesn't seem to make much difference !?
//

/**
 * queue_spin_lock_slowpath - acquire the queue spinlock
 * @lock: Pointer to queue spinlock structure
 *
 *  fast  :slow  :unlock
 *:  :
 * uncontended  (0,0,0) --:-- (0,0,1) --:-- 
(*,*,0)
 *:   | ^.--. /  :
 *:   v   \  \|  :
 * pending:(0,1,1) +-- (0,1,0)   \   |  :
 *:   | ^--'  |   |  :
 *:   v   |   |  :
 * uncontended:(n,x,y) +-- (n,0,0) --'   |  :
 *   queue:   | ^--'  |  :
 *:   v   |  :
 * contended  :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
 *   queue:  :
 *
 */
void queue_spin_lock_slowpath(struct qspinlock *lock)
{
struct mcs_spinlock *prev, *next, *node;
u32 val, new, old, code;
int idx;

#if PENDING
/*
 * trylock || pending
 *
 * 0,0,0 - 0,0,1 ; trylock
 * 0,0,1 - 0,1,1 ; pending
 */
val = atomic_read(lock-val);
#if !OPT2
for (;;) {
/*
 * If we observe any contention; queue.
 */
if (val  ~_Q_LOCKED_MASK)
goto queue;

new = _QSPINLOCK_LOCKED;
if (val == new)
new |= _QSPINLOCK_PENDING;

old = atomic_cmpxchg(lock-val, val, new);
if (old == val)
break;

   

Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-03-02 Thread Oleg Nesterov
On 02/26, Waiman Long wrote:

 @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct 
 qspinlock *lock)
   int qlcode = atomic_read(lock-qlcode);
  
   if (!(qlcode  _QSPINLOCK_LOCKED)  (atomic_cmpxchg(lock-qlcode,
 - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode))
 + qlcode, code|_QSPINLOCK_LOCKED) == qlcode))

Hmm. didn't read the patch, but this change looks like accidental typo...

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-02-28 Thread Peter Zijlstra
On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:
 +   old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
 +
 +   if (old == 0) {
 +   /*
 +* Got the lock, can clear the waiting bit now
 +*/
 +   smp_u8_store_release(qlock-wait, 0);
 
 So we just did an atomic op, and now you're trying to optimize this
 write. Why do you need a whole byte for that?
 
 Surely a cmpxchg loop with the right atomic op can't be _that_ much
 slower? Its far more readable and likely avoids that steal fail below as
 well.
 
 At low contention level, atomic operations that requires a lock prefix are
 the major contributor to the total execution times. I saw estimate online
 that the time to execute a lock prefix instruction can easily be 50X longer
 than a regular instruction that can be pipelined. That is why I try to do it
 with as few lock prefix instructions as possible. If I have to do an atomic
 cmpxchg, it probably won't be faster than the regular qspinlock slowpath.

At low contention the cmpxchg won't have to be retried (much) so using
it won't be a problem and you get to have arbitrary atomic ops.

 Given that speed at low contention level which is the common case is
 important to get this patch accepted, I have to do what I can to make it run
 as far as possible for this 2 contending task case.

What I'm saying is that you can do the whole thing with a single
cmpxchg. No extra ops needed. And at that point you don't need a whole
byte, you can use a single bit.

that removes the whole NR_CPUS dependent logic.

 +   /*
 +* Someone has steal the lock, so wait again
 +*/
 +   goto try_again;

 That's just a fail.. steals should not ever be allowed. It's a fair lock
 after all.
 
 The code is unfair, but this unfairness help it to run faster than ticket
 spinlock in this particular case. And the regular qspinlock slowpath is
 fair. A little bit of unfairness in this particular case helps its speed.

*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
should preserve this.

BTW; can you share your benchmark thingy? 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-02-28 Thread Waiman Long

On 02/28/2014 04:29 AM, Peter Zijlstra wrote:

On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote:

+   old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+   if (old == 0) {
+   /*
+* Got the lock, can clear the waiting bit now
+*/
+   smp_u8_store_release(qlock-wait, 0);

So we just did an atomic op, and now you're trying to optimize this
write. Why do you need a whole byte for that?

Surely a cmpxchg loop with the right atomic op can't be _that_ much
slower? Its far more readable and likely avoids that steal fail below as
well.

At low contention level, atomic operations that requires a lock prefix are
the major contributor to the total execution times. I saw estimate online
that the time to execute a lock prefix instruction can easily be 50X longer
than a regular instruction that can be pipelined. That is why I try to do it
with as few lock prefix instructions as possible. If I have to do an atomic
cmpxchg, it probably won't be faster than the regular qspinlock slowpath.

At low contention the cmpxchg won't have to be retried (much) so using
it won't be a problem and you get to have arbitrary atomic ops.


Given that speed at low contention level which is the common case is
important to get this patch accepted, I have to do what I can to make it run
as far as possible for this 2 contending task case.

What I'm saying is that you can do the whole thing with a single
cmpxchg. No extra ops needed. And at that point you don't need a whole
byte, you can use a single bit.

that removes the whole NR_CPUS dependent logic.


After modifying it to do a deterministic cmpxchg, the test run time of 2 
contending tasks jumps up from 600ms (best case) to about 1700ms which 
was worse than the original qspinlock's 1300-1500ms. It is the 
opportunistic nature of the xchg() code that can potentially combine 
multiple steps in the deterministic atomic sequence which can saves 
time. Without that, I would rather prefer going back to the basic 
qspinlock queuing sequence for 2 contending tasks.


Please take a look at the performance data in my patch 3 to see if the 
slowdown at 2 and 3 contending tasks are acceptable or not.


The reason why I need a whole byte for the lock bit is because of the 
simple unlock code of assigning 0 to the lock byte by the lock holder. 
Utilizing other bits in the low byte for other purpose will complicate 
the unlock path and slow down the no-contention case.



+   /*
+* Someone has steal the lock, so wait again
+*/
+   goto try_again;

That's just a fail.. steals should not ever be allowed. It's a fair lock
after all.

The code is unfair, but this unfairness help it to run faster than ticket
spinlock in this particular case. And the regular qspinlock slowpath is
fair. A little bit of unfairness in this particular case helps its speed.

*groan*, no, unfairness not cool. ticket lock is absolutely fair; we
should preserve this.


We can preserve that by removing patch 3.


BTW; can you share your benchmark thingy?


I have attached the test program that I used to generate the timing data 
for patch 3.


-Longman




locktest.tar.gz
Description: GNU Zip compressed data
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-02-28 Thread Peter Zijlstra
On Fri, Feb 28, 2014 at 08:25:24AM -0800, Linus Torvalds wrote:
 On Feb 28, 2014 1:30 AM, Peter Zijlstra pet...@infradead.org wrote:
 
  At low contention the cmpxchg won't have to be retried (much) so using
  it won't be a problem and you get to have arbitrary atomic ops.
 
 Peter, the difference between an atomic op and *no* atomic op is huge.

I know, I'm just asking what the difference is between the xchg() - and
atomic op, and an cmpxchg(), also an atomic op.

The xchg() makes the entire thing somewhat difficult. Needing to fixup
all kinds of states if we guessed wrong about what was in the variables.

 And Waiman posted numbers for the optimization. Why do you argue with
 handwaving and against numbers?

I've asked for his benchmark.. 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-02-28 Thread Peter Zijlstra


 After modifying it to do a deterministic cmpxchg, the test run time of 2
 contending tasks jumps up from 600ms (best case) to about 1700ms which was
 worse than the original qspinlock's 1300-1500ms. It is the opportunistic
 nature of the xchg() code that can potentially combine multiple steps in the
 deterministic atomic sequence which can saves time. Without that, I would
 rather prefer going back to the basic qspinlock queuing sequence for 2
 contending tasks.
 
 Please take a look at the performance data in my patch 3 to see if the
 slowdown at 2 and 3 contending tasks are acceptable or not.

Right; so I've gone back to a simple version (~200 lines) that's fairly
easy to comprehend (to me, since I wrote it). And will now try to see if
I can find the same state transitions in your code.

I find your code somewhat hard to follow; mostly due to those xchg() +
fixup thingies. But I'll get there.

 The reason why I need a whole byte for the lock bit is because of the simple
 unlock code of assigning 0 to the lock byte by the lock holder. Utilizing
 other bits in the low byte for other purpose will complicate the unlock path
 and slow down the no-contention case.

Yeah, I get why you need a whole byte for the lock part, I was asking if
we really need another whole byte for the pending part.

So in my version I think I see an optimization case where this is indeed
useful and I can trade an atomic op for a write barrier, which should be
a big win.

It just wasn't at all clear (to me) from your code.

(I also think the optimization isn't x86 specific)

 The code is unfair, but this unfairness help it to run faster than ticket
 spinlock in this particular case. And the regular qspinlock slowpath is
 fair. A little bit of unfairness in this particular case helps its speed.

 *groan*, no, unfairness not cool. ticket lock is absolutely fair; we
 should preserve this.
 
 We can preserve that by removing patch 3.

I've got a version that does the pending thing and still is entirely
fair.

I don't think the concept of the extra spinner is incompatible with
fairness.

 BTW; can you share your benchmark thingy?
 
 I have attached the test program that I used to generate the timing data for
 patch 3.

Thanks, I'll have a play.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-02-27 Thread Waiman Long

On 02/26/2014 11:20 AM, Peter Zijlstra wrote:

You don't happen to have a proper state diagram for this thing do you?

I suppose I'm going to have to make one; this is all getting a bit
unwieldy, and those xchg() + fixup things are hard to read.


I don't have a state diagram on hand, but I will add more comments to 
describe the 4 possible cases and how to handle them.




On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote:

+static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
+{
+   union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
+   u16  old;
+
+   /*
+* Fall into the quick spinning code path only if no one is waiting
+* or the lock is available.
+*/
+   if (unlikely((qsval != _QSPINLOCK_LOCKED)
+(qsval != _QSPINLOCK_WAITING)))
+   return 0;
+
+   old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
+
+   if (old == 0) {
+   /*
+* Got the lock, can clear the waiting bit now
+*/
+   smp_u8_store_release(qlock-wait, 0);


So we just did an atomic op, and now you're trying to optimize this
write. Why do you need a whole byte for that?

Surely a cmpxchg loop with the right atomic op can't be _that_ much
slower? Its far more readable and likely avoids that steal fail below as
well.


At low contention level, atomic operations that requires a lock prefix 
are the major contributor to the total execution times. I saw estimate 
online that the time to execute a lock prefix instruction can easily be 
50X longer than a regular instruction that can be pipelined. That is why 
I try to do it with as few lock prefix instructions as possible. If I 
have to do an atomic cmpxchg, it probably won't be faster than the 
regular qspinlock slowpath.


Given that speed at low contention level which is the common case is 
important to get this patch accepted, I have to do what I can to make it 
run as far as possible for this 2 contending task case.



+   return 1;
+   } else if (old == _QSPINLOCK_LOCKED) {
+try_again:
+   /*
+* Wait until the lock byte is cleared to get the lock
+*/
+   do {
+   cpu_relax();
+   } while (ACCESS_ONCE(qlock-lock));
+   /*
+* Set the lock bit  clear the waiting bit
+*/
+   if (cmpxchg(qlock-lock_wait, _QSPINLOCK_WAITING,
+  _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
+   return 1;
+   /*
+* Someone has steal the lock, so wait again
+*/
+   goto try_again;

That's just a fail.. steals should not ever be allowed. It's a fair lock
after all.


The code is unfair, but this unfairness help it to run faster than 
ticket spinlock in this particular case. And the regular qspinlock 
slowpath is fair. A little bit of unfairness in this particular case 
helps its speed.



+   } else if (old == _QSPINLOCK_WAITING) {
+   /*
+* Another task is already waiting while it steals the lock.
+* A bit of unfairness here won't change the big picture.
+* So just take the lock and return.
+*/
+   return 1;
+   }
+   /*
+* Nothing need to be done if the old value is
+* (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
+*/
+   return 0;
+}





@@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int 
qsval)
return;
}

+#ifdef queue_code_xchg
+   prev_qcode = queue_code_xchg(lock, my_qcode);
+#else
/*
 * Exchange current copy of the queue node code
 */
@@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int 
qsval)
} else
prev_qcode= ~_QSPINLOCK_LOCKED;/* Clear the lock bit */
my_qcode= ~_QSPINLOCK_LOCKED;
+#endif /* queue_code_xchg */

if (prev_qcode) {
/*

That's just horrible.. please just make the entire #else branch another
version of that same queue_code_xchg() function.


OK, I will wrap it in another function.

Regards,
Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks

2014-02-26 Thread Peter Zijlstra

You don't happen to have a proper state diagram for this thing do you?

I suppose I'm going to have to make one; this is all getting a bit
unwieldy, and those xchg() + fixup things are hard to read.

On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote:
 +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval)
 +{
 + union arch_qspinlock *qlock = (union arch_qspinlock *)lock;
 + u16  old;
 +
 + /*
 +  * Fall into the quick spinning code path only if no one is waiting
 +  * or the lock is available.
 +  */
 + if (unlikely((qsval != _QSPINLOCK_LOCKED) 
 +  (qsval != _QSPINLOCK_WAITING)))
 + return 0;
 +
 + old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED);
 +
 + if (old == 0) {
 + /*
 +  * Got the lock, can clear the waiting bit now
 +  */
 + smp_u8_store_release(qlock-wait, 0);


So we just did an atomic op, and now you're trying to optimize this
write. Why do you need a whole byte for that?

Surely a cmpxchg loop with the right atomic op can't be _that_ much
slower? Its far more readable and likely avoids that steal fail below as
well.

 + return 1;
 + } else if (old == _QSPINLOCK_LOCKED) {
 +try_again:
 + /*
 +  * Wait until the lock byte is cleared to get the lock
 +  */
 + do {
 + cpu_relax();
 + } while (ACCESS_ONCE(qlock-lock));
 + /*
 +  * Set the lock bit  clear the waiting bit
 +  */
 + if (cmpxchg(qlock-lock_wait, _QSPINLOCK_WAITING,
 +_QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING)
 + return 1;
 + /*
 +  * Someone has steal the lock, so wait again
 +  */
 + goto try_again;

That's just a fail.. steals should not ever be allowed. It's a fair lock
after all.

 + } else if (old == _QSPINLOCK_WAITING) {
 + /*
 +  * Another task is already waiting while it steals the lock.
 +  * A bit of unfairness here won't change the big picture.
 +  * So just take the lock and return.
 +  */
 + return 1;
 + }
 + /*
 +  * Nothing need to be done if the old value is
 +  * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED).
 +  */
 + return 0;
 +}




 @@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int 
 qsval)
   return;
   }
  
 +#ifdef queue_code_xchg
 + prev_qcode = queue_code_xchg(lock, my_qcode);
 +#else
   /*
* Exchange current copy of the queue node code
*/
 @@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int 
 qsval)
   } else
   prev_qcode = ~_QSPINLOCK_LOCKED;   /* Clear the lock bit */
   my_qcode = ~_QSPINLOCK_LOCKED;
 +#endif /* queue_code_xchg */
  
   if (prev_qcode) {
   /*

That's just horrible.. please just make the entire #else branch another
version of that same queue_code_xchg() function.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization