Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On Tue, Mar 04, 2014 at 11:40:43PM +0100, Peter Zijlstra wrote: On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote: Peter, I was trying to implement the generic queue code exchange code using cmpxchg as suggested by you. However, when I gathered the performance data, the code performed worse than I expected at a higher contention level. Below were the execution time of the benchmark tool that I sent you: I'm just not seeing that; with test-4 modified to take the AMD compute units into account: OK; I tried on a few larger machines and I can indeed see it there. That said; our code doesn't differ that much. I see why you're not doing too well on the 2 CPU contention. You've got an atomic op too much in that path. But given you see benefit even with 2 atomic ops (I had mixed results on that) we can do the pending/waiter thing unconditionally for NR_CPUS16k. I also think I can do your full xchg thing without allowing lock steals. I'll try and do a full series tomorrow that starts with simple code and builds on that, doing each optimization one by one. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On 03/02/2014 08:16 AM, Oleg Nesterov wrote: On 02/26, Waiman Long wrote: @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) int qlcode = atomic_read(lock-qlcode); if (!(qlcode _QSPINLOCK_LOCKED) (atomic_cmpxchg(lock-qlcode, - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode)) + qlcode, code|_QSPINLOCK_LOCKED) == qlcode)) Hmm. didn't read the patch, but this change looks like accidental typo... Oleg. Thank for catching that typo error. That generic code wasn't compiled in and so I missed it. I will fix that in the next version. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On 03/03/2014 12:43 PM, Peter Zijlstra wrote: Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket)AMD (qspinlock + pending + opt) Local: Local: 1:324.4255301:324.102142 2: 17141.3240502:620.185930 3: 52212.2323433: 25242.574661 4: 93136.4583144: 47982.037866 6: 167967.4559656: 95345.011864 8: 245402.5348698: 142412.451438 2 - nodes: 2 - nodes: 2: 12763.6409562: 1879.460823 4: 94423.0271234: 48278.719130 6: 167903.6983616: 96747.767310 8: 257243.5082948: 144672.846317 4 - nodes: 4 - nodes: 4: 82408.8536034: 49820.323075 8: 260492.9523558: 143538.264724 16: 630099.031148 16: 337796.553795 Intel (ticket) Intel (qspinlock + pending + opt) Local: Local: 1:19.002249 1:29.002844 2: 5093.275530 2: 1282.209519 3: 22300.859761 3: 22127.477388 4: 44929.922325 4: 44493.881832 6: 86338.755247 6: 86360.083940 2 - nodes: 2 - nodes: 2: 1509.1938242: 1209.090219 4: 48154.4959984: 48547.242379 8: 137946.7872448: 141381.498125 --- There a few curious facts I found (assuming my test code is sane). - Intel seems to be an order of magnitude faster on uncontended LOCKed ops compared to AMD - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower than the uncontended ticket xadd -- although both are plenty fast when compared to AMD. - In general, replacing cmpxchg loops with unconditional atomic ops doesn't seem to matter a whole lot when the thing is contended. Below is the (rather messy) qspinlock slow path code (the only thing that really differs between our versions. I'll try and slot your version in tomorrow. --- It is curious to see that the qspinlock code offers a big benefit on AMD machines, but no so much on Intel. Anyway, I am working on a revised version of the patch that includes some of your comments. I will also try to see if I can get an AMD machine to run test on. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
Updated version, this includes numbers for my SNB desktop and Waiman's variant. Curiously Waiman's version seems consistently slower on 2 cross node CPUs. Whereas my version seems to have a problem on SNB with 2 CPUs. There's something weird with the ticket lock numbers; when I compile the code with: gcc (Debian 4.7.2-5) 4.7.2 I get the first set; when I compile with: gcc (Ubuntu/Linaro 4.7.3-2ubuntu4) 4.7.3 I get the second set; afaict the other locks don't seem to have this problem, but I only just noticed. --- I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, a 2 node (2 socket) Intel Westmere-EP and my i7-2600K (SNB) desktop. (ticket)(qspinlock + all) (waiman) AMD Interlagos Local: 1:324.4255301:324.1021421:323.857834 2: 17141.3240502:620.1859302:618.737681 3: 52212.2323433: 25242.5746613: 24888.154161 4: 93136.4583144: 47982.0378664: 48227.610976 6: 167967.4559656: 95345.0118646: 94372.276116 8: 245402.5348698: 142412.4514388: 140516.525791 1: 324.071515 2: 981.778516 3: 24414.144262 4: 50868.376667 6: 99536.890504 8: 151616.395779 2 - nodes: 2: 12763.6409562: 1879.4608232: 2023.594014 4: 94423.0271234: 48278.7191304: 48621.724929 6: 167903.6983616: 96747.7673106: 95815.242461 8: 257243.5082948: 144672.8463178: 143282.222038 2: 1875.637053 4: 50082.796058 6: 107780.361523 8: 163166.728218 4 - nodes: 4: 82408.8536034: 49820.3230754: 50566.036473 8: 260492.9523558: 143538.2647248: 143485.584624 16: 630099.031148 16: 337796.553795 16: 333419.421338 4: 55409.896671 8: 167340.905593 16: 443195.057052 Intel WSM-EP Local: 1:19.002249 1:29.002844 1:28.979917 2: 5093.275530 2: 1282.209519 2: 1236.624785 3: 22300.859761 3: 22127.477388 3: 22336.241120 4: 44929.922325 4: 44493.881832 4: 44832.450598 6: 86338.755247 6: 86360.083940 6: 85808.603491 1: 20.009974 2: 1206.419074 3: 22071.535000 4: 44606.831373 6: 86498.760774 2 - nodes: 2: 1527.4661592: 1227.0519932: 1434.204666 4: 46004.2321794: 46450.7872344: 46999.356429 6: 89226.4720576: 90124.9843246: 90110.423115 8: 137225.4724068: 137909.1843588: 137988.290401 Intel SNB Local: 1:15.276759 1:25.336807 1:25.353041 2: 714.621152 2: 843.240641 2: 711.281211 3: 11339.104267 3: 11751.159167 3: 11684.286334 4: 22648.387376 4: 23454.798068 4: 22903.498910 spinlocks.tar.bz2 Description: Binary data ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
Peter, I was trying to implement the generic queue code exchange code using cmpxchg as suggested by you. However, when I gathered the performance data, the code performed worse than I expected at a higher contention level. Below were the execution time of the benchmark tool that I sent you: [xchg][cmpxchg] # of tasksTicket lock Queue lock Queue Lock ----- --- -- 1 135135 135 2 732 13151102 3 1827 23722681 4 2689 2934 5392 5 3736 3658 7696 6 4942 44349876 7 6304 5176 11901 8 7736 5955 14551 Below is the code that I used: static inline u32 queue_code_xchg(struct qspinlock *lock, u32 *ocode, u32 ncode) { while (true) { u32 qlcode = atomic_read(lock-qlcode); if (qlcode == 0) { /* * Try to get the lock */ if (atomic_cmpxchg(lock-qlcode, 0, _QSPINLOCK_LOCKED) == 0) return 1; } else if (qlcode _QSPINLOCK_LOCKED) { *ocode = atomic_cmpxchg(lock-qlcode, qlcode, ncode | _QSPINLOCK_LOCKED); if (*ocode == qlcode) { /* Clear lock bit before return */ *ocode = ~_QSPINLOCK_LOCKED; return 0; } } /* * Wait if atomic_cmpxchg() fails or lock is temporarily free. */ arch_mutex_cpu_relax(); } } My cmpxchg code is not optimal, and I can probably tune the code to make it perform better. Given the trend that I was seeing, however, I think I will keep the current xchg code, but I will package it in an inline function. -Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On Tue, Mar 04, 2014 at 05:58:00PM +0100, Peter Zijlstra wrote: 2: 17141.3240502:620.1859302:618.737681 So I forgot that AMD has compute units that share L2: root@interlagos:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }' 982.938839 1325.702905 root@interlagos:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }' 630.684313 999.119087 root@interlagos:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 1 ; $LOCK 0 2) | awk '/^total/ { print $2 }' 620.562791 1644.700639 Doing the same for Intel SMT, which shares L1: root@westmere:~/spinlocks# export LOCK=./ticket ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }' 45.765302 1292.721827 root@westmere:~/spinlocks# export LOCK=./qspinlock-pending-opt2 ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }' 54.536890 1260.467527 root@westmere:~/spinlocks# export LOCK=./waiman ; ($LOCK 0 12 ; $LOCK 0 1) | awk '/^total/ { print $2 }' 65.944794 1230.522895 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote: Peter, I was trying to implement the generic queue code exchange code using cmpxchg as suggested by you. However, when I gathered the performance data, the code performed worse than I expected at a higher contention level. Below were the execution time of the benchmark tool that I sent you: [xchg][cmpxchg] # of tasksTicket lock Queue lock Queue Lock ----- --- -- 1 135135 135 2 732 13151102 3 1827 23722681 4 2689 2934 5392 5 3736 3658 7696 6 4942 44349876 7 6304 5176 11901 8 7736 5955 14551 I'm just not seeing that; with test-4 modified to take the AMD compute units into account: root@interlagos:~/spinlocks# LOCK=./qspinlock-pending-opt ./test-4.sh ; LOCK=./qspinlock-pending-opt2 ./test-4.sh 4: 50783.509653 8: 146295.875715 16: 332942.964709 4: 51033.341441 8: 146320.656285 16: 332586.355194 And the difference between opt and opt2 is that opt2 replaces 2 cmpxchg loops with unconditional ops (xchg8 and xchg16). And I'd think that 4 CPUs x 4 Nodes would be heavy contention. I'll have another poke tomorrow; including verifying asm tomorrow, need to go sleep now. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
Hi, Here are some numbers for my version -- also attached is the test code. I found that booting big machines is tediously slow so I lifted the whole lot to userspace. I measure the cycles spend in arch_spin_lock() + arch_spin_unlock(). The machines used are a 4 node (2 socket) AMD Interlagos, and a 2 node (2 socket) Intel Westmere-EP. AMD (ticket)AMD (qspinlock + pending + opt) Local: Local: 1:324.4255301:324.102142 2: 17141.3240502:620.185930 3: 52212.2323433: 25242.574661 4: 93136.4583144: 47982.037866 6: 167967.4559656: 95345.011864 8: 245402.5348698: 142412.451438 2 - nodes: 2 - nodes: 2: 12763.6409562: 1879.460823 4: 94423.0271234: 48278.719130 6: 167903.6983616: 96747.767310 8: 257243.5082948: 144672.846317 4 - nodes: 4 - nodes: 4: 82408.8536034: 49820.323075 8: 260492.9523558: 143538.264724 16: 630099.031148 16: 337796.553795 Intel (ticket) Intel (qspinlock + pending + opt) Local: Local: 1:19.002249 1:29.002844 2: 5093.275530 2: 1282.209519 3: 22300.859761 3: 22127.477388 4: 44929.922325 4: 44493.881832 6: 86338.755247 6: 86360.083940 2 - nodes: 2 - nodes: 2: 1509.1938242: 1209.090219 4: 48154.4959984: 48547.242379 8: 137946.7872448: 141381.498125 --- There a few curious facts I found (assuming my test code is sane). - Intel seems to be an order of magnitude faster on uncontended LOCKed ops compared to AMD - On Intel the uncontended qspinlock fast path (cmpxchg) seems slower than the uncontended ticket xadd -- although both are plenty fast when compared to AMD. - In general, replacing cmpxchg loops with unconditional atomic ops doesn't seem to matter a whole lot when the thing is contended. Below is the (rather messy) qspinlock slow path code (the only thing that really differs between our versions. I'll try and slot your version in tomorrow. --- /* * Exactly fills one cacheline on 64bit. */ static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); static inline u32 encode_tail(int cpu, int idx) { u32 code; code = (cpu + 1) _Q_TAIL_CPU_OFFSET; code |= idx _Q_TAIL_IDX_OFFSET; /* assume 4 */ return code; } static inline struct mcs_spinlock *decode_tail(u32 code) { int cpu = (code _Q_TAIL_CPU_OFFSET) - 1; int idx = (code _Q_TAIL_IDX_OFFSET) _Q_TAIL_IDX_MASK; return per_cpu_ptr(mcs_nodes[idx], cpu); } #define _QSPINLOCK_PENDING (1U _Q_PENDING_OFFSET) #define _QSPINLOCK_MASK (_QSPINLOCK_LOCKED | _QSPINLOCK_PENDING) // PENDING - enables the pending bit logic // OPT - removes one atomic op at the cost of making pending a byte // OPT2- replaces some cmpxchg loops with unconditional atomic ops // // PENDING looks to be a win, even with 2 atomic ops on Intel, and a loss on AMD // OPT is a full win // OPT2 somehow doesn't seem to make much difference !? // /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * * fast :slow :unlock *: : * uncontended (0,0,0) --:-- (0,0,1) --:-- (*,*,0) *: | ^.--. / : *: v \ \| : * pending:(0,1,1) +-- (0,1,0) \ | : *: | ^--' | | : *: v | | : * uncontended:(n,x,y) +-- (n,0,0) --' | : * queue: | ^--' | : *: v | : * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : * queue: : * */ void queue_spin_lock_slowpath(struct qspinlock *lock) { struct mcs_spinlock *prev, *next, *node; u32 val, new, old, code; int idx; #if PENDING /* * trylock || pending * * 0,0,0 - 0,0,1 ; trylock * 0,0,1 - 0,1,1 ; pending */ val = atomic_read(lock-val); #if !OPT2 for (;;) { /* * If we observe any contention; queue. */ if (val ~_Q_LOCKED_MASK) goto queue; new = _QSPINLOCK_LOCKED; if (val == new) new |= _QSPINLOCK_PENDING; old = atomic_cmpxchg(lock-val, val, new); if (old == val) break;
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On 02/26, Waiman Long wrote: @@ -144,7 +317,7 @@ static __always_inline int queue_spin_setlock(struct qspinlock *lock) int qlcode = atomic_read(lock-qlcode); if (!(qlcode _QSPINLOCK_LOCKED) (atomic_cmpxchg(lock-qlcode, - qlcode, qlcode|_QSPINLOCK_LOCKED) == qlcode)) + qlcode, code|_QSPINLOCK_LOCKED) == qlcode)) Hmm. didn't read the patch, but this change looks like accidental typo... Oleg. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: + old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); + + if (old == 0) { + /* +* Got the lock, can clear the waiting bit now +*/ + smp_u8_store_release(qlock-wait, 0); So we just did an atomic op, and now you're trying to optimize this write. Why do you need a whole byte for that? Surely a cmpxchg loop with the right atomic op can't be _that_ much slower? Its far more readable and likely avoids that steal fail below as well. At low contention level, atomic operations that requires a lock prefix are the major contributor to the total execution times. I saw estimate online that the time to execute a lock prefix instruction can easily be 50X longer than a regular instruction that can be pipelined. That is why I try to do it with as few lock prefix instructions as possible. If I have to do an atomic cmpxchg, it probably won't be faster than the regular qspinlock slowpath. At low contention the cmpxchg won't have to be retried (much) so using it won't be a problem and you get to have arbitrary atomic ops. Given that speed at low contention level which is the common case is important to get this patch accepted, I have to do what I can to make it run as far as possible for this 2 contending task case. What I'm saying is that you can do the whole thing with a single cmpxchg. No extra ops needed. And at that point you don't need a whole byte, you can use a single bit. that removes the whole NR_CPUS dependent logic. + /* +* Someone has steal the lock, so wait again +*/ + goto try_again; That's just a fail.. steals should not ever be allowed. It's a fair lock after all. The code is unfair, but this unfairness help it to run faster than ticket spinlock in this particular case. And the regular qspinlock slowpath is fair. A little bit of unfairness in this particular case helps its speed. *groan*, no, unfairness not cool. ticket lock is absolutely fair; we should preserve this. BTW; can you share your benchmark thingy? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On 02/28/2014 04:29 AM, Peter Zijlstra wrote: On Thu, Feb 27, 2014 at 03:42:19PM -0500, Waiman Long wrote: + old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); + + if (old == 0) { + /* +* Got the lock, can clear the waiting bit now +*/ + smp_u8_store_release(qlock-wait, 0); So we just did an atomic op, and now you're trying to optimize this write. Why do you need a whole byte for that? Surely a cmpxchg loop with the right atomic op can't be _that_ much slower? Its far more readable and likely avoids that steal fail below as well. At low contention level, atomic operations that requires a lock prefix are the major contributor to the total execution times. I saw estimate online that the time to execute a lock prefix instruction can easily be 50X longer than a regular instruction that can be pipelined. That is why I try to do it with as few lock prefix instructions as possible. If I have to do an atomic cmpxchg, it probably won't be faster than the regular qspinlock slowpath. At low contention the cmpxchg won't have to be retried (much) so using it won't be a problem and you get to have arbitrary atomic ops. Given that speed at low contention level which is the common case is important to get this patch accepted, I have to do what I can to make it run as far as possible for this 2 contending task case. What I'm saying is that you can do the whole thing with a single cmpxchg. No extra ops needed. And at that point you don't need a whole byte, you can use a single bit. that removes the whole NR_CPUS dependent logic. After modifying it to do a deterministic cmpxchg, the test run time of 2 contending tasks jumps up from 600ms (best case) to about 1700ms which was worse than the original qspinlock's 1300-1500ms. It is the opportunistic nature of the xchg() code that can potentially combine multiple steps in the deterministic atomic sequence which can saves time. Without that, I would rather prefer going back to the basic qspinlock queuing sequence for 2 contending tasks. Please take a look at the performance data in my patch 3 to see if the slowdown at 2 and 3 contending tasks are acceptable or not. The reason why I need a whole byte for the lock bit is because of the simple unlock code of assigning 0 to the lock byte by the lock holder. Utilizing other bits in the low byte for other purpose will complicate the unlock path and slow down the no-contention case. + /* +* Someone has steal the lock, so wait again +*/ + goto try_again; That's just a fail.. steals should not ever be allowed. It's a fair lock after all. The code is unfair, but this unfairness help it to run faster than ticket spinlock in this particular case. And the regular qspinlock slowpath is fair. A little bit of unfairness in this particular case helps its speed. *groan*, no, unfairness not cool. ticket lock is absolutely fair; we should preserve this. We can preserve that by removing patch 3. BTW; can you share your benchmark thingy? I have attached the test program that I used to generate the timing data for patch 3. -Longman locktest.tar.gz Description: GNU Zip compressed data ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On Fri, Feb 28, 2014 at 08:25:24AM -0800, Linus Torvalds wrote: On Feb 28, 2014 1:30 AM, Peter Zijlstra pet...@infradead.org wrote: At low contention the cmpxchg won't have to be retried (much) so using it won't be a problem and you get to have arbitrary atomic ops. Peter, the difference between an atomic op and *no* atomic op is huge. I know, I'm just asking what the difference is between the xchg() - and atomic op, and an cmpxchg(), also an atomic op. The xchg() makes the entire thing somewhat difficult. Needing to fixup all kinds of states if we guessed wrong about what was in the variables. And Waiman posted numbers for the optimization. Why do you argue with handwaving and against numbers? I've asked for his benchmark.. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
After modifying it to do a deterministic cmpxchg, the test run time of 2 contending tasks jumps up from 600ms (best case) to about 1700ms which was worse than the original qspinlock's 1300-1500ms. It is the opportunistic nature of the xchg() code that can potentially combine multiple steps in the deterministic atomic sequence which can saves time. Without that, I would rather prefer going back to the basic qspinlock queuing sequence for 2 contending tasks. Please take a look at the performance data in my patch 3 to see if the slowdown at 2 and 3 contending tasks are acceptable or not. Right; so I've gone back to a simple version (~200 lines) that's fairly easy to comprehend (to me, since I wrote it). And will now try to see if I can find the same state transitions in your code. I find your code somewhat hard to follow; mostly due to those xchg() + fixup thingies. But I'll get there. The reason why I need a whole byte for the lock bit is because of the simple unlock code of assigning 0 to the lock byte by the lock holder. Utilizing other bits in the low byte for other purpose will complicate the unlock path and slow down the no-contention case. Yeah, I get why you need a whole byte for the lock part, I was asking if we really need another whole byte for the pending part. So in my version I think I see an optimization case where this is indeed useful and I can trade an atomic op for a write barrier, which should be a big win. It just wasn't at all clear (to me) from your code. (I also think the optimization isn't x86 specific) The code is unfair, but this unfairness help it to run faster than ticket spinlock in this particular case. And the regular qspinlock slowpath is fair. A little bit of unfairness in this particular case helps its speed. *groan*, no, unfairness not cool. ticket lock is absolutely fair; we should preserve this. We can preserve that by removing patch 3. I've got a version that does the pending thing and still is entirely fair. I don't think the concept of the extra spinner is incompatible with fairness. BTW; can you share your benchmark thingy? I have attached the test program that I used to generate the timing data for patch 3. Thanks, I'll have a play. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
On 02/26/2014 11:20 AM, Peter Zijlstra wrote: You don't happen to have a proper state diagram for this thing do you? I suppose I'm going to have to make one; this is all getting a bit unwieldy, and those xchg() + fixup things are hard to read. I don't have a state diagram on hand, but I will add more comments to describe the 4 possible cases and how to handle them. On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote: +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + u16 old; + + /* +* Fall into the quick spinning code path only if no one is waiting +* or the lock is available. +*/ + if (unlikely((qsval != _QSPINLOCK_LOCKED) +(qsval != _QSPINLOCK_WAITING))) + return 0; + + old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); + + if (old == 0) { + /* +* Got the lock, can clear the waiting bit now +*/ + smp_u8_store_release(qlock-wait, 0); So we just did an atomic op, and now you're trying to optimize this write. Why do you need a whole byte for that? Surely a cmpxchg loop with the right atomic op can't be _that_ much slower? Its far more readable and likely avoids that steal fail below as well. At low contention level, atomic operations that requires a lock prefix are the major contributor to the total execution times. I saw estimate online that the time to execute a lock prefix instruction can easily be 50X longer than a regular instruction that can be pipelined. That is why I try to do it with as few lock prefix instructions as possible. If I have to do an atomic cmpxchg, it probably won't be faster than the regular qspinlock slowpath. Given that speed at low contention level which is the common case is important to get this patch accepted, I have to do what I can to make it run as far as possible for this 2 contending task case. + return 1; + } else if (old == _QSPINLOCK_LOCKED) { +try_again: + /* +* Wait until the lock byte is cleared to get the lock +*/ + do { + cpu_relax(); + } while (ACCESS_ONCE(qlock-lock)); + /* +* Set the lock bit clear the waiting bit +*/ + if (cmpxchg(qlock-lock_wait, _QSPINLOCK_WAITING, + _QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING) + return 1; + /* +* Someone has steal the lock, so wait again +*/ + goto try_again; That's just a fail.. steals should not ever be allowed. It's a fair lock after all. The code is unfair, but this unfairness help it to run faster than ticket spinlock in this particular case. And the regular qspinlock slowpath is fair. A little bit of unfairness in this particular case helps its speed. + } else if (old == _QSPINLOCK_WAITING) { + /* +* Another task is already waiting while it steals the lock. +* A bit of unfairness here won't change the big picture. +* So just take the lock and return. +*/ + return 1; + } + /* +* Nothing need to be done if the old value is +* (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED). +*/ + return 0; +} @@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) return; } +#ifdef queue_code_xchg + prev_qcode = queue_code_xchg(lock, my_qcode); +#else /* * Exchange current copy of the queue node code */ @@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) } else prev_qcode= ~_QSPINLOCK_LOCKED;/* Clear the lock bit */ my_qcode= ~_QSPINLOCK_LOCKED; +#endif /* queue_code_xchg */ if (prev_qcode) { /* That's just horrible.. please just make the entire #else branch another version of that same queue_code_xchg() function. OK, I will wrap it in another function. Regards, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 3/8] qspinlock, x86: Add x86 specific optimization for 2 contending tasks
You don't happen to have a proper state diagram for this thing do you? I suppose I'm going to have to make one; this is all getting a bit unwieldy, and those xchg() + fixup things are hard to read. On Wed, Feb 26, 2014 at 10:14:23AM -0500, Waiman Long wrote: +static inline int queue_spin_trylock_quick(struct qspinlock *lock, int qsval) +{ + union arch_qspinlock *qlock = (union arch_qspinlock *)lock; + u16 old; + + /* + * Fall into the quick spinning code path only if no one is waiting + * or the lock is available. + */ + if (unlikely((qsval != _QSPINLOCK_LOCKED) + (qsval != _QSPINLOCK_WAITING))) + return 0; + + old = xchg(qlock-lock_wait, _QSPINLOCK_WAITING|_QSPINLOCK_LOCKED); + + if (old == 0) { + /* + * Got the lock, can clear the waiting bit now + */ + smp_u8_store_release(qlock-wait, 0); So we just did an atomic op, and now you're trying to optimize this write. Why do you need a whole byte for that? Surely a cmpxchg loop with the right atomic op can't be _that_ much slower? Its far more readable and likely avoids that steal fail below as well. + return 1; + } else if (old == _QSPINLOCK_LOCKED) { +try_again: + /* + * Wait until the lock byte is cleared to get the lock + */ + do { + cpu_relax(); + } while (ACCESS_ONCE(qlock-lock)); + /* + * Set the lock bit clear the waiting bit + */ + if (cmpxchg(qlock-lock_wait, _QSPINLOCK_WAITING, +_QSPINLOCK_LOCKED) == _QSPINLOCK_WAITING) + return 1; + /* + * Someone has steal the lock, so wait again + */ + goto try_again; That's just a fail.. steals should not ever be allowed. It's a fair lock after all. + } else if (old == _QSPINLOCK_WAITING) { + /* + * Another task is already waiting while it steals the lock. + * A bit of unfairness here won't change the big picture. + * So just take the lock and return. + */ + return 1; + } + /* + * Nothing need to be done if the old value is + * (_QSPINLOCK_WAITING | _QSPINLOCK_LOCKED). + */ + return 0; +} @@ -296,6 +478,9 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) return; } +#ifdef queue_code_xchg + prev_qcode = queue_code_xchg(lock, my_qcode); +#else /* * Exchange current copy of the queue node code */ @@ -329,6 +514,7 @@ void queue_spin_lock_slowpath(struct qspinlock *lock, int qsval) } else prev_qcode = ~_QSPINLOCK_LOCKED; /* Clear the lock bit */ my_qcode = ~_QSPINLOCK_LOCKED; +#endif /* queue_code_xchg */ if (prev_qcode) { /* That's just horrible.. please just make the entire #else branch another version of that same queue_code_xchg() function. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization