Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Mon, Jul 01, 2013 at 02:49:34PM +0530, Raghavendra KT wrote: > On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT > wrote: > > > > > > On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney > > wrote: > >> > >> Breaking up locks is better than implementing high-contention locks, but > >> if we must have high-contention locks, why not make them automatically > >> switch between light-weight ticket locks at low contention and queued > >> locks at high contention? After all, this would remove the need for > >> the developer to predict which locks will be highly contended. > >> > >> This commit allows ticket locks to automatically switch between pure > >> ticketlock and queued-lock operation as needed. If too many CPUs are > >> spinning on a given ticket lock, a queue structure will be allocated > >> and the lock will switch to queued-lock operation. When the lock becomes > >> free, it will switch back into ticketlock operation. The low-order bit > >> of the head counter is used to indicate that the lock is in queued mode, > >> which forces an unconditional mismatch between the head and tail counters. > >> This approach means that the common-case code path under conditions of > >> low contention is very nearly that of a plain ticket lock. > >> > >> A fixed number of queueing structures is statically allocated in an > >> array. The ticket-lock address is used to hash into an initial element, > >> but if that element is already in use, it moves to the next element. If > >> the entire array is already in use, continue to spin in ticket mode. > >> > >> Signed-off-by: Paul E. McKenney > >> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). > >> ] > >> [ paulmck: Address Eric Dumazet review feedback. ] > >> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > >> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > >> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > >> [ paulmck: Reduce queue-switch contention (Waiman Long). ] > >> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). > >> ] > >> [ paulmck: Type safety fixes (Steven Rostedt). ] > >> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > >> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] > >> > > [...] > > > > I did test this on 32 core machine with 32 vcpu guests. > > > > This version gave me around 20% improvement fro sysbench and 36% improvement > > for ebizzy, for 1x commit though other overcommited results showed > > degradation. I have not tested Lai Jiangshan's patches on top of this yet. > > Will report any findings. > > Sorry for late report. Not a problem, thank you for running these numbers! > With Lai's patch I see few percentage of improvement in ebizzy 1x and > reduction in degradation in dbench 1x. OK, good! But my guess is that even pushing the lock-acquisition slowpath out of line, we still would not reach parity for the less-good results. Still seems like I should add Lai Jiangshan's patches and post them somewhere in case they are helpful in some other context. Thanx, Paul > But over-commit degradation seem to still persist. seeing this, I > feel it is more of qmode overhead itself for large guests, > > +---+---+---+---++---+ > ebizzy (rec/sec higher is better) > +---+---+---+---+---++---+ > base stdev patched stdev %improvement > +---+---+---+---++---+ > 1x 5574.9000 237.4997 7851.9000 148.673740.84378 > 2x 2741.5000 561.3090 1620.9000 410.8299 -40.87543 > 3x 2146.2500 216.7718 1751.833396.5023 -18.37702 > +---+---+---+---++---+ > +---+---+---+---++---+ > dbench (throughput higher is better) > +---+---+---+---++---+ > base stdev patched stdev %improvement > +---+---+---+---++---+ > 1x 14111.5600 754.4525 13826.5700 1458.0744-2.01955 > 2x 2481.627071.2665 1549.3740 245.3777 -37.56620 > 3x 1510.248331.8634 1116.015826.4882 -26.10382 > +---+---+---+---++---+ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT wrote: > > > On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney > wrote: >> >> Breaking up locks is better than implementing high-contention locks, but >> if we must have high-contention locks, why not make them automatically >> switch between light-weight ticket locks at low contention and queued >> locks at high contention? After all, this would remove the need for >> the developer to predict which locks will be highly contended. >> >> This commit allows ticket locks to automatically switch between pure >> ticketlock and queued-lock operation as needed. If too many CPUs are >> spinning on a given ticket lock, a queue structure will be allocated >> and the lock will switch to queued-lock operation. When the lock becomes >> free, it will switch back into ticketlock operation. The low-order bit >> of the head counter is used to indicate that the lock is in queued mode, >> which forces an unconditional mismatch between the head and tail counters. >> This approach means that the common-case code path under conditions of >> low contention is very nearly that of a plain ticket lock. >> >> A fixed number of queueing structures is statically allocated in an >> array. The ticket-lock address is used to hash into an initial element, >> but if that element is already in use, it moves to the next element. If >> the entire array is already in use, continue to spin in ticket mode. >> >> Signed-off-by: Paul E. McKenney >> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). >> ] >> [ paulmck: Address Eric Dumazet review feedback. ] >> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] >> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] >> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] >> [ paulmck: Reduce queue-switch contention (Waiman Long). ] >> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). >> ] >> [ paulmck: Type safety fixes (Steven Rostedt). ] >> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] >> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] >> > [...] > > I did test this on 32 core machine with 32 vcpu guests. > > This version gave me around 20% improvement fro sysbench and 36% improvement > for ebizzy, for 1x commit though other overcommited results showed > degradation. I have not tested Lai Jiangshan's patches on top of this yet. > Will report any findings. Sorry for late report. With Lai's patch I see few percentage of improvement in ebizzy 1x and reduction in degradation in dbench 1x. But over-commit degradation seem to still persist. seeing this, I feel it is more of qmode overhead itself for large guests, +---+---+---+---++---+ ebizzy (rec/sec higher is better) +---+---+---+---+---++---+ base stdev patched stdev %improvement +---+---+---+---++---+ 1x 5574.9000 237.4997 7851.9000 148.673740.84378 2x 2741.5000 561.3090 1620.9000 410.8299 -40.87543 3x 2146.2500 216.7718 1751.833396.5023 -18.37702 +---+---+---+---++---+ +---+---+---+---++---+ dbench (throughput higher is better) +---+---+---+---++---+ base stdev patched stdev %improvement +---+---+---+---++---+ 1x 14111.5600 754.4525 13826.5700 1458.0744-2.01955 2x 2481.627071.2665 1549.3740 245.3777 -37.56620 3x 1510.248331.8634 1116.015826.4882 -26.10382 +---+---+---+---++---+ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT raghavendra.kt.li...@gmail.com wrote: On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] [...] I did test this on 32 core machine with 32 vcpu guests. This version gave me around 20% improvement fro sysbench and 36% improvement for ebizzy, for 1x commit though other overcommited results showed degradation. I have not tested Lai Jiangshan's patches on top of this yet. Will report any findings. Sorry for late report. With Lai's patch I see few percentage of improvement in ebizzy 1x and reduction in degradation in dbench 1x. But over-commit degradation seem to still persist. seeing this, I feel it is more of qmode overhead itself for large guests, +---+---+---+---++---+ ebizzy (rec/sec higher is better) +---+---+---+---+---++---+ base stdev patched stdev %improvement +---+---+---+---++---+ 1x 5574.9000 237.4997 7851.9000 148.673740.84378 2x 2741.5000 561.3090 1620.9000 410.8299 -40.87543 3x 2146.2500 216.7718 1751.833396.5023 -18.37702 +---+---+---+---++---+ +---+---+---+---++---+ dbench (throughput higher is better) +---+---+---+---++---+ base stdev patched stdev %improvement +---+---+---+---++---+ 1x 14111.5600 754.4525 13826.5700 1458.0744-2.01955 2x 2481.627071.2665 1549.3740 245.3777 -37.56620 3x 1510.248331.8634 1116.015826.4882 -26.10382 +---+---+---+---++---+ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Mon, Jul 01, 2013 at 02:49:34PM +0530, Raghavendra KT wrote: On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT raghavendra.kt.li...@gmail.com wrote: On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] [...] I did test this on 32 core machine with 32 vcpu guests. This version gave me around 20% improvement fro sysbench and 36% improvement for ebizzy, for 1x commit though other overcommited results showed degradation. I have not tested Lai Jiangshan's patches on top of this yet. Will report any findings. Sorry for late report. Not a problem, thank you for running these numbers! With Lai's patch I see few percentage of improvement in ebizzy 1x and reduction in degradation in dbench 1x. OK, good! But my guess is that even pushing the lock-acquisition slowpath out of line, we still would not reach parity for the less-good results. Still seems like I should add Lai Jiangshan's patches and post them somewhere in case they are helpful in some other context. Thanx, Paul But over-commit degradation seem to still persist. seeing this, I feel it is more of qmode overhead itself for large guests, +---+---+---+---++---+ ebizzy (rec/sec higher is better) +---+---+---+---+---++---+ base stdev patched stdev %improvement +---+---+---+---++---+ 1x 5574.9000 237.4997 7851.9000 148.673740.84378 2x 2741.5000 561.3090 1620.9000 410.8299 -40.87543 3x 2146.2500 216.7718 1751.833396.5023 -18.37702 +---+---+---+---++---+ +---+---+---+---++---+ dbench (throughput higher is better) +---+---+---+---++---+ base stdev patched stdev %improvement +---+---+---+---++---+ 1x 14111.5600 754.4525 13826.5700 1458.0744-2.01955 2x 2481.627071.2665 1549.3740 245.3777 -37.56620 3x 1510.248331.8634 1116.015826.4882 -26.10382 +---+---+---+---++---+ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Fri, Jun 14, 2013 at 09:28:16AM +0800, Lai Jiangshan wrote: > On 06/14/2013 07:57 AM, Paul E. McKenney wrote: > > On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: > >> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney > >> wrote: > >>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: > On 06/12/2013 11:40 PM, Paul E. McKenney wrote: > > Breaking up locks is better than implementing high-contention locks, but > > if we must have high-contention locks, why not make them automatically > > switch between light-weight ticket locks at low contention and queued > > locks at high contention? After all, this would remove the need for > > the developer to predict which locks will be highly contended. > > > > This commit allows ticket locks to automatically switch between pure > > ticketlock and queued-lock operation as needed. If too many CPUs are > > spinning on a given ticket lock, a queue structure will be allocated > > and the lock will switch to queued-lock operation. When the lock > > becomes > > free, it will switch back into ticketlock operation. The low-order bit > > of the head counter is used to indicate that the lock is in queued mode, > > which forces an unconditional mismatch between the head and tail > > counters. > > This approach means that the common-case code path under conditions of > > low contention is very nearly that of a plain ticket lock. > > > > A fixed number of queueing structures is statically allocated in an > > array. The ticket-lock address is used to hash into an initial element, > > but if that element is already in use, it moves to the next element. If > > the entire array is already in use, continue to spin in ticket mode. > > > > Signed-off-by: Paul E. McKenney > > [ paulmck: Eliminate duplicate code and update comments (Steven > > Rostedt). ] > > [ paulmck: Address Eric Dumazet review feedback. ] > > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen > > Persvold). ] > > [ paulmck: Type safety fixes (Steven Rostedt). ] > > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] > > > Hi, Paul, > > I simplify the code and remove the search by encoding the index of > struct tkt_q_head > into lock->tickets.head. > > A) lock->tickets.head(when queued-lock): > - > index of struct tkt_q_head |0|1| > - > >>> > >>> Interesting approach! It might reduce queued-mode overhead a bit in > >>> some cases, though I bet that in the common case the first queue element > >>> examined is the right one. More on this below. > >>> > The bit0 = 1 for queued, and the bit1 = 0 is reserved for > __ticket_spin_unlock(), > thus __ticket_spin_unlock() will not change the higher bits of > lock->tickets.head. > > B) tqhp->head is for the real value of lock->tickets.head. > if the last bit of tqhp->head is 1, it means it is the head ticket when > started queuing. > >>> > >>> But don't you also need the xadd() in __ticket_spin_unlock() to become > >>> a cmpxchg() for this to work? Or is your patch missing your changes to > >>> arch/x86/include/asm/spinlock.h? Either way, this is likely to increase > >>> the no-contention overhead, which might be counterproductive. Wouldn't > >>> hurt to get measurements, though. > >> > >> No, don't need to change __ticket_spin_unlock() in my idea. > >> bit1 in the tickets.head is reserved for __ticket_spin_unlock(), > >> __ticket_spin_unlock() only changes the bit1, it will not change > >> the higher bits. tkt_q_do_wake() will restore the tickets.head. > >> > >> This approach avoids cmpxchg in __ticket_spin_unlock(). > > > > Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() > > need to be atomic in order to handle concurrent invocations of > > __ticket_spin_lock()? > > I don't understand, do we just discuss about __ticket_spin_unlock() only? > Or does my suggestion hurt __ticket_spin_lock()? On many architectures, it is harmless. But my concern is that __ticket_spin_lock() is atomically incrementing the full value (both head and tail), but in such a way as to never change the value of head. So in theory, a concurrent non-atomic store to head should be OK, but it does make me quite nervous. At the very least, it needs a comment saying why it is safe. Thanx, Paul > > Either way, it would be good to
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Fri, Jun 14, 2013 at 03:12:43PM +0800, Lai Jiangshan wrote: > On 06/14/2013 07:57 AM, Paul E. McKenney wrote: > > On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: > >> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney > >> wrote: > >>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: > On 06/12/2013 11:40 PM, Paul E. McKenney wrote: > > Breaking up locks is better than implementing high-contention locks, but > > if we must have high-contention locks, why not make them automatically > > switch between light-weight ticket locks at low contention and queued > > locks at high contention? After all, this would remove the need for > > the developer to predict which locks will be highly contended. > > > > This commit allows ticket locks to automatically switch between pure > > ticketlock and queued-lock operation as needed. If too many CPUs are > > spinning on a given ticket lock, a queue structure will be allocated > > and the lock will switch to queued-lock operation. When the lock > > becomes > > free, it will switch back into ticketlock operation. The low-order bit > > of the head counter is used to indicate that the lock is in queued mode, > > which forces an unconditional mismatch between the head and tail > > counters. > > This approach means that the common-case code path under conditions of > > low contention is very nearly that of a plain ticket lock. > > > > A fixed number of queueing structures is statically allocated in an > > array. The ticket-lock address is used to hash into an initial element, > > but if that element is already in use, it moves to the next element. If > > the entire array is already in use, continue to spin in ticket mode. > > > > Signed-off-by: Paul E. McKenney > > [ paulmck: Eliminate duplicate code and update comments (Steven > > Rostedt). ] > > [ paulmck: Address Eric Dumazet review feedback. ] > > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen > > Persvold). ] > > [ paulmck: Type safety fixes (Steven Rostedt). ] > > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] > > > Hi, Paul, > > I simplify the code and remove the search by encoding the index of > struct tkt_q_head > into lock->tickets.head. > > A) lock->tickets.head(when queued-lock): > - > index of struct tkt_q_head |0|1| > - > >>> > >>> Interesting approach! It might reduce queued-mode overhead a bit in > >>> some cases, though I bet that in the common case the first queue element > >>> examined is the right one. More on this below. > >>> > The bit0 = 1 for queued, and the bit1 = 0 is reserved for > __ticket_spin_unlock(), > thus __ticket_spin_unlock() will not change the higher bits of > lock->tickets.head. > > B) tqhp->head is for the real value of lock->tickets.head. > if the last bit of tqhp->head is 1, it means it is the head ticket when > started queuing. > >>> > >>> But don't you also need the xadd() in __ticket_spin_unlock() to become > >>> a cmpxchg() for this to work? Or is your patch missing your changes to > >>> arch/x86/include/asm/spinlock.h? Either way, this is likely to increase > >>> the no-contention overhead, which might be counterproductive. Wouldn't > >>> hurt to get measurements, though. > >> > >> No, don't need to change __ticket_spin_unlock() in my idea. > >> bit1 in the tickets.head is reserved for __ticket_spin_unlock(), > >> __ticket_spin_unlock() only changes the bit1, it will not change > >> the higher bits. tkt_q_do_wake() will restore the tickets.head. > >> > >> This approach avoids cmpxchg in __ticket_spin_unlock(). > > > > Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() > > need to be atomic in order to handle concurrent invocations of > > __ticket_spin_lock()? > > > > Either way, it would be good to see the performance effects of this. > > > > Thanx, Paul > > > >>> Given the results that Davidlohr posted, I believe that the following > >>> optimizations would also provide some improvement: > >>> > >>> 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() > >>> to a separate linker section in order to reduce the icache > >>> penalty exacted by the spinloop. This is likely to be causing > >>> some of the performance reductions in the cases where ticket > >>>
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On 06/14/2013 07:57 AM, Paul E. McKenney wrote: > On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: >> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney >> wrote: >>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: > Breaking up locks is better than implementing high-contention locks, but > if we must have high-contention locks, why not make them automatically > switch between light-weight ticket locks at low contention and queued > locks at high contention? After all, this would remove the need for > the developer to predict which locks will be highly contended. > > This commit allows ticket locks to automatically switch between pure > ticketlock and queued-lock operation as needed. If too many CPUs are > spinning on a given ticket lock, a queue structure will be allocated > and the lock will switch to queued-lock operation. When the lock becomes > free, it will switch back into ticketlock operation. The low-order bit > of the head counter is used to indicate that the lock is in queued mode, > which forces an unconditional mismatch between the head and tail counters. > This approach means that the common-case code path under conditions of > low contention is very nearly that of a plain ticket lock. > > A fixed number of queueing structures is statically allocated in an > array. The ticket-lock address is used to hash into an initial element, > but if that element is already in use, it moves to the next element. If > the entire array is already in use, continue to spin in ticket mode. > > Signed-off-by: Paul E. McKenney > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). > ] > [ paulmck: Address Eric Dumazet review feedback. ] > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). > ] > [ paulmck: Type safety fixes (Steven Rostedt). ] > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock->tickets.head. A) lock->tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - >>> >>> Interesting approach! It might reduce queued-mode overhead a bit in >>> some cases, though I bet that in the common case the first queue element >>> examined is the right one. More on this below. >>> The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock->tickets.head. B) tqhp->head is for the real value of lock->tickets.head. if the last bit of tqhp->head is 1, it means it is the head ticket when started queuing. >>> >>> But don't you also need the xadd() in __ticket_spin_unlock() to become >>> a cmpxchg() for this to work? Or is your patch missing your changes to >>> arch/x86/include/asm/spinlock.h? Either way, this is likely to increase >>> the no-contention overhead, which might be counterproductive. Wouldn't >>> hurt to get measurements, though. >> >> No, don't need to change __ticket_spin_unlock() in my idea. >> bit1 in the tickets.head is reserved for __ticket_spin_unlock(), >> __ticket_spin_unlock() only changes the bit1, it will not change >> the higher bits. tkt_q_do_wake() will restore the tickets.head. >> >> This approach avoids cmpxchg in __ticket_spin_unlock(). > > Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() > need to be atomic in order to handle concurrent invocations of > __ticket_spin_lock()? > > Either way, it would be good to see the performance effects of this. > > Thanx, Paul > >>> Given the results that Davidlohr posted, I believe that the following >>> optimizations would also provide some improvement: >>> >>> 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() >>> to a separate linker section in order to reduce the icache >>> penalty exacted by the spinloop. This is likely to be causing >>> some of the performance reductions in the cases where ticket >>> locks are not highly contended. >>> >>> 2. Limit the number of elements searched for in the array of >>> queues. However, this would help only if a number of ticket >>> locks were in queued mode at the same time. >>> >>> 3. Dynamically
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On 06/14/2013 07:57 AM, Paul E. McKenney wrote: On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. No, don't need to change __ticket_spin_unlock() in my idea. bit1 in the tickets.head is reserved for __ticket_spin_unlock(), __ticket_spin_unlock() only changes the bit1, it will not change the higher bits. tkt_q_do_wake() will restore the tickets.head. This approach avoids cmpxchg in __ticket_spin_unlock(). Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() need to be atomic in order to handle concurrent invocations of __ticket_spin_lock()? Either way, it would be good to see the performance effects of this. Thanx, Paul Given the results that Davidlohr posted, I believe that the following optimizations would also provide some improvement: 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() to a separate linker section in order to reduce the icache penalty exacted by the spinloop. This is likely to be causing some of the performance reductions in the cases where ticket locks are not highly contended. 2. Limit the number of elements searched for in the array of queues. However, this would help only if a number of ticket locks were in queued mode at the same time. 3. Dynamically allocate the queue array at boot. This might also reduce cache pressure, again, at least in cases where there are a number of ticket locks in queued mode at the same time. Frederic just reminded me that I owe him some energy-efficiency improvements for adaptive ticks, so I won't get to
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Fri, Jun 14, 2013 at 03:12:43PM +0800, Lai Jiangshan wrote: On 06/14/2013 07:57 AM, Paul E. McKenney wrote: On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. No, don't need to change __ticket_spin_unlock() in my idea. bit1 in the tickets.head is reserved for __ticket_spin_unlock(), __ticket_spin_unlock() only changes the bit1, it will not change the higher bits. tkt_q_do_wake() will restore the tickets.head. This approach avoids cmpxchg in __ticket_spin_unlock(). Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() need to be atomic in order to handle concurrent invocations of __ticket_spin_lock()? Either way, it would be good to see the performance effects of this. Thanx, Paul Given the results that Davidlohr posted, I believe that the following optimizations would also provide some improvement: 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() to a separate linker section in order to reduce the icache penalty exacted by the spinloop. This is likely to be causing some of the performance reductions in the cases where ticket locks are not highly contended. 2. Limit the number of elements searched for in the array of queues. However, this would help only if a number of ticket locks were in queued mode at the same time. 3. Dynamically allocate the queue array at boot. This might also reduce cache pressure, again, at least in cases where there are a number of
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Fri, Jun 14, 2013 at 09:28:16AM +0800, Lai Jiangshan wrote: On 06/14/2013 07:57 AM, Paul E. McKenney wrote: On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. No, don't need to change __ticket_spin_unlock() in my idea. bit1 in the tickets.head is reserved for __ticket_spin_unlock(), __ticket_spin_unlock() only changes the bit1, it will not change the higher bits. tkt_q_do_wake() will restore the tickets.head. This approach avoids cmpxchg in __ticket_spin_unlock(). Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() need to be atomic in order to handle concurrent invocations of __ticket_spin_lock()? I don't understand, do we just discuss about __ticket_spin_unlock() only? Or does my suggestion hurt __ticket_spin_lock()? On many architectures, it is harmless. But my concern is that __ticket_spin_lock() is atomically incrementing the full value (both head and tail), but in such a way as to never change the value of head. So in theory, a concurrent non-atomic store to head should be OK, but it does make me quite nervous. At the very least, it needs a comment saying why it is safe. Thanx, Paul Either way, it would be good to see the performance effects of this. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On 06/14/2013 07:57 AM, Paul E. McKenney wrote: > On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: >> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney >> wrote: >>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: > Breaking up locks is better than implementing high-contention locks, but > if we must have high-contention locks, why not make them automatically > switch between light-weight ticket locks at low contention and queued > locks at high contention? After all, this would remove the need for > the developer to predict which locks will be highly contended. > > This commit allows ticket locks to automatically switch between pure > ticketlock and queued-lock operation as needed. If too many CPUs are > spinning on a given ticket lock, a queue structure will be allocated > and the lock will switch to queued-lock operation. When the lock becomes > free, it will switch back into ticketlock operation. The low-order bit > of the head counter is used to indicate that the lock is in queued mode, > which forces an unconditional mismatch between the head and tail counters. > This approach means that the common-case code path under conditions of > low contention is very nearly that of a plain ticket lock. > > A fixed number of queueing structures is statically allocated in an > array. The ticket-lock address is used to hash into an initial element, > but if that element is already in use, it moves to the next element. If > the entire array is already in use, continue to spin in ticket mode. > > Signed-off-by: Paul E. McKenney > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). > ] > [ paulmck: Address Eric Dumazet review feedback. ] > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). > ] > [ paulmck: Type safety fixes (Steven Rostedt). ] > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock->tickets.head. A) lock->tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - >>> >>> Interesting approach! It might reduce queued-mode overhead a bit in >>> some cases, though I bet that in the common case the first queue element >>> examined is the right one. More on this below. >>> The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock->tickets.head. B) tqhp->head is for the real value of lock->tickets.head. if the last bit of tqhp->head is 1, it means it is the head ticket when started queuing. >>> >>> But don't you also need the xadd() in __ticket_spin_unlock() to become >>> a cmpxchg() for this to work? Or is your patch missing your changes to >>> arch/x86/include/asm/spinlock.h? Either way, this is likely to increase >>> the no-contention overhead, which might be counterproductive. Wouldn't >>> hurt to get measurements, though. >> >> No, don't need to change __ticket_spin_unlock() in my idea. >> bit1 in the tickets.head is reserved for __ticket_spin_unlock(), >> __ticket_spin_unlock() only changes the bit1, it will not change >> the higher bits. tkt_q_do_wake() will restore the tickets.head. >> >> This approach avoids cmpxchg in __ticket_spin_unlock(). > > Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() > need to be atomic in order to handle concurrent invocations of > __ticket_spin_lock()? I don't understand, do we just discuss about __ticket_spin_unlock() only? Or does my suggestion hurt __ticket_spin_lock()? > > Either way, it would be good to see the performance effects of this. > > Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: > On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney > wrote: > > On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: > >> On 06/12/2013 11:40 PM, Paul E. McKenney wrote: > >> > Breaking up locks is better than implementing high-contention locks, but > >> > if we must have high-contention locks, why not make them automatically > >> > switch between light-weight ticket locks at low contention and queued > >> > locks at high contention? After all, this would remove the need for > >> > the developer to predict which locks will be highly contended. > >> > > >> > This commit allows ticket locks to automatically switch between pure > >> > ticketlock and queued-lock operation as needed. If too many CPUs are > >> > spinning on a given ticket lock, a queue structure will be allocated > >> > and the lock will switch to queued-lock operation. When the lock becomes > >> > free, it will switch back into ticketlock operation. The low-order bit > >> > of the head counter is used to indicate that the lock is in queued mode, > >> > which forces an unconditional mismatch between the head and tail > >> > counters. > >> > This approach means that the common-case code path under conditions of > >> > low contention is very nearly that of a plain ticket lock. > >> > > >> > A fixed number of queueing structures is statically allocated in an > >> > array. The ticket-lock address is used to hash into an initial element, > >> > but if that element is already in use, it moves to the next element. If > >> > the entire array is already in use, continue to spin in ticket mode. > >> > > >> > Signed-off-by: Paul E. McKenney > >> > [ paulmck: Eliminate duplicate code and update comments (Steven > >> > Rostedt). ] > >> > [ paulmck: Address Eric Dumazet review feedback. ] > >> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > >> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > >> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > >> > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > >> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen > >> > Persvold). ] > >> > [ paulmck: Type safety fixes (Steven Rostedt). ] > >> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > >> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] > >> > >> > >> Hi, Paul, > >> > >> I simplify the code and remove the search by encoding the index of struct > >> tkt_q_head > >> into lock->tickets.head. > >> > >> A) lock->tickets.head(when queued-lock): > >> - > >> index of struct tkt_q_head |0|1| > >> - > > > > Interesting approach! It might reduce queued-mode overhead a bit in > > some cases, though I bet that in the common case the first queue element > > examined is the right one. More on this below. > > > >> The bit0 = 1 for queued, and the bit1 = 0 is reserved for > >> __ticket_spin_unlock(), > >> thus __ticket_spin_unlock() will not change the higher bits of > >> lock->tickets.head. > >> > >> B) tqhp->head is for the real value of lock->tickets.head. > >> if the last bit of tqhp->head is 1, it means it is the head ticket when > >> started queuing. > > > > But don't you also need the xadd() in __ticket_spin_unlock() to become > > a cmpxchg() for this to work? Or is your patch missing your changes to > > arch/x86/include/asm/spinlock.h? Either way, this is likely to increase > > the no-contention overhead, which might be counterproductive. Wouldn't > > hurt to get measurements, though. > > No, don't need to change __ticket_spin_unlock() in my idea. > bit1 in the tickets.head is reserved for __ticket_spin_unlock(), > __ticket_spin_unlock() only changes the bit1, it will not change > the higher bits. tkt_q_do_wake() will restore the tickets.head. > > This approach avoids cmpxchg in __ticket_spin_unlock(). Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() need to be atomic in order to handle concurrent invocations of __ticket_spin_lock()? Either way, it would be good to see the performance effects of this. Thanx, Paul > > Given the results that Davidlohr posted, I believe that the following > > optimizations would also provide some improvement: > > > > 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() > > to a separate linker section in order to reduce the icache > > penalty exacted by the spinloop. This is likely to be causing > > some of the performance reductions in the cases where ticket > > locks are not highly contended. > > > > 2. Limit the number of elements searched for in the array of > > queues. However, this would help only if a number of ticket > > locks were in queued mode at the same time. > > > > 3. Dynamically allocate the queue array at boot.
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney wrote: > On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: >> On 06/12/2013 11:40 PM, Paul E. McKenney wrote: >> > Breaking up locks is better than implementing high-contention locks, but >> > if we must have high-contention locks, why not make them automatically >> > switch between light-weight ticket locks at low contention and queued >> > locks at high contention? After all, this would remove the need for >> > the developer to predict which locks will be highly contended. >> > >> > This commit allows ticket locks to automatically switch between pure >> > ticketlock and queued-lock operation as needed. If too many CPUs are >> > spinning on a given ticket lock, a queue structure will be allocated >> > and the lock will switch to queued-lock operation. When the lock becomes >> > free, it will switch back into ticketlock operation. The low-order bit >> > of the head counter is used to indicate that the lock is in queued mode, >> > which forces an unconditional mismatch between the head and tail counters. >> > This approach means that the common-case code path under conditions of >> > low contention is very nearly that of a plain ticket lock. >> > >> > A fixed number of queueing structures is statically allocated in an >> > array. The ticket-lock address is used to hash into an initial element, >> > but if that element is already in use, it moves to the next element. If >> > the entire array is already in use, continue to spin in ticket mode. >> > >> > Signed-off-by: Paul E. McKenney >> > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] >> > [ paulmck: Address Eric Dumazet review feedback. ] >> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] >> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] >> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] >> > [ paulmck: Reduce queue-switch contention (Waiman Long). ] >> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] >> > [ paulmck: Type safety fixes (Steven Rostedt). ] >> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] >> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] >> >> >> Hi, Paul, >> >> I simplify the code and remove the search by encoding the index of struct >> tkt_q_head >> into lock->tickets.head. >> >> A) lock->tickets.head(when queued-lock): >> - >> index of struct tkt_q_head |0|1| >> - > > Interesting approach! It might reduce queued-mode overhead a bit in > some cases, though I bet that in the common case the first queue element > examined is the right one. More on this below. > >> The bit0 = 1 for queued, and the bit1 = 0 is reserved for >> __ticket_spin_unlock(), >> thus __ticket_spin_unlock() will not change the higher bits of >> lock->tickets.head. >> >> B) tqhp->head is for the real value of lock->tickets.head. >> if the last bit of tqhp->head is 1, it means it is the head ticket when >> started queuing. > > But don't you also need the xadd() in __ticket_spin_unlock() to become > a cmpxchg() for this to work? Or is your patch missing your changes to > arch/x86/include/asm/spinlock.h? Either way, this is likely to increase > the no-contention overhead, which might be counterproductive. Wouldn't > hurt to get measurements, though. No, don't need to change __ticket_spin_unlock() in my idea. bit1 in the tickets.head is reserved for __ticket_spin_unlock(), __ticket_spin_unlock() only changes the bit1, it will not change the higher bits. tkt_q_do_wake() will restore the tickets.head. This approach avoids cmpxchg in __ticket_spin_unlock(). > > Given the results that Davidlohr posted, I believe that the following > optimizations would also provide some improvement: > > 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() > to a separate linker section in order to reduce the icache > penalty exacted by the spinloop. This is likely to be causing > some of the performance reductions in the cases where ticket > locks are not highly contended. > > 2. Limit the number of elements searched for in the array of > queues. However, this would help only if a number of ticket > locks were in queued mode at the same time. > > 3. Dynamically allocate the queue array at boot. This might > also reduce cache pressure, again, at least in cases where > there are a number of ticket locks in queued mode at the > same time. > > Frederic just reminded me that I owe him some energy-efficiency improvements > for adaptive ticks, so I won't get to these very quickly. Please feel free > to take these on -- the patch clearly does well under high contention, so > reducing the no-contention penalty could really help. > > Thanx, Paul > >> Thanks, >> Lai >> >>
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: > On 06/12/2013 11:40 PM, Paul E. McKenney wrote: > > Breaking up locks is better than implementing high-contention locks, but > > if we must have high-contention locks, why not make them automatically > > switch between light-weight ticket locks at low contention and queued > > locks at high contention? After all, this would remove the need for > > the developer to predict which locks will be highly contended. > > > > This commit allows ticket locks to automatically switch between pure > > ticketlock and queued-lock operation as needed. If too many CPUs are > > spinning on a given ticket lock, a queue structure will be allocated > > and the lock will switch to queued-lock operation. When the lock becomes > > free, it will switch back into ticketlock operation. The low-order bit > > of the head counter is used to indicate that the lock is in queued mode, > > which forces an unconditional mismatch between the head and tail counters. > > This approach means that the common-case code path under conditions of > > low contention is very nearly that of a plain ticket lock. > > > > A fixed number of queueing structures is statically allocated in an > > array. The ticket-lock address is used to hash into an initial element, > > but if that element is already in use, it moves to the next element. If > > the entire array is already in use, continue to spin in ticket mode. > > > > Signed-off-by: Paul E. McKenney > > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] > > [ paulmck: Address Eric Dumazet review feedback. ] > > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] > > [ paulmck: Type safety fixes (Steven Rostedt). ] > > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] > > > Hi, Paul, > > I simplify the code and remove the search by encoding the index of struct > tkt_q_head > into lock->tickets.head. > > A) lock->tickets.head(when queued-lock): > - > index of struct tkt_q_head |0|1| > - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. > The bit0 = 1 for queued, and the bit1 = 0 is reserved for > __ticket_spin_unlock(), > thus __ticket_spin_unlock() will not change the higher bits of > lock->tickets.head. > > B) tqhp->head is for the real value of lock->tickets.head. > if the last bit of tqhp->head is 1, it means it is the head ticket when > started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. Given the results that Davidlohr posted, I believe that the following optimizations would also provide some improvement: 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() to a separate linker section in order to reduce the icache penalty exacted by the spinloop. This is likely to be causing some of the performance reductions in the cases where ticket locks are not highly contended. 2. Limit the number of elements searched for in the array of queues. However, this would help only if a number of ticket locks were in queued mode at the same time. 3. Dynamically allocate the queue array at boot. This might also reduce cache pressure, again, at least in cases where there are a number of ticket locks in queued mode at the same time. Frederic just reminded me that I owe him some energy-efficiency improvements for adaptive ticks, so I won't get to these very quickly. Please feel free to take these on -- the patch clearly does well under high contention, so reducing the no-contention penalty could really help. Thanx, Paul > Thanks, > Lai > > kernel/tktqlock.c | 51 +-- > 1 files changed, 13 insertions(+), 38 deletions(-) > > diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c > index 912817c..1329d0f 100644 > --- a/kernel/tktqlock.c > +++ b/kernel/tktqlock.c > @@ -33,7 +33,7 @@ struct tkt_q { > > struct tkt_q_head { > arch_spinlock_t *ref; /* Pointer to spinlock if in use. */ > - s64 head_tkt; /* Head ticket when started queuing. */
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. Given the results that Davidlohr posted, I believe that the following optimizations would also provide some improvement: 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() to a separate linker section in order to reduce the icache penalty exacted by the spinloop. This is likely to be causing some of the performance reductions in the cases where ticket locks are not highly contended. 2. Limit the number of elements searched for in the array of queues. However, this would help only if a number of ticket locks were in queued mode at the same time. 3. Dynamically allocate the queue array at boot. This might also reduce cache pressure, again, at least in cases where there are a number of ticket locks in queued mode at the same time. Frederic just reminded me that I owe him some energy-efficiency improvements for adaptive ticks, so I won't get to these very quickly. Please feel free to take these on -- the patch clearly does well under high contention, so reducing the no-contention penalty could really help. Thanx, Paul Thanks, Lai kernel/tktqlock.c | 51 +-- 1 files changed, 13 insertions(+), 38 deletions(-) diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c index 912817c..1329d0f 100644 --- a/kernel/tktqlock.c +++ b/kernel/tktqlock.c @@ -33,7 +33,7 @@ struct tkt_q { struct tkt_q_head { arch_spinlock_t *ref; /* Pointer to spinlock if in use. */ - s64 head_tkt; /* Head ticket when started queuing. */ + __ticket_t head;/* Real head when queued. */ struct
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. No, don't need to change __ticket_spin_unlock() in my idea. bit1 in the tickets.head is reserved for __ticket_spin_unlock(), __ticket_spin_unlock() only changes the bit1, it will not change the higher bits. tkt_q_do_wake() will restore the tickets.head. This approach avoids cmpxchg in __ticket_spin_unlock(). Given the results that Davidlohr posted, I believe that the following optimizations would also provide some improvement: 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() to a separate linker section in order to reduce the icache penalty exacted by the spinloop. This is likely to be causing some of the performance reductions in the cases where ticket locks are not highly contended. 2. Limit the number of elements searched for in the array of queues. However, this would help only if a number of ticket locks were in queued mode at the same time. 3. Dynamically allocate the queue array at boot. This might also reduce cache pressure, again, at least in cases where there are a number of ticket locks in queued mode at the same time. Frederic just reminded me that I owe him some energy-efficiency improvements for adaptive ticks, so I won't get to these very quickly. Please feel free to take these on -- the patch clearly does well under high contention, so reducing the no-contention penalty could really help. Thanx, Paul Thanks, Lai kernel/tktqlock.c | 51 +-- 1 files changed, 13 insertions(+), 38 deletions(-) diff
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. No, don't need to change __ticket_spin_unlock() in my idea. bit1 in the tickets.head is reserved for __ticket_spin_unlock(), __ticket_spin_unlock() only changes the bit1, it will not change the higher bits. tkt_q_do_wake() will restore the tickets.head. This approach avoids cmpxchg in __ticket_spin_unlock(). Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() need to be atomic in order to handle concurrent invocations of __ticket_spin_lock()? Either way, it would be good to see the performance effects of this. Thanx, Paul Given the results that Davidlohr posted, I believe that the following optimizations would also provide some improvement: 1. Move the call to tkt_spin_pass() from __ticket_spin_lock() to a separate linker section in order to reduce the icache penalty exacted by the spinloop. This is likely to be causing some of the performance reductions in the cases where ticket locks are not highly contended. 2. Limit the number of elements searched for in the array of queues. However, this would help only if a number of ticket locks were in queued mode at the same time. 3. Dynamically allocate the queue array at boot. This might also reduce cache pressure, again, at least in cases where there are a number of ticket locks in queued mode at the same time. Frederic just reminded me that I owe him some
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On 06/14/2013 07:57 AM, Paul E. McKenney wrote: On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote: On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote: On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - Interesting approach! It might reduce queued-mode overhead a bit in some cases, though I bet that in the common case the first queue element examined is the right one. More on this below. The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. But don't you also need the xadd() in __ticket_spin_unlock() to become a cmpxchg() for this to work? Or is your patch missing your changes to arch/x86/include/asm/spinlock.h? Either way, this is likely to increase the no-contention overhead, which might be counterproductive. Wouldn't hurt to get measurements, though. No, don't need to change __ticket_spin_unlock() in my idea. bit1 in the tickets.head is reserved for __ticket_spin_unlock(), __ticket_spin_unlock() only changes the bit1, it will not change the higher bits. tkt_q_do_wake() will restore the tickets.head. This approach avoids cmpxchg in __ticket_spin_unlock(). Ah, I did miss that. But doesn't the adjustment in __ticket_spin_lock() need to be atomic in order to handle concurrent invocations of __ticket_spin_lock()? I don't understand, do we just discuss about __ticket_spin_unlock() only? Or does my suggestion hurt __ticket_spin_lock()? Either way, it would be good to see the performance effects of this. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On 06/12/2013 11:40 PM, Paul E. McKenney wrote: > Breaking up locks is better than implementing high-contention locks, but > if we must have high-contention locks, why not make them automatically > switch between light-weight ticket locks at low contention and queued > locks at high contention? After all, this would remove the need for > the developer to predict which locks will be highly contended. > > This commit allows ticket locks to automatically switch between pure > ticketlock and queued-lock operation as needed. If too many CPUs are > spinning on a given ticket lock, a queue structure will be allocated > and the lock will switch to queued-lock operation. When the lock becomes > free, it will switch back into ticketlock operation. The low-order bit > of the head counter is used to indicate that the lock is in queued mode, > which forces an unconditional mismatch between the head and tail counters. > This approach means that the common-case code path under conditions of > low contention is very nearly that of a plain ticket lock. > > A fixed number of queueing structures is statically allocated in an > array. The ticket-lock address is used to hash into an initial element, > but if that element is already in use, it moves to the next element. If > the entire array is already in use, continue to spin in ticket mode. > > Signed-off-by: Paul E. McKenney > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] > [ paulmck: Address Eric Dumazet review feedback. ] > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] > [ paulmck: Type safety fixes (Steven Rostedt). ] > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock->tickets.head. A) lock->tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock->tickets.head. B) tqhp->head is for the real value of lock->tickets.head. if the last bit of tqhp->head is 1, it means it is the head ticket when started queuing. Thanks, Lai kernel/tktqlock.c | 51 +-- 1 files changed, 13 insertions(+), 38 deletions(-) diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c index 912817c..1329d0f 100644 --- a/kernel/tktqlock.c +++ b/kernel/tktqlock.c @@ -33,7 +33,7 @@ struct tkt_q { struct tkt_q_head { arch_spinlock_t *ref; /* Pointer to spinlock if in use. */ - s64 head_tkt; /* Head ticket when started queuing. */ + __ticket_t head;/* Real head when queued. */ struct tkt_q *spin; /* Head of queue. */ struct tkt_q **spin_tail; /* Tail of queue. */ }; @@ -77,15 +77,8 @@ static unsigned long tkt_q_hash(arch_spinlock_t *lock) */ static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *lock) { - int i; - int start; - - start = i = tkt_q_hash(lock); - do - if (ACCESS_ONCE(tkt_q_heads[i].ref) == lock) - return _q_heads[i]; - while ((i = tkt_q_next_slot(i)) != start); - return NULL; + BUILD_BUG_ON(TKT_Q_NQUEUES > (1 << (TICKET_SHIFT - 2))); + return _q_heads[ACCESS_ONCE(lock->tickets.head) >> 2]; } /* @@ -101,11 +94,11 @@ static bool tkt_q_try_unqueue(arch_spinlock_t *lock, struct tkt_q_head *tqhp) /* Pick up the ticket values. */ asold = ACCESS_ONCE(*lock); - if ((asold.tickets.head & ~0x1) == asold.tickets.tail) { + if (tqhp->head == asold.tickets.tail) { /* Attempt to mark the lock as not having a queue. */ asnew = asold; - asnew.tickets.head &= ~0x1; + asnew.tickets.head = tqhp->head; if (cmpxchg(>head_tail, asold.head_tail, asnew.head_tail) == asold.head_tail) { @@ -128,12 +121,9 @@ void tkt_q_do_wake(arch_spinlock_t *lock) struct tkt_q_head *tqhp; struct tkt_q *tqp; - /* -* If the queue is still being set up, wait for it. Note that -* the caller's xadd() provides the needed memory ordering. -*/ - while ((tqhp = tkt_q_find_head(lock)) == NULL) - cpu_relax(); + tqhp = tkt_q_find_head(lock); + ACCESS_ONCE(lock->tickets.head) -= __TKT_SPIN_INC; + ACCESS_ONCE(tqhp->head) = (tqhp->head &
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Thu, Jun 13, 2013 at 12:13:47AM +0800, Lai Jiangshan wrote: > On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney > wrote: > > Breaking up locks is better than implementing high-contention locks, but > > if we must have high-contention locks, why not make them automatically > > switch between light-weight ticket locks at low contention and queued > > locks at high contention? After all, this would remove the need for > > the developer to predict which locks will be highly contended. > > > > This commit allows ticket locks to automatically switch between pure > > ticketlock and queued-lock operation as needed. If too many CPUs are > > spinning on a given ticket lock, a queue structure will be allocated > > and the lock will switch to queued-lock operation. When the lock becomes > > free, it will switch back into ticketlock operation. The low-order bit > > of the head counter is used to indicate that the lock is in queued mode, > > which forces an unconditional mismatch between the head and tail counters. > > This approach means that the common-case code path under conditions of > > low contention is very nearly that of a plain ticket lock. > > > > A fixed number of queueing structures is statically allocated in an > > array. The ticket-lock address is used to hash into an initial element, > > but if that element is already in use, it moves to the next element. If > > the entire array is already in use, continue to spin in ticket mode. > > > > Signed-off-by: Paul E. McKenney > > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] > > [ paulmck: Address Eric Dumazet review feedback. ] > > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] > > [ paulmck: Type safety fixes (Steven Rostedt). ] > > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] > > > > diff --git a/arch/x86/include/asm/spinlock.h > > b/arch/x86/include/asm/spinlock.h > > index 33692ea..5aa0177 100644 > > --- a/arch/x86/include/asm/spinlock.h > > +++ b/arch/x86/include/asm/spinlock.h > > @@ -34,6 +34,21 @@ > > # define UNLOCK_LOCK_PREFIX > > #endif > > > > +#ifdef CONFIG_TICKET_LOCK_QUEUED > > + > > +#define __TKT_SPIN_INC 2 > > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc); > > + > > +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */ > > + > > +#define __TKT_SPIN_INC 1 > > +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets > > inc) > > +{ > > + return false; > > +} > > + > > +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */ > > + > > /* > > * Ticket locks are conceptually two parts, one indicating the current > > head of > > * the queue, and the other indicating the current tail. The lock is > > acquired > > @@ -49,17 +64,16 @@ > > */ > > static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) > > { > > - register struct __raw_tickets inc = { .tail = 1 }; > > + register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC }; > > > > inc = xadd(>tickets, inc); > > - > > for (;;) { > > - if (inc.head == inc.tail) > > + if (inc.head == inc.tail || tkt_spin_pass(lock, inc)) > > break; > > cpu_relax(); > > inc.head = ACCESS_ONCE(lock->tickets.head); > > } > > - barrier(); /* make sure nothing creeps before the lock > > is taken */ > > + barrier(); /* Make sure nothing creeps in before the lock is taken. > > */ > > } > > > > static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) > > @@ -70,17 +84,33 @@ static __always_inline int > > __ticket_spin_trylock(arch_spinlock_t *lock) > > if (old.tickets.head != old.tickets.tail) > > return 0; > > > > - new.head_tail = old.head_tail + (1 << TICKET_SHIFT); > > + new.head_tail = old.head_tail + (__TKT_SPIN_INC << TICKET_SHIFT); > > > > /* cmpxchg is a full barrier, so nothing can move before it */ > > return cmpxchg(>head_tail, old.head_tail, new.head_tail) == > > old.head_tail; > > } > > > > +#ifndef CONFIG_TICKET_LOCK_QUEUED > > + > > static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) > > { > > __add(>tickets.head, 1, UNLOCK_LOCK_PREFIX); > > } > > > > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */ > > + > > +extern void tkt_q_do_wake(arch_spinlock_t *lock); > > + > > +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) > > +{ > > + __ticket_t head = 2; > > + > > + head = xadd(>tickets.head, head); > > + if (head & 0x1) > > + tkt_q_do_wake(lock);
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney wrote: > Breaking up locks is better than implementing high-contention locks, but > if we must have high-contention locks, why not make them automatically > switch between light-weight ticket locks at low contention and queued > locks at high contention? After all, this would remove the need for > the developer to predict which locks will be highly contended. > > This commit allows ticket locks to automatically switch between pure > ticketlock and queued-lock operation as needed. If too many CPUs are > spinning on a given ticket lock, a queue structure will be allocated > and the lock will switch to queued-lock operation. When the lock becomes > free, it will switch back into ticketlock operation. The low-order bit > of the head counter is used to indicate that the lock is in queued mode, > which forces an unconditional mismatch between the head and tail counters. > This approach means that the common-case code path under conditions of > low contention is very nearly that of a plain ticket lock. > > A fixed number of queueing structures is statically allocated in an > array. The ticket-lock address is used to hash into an initial element, > but if that element is already in use, it moves to the next element. If > the entire array is already in use, continue to spin in ticket mode. > > Signed-off-by: Paul E. McKenney > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] > [ paulmck: Address Eric Dumazet review feedback. ] > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ] > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] > [ paulmck: Reduce queue-switch contention (Waiman Long). ] > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] > [ paulmck: Type safety fixes (Steven Rostedt). ] > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] > > diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h > index 33692ea..5aa0177 100644 > --- a/arch/x86/include/asm/spinlock.h > +++ b/arch/x86/include/asm/spinlock.h > @@ -34,6 +34,21 @@ > # define UNLOCK_LOCK_PREFIX > #endif > > +#ifdef CONFIG_TICKET_LOCK_QUEUED > + > +#define __TKT_SPIN_INC 2 > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc); > + > +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */ > + > +#define __TKT_SPIN_INC 1 > +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets > inc) > +{ > + return false; > +} > + > +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */ > + > /* > * Ticket locks are conceptually two parts, one indicating the current head > of > * the queue, and the other indicating the current tail. The lock is acquired > @@ -49,17 +64,16 @@ > */ > static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) > { > - register struct __raw_tickets inc = { .tail = 1 }; > + register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC }; > > inc = xadd(>tickets, inc); > - > for (;;) { > - if (inc.head == inc.tail) > + if (inc.head == inc.tail || tkt_spin_pass(lock, inc)) > break; > cpu_relax(); > inc.head = ACCESS_ONCE(lock->tickets.head); > } > - barrier(); /* make sure nothing creeps before the lock > is taken */ > + barrier(); /* Make sure nothing creeps in before the lock is taken. */ > } > > static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) > @@ -70,17 +84,33 @@ static __always_inline int > __ticket_spin_trylock(arch_spinlock_t *lock) > if (old.tickets.head != old.tickets.tail) > return 0; > > - new.head_tail = old.head_tail + (1 << TICKET_SHIFT); > + new.head_tail = old.head_tail + (__TKT_SPIN_INC << TICKET_SHIFT); > > /* cmpxchg is a full barrier, so nothing can move before it */ > return cmpxchg(>head_tail, old.head_tail, new.head_tail) == > old.head_tail; > } > > +#ifndef CONFIG_TICKET_LOCK_QUEUED > + > static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) > { > __add(>tickets.head, 1, UNLOCK_LOCK_PREFIX); > } > > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */ > + > +extern void tkt_q_do_wake(arch_spinlock_t *lock); > + > +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) > +{ > + __ticket_t head = 2; > + > + head = xadd(>tickets.head, head); > + if (head & 0x1) > + tkt_q_do_wake(lock); > +} > +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */ > + > static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) > { > struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets); > diff --git a/arch/x86/include/asm/spinlock_types.h > b/arch/x86/include/asm/spinlock_types.h > index
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..5aa0177 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -34,6 +34,21 @@ # define UNLOCK_LOCK_PREFIX #endif +#ifdef CONFIG_TICKET_LOCK_QUEUED + +#define __TKT_SPIN_INC 2 +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc); + +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */ + +#define __TKT_SPIN_INC 1 +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc) +{ + return false; +} + +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */ + /* * Ticket locks are conceptually two parts, one indicating the current head of * the queue, and the other indicating the current tail. The lock is acquired @@ -49,17 +64,16 @@ */ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - register struct __raw_tickets inc = { .tail = 1 }; + register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC }; inc = xadd(lock-tickets, inc); - for (;;) { - if (inc.head == inc.tail) + if (inc.head == inc.tail || tkt_spin_pass(lock, inc)) break; cpu_relax(); inc.head = ACCESS_ONCE(lock-tickets.head); } - barrier(); /* make sure nothing creeps before the lock is taken */ + barrier(); /* Make sure nothing creeps in before the lock is taken. */ } static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) @@ -70,17 +84,33 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) if (old.tickets.head != old.tickets.tail) return 0; - new.head_tail = old.head_tail + (1 TICKET_SHIFT); + new.head_tail = old.head_tail + (__TKT_SPIN_INC TICKET_SHIFT); /* cmpxchg is a full barrier, so nothing can move before it */ return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == old.head_tail; } +#ifndef CONFIG_TICKET_LOCK_QUEUED + static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { __add(lock-tickets.head, 1, UNLOCK_LOCK_PREFIX); } +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */ + +extern void tkt_q_do_wake(arch_spinlock_t *lock); + +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) +{ + __ticket_t head = 2; + + head = xadd(lock-tickets.head, head); + if (head 0x1) + tkt_q_do_wake(lock); +} +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */ + static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) { struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index ad0ad07..cdaefdd 100644 ---
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On Thu, Jun 13, 2013 at 12:13:47AM +0800, Lai Jiangshan wrote: On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..5aa0177 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -34,6 +34,21 @@ # define UNLOCK_LOCK_PREFIX #endif +#ifdef CONFIG_TICKET_LOCK_QUEUED + +#define __TKT_SPIN_INC 2 +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc); + +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */ + +#define __TKT_SPIN_INC 1 +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc) +{ + return false; +} + +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */ + /* * Ticket locks are conceptually two parts, one indicating the current head of * the queue, and the other indicating the current tail. The lock is acquired @@ -49,17 +64,16 @@ */ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - register struct __raw_tickets inc = { .tail = 1 }; + register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC }; inc = xadd(lock-tickets, inc); - for (;;) { - if (inc.head == inc.tail) + if (inc.head == inc.tail || tkt_spin_pass(lock, inc)) break; cpu_relax(); inc.head = ACCESS_ONCE(lock-tickets.head); } - barrier(); /* make sure nothing creeps before the lock is taken */ + barrier(); /* Make sure nothing creeps in before the lock is taken. */ } static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) @@ -70,17 +84,33 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) if (old.tickets.head != old.tickets.tail) return 0; - new.head_tail = old.head_tail + (1 TICKET_SHIFT); + new.head_tail = old.head_tail + (__TKT_SPIN_INC TICKET_SHIFT); /* cmpxchg is a full barrier, so nothing can move before it */ return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == old.head_tail; } +#ifndef CONFIG_TICKET_LOCK_QUEUED + static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { __add(lock-tickets.head, 1, UNLOCK_LOCK_PREFIX); } +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */ + +extern void tkt_q_do_wake(arch_spinlock_t *lock); + +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) +{ + __ticket_t head = 2; + + head = xadd(lock-tickets.head, head); + if (head 0x1) + tkt_q_do_wake(lock); +} +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */ + static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) { struct __raw_tickets tmp =
Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock
On 06/12/2013 11:40 PM, Paul E. McKenney wrote: Breaking up locks is better than implementing high-contention locks, but if we must have high-contention locks, why not make them automatically switch between light-weight ticket locks at low contention and queued locks at high contention? After all, this would remove the need for the developer to predict which locks will be highly contended. This commit allows ticket locks to automatically switch between pure ticketlock and queued-lock operation as needed. If too many CPUs are spinning on a given ticket lock, a queue structure will be allocated and the lock will switch to queued-lock operation. When the lock becomes free, it will switch back into ticketlock operation. The low-order bit of the head counter is used to indicate that the lock is in queued mode, which forces an unconditional mismatch between the head and tail counters. This approach means that the common-case code path under conditions of low contention is very nearly that of a plain ticket lock. A fixed number of queueing structures is statically allocated in an array. The ticket-lock address is used to hash into an initial element, but if that element is already in use, it moves to the next element. If the entire array is already in use, continue to spin in ticket mode. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ] [ paulmck: Address Eric Dumazet review feedback. ] [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ] [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ] [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ] [ paulmck: Reduce queue-switch contention (Waiman Long). ] [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ] [ paulmck: Type safety fixes (Steven Rostedt). ] [ paulmck: Pre-check cmpxchg() value (Waiman Long). ] [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ] Hi, Paul, I simplify the code and remove the search by encoding the index of struct tkt_q_head into lock-tickets.head. A) lock-tickets.head(when queued-lock): - index of struct tkt_q_head |0|1| - The bit0 = 1 for queued, and the bit1 = 0 is reserved for __ticket_spin_unlock(), thus __ticket_spin_unlock() will not change the higher bits of lock-tickets.head. B) tqhp-head is for the real value of lock-tickets.head. if the last bit of tqhp-head is 1, it means it is the head ticket when started queuing. Thanks, Lai kernel/tktqlock.c | 51 +-- 1 files changed, 13 insertions(+), 38 deletions(-) diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c index 912817c..1329d0f 100644 --- a/kernel/tktqlock.c +++ b/kernel/tktqlock.c @@ -33,7 +33,7 @@ struct tkt_q { struct tkt_q_head { arch_spinlock_t *ref; /* Pointer to spinlock if in use. */ - s64 head_tkt; /* Head ticket when started queuing. */ + __ticket_t head;/* Real head when queued. */ struct tkt_q *spin; /* Head of queue. */ struct tkt_q **spin_tail; /* Tail of queue. */ }; @@ -77,15 +77,8 @@ static unsigned long tkt_q_hash(arch_spinlock_t *lock) */ static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *lock) { - int i; - int start; - - start = i = tkt_q_hash(lock); - do - if (ACCESS_ONCE(tkt_q_heads[i].ref) == lock) - return tkt_q_heads[i]; - while ((i = tkt_q_next_slot(i)) != start); - return NULL; + BUILD_BUG_ON(TKT_Q_NQUEUES (1 (TICKET_SHIFT - 2))); + return tkt_q_heads[ACCESS_ONCE(lock-tickets.head) 2]; } /* @@ -101,11 +94,11 @@ static bool tkt_q_try_unqueue(arch_spinlock_t *lock, struct tkt_q_head *tqhp) /* Pick up the ticket values. */ asold = ACCESS_ONCE(*lock); - if ((asold.tickets.head ~0x1) == asold.tickets.tail) { + if (tqhp-head == asold.tickets.tail) { /* Attempt to mark the lock as not having a queue. */ asnew = asold; - asnew.tickets.head = ~0x1; + asnew.tickets.head = tqhp-head; if (cmpxchg(lock-head_tail, asold.head_tail, asnew.head_tail) == asold.head_tail) { @@ -128,12 +121,9 @@ void tkt_q_do_wake(arch_spinlock_t *lock) struct tkt_q_head *tqhp; struct tkt_q *tqp; - /* -* If the queue is still being set up, wait for it. Note that -* the caller's xadd() provides the needed memory ordering. -*/ - while ((tqhp = tkt_q_find_head(lock)) == NULL) - cpu_relax(); + tqhp = tkt_q_find_head(lock); + ACCESS_ONCE(lock-tickets.head) -= __TKT_SPIN_INC; + ACCESS_ONCE(tqhp-head) = (tqhp-head ~0x1) +