Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-07-01 Thread Paul E. McKenney
On Mon, Jul 01, 2013 at 02:49:34PM +0530, Raghavendra KT wrote:
> On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT
>  wrote:
> >
> >
> > On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney
> >  wrote:
> >>
> >> Breaking up locks is better than implementing high-contention locks, but
> >> if we must have high-contention locks, why not make them automatically
> >> switch between light-weight ticket locks at low contention and queued
> >> locks at high contention?  After all, this would remove the need for
> >> the developer to predict which locks will be highly contended.
> >>
> >> This commit allows ticket locks to automatically switch between pure
> >> ticketlock and queued-lock operation as needed.  If too many CPUs are
> >> spinning on a given ticket lock, a queue structure will be allocated
> >> and the lock will switch to queued-lock operation.  When the lock becomes
> >> free, it will switch back into ticketlock operation.  The low-order bit
> >> of the head counter is used to indicate that the lock is in queued mode,
> >> which forces an unconditional mismatch between the head and tail counters.
> >> This approach means that the common-case code path under conditions of
> >> low contention is very nearly that of a plain ticket lock.
> >>
> >> A fixed number of queueing structures is statically allocated in an
> >> array.  The ticket-lock address is used to hash into an initial element,
> >> but if that element is already in use, it moves to the next element.  If
> >> the entire array is already in use, continue to spin in ticket mode.
> >>
> >> Signed-off-by: Paul E. McKenney 
> >> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt).
> >> ]
> >> [ paulmck: Address Eric Dumazet review feedback. ]
> >> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> >> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> >> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> >> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> >> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold).
> >> ]
> >> [ paulmck: Type safety fixes (Steven Rostedt). ]
> >> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> >> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
> >>
> > [...]
> >
> > I did test this on 32 core machine with 32 vcpu guests.
> >
> > This version gave me around 20% improvement fro sysbench and 36% improvement
> > for ebizzy, for 1x commit though other overcommited results showed
> > degradation. I have not tested Lai Jiangshan's patches on top of this yet.
> > Will report any findings.
> 
> Sorry for late report.

Not a problem, thank you for running these numbers!

> With Lai's patch I see few percentage of improvement in ebizzy 1x and
> reduction in degradation in dbench 1x.

OK, good!  But my guess is that even pushing the lock-acquisition
slowpath out of line, we still would not reach parity for the less-good
results.  Still seems like I should add Lai Jiangshan's patches
and post them somewhere in case they are helpful in some other context.

Thanx, Paul

> But over-commit degradation seem to still persist. seeing this,  I
> feel it is more of qmode overhead itself for large guests,
> 
> +---+---+---+---++---+
>   ebizzy (rec/sec higher is better)
> +---+---+---+---+---++---+
> base  stdev patched   stdev %improvement
> +---+---+---+---++---+
> 1x  5574.9000   237.4997  7851.9000   148.673740.84378
> 2x  2741.5000   561.3090  1620.9000   410.8299   -40.87543
> 3x  2146.2500   216.7718  1751.833396.5023   -18.37702
> +---+---+---+---++---+
> +---+---+---+---++---+
>   dbench (throughput higher is better)
> +---+---+---+---++---+
> base  stdev patched   stdev %improvement
> +---+---+---+---++---+
> 1x 14111.5600   754.4525 13826.5700  1458.0744-2.01955
> 2x  2481.627071.2665  1549.3740   245.3777   -37.56620
> 3x  1510.248331.8634  1116.015826.4882   -26.10382
> +---+---+---+---++---+
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-07-01 Thread Raghavendra KT
On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT
 wrote:
>
>
> On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney
>  wrote:
>>
>> Breaking up locks is better than implementing high-contention locks, but
>> if we must have high-contention locks, why not make them automatically
>> switch between light-weight ticket locks at low contention and queued
>> locks at high contention?  After all, this would remove the need for
>> the developer to predict which locks will be highly contended.
>>
>> This commit allows ticket locks to automatically switch between pure
>> ticketlock and queued-lock operation as needed.  If too many CPUs are
>> spinning on a given ticket lock, a queue structure will be allocated
>> and the lock will switch to queued-lock operation.  When the lock becomes
>> free, it will switch back into ticketlock operation.  The low-order bit
>> of the head counter is used to indicate that the lock is in queued mode,
>> which forces an unconditional mismatch between the head and tail counters.
>> This approach means that the common-case code path under conditions of
>> low contention is very nearly that of a plain ticket lock.
>>
>> A fixed number of queueing structures is statically allocated in an
>> array.  The ticket-lock address is used to hash into an initial element,
>> but if that element is already in use, it moves to the next element.  If
>> the entire array is already in use, continue to spin in ticket mode.
>>
>> Signed-off-by: Paul E. McKenney 
>> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt).
>> ]
>> [ paulmck: Address Eric Dumazet review feedback. ]
>> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
>> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
>> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
>> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
>> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold).
>> ]
>> [ paulmck: Type safety fixes (Steven Rostedt). ]
>> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
>> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
>>
> [...]
>
> I did test this on 32 core machine with 32 vcpu guests.
>
> This version gave me around 20% improvement fro sysbench and 36% improvement
> for ebizzy, for 1x commit though other overcommited results showed
> degradation. I have not tested Lai Jiangshan's patches on top of this yet.
> Will report any findings.

Sorry for late report.

With Lai's patch I see few percentage of improvement in ebizzy 1x and
reduction in degradation in dbench 1x.

But over-commit degradation seem to still persist. seeing this,  I
feel it is more of qmode overhead itself for large guests,

+---+---+---+---++---+
  ebizzy (rec/sec higher is better)
+---+---+---+---+---++---+
base  stdev patched   stdev %improvement
+---+---+---+---++---+
1x  5574.9000   237.4997  7851.9000   148.673740.84378
2x  2741.5000   561.3090  1620.9000   410.8299   -40.87543
3x  2146.2500   216.7718  1751.833396.5023   -18.37702
+---+---+---+---++---+
+---+---+---+---++---+
  dbench (throughput higher is better)
+---+---+---+---++---+
base  stdev patched   stdev %improvement
+---+---+---+---++---+
1x 14111.5600   754.4525 13826.5700  1458.0744-2.01955
2x  2481.627071.2665  1549.3740   245.3777   -37.56620
3x  1510.248331.8634  1116.015826.4882   -26.10382
+---+---+---+---++---+
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-07-01 Thread Raghavendra KT
On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT
raghavendra.kt.li...@gmail.com wrote:


 On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:

 Breaking up locks is better than implementing high-contention locks, but
 if we must have high-contention locks, why not make them automatically
 switch between light-weight ticket locks at low contention and queued
 locks at high contention?  After all, this would remove the need for
 the developer to predict which locks will be highly contended.

 This commit allows ticket locks to automatically switch between pure
 ticketlock and queued-lock operation as needed.  If too many CPUs are
 spinning on a given ticket lock, a queue structure will be allocated
 and the lock will switch to queued-lock operation.  When the lock becomes
 free, it will switch back into ticketlock operation.  The low-order bit
 of the head counter is used to indicate that the lock is in queued mode,
 which forces an unconditional mismatch between the head and tail counters.
 This approach means that the common-case code path under conditions of
 low contention is very nearly that of a plain ticket lock.

 A fixed number of queueing structures is statically allocated in an
 array.  The ticket-lock address is used to hash into an initial element,
 but if that element is already in use, it moves to the next element.  If
 the entire array is already in use, continue to spin in ticket mode.

 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt).
 ]
 [ paulmck: Address Eric Dumazet review feedback. ]
 [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
 [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
 [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
 [ paulmck: Reduce queue-switch contention (Waiman Long). ]
 [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold).
 ]
 [ paulmck: Type safety fixes (Steven Rostedt). ]
 [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
 [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]

 [...]

 I did test this on 32 core machine with 32 vcpu guests.

 This version gave me around 20% improvement fro sysbench and 36% improvement
 for ebizzy, for 1x commit though other overcommited results showed
 degradation. I have not tested Lai Jiangshan's patches on top of this yet.
 Will report any findings.

Sorry for late report.

With Lai's patch I see few percentage of improvement in ebizzy 1x and
reduction in degradation in dbench 1x.

But over-commit degradation seem to still persist. seeing this,  I
feel it is more of qmode overhead itself for large guests,

+---+---+---+---++---+
  ebizzy (rec/sec higher is better)
+---+---+---+---+---++---+
base  stdev patched   stdev %improvement
+---+---+---+---++---+
1x  5574.9000   237.4997  7851.9000   148.673740.84378
2x  2741.5000   561.3090  1620.9000   410.8299   -40.87543
3x  2146.2500   216.7718  1751.833396.5023   -18.37702
+---+---+---+---++---+
+---+---+---+---++---+
  dbench (throughput higher is better)
+---+---+---+---++---+
base  stdev patched   stdev %improvement
+---+---+---+---++---+
1x 14111.5600   754.4525 13826.5700  1458.0744-2.01955
2x  2481.627071.2665  1549.3740   245.3777   -37.56620
3x  1510.248331.8634  1116.015826.4882   -26.10382
+---+---+---+---++---+
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-07-01 Thread Paul E. McKenney
On Mon, Jul 01, 2013 at 02:49:34PM +0530, Raghavendra KT wrote:
 On Sun, Jun 23, 2013 at 11:23 PM, Raghavendra KT
 raghavendra.kt.li...@gmail.com wrote:
 
 
  On Wed, Jun 12, 2013 at 9:10 PM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
 
  Breaking up locks is better than implementing high-contention locks, but
  if we must have high-contention locks, why not make them automatically
  switch between light-weight ticket locks at low contention and queued
  locks at high contention?  After all, this would remove the need for
  the developer to predict which locks will be highly contended.
 
  This commit allows ticket locks to automatically switch between pure
  ticketlock and queued-lock operation as needed.  If too many CPUs are
  spinning on a given ticket lock, a queue structure will be allocated
  and the lock will switch to queued-lock operation.  When the lock becomes
  free, it will switch back into ticketlock operation.  The low-order bit
  of the head counter is used to indicate that the lock is in queued mode,
  which forces an unconditional mismatch between the head and tail counters.
  This approach means that the common-case code path under conditions of
  low contention is very nearly that of a plain ticket lock.
 
  A fixed number of queueing structures is statically allocated in an
  array.  The ticket-lock address is used to hash into an initial element,
  but if that element is already in use, it moves to the next element.  If
  the entire array is already in use, continue to spin in ticket mode.
 
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt).
  ]
  [ paulmck: Address Eric Dumazet review feedback. ]
  [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
  [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
  [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
  [ paulmck: Reduce queue-switch contention (Waiman Long). ]
  [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold).
  ]
  [ paulmck: Type safety fixes (Steven Rostedt). ]
  [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
  [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
 
  [...]
 
  I did test this on 32 core machine with 32 vcpu guests.
 
  This version gave me around 20% improvement fro sysbench and 36% improvement
  for ebizzy, for 1x commit though other overcommited results showed
  degradation. I have not tested Lai Jiangshan's patches on top of this yet.
  Will report any findings.
 
 Sorry for late report.

Not a problem, thank you for running these numbers!

 With Lai's patch I see few percentage of improvement in ebizzy 1x and
 reduction in degradation in dbench 1x.

OK, good!  But my guess is that even pushing the lock-acquisition
slowpath out of line, we still would not reach parity for the less-good
results.  Still seems like I should add Lai Jiangshan's patches
and post them somewhere in case they are helpful in some other context.

Thanx, Paul

 But over-commit degradation seem to still persist. seeing this,  I
 feel it is more of qmode overhead itself for large guests,
 
 +---+---+---+---++---+
   ebizzy (rec/sec higher is better)
 +---+---+---+---+---++---+
 base  stdev patched   stdev %improvement
 +---+---+---+---++---+
 1x  5574.9000   237.4997  7851.9000   148.673740.84378
 2x  2741.5000   561.3090  1620.9000   410.8299   -40.87543
 3x  2146.2500   216.7718  1751.833396.5023   -18.37702
 +---+---+---+---++---+
 +---+---+---+---++---+
   dbench (throughput higher is better)
 +---+---+---+---++---+
 base  stdev patched   stdev %improvement
 +---+---+---+---++---+
 1x 14111.5600   754.4525 13826.5700  1458.0744-2.01955
 2x  2481.627071.2665  1549.3740   245.3777   -37.56620
 3x  1510.248331.8634  1116.015826.4882   -26.10382
 +---+---+---+---++---+
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-14 Thread Paul E. McKenney
On Fri, Jun 14, 2013 at 09:28:16AM +0800, Lai Jiangshan wrote:
> On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
> > On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
> >> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
> >>  wrote:
> >>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
>  On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> > Breaking up locks is better than implementing high-contention locks, but
> > if we must have high-contention locks, why not make them automatically
> > switch between light-weight ticket locks at low contention and queued
> > locks at high contention?  After all, this would remove the need for
> > the developer to predict which locks will be highly contended.
> >
> > This commit allows ticket locks to automatically switch between pure
> > ticketlock and queued-lock operation as needed.  If too many CPUs are
> > spinning on a given ticket lock, a queue structure will be allocated
> > and the lock will switch to queued-lock operation.  When the lock 
> > becomes
> > free, it will switch back into ticketlock operation.  The low-order bit
> > of the head counter is used to indicate that the lock is in queued mode,
> > which forces an unconditional mismatch between the head and tail 
> > counters.
> > This approach means that the common-case code path under conditions of
> > low contention is very nearly that of a plain ticket lock.
> >
> > A fixed number of queueing structures is statically allocated in an
> > array.  The ticket-lock address is used to hash into an initial element,
> > but if that element is already in use, it moves to the next element.  If
> > the entire array is already in use, continue to spin in ticket mode.
> >
> > Signed-off-by: Paul E. McKenney 
> > [ paulmck: Eliminate duplicate code and update comments (Steven 
> > Rostedt). ]
> > [ paulmck: Address Eric Dumazet review feedback. ]
> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen 
> > Persvold). ]
> > [ paulmck: Type safety fixes (Steven Rostedt). ]
> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
> 
> 
>  Hi, Paul,
> 
>  I simplify the code and remove the search by encoding the index of 
>  struct tkt_q_head
>  into lock->tickets.head.
> 
>  A) lock->tickets.head(when queued-lock):
>  -
>   index of struct tkt_q_head |0|1|
>  -
> >>>
> >>> Interesting approach!  It might reduce queued-mode overhead a bit in
> >>> some cases, though I bet that in the common case the first queue element
> >>> examined is the right one.  More on this below.
> >>>
>  The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
>  __ticket_spin_unlock(),
>  thus __ticket_spin_unlock() will not change the higher bits of 
>  lock->tickets.head.
> 
>  B) tqhp->head is for the real value of lock->tickets.head.
>  if the last bit of tqhp->head is 1, it means it is the head ticket when 
>  started queuing.
> >>>
> >>> But don't you also need the xadd() in __ticket_spin_unlock() to become
> >>> a cmpxchg() for this to work?  Or is your patch missing your changes to
> >>> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
> >>> the no-contention overhead, which might be counterproductive.  Wouldn't
> >>> hurt to get measurements, though.
> >>
> >> No, don't need to change __ticket_spin_unlock() in my idea.
> >> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
> >> __ticket_spin_unlock() only changes the bit1, it will not change
> >> the higher bits. tkt_q_do_wake() will restore the tickets.head.
> >>
> >> This approach avoids cmpxchg in __ticket_spin_unlock().
> > 
> > Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
> > need to be atomic in order to handle concurrent invocations of
> > __ticket_spin_lock()?
> 
> I don't understand, do we just discuss about __ticket_spin_unlock() only?
> Or does my suggestion hurt __ticket_spin_lock()?

On many architectures, it is harmless.  But my concern is that
__ticket_spin_lock() is atomically incrementing the full value
(both head and tail), but in such a way as to never change the
value of head.  So in theory, a concurrent non-atomic store to
head should be OK, but it does make me quite nervous.

At the very least, it needs a comment saying why it is safe.

Thanx, Paul

> > Either way, it would be good to 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-14 Thread Paul E. McKenney
On Fri, Jun 14, 2013 at 03:12:43PM +0800, Lai Jiangshan wrote:
> On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
> > On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
> >> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
> >>  wrote:
> >>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
>  On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> > Breaking up locks is better than implementing high-contention locks, but
> > if we must have high-contention locks, why not make them automatically
> > switch between light-weight ticket locks at low contention and queued
> > locks at high contention?  After all, this would remove the need for
> > the developer to predict which locks will be highly contended.
> >
> > This commit allows ticket locks to automatically switch between pure
> > ticketlock and queued-lock operation as needed.  If too many CPUs are
> > spinning on a given ticket lock, a queue structure will be allocated
> > and the lock will switch to queued-lock operation.  When the lock 
> > becomes
> > free, it will switch back into ticketlock operation.  The low-order bit
> > of the head counter is used to indicate that the lock is in queued mode,
> > which forces an unconditional mismatch between the head and tail 
> > counters.
> > This approach means that the common-case code path under conditions of
> > low contention is very nearly that of a plain ticket lock.
> >
> > A fixed number of queueing structures is statically allocated in an
> > array.  The ticket-lock address is used to hash into an initial element,
> > but if that element is already in use, it moves to the next element.  If
> > the entire array is already in use, continue to spin in ticket mode.
> >
> > Signed-off-by: Paul E. McKenney 
> > [ paulmck: Eliminate duplicate code and update comments (Steven 
> > Rostedt). ]
> > [ paulmck: Address Eric Dumazet review feedback. ]
> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen 
> > Persvold). ]
> > [ paulmck: Type safety fixes (Steven Rostedt). ]
> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
> 
> 
>  Hi, Paul,
> 
>  I simplify the code and remove the search by encoding the index of 
>  struct tkt_q_head
>  into lock->tickets.head.
> 
>  A) lock->tickets.head(when queued-lock):
>  -
>   index of struct tkt_q_head |0|1|
>  -
> >>>
> >>> Interesting approach!  It might reduce queued-mode overhead a bit in
> >>> some cases, though I bet that in the common case the first queue element
> >>> examined is the right one.  More on this below.
> >>>
>  The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
>  __ticket_spin_unlock(),
>  thus __ticket_spin_unlock() will not change the higher bits of 
>  lock->tickets.head.
> 
>  B) tqhp->head is for the real value of lock->tickets.head.
>  if the last bit of tqhp->head is 1, it means it is the head ticket when 
>  started queuing.
> >>>
> >>> But don't you also need the xadd() in __ticket_spin_unlock() to become
> >>> a cmpxchg() for this to work?  Or is your patch missing your changes to
> >>> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
> >>> the no-contention overhead, which might be counterproductive.  Wouldn't
> >>> hurt to get measurements, though.
> >>
> >> No, don't need to change __ticket_spin_unlock() in my idea.
> >> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
> >> __ticket_spin_unlock() only changes the bit1, it will not change
> >> the higher bits. tkt_q_do_wake() will restore the tickets.head.
> >>
> >> This approach avoids cmpxchg in __ticket_spin_unlock().
> > 
> > Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
> > need to be atomic in order to handle concurrent invocations of
> > __ticket_spin_lock()?
> > 
> > Either way, it would be good to see the performance effects of this.
> > 
> > Thanx, Paul
> > 
> >>> Given the results that Davidlohr posted, I believe that the following
> >>> optimizations would also provide some improvement:
> >>>
> >>> 1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
> >>> to a separate linker section in order to reduce the icache
> >>> penalty exacted by the spinloop.  This is likely to be causing
> >>> some of the performance reductions in the cases where ticket
> >>> 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-14 Thread Lai Jiangshan
On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
> On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
>> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
>>  wrote:
>>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
 On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
>
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). 
> ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). 
> ]
> [ paulmck: Type safety fixes (Steven Rostedt). ]
> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


 Hi, Paul,

 I simplify the code and remove the search by encoding the index of struct 
 tkt_q_head
 into lock->tickets.head.

 A) lock->tickets.head(when queued-lock):
 -
  index of struct tkt_q_head |0|1|
 -
>>>
>>> Interesting approach!  It might reduce queued-mode overhead a bit in
>>> some cases, though I bet that in the common case the first queue element
>>> examined is the right one.  More on this below.
>>>
 The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
 __ticket_spin_unlock(),
 thus __ticket_spin_unlock() will not change the higher bits of 
 lock->tickets.head.

 B) tqhp->head is for the real value of lock->tickets.head.
 if the last bit of tqhp->head is 1, it means it is the head ticket when 
 started queuing.
>>>
>>> But don't you also need the xadd() in __ticket_spin_unlock() to become
>>> a cmpxchg() for this to work?  Or is your patch missing your changes to
>>> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
>>> the no-contention overhead, which might be counterproductive.  Wouldn't
>>> hurt to get measurements, though.
>>
>> No, don't need to change __ticket_spin_unlock() in my idea.
>> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
>> __ticket_spin_unlock() only changes the bit1, it will not change
>> the higher bits. tkt_q_do_wake() will restore the tickets.head.
>>
>> This approach avoids cmpxchg in __ticket_spin_unlock().
> 
> Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
> need to be atomic in order to handle concurrent invocations of
> __ticket_spin_lock()?
> 
> Either way, it would be good to see the performance effects of this.
> 
>   Thanx, Paul
> 
>>> Given the results that Davidlohr posted, I believe that the following
>>> optimizations would also provide some improvement:
>>>
>>> 1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
>>> to a separate linker section in order to reduce the icache
>>> penalty exacted by the spinloop.  This is likely to be causing
>>> some of the performance reductions in the cases where ticket
>>> locks are not highly contended.
>>>
>>> 2.  Limit the number of elements searched for in the array of
>>> queues.  However, this would help only if a number of ticket
>>> locks were in queued mode at the same time.
>>>
>>> 3.  Dynamically 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-14 Thread Lai Jiangshan
On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
 On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
 On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
 On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
 Breaking up locks is better than implementing high-contention locks, but
 if we must have high-contention locks, why not make them automatically
 switch between light-weight ticket locks at low contention and queued
 locks at high contention?  After all, this would remove the need for
 the developer to predict which locks will be highly contended.

 This commit allows ticket locks to automatically switch between pure
 ticketlock and queued-lock operation as needed.  If too many CPUs are
 spinning on a given ticket lock, a queue structure will be allocated
 and the lock will switch to queued-lock operation.  When the lock becomes
 free, it will switch back into ticketlock operation.  The low-order bit
 of the head counter is used to indicate that the lock is in queued mode,
 which forces an unconditional mismatch between the head and tail counters.
 This approach means that the common-case code path under conditions of
 low contention is very nearly that of a plain ticket lock.

 A fixed number of queueing structures is statically allocated in an
 array.  The ticket-lock address is used to hash into an initial element,
 but if that element is already in use, it moves to the next element.  If
 the entire array is already in use, continue to spin in ticket mode.

 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). 
 ]
 [ paulmck: Address Eric Dumazet review feedback. ]
 [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
 [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
 [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
 [ paulmck: Reduce queue-switch contention (Waiman Long). ]
 [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). 
 ]
 [ paulmck: Type safety fixes (Steven Rostedt). ]
 [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
 [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


 Hi, Paul,

 I simplify the code and remove the search by encoding the index of struct 
 tkt_q_head
 into lock-tickets.head.

 A) lock-tickets.head(when queued-lock):
 -
  index of struct tkt_q_head |0|1|
 -

 Interesting approach!  It might reduce queued-mode overhead a bit in
 some cases, though I bet that in the common case the first queue element
 examined is the right one.  More on this below.

 The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
 __ticket_spin_unlock(),
 thus __ticket_spin_unlock() will not change the higher bits of 
 lock-tickets.head.

 B) tqhp-head is for the real value of lock-tickets.head.
 if the last bit of tqhp-head is 1, it means it is the head ticket when 
 started queuing.

 But don't you also need the xadd() in __ticket_spin_unlock() to become
 a cmpxchg() for this to work?  Or is your patch missing your changes to
 arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
 the no-contention overhead, which might be counterproductive.  Wouldn't
 hurt to get measurements, though.

 No, don't need to change __ticket_spin_unlock() in my idea.
 bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
 __ticket_spin_unlock() only changes the bit1, it will not change
 the higher bits. tkt_q_do_wake() will restore the tickets.head.

 This approach avoids cmpxchg in __ticket_spin_unlock().
 
 Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
 need to be atomic in order to handle concurrent invocations of
 __ticket_spin_lock()?
 
 Either way, it would be good to see the performance effects of this.
 
   Thanx, Paul
 
 Given the results that Davidlohr posted, I believe that the following
 optimizations would also provide some improvement:

 1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
 to a separate linker section in order to reduce the icache
 penalty exacted by the spinloop.  This is likely to be causing
 some of the performance reductions in the cases where ticket
 locks are not highly contended.

 2.  Limit the number of elements searched for in the array of
 queues.  However, this would help only if a number of ticket
 locks were in queued mode at the same time.

 3.  Dynamically allocate the queue array at boot.  This might
 also reduce cache pressure, again, at least in cases where
 there are a number of ticket locks in queued mode at the
 same time.

 Frederic just reminded me that I owe him some energy-efficiency improvements
 for adaptive ticks, so I won't get to 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-14 Thread Paul E. McKenney
On Fri, Jun 14, 2013 at 03:12:43PM +0800, Lai Jiangshan wrote:
 On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
  On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
  On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
  On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
  Breaking up locks is better than implementing high-contention locks, but
  if we must have high-contention locks, why not make them automatically
  switch between light-weight ticket locks at low contention and queued
  locks at high contention?  After all, this would remove the need for
  the developer to predict which locks will be highly contended.
 
  This commit allows ticket locks to automatically switch between pure
  ticketlock and queued-lock operation as needed.  If too many CPUs are
  spinning on a given ticket lock, a queue structure will be allocated
  and the lock will switch to queued-lock operation.  When the lock 
  becomes
  free, it will switch back into ticketlock operation.  The low-order bit
  of the head counter is used to indicate that the lock is in queued mode,
  which forces an unconditional mismatch between the head and tail 
  counters.
  This approach means that the common-case code path under conditions of
  low contention is very nearly that of a plain ticket lock.
 
  A fixed number of queueing structures is statically allocated in an
  array.  The ticket-lock address is used to hash into an initial element,
  but if that element is already in use, it moves to the next element.  If
  the entire array is already in use, continue to spin in ticket mode.
 
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  [ paulmck: Eliminate duplicate code and update comments (Steven 
  Rostedt). ]
  [ paulmck: Address Eric Dumazet review feedback. ]
  [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
  [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
  [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
  [ paulmck: Reduce queue-switch contention (Waiman Long). ]
  [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen 
  Persvold). ]
  [ paulmck: Type safety fixes (Steven Rostedt). ]
  [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
  [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
 
 
  Hi, Paul,
 
  I simplify the code and remove the search by encoding the index of 
  struct tkt_q_head
  into lock-tickets.head.
 
  A) lock-tickets.head(when queued-lock):
  -
   index of struct tkt_q_head |0|1|
  -
 
  Interesting approach!  It might reduce queued-mode overhead a bit in
  some cases, though I bet that in the common case the first queue element
  examined is the right one.  More on this below.
 
  The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
  __ticket_spin_unlock(),
  thus __ticket_spin_unlock() will not change the higher bits of 
  lock-tickets.head.
 
  B) tqhp-head is for the real value of lock-tickets.head.
  if the last bit of tqhp-head is 1, it means it is the head ticket when 
  started queuing.
 
  But don't you also need the xadd() in __ticket_spin_unlock() to become
  a cmpxchg() for this to work?  Or is your patch missing your changes to
  arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
  the no-contention overhead, which might be counterproductive.  Wouldn't
  hurt to get measurements, though.
 
  No, don't need to change __ticket_spin_unlock() in my idea.
  bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
  __ticket_spin_unlock() only changes the bit1, it will not change
  the higher bits. tkt_q_do_wake() will restore the tickets.head.
 
  This approach avoids cmpxchg in __ticket_spin_unlock().
  
  Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
  need to be atomic in order to handle concurrent invocations of
  __ticket_spin_lock()?
  
  Either way, it would be good to see the performance effects of this.
  
  Thanx, Paul
  
  Given the results that Davidlohr posted, I believe that the following
  optimizations would also provide some improvement:
 
  1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
  to a separate linker section in order to reduce the icache
  penalty exacted by the spinloop.  This is likely to be causing
  some of the performance reductions in the cases where ticket
  locks are not highly contended.
 
  2.  Limit the number of elements searched for in the array of
  queues.  However, this would help only if a number of ticket
  locks were in queued mode at the same time.
 
  3.  Dynamically allocate the queue array at boot.  This might
  also reduce cache pressure, again, at least in cases where
  there are a number of 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-14 Thread Paul E. McKenney
On Fri, Jun 14, 2013 at 09:28:16AM +0800, Lai Jiangshan wrote:
 On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
  On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
  On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
  On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
  Breaking up locks is better than implementing high-contention locks, but
  if we must have high-contention locks, why not make them automatically
  switch between light-weight ticket locks at low contention and queued
  locks at high contention?  After all, this would remove the need for
  the developer to predict which locks will be highly contended.
 
  This commit allows ticket locks to automatically switch between pure
  ticketlock and queued-lock operation as needed.  If too many CPUs are
  spinning on a given ticket lock, a queue structure will be allocated
  and the lock will switch to queued-lock operation.  When the lock 
  becomes
  free, it will switch back into ticketlock operation.  The low-order bit
  of the head counter is used to indicate that the lock is in queued mode,
  which forces an unconditional mismatch between the head and tail 
  counters.
  This approach means that the common-case code path under conditions of
  low contention is very nearly that of a plain ticket lock.
 
  A fixed number of queueing structures is statically allocated in an
  array.  The ticket-lock address is used to hash into an initial element,
  but if that element is already in use, it moves to the next element.  If
  the entire array is already in use, continue to spin in ticket mode.
 
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  [ paulmck: Eliminate duplicate code and update comments (Steven 
  Rostedt). ]
  [ paulmck: Address Eric Dumazet review feedback. ]
  [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
  [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
  [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
  [ paulmck: Reduce queue-switch contention (Waiman Long). ]
  [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen 
  Persvold). ]
  [ paulmck: Type safety fixes (Steven Rostedt). ]
  [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
  [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
 
 
  Hi, Paul,
 
  I simplify the code and remove the search by encoding the index of 
  struct tkt_q_head
  into lock-tickets.head.
 
  A) lock-tickets.head(when queued-lock):
  -
   index of struct tkt_q_head |0|1|
  -
 
  Interesting approach!  It might reduce queued-mode overhead a bit in
  some cases, though I bet that in the common case the first queue element
  examined is the right one.  More on this below.
 
  The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
  __ticket_spin_unlock(),
  thus __ticket_spin_unlock() will not change the higher bits of 
  lock-tickets.head.
 
  B) tqhp-head is for the real value of lock-tickets.head.
  if the last bit of tqhp-head is 1, it means it is the head ticket when 
  started queuing.
 
  But don't you also need the xadd() in __ticket_spin_unlock() to become
  a cmpxchg() for this to work?  Or is your patch missing your changes to
  arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
  the no-contention overhead, which might be counterproductive.  Wouldn't
  hurt to get measurements, though.
 
  No, don't need to change __ticket_spin_unlock() in my idea.
  bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
  __ticket_spin_unlock() only changes the bit1, it will not change
  the higher bits. tkt_q_do_wake() will restore the tickets.head.
 
  This approach avoids cmpxchg in __ticket_spin_unlock().
  
  Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
  need to be atomic in order to handle concurrent invocations of
  __ticket_spin_lock()?
 
 I don't understand, do we just discuss about __ticket_spin_unlock() only?
 Or does my suggestion hurt __ticket_spin_lock()?

On many architectures, it is harmless.  But my concern is that
__ticket_spin_lock() is atomically incrementing the full value
(both head and tail), but in such a way as to never change the
value of head.  So in theory, a concurrent non-atomic store to
head should be OK, but it does make me quite nervous.

At the very least, it needs a comment saying why it is safe.

Thanx, Paul

  Either way, it would be good to see the performance effects of this.
  
  Thanx, Paul
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Lai Jiangshan
On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
> On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
>> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
>>  wrote:
>>> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
 On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
>
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). 
> ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). 
> ]
> [ paulmck: Type safety fixes (Steven Rostedt). ]
> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


 Hi, Paul,

 I simplify the code and remove the search by encoding the index of struct 
 tkt_q_head
 into lock->tickets.head.

 A) lock->tickets.head(when queued-lock):
 -
  index of struct tkt_q_head |0|1|
 -
>>>
>>> Interesting approach!  It might reduce queued-mode overhead a bit in
>>> some cases, though I bet that in the common case the first queue element
>>> examined is the right one.  More on this below.
>>>
 The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
 __ticket_spin_unlock(),
 thus __ticket_spin_unlock() will not change the higher bits of 
 lock->tickets.head.

 B) tqhp->head is for the real value of lock->tickets.head.
 if the last bit of tqhp->head is 1, it means it is the head ticket when 
 started queuing.
>>>
>>> But don't you also need the xadd() in __ticket_spin_unlock() to become
>>> a cmpxchg() for this to work?  Or is your patch missing your changes to
>>> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
>>> the no-contention overhead, which might be counterproductive.  Wouldn't
>>> hurt to get measurements, though.
>>
>> No, don't need to change __ticket_spin_unlock() in my idea.
>> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
>> __ticket_spin_unlock() only changes the bit1, it will not change
>> the higher bits. tkt_q_do_wake() will restore the tickets.head.
>>
>> This approach avoids cmpxchg in __ticket_spin_unlock().
> 
> Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
> need to be atomic in order to handle concurrent invocations of
> __ticket_spin_lock()?

I don't understand, do we just discuss about __ticket_spin_unlock() only?
Or does my suggestion hurt __ticket_spin_lock()?

> 
> Either way, it would be good to see the performance effects of this.
> 
>   Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Paul E. McKenney
On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
> On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
>  wrote:
> > On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
> >> On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> >> > Breaking up locks is better than implementing high-contention locks, but
> >> > if we must have high-contention locks, why not make them automatically
> >> > switch between light-weight ticket locks at low contention and queued
> >> > locks at high contention?  After all, this would remove the need for
> >> > the developer to predict which locks will be highly contended.
> >> >
> >> > This commit allows ticket locks to automatically switch between pure
> >> > ticketlock and queued-lock operation as needed.  If too many CPUs are
> >> > spinning on a given ticket lock, a queue structure will be allocated
> >> > and the lock will switch to queued-lock operation.  When the lock becomes
> >> > free, it will switch back into ticketlock operation.  The low-order bit
> >> > of the head counter is used to indicate that the lock is in queued mode,
> >> > which forces an unconditional mismatch between the head and tail 
> >> > counters.
> >> > This approach means that the common-case code path under conditions of
> >> > low contention is very nearly that of a plain ticket lock.
> >> >
> >> > A fixed number of queueing structures is statically allocated in an
> >> > array.  The ticket-lock address is used to hash into an initial element,
> >> > but if that element is already in use, it moves to the next element.  If
> >> > the entire array is already in use, continue to spin in ticket mode.
> >> >
> >> > Signed-off-by: Paul E. McKenney 
> >> > [ paulmck: Eliminate duplicate code and update comments (Steven 
> >> > Rostedt). ]
> >> > [ paulmck: Address Eric Dumazet review feedback. ]
> >> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> >> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> >> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> >> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> >> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen 
> >> > Persvold). ]
> >> > [ paulmck: Type safety fixes (Steven Rostedt). ]
> >> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> >> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
> >>
> >>
> >> Hi, Paul,
> >>
> >> I simplify the code and remove the search by encoding the index of struct 
> >> tkt_q_head
> >> into lock->tickets.head.
> >>
> >> A) lock->tickets.head(when queued-lock):
> >> -
> >>  index of struct tkt_q_head |0|1|
> >> -
> >
> > Interesting approach!  It might reduce queued-mode overhead a bit in
> > some cases, though I bet that in the common case the first queue element
> > examined is the right one.  More on this below.
> >
> >> The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
> >> __ticket_spin_unlock(),
> >> thus __ticket_spin_unlock() will not change the higher bits of 
> >> lock->tickets.head.
> >>
> >> B) tqhp->head is for the real value of lock->tickets.head.
> >> if the last bit of tqhp->head is 1, it means it is the head ticket when 
> >> started queuing.
> >
> > But don't you also need the xadd() in __ticket_spin_unlock() to become
> > a cmpxchg() for this to work?  Or is your patch missing your changes to
> > arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
> > the no-contention overhead, which might be counterproductive.  Wouldn't
> > hurt to get measurements, though.
> 
> No, don't need to change __ticket_spin_unlock() in my idea.
> bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
> __ticket_spin_unlock() only changes the bit1, it will not change
> the higher bits. tkt_q_do_wake() will restore the tickets.head.
> 
> This approach avoids cmpxchg in __ticket_spin_unlock().

Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
need to be atomic in order to handle concurrent invocations of
__ticket_spin_lock()?

Either way, it would be good to see the performance effects of this.

Thanx, Paul

> > Given the results that Davidlohr posted, I believe that the following
> > optimizations would also provide some improvement:
> >
> > 1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
> > to a separate linker section in order to reduce the icache
> > penalty exacted by the spinloop.  This is likely to be causing
> > some of the performance reductions in the cases where ticket
> > locks are not highly contended.
> >
> > 2.  Limit the number of elements searched for in the array of
> > queues.  However, this would help only if a number of ticket
> > locks were in queued mode at the same time.
> >
> > 3.  Dynamically allocate the queue array at boot.  

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Lai Jiangshan
On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
 wrote:
> On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
>> On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
>> > Breaking up locks is better than implementing high-contention locks, but
>> > if we must have high-contention locks, why not make them automatically
>> > switch between light-weight ticket locks at low contention and queued
>> > locks at high contention?  After all, this would remove the need for
>> > the developer to predict which locks will be highly contended.
>> >
>> > This commit allows ticket locks to automatically switch between pure
>> > ticketlock and queued-lock operation as needed.  If too many CPUs are
>> > spinning on a given ticket lock, a queue structure will be allocated
>> > and the lock will switch to queued-lock operation.  When the lock becomes
>> > free, it will switch back into ticketlock operation.  The low-order bit
>> > of the head counter is used to indicate that the lock is in queued mode,
>> > which forces an unconditional mismatch between the head and tail counters.
>> > This approach means that the common-case code path under conditions of
>> > low contention is very nearly that of a plain ticket lock.
>> >
>> > A fixed number of queueing structures is statically allocated in an
>> > array.  The ticket-lock address is used to hash into an initial element,
>> > but if that element is already in use, it moves to the next element.  If
>> > the entire array is already in use, continue to spin in ticket mode.
>> >
>> > Signed-off-by: Paul E. McKenney 
>> > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
>> > [ paulmck: Address Eric Dumazet review feedback. ]
>> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
>> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
>> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
>> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
>> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
>> > [ paulmck: Type safety fixes (Steven Rostedt). ]
>> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
>> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
>>
>>
>> Hi, Paul,
>>
>> I simplify the code and remove the search by encoding the index of struct 
>> tkt_q_head
>> into lock->tickets.head.
>>
>> A) lock->tickets.head(when queued-lock):
>> -
>>  index of struct tkt_q_head |0|1|
>> -
>
> Interesting approach!  It might reduce queued-mode overhead a bit in
> some cases, though I bet that in the common case the first queue element
> examined is the right one.  More on this below.
>
>> The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
>> __ticket_spin_unlock(),
>> thus __ticket_spin_unlock() will not change the higher bits of 
>> lock->tickets.head.
>>
>> B) tqhp->head is for the real value of lock->tickets.head.
>> if the last bit of tqhp->head is 1, it means it is the head ticket when 
>> started queuing.
>
> But don't you also need the xadd() in __ticket_spin_unlock() to become
> a cmpxchg() for this to work?  Or is your patch missing your changes to
> arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
> the no-contention overhead, which might be counterproductive.  Wouldn't
> hurt to get measurements, though.

No, don't need to change __ticket_spin_unlock() in my idea.
bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
__ticket_spin_unlock() only changes the bit1, it will not change
the higher bits. tkt_q_do_wake() will restore the tickets.head.

This approach avoids cmpxchg in __ticket_spin_unlock().

>
> Given the results that Davidlohr posted, I believe that the following
> optimizations would also provide some improvement:
>
> 1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
> to a separate linker section in order to reduce the icache
> penalty exacted by the spinloop.  This is likely to be causing
> some of the performance reductions in the cases where ticket
> locks are not highly contended.
>
> 2.  Limit the number of elements searched for in the array of
> queues.  However, this would help only if a number of ticket
> locks were in queued mode at the same time.
>
> 3.  Dynamically allocate the queue array at boot.  This might
> also reduce cache pressure, again, at least in cases where
> there are a number of ticket locks in queued mode at the
> same time.
>
> Frederic just reminded me that I owe him some energy-efficiency improvements
> for adaptive ticks, so I won't get to these very quickly.  Please feel free
> to take these on -- the patch clearly does well under high contention, so
> reducing the no-contention penalty could really help.
>
> Thanx, Paul
>
>> Thanks,
>> Lai
>>
>>  

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Paul E. McKenney
On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
> On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> > Breaking up locks is better than implementing high-contention locks, but
> > if we must have high-contention locks, why not make them automatically
> > switch between light-weight ticket locks at low contention and queued
> > locks at high contention?  After all, this would remove the need for
> > the developer to predict which locks will be highly contended.
> > 
> > This commit allows ticket locks to automatically switch between pure
> > ticketlock and queued-lock operation as needed.  If too many CPUs are
> > spinning on a given ticket lock, a queue structure will be allocated
> > and the lock will switch to queued-lock operation.  When the lock becomes
> > free, it will switch back into ticketlock operation.  The low-order bit
> > of the head counter is used to indicate that the lock is in queued mode,
> > which forces an unconditional mismatch between the head and tail counters.
> > This approach means that the common-case code path under conditions of
> > low contention is very nearly that of a plain ticket lock.
> > 
> > A fixed number of queueing structures is statically allocated in an
> > array.  The ticket-lock address is used to hash into an initial element,
> > but if that element is already in use, it moves to the next element.  If
> > the entire array is already in use, continue to spin in ticket mode.
> > 
> > Signed-off-by: Paul E. McKenney 
> > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> > [ paulmck: Address Eric Dumazet review feedback. ]
> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> > [ paulmck: Type safety fixes (Steven Rostedt). ]
> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
> 
> 
> Hi, Paul,
> 
> I simplify the code and remove the search by encoding the index of struct 
> tkt_q_head
> into lock->tickets.head.
> 
> A) lock->tickets.head(when queued-lock):
> -
>  index of struct tkt_q_head |0|1|
> -

Interesting approach!  It might reduce queued-mode overhead a bit in
some cases, though I bet that in the common case the first queue element
examined is the right one.  More on this below.

> The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
> __ticket_spin_unlock(),
> thus __ticket_spin_unlock() will not change the higher bits of 
> lock->tickets.head.
> 
> B) tqhp->head is for the real value of lock->tickets.head.
> if the last bit of tqhp->head is 1, it means it is the head ticket when 
> started queuing.

But don't you also need the xadd() in __ticket_spin_unlock() to become
a cmpxchg() for this to work?  Or is your patch missing your changes to
arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
the no-contention overhead, which might be counterproductive.  Wouldn't
hurt to get measurements, though.

Given the results that Davidlohr posted, I believe that the following
optimizations would also provide some improvement:

1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
to a separate linker section in order to reduce the icache
penalty exacted by the spinloop.  This is likely to be causing
some of the performance reductions in the cases where ticket
locks are not highly contended.

2.  Limit the number of elements searched for in the array of
queues.  However, this would help only if a number of ticket
locks were in queued mode at the same time.

3.  Dynamically allocate the queue array at boot.  This might
also reduce cache pressure, again, at least in cases where
there are a number of ticket locks in queued mode at the
same time.

Frederic just reminded me that I owe him some energy-efficiency improvements
for adaptive ticks, so I won't get to these very quickly.  Please feel free
to take these on -- the patch clearly does well under high contention, so
reducing the no-contention penalty could really help.

Thanx, Paul

> Thanks,
> Lai
> 
>  kernel/tktqlock.c |   51 +--
>  1 files changed, 13 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c
> index 912817c..1329d0f 100644
> --- a/kernel/tktqlock.c
> +++ b/kernel/tktqlock.c
> @@ -33,7 +33,7 @@ struct tkt_q {
> 
>  struct tkt_q_head {
>   arch_spinlock_t *ref;   /* Pointer to spinlock if in use. */
> - s64 head_tkt;   /* Head ticket when started queuing. */

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Paul E. McKenney
On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
 On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
  Breaking up locks is better than implementing high-contention locks, but
  if we must have high-contention locks, why not make them automatically
  switch between light-weight ticket locks at low contention and queued
  locks at high contention?  After all, this would remove the need for
  the developer to predict which locks will be highly contended.
  
  This commit allows ticket locks to automatically switch between pure
  ticketlock and queued-lock operation as needed.  If too many CPUs are
  spinning on a given ticket lock, a queue structure will be allocated
  and the lock will switch to queued-lock operation.  When the lock becomes
  free, it will switch back into ticketlock operation.  The low-order bit
  of the head counter is used to indicate that the lock is in queued mode,
  which forces an unconditional mismatch between the head and tail counters.
  This approach means that the common-case code path under conditions of
  low contention is very nearly that of a plain ticket lock.
  
  A fixed number of queueing structures is statically allocated in an
  array.  The ticket-lock address is used to hash into an initial element,
  but if that element is already in use, it moves to the next element.  If
  the entire array is already in use, continue to spin in ticket mode.
  
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
  [ paulmck: Address Eric Dumazet review feedback. ]
  [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
  [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
  [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
  [ paulmck: Reduce queue-switch contention (Waiman Long). ]
  [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
  [ paulmck: Type safety fixes (Steven Rostedt). ]
  [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
  [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
 
 
 Hi, Paul,
 
 I simplify the code and remove the search by encoding the index of struct 
 tkt_q_head
 into lock-tickets.head.
 
 A) lock-tickets.head(when queued-lock):
 -
  index of struct tkt_q_head |0|1|
 -

Interesting approach!  It might reduce queued-mode overhead a bit in
some cases, though I bet that in the common case the first queue element
examined is the right one.  More on this below.

 The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
 __ticket_spin_unlock(),
 thus __ticket_spin_unlock() will not change the higher bits of 
 lock-tickets.head.
 
 B) tqhp-head is for the real value of lock-tickets.head.
 if the last bit of tqhp-head is 1, it means it is the head ticket when 
 started queuing.

But don't you also need the xadd() in __ticket_spin_unlock() to become
a cmpxchg() for this to work?  Or is your patch missing your changes to
arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
the no-contention overhead, which might be counterproductive.  Wouldn't
hurt to get measurements, though.

Given the results that Davidlohr posted, I believe that the following
optimizations would also provide some improvement:

1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
to a separate linker section in order to reduce the icache
penalty exacted by the spinloop.  This is likely to be causing
some of the performance reductions in the cases where ticket
locks are not highly contended.

2.  Limit the number of elements searched for in the array of
queues.  However, this would help only if a number of ticket
locks were in queued mode at the same time.

3.  Dynamically allocate the queue array at boot.  This might
also reduce cache pressure, again, at least in cases where
there are a number of ticket locks in queued mode at the
same time.

Frederic just reminded me that I owe him some energy-efficiency improvements
for adaptive ticks, so I won't get to these very quickly.  Please feel free
to take these on -- the patch clearly does well under high contention, so
reducing the no-contention penalty could really help.

Thanx, Paul

 Thanks,
 Lai
 
  kernel/tktqlock.c |   51 +--
  1 files changed, 13 insertions(+), 38 deletions(-)
 
 diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c
 index 912817c..1329d0f 100644
 --- a/kernel/tktqlock.c
 +++ b/kernel/tktqlock.c
 @@ -33,7 +33,7 @@ struct tkt_q {
 
  struct tkt_q_head {
   arch_spinlock_t *ref;   /* Pointer to spinlock if in use. */
 - s64 head_tkt;   /* Head ticket when started queuing. */
 + __ticket_t head;/* Real head when queued. */
   struct 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Lai Jiangshan
On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:
 On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
 On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
  Breaking up locks is better than implementing high-contention locks, but
  if we must have high-contention locks, why not make them automatically
  switch between light-weight ticket locks at low contention and queued
  locks at high contention?  After all, this would remove the need for
  the developer to predict which locks will be highly contended.
 
  This commit allows ticket locks to automatically switch between pure
  ticketlock and queued-lock operation as needed.  If too many CPUs are
  spinning on a given ticket lock, a queue structure will be allocated
  and the lock will switch to queued-lock operation.  When the lock becomes
  free, it will switch back into ticketlock operation.  The low-order bit
  of the head counter is used to indicate that the lock is in queued mode,
  which forces an unconditional mismatch between the head and tail counters.
  This approach means that the common-case code path under conditions of
  low contention is very nearly that of a plain ticket lock.
 
  A fixed number of queueing structures is statically allocated in an
  array.  The ticket-lock address is used to hash into an initial element,
  but if that element is already in use, it moves to the next element.  If
  the entire array is already in use, continue to spin in ticket mode.
 
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
  [ paulmck: Address Eric Dumazet review feedback. ]
  [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
  [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
  [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
  [ paulmck: Reduce queue-switch contention (Waiman Long). ]
  [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
  [ paulmck: Type safety fixes (Steven Rostedt). ]
  [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
  [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


 Hi, Paul,

 I simplify the code and remove the search by encoding the index of struct 
 tkt_q_head
 into lock-tickets.head.

 A) lock-tickets.head(when queued-lock):
 -
  index of struct tkt_q_head |0|1|
 -

 Interesting approach!  It might reduce queued-mode overhead a bit in
 some cases, though I bet that in the common case the first queue element
 examined is the right one.  More on this below.

 The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
 __ticket_spin_unlock(),
 thus __ticket_spin_unlock() will not change the higher bits of 
 lock-tickets.head.

 B) tqhp-head is for the real value of lock-tickets.head.
 if the last bit of tqhp-head is 1, it means it is the head ticket when 
 started queuing.

 But don't you also need the xadd() in __ticket_spin_unlock() to become
 a cmpxchg() for this to work?  Or is your patch missing your changes to
 arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
 the no-contention overhead, which might be counterproductive.  Wouldn't
 hurt to get measurements, though.

No, don't need to change __ticket_spin_unlock() in my idea.
bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
__ticket_spin_unlock() only changes the bit1, it will not change
the higher bits. tkt_q_do_wake() will restore the tickets.head.

This approach avoids cmpxchg in __ticket_spin_unlock().


 Given the results that Davidlohr posted, I believe that the following
 optimizations would also provide some improvement:

 1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
 to a separate linker section in order to reduce the icache
 penalty exacted by the spinloop.  This is likely to be causing
 some of the performance reductions in the cases where ticket
 locks are not highly contended.

 2.  Limit the number of elements searched for in the array of
 queues.  However, this would help only if a number of ticket
 locks were in queued mode at the same time.

 3.  Dynamically allocate the queue array at boot.  This might
 also reduce cache pressure, again, at least in cases where
 there are a number of ticket locks in queued mode at the
 same time.

 Frederic just reminded me that I owe him some energy-efficiency improvements
 for adaptive ticks, so I won't get to these very quickly.  Please feel free
 to take these on -- the patch clearly does well under high contention, so
 reducing the no-contention penalty could really help.

 Thanx, Paul

 Thanks,
 Lai

  kernel/tktqlock.c |   51 +--
  1 files changed, 13 insertions(+), 38 deletions(-)

 diff 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Paul E. McKenney
On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
 On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
  On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
  On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
   Breaking up locks is better than implementing high-contention locks, but
   if we must have high-contention locks, why not make them automatically
   switch between light-weight ticket locks at low contention and queued
   locks at high contention?  After all, this would remove the need for
   the developer to predict which locks will be highly contended.
  
   This commit allows ticket locks to automatically switch between pure
   ticketlock and queued-lock operation as needed.  If too many CPUs are
   spinning on a given ticket lock, a queue structure will be allocated
   and the lock will switch to queued-lock operation.  When the lock becomes
   free, it will switch back into ticketlock operation.  The low-order bit
   of the head counter is used to indicate that the lock is in queued mode,
   which forces an unconditional mismatch between the head and tail 
   counters.
   This approach means that the common-case code path under conditions of
   low contention is very nearly that of a plain ticket lock.
  
   A fixed number of queueing structures is statically allocated in an
   array.  The ticket-lock address is used to hash into an initial element,
   but if that element is already in use, it moves to the next element.  If
   the entire array is already in use, continue to spin in ticket mode.
  
   Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
   [ paulmck: Eliminate duplicate code and update comments (Steven 
   Rostedt). ]
   [ paulmck: Address Eric Dumazet review feedback. ]
   [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
   [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
   [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
   [ paulmck: Reduce queue-switch contention (Waiman Long). ]
   [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen 
   Persvold). ]
   [ paulmck: Type safety fixes (Steven Rostedt). ]
   [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
   [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
 
 
  Hi, Paul,
 
  I simplify the code and remove the search by encoding the index of struct 
  tkt_q_head
  into lock-tickets.head.
 
  A) lock-tickets.head(when queued-lock):
  -
   index of struct tkt_q_head |0|1|
  -
 
  Interesting approach!  It might reduce queued-mode overhead a bit in
  some cases, though I bet that in the common case the first queue element
  examined is the right one.  More on this below.
 
  The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
  __ticket_spin_unlock(),
  thus __ticket_spin_unlock() will not change the higher bits of 
  lock-tickets.head.
 
  B) tqhp-head is for the real value of lock-tickets.head.
  if the last bit of tqhp-head is 1, it means it is the head ticket when 
  started queuing.
 
  But don't you also need the xadd() in __ticket_spin_unlock() to become
  a cmpxchg() for this to work?  Or is your patch missing your changes to
  arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
  the no-contention overhead, which might be counterproductive.  Wouldn't
  hurt to get measurements, though.
 
 No, don't need to change __ticket_spin_unlock() in my idea.
 bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
 __ticket_spin_unlock() only changes the bit1, it will not change
 the higher bits. tkt_q_do_wake() will restore the tickets.head.
 
 This approach avoids cmpxchg in __ticket_spin_unlock().

Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
need to be atomic in order to handle concurrent invocations of
__ticket_spin_lock()?

Either way, it would be good to see the performance effects of this.

Thanx, Paul

  Given the results that Davidlohr posted, I believe that the following
  optimizations would also provide some improvement:
 
  1.  Move the call to tkt_spin_pass() from __ticket_spin_lock()
  to a separate linker section in order to reduce the icache
  penalty exacted by the spinloop.  This is likely to be causing
  some of the performance reductions in the cases where ticket
  locks are not highly contended.
 
  2.  Limit the number of elements searched for in the array of
  queues.  However, this would help only if a number of ticket
  locks were in queued mode at the same time.
 
  3.  Dynamically allocate the queue array at boot.  This might
  also reduce cache pressure, again, at least in cases where
  there are a number of ticket locks in queued mode at the
  same time.
 
  Frederic just reminded me that I owe him some 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-13 Thread Lai Jiangshan
On 06/14/2013 07:57 AM, Paul E. McKenney wrote:
 On Fri, Jun 14, 2013 at 07:25:57AM +0800, Lai Jiangshan wrote:
 On Thu, Jun 13, 2013 at 11:22 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 On Thu, Jun 13, 2013 at 10:55:41AM +0800, Lai Jiangshan wrote:
 On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
 Breaking up locks is better than implementing high-contention locks, but
 if we must have high-contention locks, why not make them automatically
 switch between light-weight ticket locks at low contention and queued
 locks at high contention?  After all, this would remove the need for
 the developer to predict which locks will be highly contended.

 This commit allows ticket locks to automatically switch between pure
 ticketlock and queued-lock operation as needed.  If too many CPUs are
 spinning on a given ticket lock, a queue structure will be allocated
 and the lock will switch to queued-lock operation.  When the lock becomes
 free, it will switch back into ticketlock operation.  The low-order bit
 of the head counter is used to indicate that the lock is in queued mode,
 which forces an unconditional mismatch between the head and tail counters.
 This approach means that the common-case code path under conditions of
 low contention is very nearly that of a plain ticket lock.

 A fixed number of queueing structures is statically allocated in an
 array.  The ticket-lock address is used to hash into an initial element,
 but if that element is already in use, it moves to the next element.  If
 the entire array is already in use, continue to spin in ticket mode.

 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). 
 ]
 [ paulmck: Address Eric Dumazet review feedback. ]
 [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
 [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
 [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
 [ paulmck: Reduce queue-switch contention (Waiman Long). ]
 [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). 
 ]
 [ paulmck: Type safety fixes (Steven Rostedt). ]
 [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
 [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


 Hi, Paul,

 I simplify the code and remove the search by encoding the index of struct 
 tkt_q_head
 into lock-tickets.head.

 A) lock-tickets.head(when queued-lock):
 -
  index of struct tkt_q_head |0|1|
 -

 Interesting approach!  It might reduce queued-mode overhead a bit in
 some cases, though I bet that in the common case the first queue element
 examined is the right one.  More on this below.

 The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
 __ticket_spin_unlock(),
 thus __ticket_spin_unlock() will not change the higher bits of 
 lock-tickets.head.

 B) tqhp-head is for the real value of lock-tickets.head.
 if the last bit of tqhp-head is 1, it means it is the head ticket when 
 started queuing.

 But don't you also need the xadd() in __ticket_spin_unlock() to become
 a cmpxchg() for this to work?  Or is your patch missing your changes to
 arch/x86/include/asm/spinlock.h?  Either way, this is likely to increase
 the no-contention overhead, which might be counterproductive.  Wouldn't
 hurt to get measurements, though.

 No, don't need to change __ticket_spin_unlock() in my idea.
 bit1 in the  tickets.head is reserved for __ticket_spin_unlock(),
 __ticket_spin_unlock() only changes the bit1, it will not change
 the higher bits. tkt_q_do_wake() will restore the tickets.head.

 This approach avoids cmpxchg in __ticket_spin_unlock().
 
 Ah, I did miss that.  But doesn't the adjustment in __ticket_spin_lock()
 need to be atomic in order to handle concurrent invocations of
 __ticket_spin_lock()?

I don't understand, do we just discuss about __ticket_spin_unlock() only?
Or does my suggestion hurt __ticket_spin_lock()?

 
 Either way, it would be good to see the performance effects of this.
 
   Thanx, Paul
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan
On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
> 
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
> 
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
> 
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> [ paulmck: Type safety fixes (Steven Rostedt). ]
> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


Hi, Paul,

I simplify the code and remove the search by encoding the index of struct 
tkt_q_head
into lock->tickets.head.

A) lock->tickets.head(when queued-lock):
-
 index of struct tkt_q_head |0|1|
-

The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
__ticket_spin_unlock(),
thus __ticket_spin_unlock() will not change the higher bits of 
lock->tickets.head.

B) tqhp->head is for the real value of lock->tickets.head.
if the last bit of tqhp->head is 1, it means it is the head ticket when started 
queuing.

Thanks,
Lai

 kernel/tktqlock.c |   51 +--
 1 files changed, 13 insertions(+), 38 deletions(-)

diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c
index 912817c..1329d0f 100644
--- a/kernel/tktqlock.c
+++ b/kernel/tktqlock.c
@@ -33,7 +33,7 @@ struct tkt_q {
 
 struct tkt_q_head {
arch_spinlock_t *ref;   /* Pointer to spinlock if in use. */
-   s64 head_tkt;   /* Head ticket when started queuing. */
+   __ticket_t head;/* Real head when queued. */
struct tkt_q *spin; /* Head of queue. */
struct tkt_q **spin_tail;   /* Tail of queue. */
 };
@@ -77,15 +77,8 @@ static unsigned long tkt_q_hash(arch_spinlock_t *lock)
  */
 static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *lock)
 {
-   int i;
-   int start;
-
-   start = i = tkt_q_hash(lock);
-   do
-   if (ACCESS_ONCE(tkt_q_heads[i].ref) == lock)
-   return _q_heads[i];
-   while ((i = tkt_q_next_slot(i)) != start);
-   return NULL;
+   BUILD_BUG_ON(TKT_Q_NQUEUES > (1 << (TICKET_SHIFT - 2)));
+   return _q_heads[ACCESS_ONCE(lock->tickets.head) >> 2];
 }
 
 /*
@@ -101,11 +94,11 @@ static bool tkt_q_try_unqueue(arch_spinlock_t *lock, 
struct tkt_q_head *tqhp)
 
/* Pick up the ticket values. */
asold = ACCESS_ONCE(*lock);
-   if ((asold.tickets.head & ~0x1) == asold.tickets.tail) {
+   if (tqhp->head == asold.tickets.tail) {
 
/* Attempt to mark the lock as not having a queue. */
asnew = asold;
-   asnew.tickets.head &= ~0x1;
+   asnew.tickets.head = tqhp->head;
if (cmpxchg(>head_tail,
asold.head_tail,
asnew.head_tail) == asold.head_tail) {
@@ -128,12 +121,9 @@ void tkt_q_do_wake(arch_spinlock_t *lock)
struct tkt_q_head *tqhp;
struct tkt_q *tqp;
 
-   /*
-* If the queue is still being set up, wait for it.  Note that
-* the caller's xadd() provides the needed memory ordering.
-*/
-   while ((tqhp = tkt_q_find_head(lock)) == NULL)
-   cpu_relax();
+   tqhp = tkt_q_find_head(lock);
+   ACCESS_ONCE(lock->tickets.head) -= __TKT_SPIN_INC;
+   ACCESS_ONCE(tqhp->head) = (tqhp->head & 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Paul E. McKenney
On Thu, Jun 13, 2013 at 12:13:47AM +0800, Lai Jiangshan wrote:
> On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney
>  wrote:
> > Breaking up locks is better than implementing high-contention locks, but
> > if we must have high-contention locks, why not make them automatically
> > switch between light-weight ticket locks at low contention and queued
> > locks at high contention?  After all, this would remove the need for
> > the developer to predict which locks will be highly contended.
> >
> > This commit allows ticket locks to automatically switch between pure
> > ticketlock and queued-lock operation as needed.  If too many CPUs are
> > spinning on a given ticket lock, a queue structure will be allocated
> > and the lock will switch to queued-lock operation.  When the lock becomes
> > free, it will switch back into ticketlock operation.  The low-order bit
> > of the head counter is used to indicate that the lock is in queued mode,
> > which forces an unconditional mismatch between the head and tail counters.
> > This approach means that the common-case code path under conditions of
> > low contention is very nearly that of a plain ticket lock.
> >
> > A fixed number of queueing structures is statically allocated in an
> > array.  The ticket-lock address is used to hash into an initial element,
> > but if that element is already in use, it moves to the next element.  If
> > the entire array is already in use, continue to spin in ticket mode.
> >
> > Signed-off-by: Paul E. McKenney 
> > [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> > [ paulmck: Address Eric Dumazet review feedback. ]
> > [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> > [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> > [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> > [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> > [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> > [ paulmck: Type safety fixes (Steven Rostedt). ]
> > [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> > [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
> >
> > diff --git a/arch/x86/include/asm/spinlock.h 
> > b/arch/x86/include/asm/spinlock.h
> > index 33692ea..5aa0177 100644
> > --- a/arch/x86/include/asm/spinlock.h
> > +++ b/arch/x86/include/asm/spinlock.h
> > @@ -34,6 +34,21 @@
> >  # define UNLOCK_LOCK_PREFIX
> >  #endif
> >
> > +#ifdef CONFIG_TICKET_LOCK_QUEUED
> > +
> > +#define __TKT_SPIN_INC 2
> > +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> > +
> > +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +#define __TKT_SPIN_INC 1
> > +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets 
> > inc)
> > +{
> > +   return false;
> > +}
> > +
> > +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */
> > +
> >  /*
> >   * Ticket locks are conceptually two parts, one indicating the current 
> > head of
> >   * the queue, and the other indicating the current tail. The lock is 
> > acquired
> > @@ -49,17 +64,16 @@
> >   */
> >  static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> >  {
> > -   register struct __raw_tickets inc = { .tail = 1 };
> > +   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };
> >
> > inc = xadd(>tickets, inc);
> > -
> > for (;;) {
> > -   if (inc.head == inc.tail)
> > +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> > break;
> > cpu_relax();
> > inc.head = ACCESS_ONCE(lock->tickets.head);
> > }
> > -   barrier();  /* make sure nothing creeps before the lock 
> > is taken */
> > +   barrier(); /* Make sure nothing creeps in before the lock is taken. 
> > */
> >  }
> >
> >  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
> > @@ -70,17 +84,33 @@ static __always_inline int 
> > __ticket_spin_trylock(arch_spinlock_t *lock)
> > if (old.tickets.head != old.tickets.tail)
> > return 0;
> >
> > -   new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> > +   new.head_tail = old.head_tail + (__TKT_SPIN_INC << TICKET_SHIFT);
> >
> > /* cmpxchg is a full barrier, so nothing can move before it */
> > return cmpxchg(>head_tail, old.head_tail, new.head_tail) == 
> > old.head_tail;
> >  }
> >
> > +#ifndef CONFIG_TICKET_LOCK_QUEUED
> > +
> >  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> >  {
> > __add(>tickets.head, 1, UNLOCK_LOCK_PREFIX);
> >  }
> >
> > +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> > +
> > +extern void tkt_q_do_wake(arch_spinlock_t *lock);
> > +
> > +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> > +{
> > +   __ticket_t head = 2;
> > +
> > +   head = xadd(>tickets.head, head);
> > +   if (head & 0x1)
> > +   tkt_q_do_wake(lock);

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan
On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney
 wrote:
> Breaking up locks is better than implementing high-contention locks, but
> if we must have high-contention locks, why not make them automatically
> switch between light-weight ticket locks at low contention and queued
> locks at high contention?  After all, this would remove the need for
> the developer to predict which locks will be highly contended.
>
> This commit allows ticket locks to automatically switch between pure
> ticketlock and queued-lock operation as needed.  If too many CPUs are
> spinning on a given ticket lock, a queue structure will be allocated
> and the lock will switch to queued-lock operation.  When the lock becomes
> free, it will switch back into ticketlock operation.  The low-order bit
> of the head counter is used to indicate that the lock is in queued mode,
> which forces an unconditional mismatch between the head and tail counters.
> This approach means that the common-case code path under conditions of
> low contention is very nearly that of a plain ticket lock.
>
> A fixed number of queueing structures is statically allocated in an
> array.  The ticket-lock address is used to hash into an initial element,
> but if that element is already in use, it moves to the next element.  If
> the entire array is already in use, continue to spin in ticket mode.
>
> Signed-off-by: Paul E. McKenney 
> [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
> [ paulmck: Address Eric Dumazet review feedback. ]
> [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
> [ paulmck: Expand ->head_tkt from s32 to s64 (Waiman Long). ]
> [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
> [ paulmck: Reduce queue-switch contention (Waiman Long). ]
> [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
> [ paulmck: Type safety fixes (Steven Rostedt). ]
> [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
> [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
>
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index 33692ea..5aa0177 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -34,6 +34,21 @@
>  # define UNLOCK_LOCK_PREFIX
>  #endif
>
> +#ifdef CONFIG_TICKET_LOCK_QUEUED
> +
> +#define __TKT_SPIN_INC 2
> +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
> +
> +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
> +#define __TKT_SPIN_INC 1
> +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets 
> inc)
> +{
> +   return false;
> +}
> +
> +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */
> +
>  /*
>   * Ticket locks are conceptually two parts, one indicating the current head 
> of
>   * the queue, and the other indicating the current tail. The lock is acquired
> @@ -49,17 +64,16 @@
>   */
>  static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>  {
> -   register struct __raw_tickets inc = { .tail = 1 };
> +   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };
>
> inc = xadd(>tickets, inc);
> -
> for (;;) {
> -   if (inc.head == inc.tail)
> +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
> break;
> cpu_relax();
> inc.head = ACCESS_ONCE(lock->tickets.head);
> }
> -   barrier();  /* make sure nothing creeps before the lock 
> is taken */
> +   barrier(); /* Make sure nothing creeps in before the lock is taken. */
>  }
>
>  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
> @@ -70,17 +84,33 @@ static __always_inline int 
> __ticket_spin_trylock(arch_spinlock_t *lock)
> if (old.tickets.head != old.tickets.tail)
> return 0;
>
> -   new.head_tail = old.head_tail + (1 << TICKET_SHIFT);
> +   new.head_tail = old.head_tail + (__TKT_SPIN_INC << TICKET_SHIFT);
>
> /* cmpxchg is a full barrier, so nothing can move before it */
> return cmpxchg(>head_tail, old.head_tail, new.head_tail) == 
> old.head_tail;
>  }
>
> +#ifndef CONFIG_TICKET_LOCK_QUEUED
> +
>  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
>  {
> __add(>tickets.head, 1, UNLOCK_LOCK_PREFIX);
>  }
>
> +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
> +extern void tkt_q_do_wake(arch_spinlock_t *lock);
> +
> +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
> +{
> +   __ticket_t head = 2;
> +
> +   head = xadd(>tickets.head, head);
> +   if (head & 0x1)
> +   tkt_q_do_wake(lock);
> +}
> +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
> +
>  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
>  {
> struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
> diff --git a/arch/x86/include/asm/spinlock_types.h 
> b/arch/x86/include/asm/spinlock_types.h
> index 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan
On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:
 Breaking up locks is better than implementing high-contention locks, but
 if we must have high-contention locks, why not make them automatically
 switch between light-weight ticket locks at low contention and queued
 locks at high contention?  After all, this would remove the need for
 the developer to predict which locks will be highly contended.

 This commit allows ticket locks to automatically switch between pure
 ticketlock and queued-lock operation as needed.  If too many CPUs are
 spinning on a given ticket lock, a queue structure will be allocated
 and the lock will switch to queued-lock operation.  When the lock becomes
 free, it will switch back into ticketlock operation.  The low-order bit
 of the head counter is used to indicate that the lock is in queued mode,
 which forces an unconditional mismatch between the head and tail counters.
 This approach means that the common-case code path under conditions of
 low contention is very nearly that of a plain ticket lock.

 A fixed number of queueing structures is statically allocated in an
 array.  The ticket-lock address is used to hash into an initial element,
 but if that element is already in use, it moves to the next element.  If
 the entire array is already in use, continue to spin in ticket mode.

 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
 [ paulmck: Address Eric Dumazet review feedback. ]
 [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
 [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
 [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
 [ paulmck: Reduce queue-switch contention (Waiman Long). ]
 [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
 [ paulmck: Type safety fixes (Steven Rostedt). ]
 [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
 [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]

 diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
 index 33692ea..5aa0177 100644
 --- a/arch/x86/include/asm/spinlock.h
 +++ b/arch/x86/include/asm/spinlock.h
 @@ -34,6 +34,21 @@
  # define UNLOCK_LOCK_PREFIX
  #endif

 +#ifdef CONFIG_TICKET_LOCK_QUEUED
 +
 +#define __TKT_SPIN_INC 2
 +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
 +
 +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */
 +
 +#define __TKT_SPIN_INC 1
 +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets 
 inc)
 +{
 +   return false;
 +}
 +
 +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */
 +
  /*
   * Ticket locks are conceptually two parts, one indicating the current head 
 of
   * the queue, and the other indicating the current tail. The lock is acquired
 @@ -49,17 +64,16 @@
   */
  static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
  {
 -   register struct __raw_tickets inc = { .tail = 1 };
 +   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };

 inc = xadd(lock-tickets, inc);
 -
 for (;;) {
 -   if (inc.head == inc.tail)
 +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
 break;
 cpu_relax();
 inc.head = ACCESS_ONCE(lock-tickets.head);
 }
 -   barrier();  /* make sure nothing creeps before the lock 
 is taken */
 +   barrier(); /* Make sure nothing creeps in before the lock is taken. */
  }

  static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
 @@ -70,17 +84,33 @@ static __always_inline int 
 __ticket_spin_trylock(arch_spinlock_t *lock)
 if (old.tickets.head != old.tickets.tail)
 return 0;

 -   new.head_tail = old.head_tail + (1  TICKET_SHIFT);
 +   new.head_tail = old.head_tail + (__TKT_SPIN_INC  TICKET_SHIFT);

 /* cmpxchg is a full barrier, so nothing can move before it */
 return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == 
 old.head_tail;
  }

 +#ifndef CONFIG_TICKET_LOCK_QUEUED
 +
  static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
  {
 __add(lock-tickets.head, 1, UNLOCK_LOCK_PREFIX);
  }

 +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
 +
 +extern void tkt_q_do_wake(arch_spinlock_t *lock);
 +
 +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
 +{
 +   __ticket_t head = 2;
 +
 +   head = xadd(lock-tickets.head, head);
 +   if (head  0x1)
 +   tkt_q_do_wake(lock);
 +}
 +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
 +
  static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
  {
 struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets);
 diff --git a/arch/x86/include/asm/spinlock_types.h 
 b/arch/x86/include/asm/spinlock_types.h
 index ad0ad07..cdaefdd 100644
 --- 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Paul E. McKenney
On Thu, Jun 13, 2013 at 12:13:47AM +0800, Lai Jiangshan wrote:
 On Wed, Jun 12, 2013 at 11:40 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
  Breaking up locks is better than implementing high-contention locks, but
  if we must have high-contention locks, why not make them automatically
  switch between light-weight ticket locks at low contention and queued
  locks at high contention?  After all, this would remove the need for
  the developer to predict which locks will be highly contended.
 
  This commit allows ticket locks to automatically switch between pure
  ticketlock and queued-lock operation as needed.  If too many CPUs are
  spinning on a given ticket lock, a queue structure will be allocated
  and the lock will switch to queued-lock operation.  When the lock becomes
  free, it will switch back into ticketlock operation.  The low-order bit
  of the head counter is used to indicate that the lock is in queued mode,
  which forces an unconditional mismatch between the head and tail counters.
  This approach means that the common-case code path under conditions of
  low contention is very nearly that of a plain ticket lock.
 
  A fixed number of queueing structures is statically allocated in an
  array.  The ticket-lock address is used to hash into an initial element,
  but if that element is already in use, it moves to the next element.  If
  the entire array is already in use, continue to spin in ticket mode.
 
  Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
  [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
  [ paulmck: Address Eric Dumazet review feedback. ]
  [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
  [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
  [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
  [ paulmck: Reduce queue-switch contention (Waiman Long). ]
  [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
  [ paulmck: Type safety fixes (Steven Rostedt). ]
  [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
  [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]
 
  diff --git a/arch/x86/include/asm/spinlock.h 
  b/arch/x86/include/asm/spinlock.h
  index 33692ea..5aa0177 100644
  --- a/arch/x86/include/asm/spinlock.h
  +++ b/arch/x86/include/asm/spinlock.h
  @@ -34,6 +34,21 @@
   # define UNLOCK_LOCK_PREFIX
   #endif
 
  +#ifdef CONFIG_TICKET_LOCK_QUEUED
  +
  +#define __TKT_SPIN_INC 2
  +bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets inc);
  +
  +#else /* #ifdef CONFIG_TICKET_LOCK_QUEUED */
  +
  +#define __TKT_SPIN_INC 1
  +static inline bool tkt_spin_pass(arch_spinlock_t *ap, struct __raw_tickets 
  inc)
  +{
  +   return false;
  +}
  +
  +#endif /* #else #ifdef CONFIG_TICKET_LOCK_QUEUED */
  +
   /*
* Ticket locks are conceptually two parts, one indicating the current 
  head of
* the queue, and the other indicating the current tail. The lock is 
  acquired
  @@ -49,17 +64,16 @@
*/
   static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
   {
  -   register struct __raw_tickets inc = { .tail = 1 };
  +   register struct __raw_tickets inc = { .tail = __TKT_SPIN_INC };
 
  inc = xadd(lock-tickets, inc);
  -
  for (;;) {
  -   if (inc.head == inc.tail)
  +   if (inc.head == inc.tail || tkt_spin_pass(lock, inc))
  break;
  cpu_relax();
  inc.head = ACCESS_ONCE(lock-tickets.head);
  }
  -   barrier();  /* make sure nothing creeps before the lock 
  is taken */
  +   barrier(); /* Make sure nothing creeps in before the lock is taken. 
  */
   }
 
   static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock)
  @@ -70,17 +84,33 @@ static __always_inline int 
  __ticket_spin_trylock(arch_spinlock_t *lock)
  if (old.tickets.head != old.tickets.tail)
  return 0;
 
  -   new.head_tail = old.head_tail + (1  TICKET_SHIFT);
  +   new.head_tail = old.head_tail + (__TKT_SPIN_INC  TICKET_SHIFT);
 
  /* cmpxchg is a full barrier, so nothing can move before it */
  return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == 
  old.head_tail;
   }
 
  +#ifndef CONFIG_TICKET_LOCK_QUEUED
  +
   static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
   {
  __add(lock-tickets.head, 1, UNLOCK_LOCK_PREFIX);
   }
 
  +#else /* #ifndef CONFIG_TICKET_LOCK_QUEUED */
  +
  +extern void tkt_q_do_wake(arch_spinlock_t *lock);
  +
  +static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock)
  +{
  +   __ticket_t head = 2;
  +
  +   head = xadd(lock-tickets.head, head);
  +   if (head  0x1)
  +   tkt_q_do_wake(lock);
  +}
  +#endif /* #else #ifndef CONFIG_TICKET_LOCK_QUEUED */
  +
   static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
   {
  struct __raw_tickets tmp = 

Re: [PATCH RFC ticketlock] v3 Auto-queued ticketlock

2013-06-12 Thread Lai Jiangshan
On 06/12/2013 11:40 PM, Paul E. McKenney wrote:
 Breaking up locks is better than implementing high-contention locks, but
 if we must have high-contention locks, why not make them automatically
 switch between light-weight ticket locks at low contention and queued
 locks at high contention?  After all, this would remove the need for
 the developer to predict which locks will be highly contended.
 
 This commit allows ticket locks to automatically switch between pure
 ticketlock and queued-lock operation as needed.  If too many CPUs are
 spinning on a given ticket lock, a queue structure will be allocated
 and the lock will switch to queued-lock operation.  When the lock becomes
 free, it will switch back into ticketlock operation.  The low-order bit
 of the head counter is used to indicate that the lock is in queued mode,
 which forces an unconditional mismatch between the head and tail counters.
 This approach means that the common-case code path under conditions of
 low contention is very nearly that of a plain ticket lock.
 
 A fixed number of queueing structures is statically allocated in an
 array.  The ticket-lock address is used to hash into an initial element,
 but if that element is already in use, it moves to the next element.  If
 the entire array is already in use, continue to spin in ticket mode.
 
 Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 [ paulmck: Eliminate duplicate code and update comments (Steven Rostedt). ]
 [ paulmck: Address Eric Dumazet review feedback. ]
 [ paulmck: Use Lai Jiangshan idea to eliminate smp_mb(). ]
 [ paulmck: Expand -head_tkt from s32 to s64 (Waiman Long). ]
 [ paulmck: Move cpu_relax() to main spin loop (Steven Rostedt). ]
 [ paulmck: Reduce queue-switch contention (Waiman Long). ]
 [ paulmck: __TKT_SPIN_INC for __ticket_spin_trylock() (Steffen Persvold). ]
 [ paulmck: Type safety fixes (Steven Rostedt). ]
 [ paulmck: Pre-check cmpxchg() value (Waiman Long). ]
 [ paulmck: smp_mb() downgrade to smp_wmb() (Lai Jiangshan). ]


Hi, Paul,

I simplify the code and remove the search by encoding the index of struct 
tkt_q_head
into lock-tickets.head.

A) lock-tickets.head(when queued-lock):
-
 index of struct tkt_q_head |0|1|
-

The bit0 = 1 for queued, and the bit1 = 0 is reserved for 
__ticket_spin_unlock(),
thus __ticket_spin_unlock() will not change the higher bits of 
lock-tickets.head.

B) tqhp-head is for the real value of lock-tickets.head.
if the last bit of tqhp-head is 1, it means it is the head ticket when started 
queuing.

Thanks,
Lai

 kernel/tktqlock.c |   51 +--
 1 files changed, 13 insertions(+), 38 deletions(-)

diff --git a/kernel/tktqlock.c b/kernel/tktqlock.c
index 912817c..1329d0f 100644
--- a/kernel/tktqlock.c
+++ b/kernel/tktqlock.c
@@ -33,7 +33,7 @@ struct tkt_q {
 
 struct tkt_q_head {
arch_spinlock_t *ref;   /* Pointer to spinlock if in use. */
-   s64 head_tkt;   /* Head ticket when started queuing. */
+   __ticket_t head;/* Real head when queued. */
struct tkt_q *spin; /* Head of queue. */
struct tkt_q **spin_tail;   /* Tail of queue. */
 };
@@ -77,15 +77,8 @@ static unsigned long tkt_q_hash(arch_spinlock_t *lock)
  */
 static struct tkt_q_head *tkt_q_find_head(arch_spinlock_t *lock)
 {
-   int i;
-   int start;
-
-   start = i = tkt_q_hash(lock);
-   do
-   if (ACCESS_ONCE(tkt_q_heads[i].ref) == lock)
-   return tkt_q_heads[i];
-   while ((i = tkt_q_next_slot(i)) != start);
-   return NULL;
+   BUILD_BUG_ON(TKT_Q_NQUEUES  (1  (TICKET_SHIFT - 2)));
+   return tkt_q_heads[ACCESS_ONCE(lock-tickets.head)  2];
 }
 
 /*
@@ -101,11 +94,11 @@ static bool tkt_q_try_unqueue(arch_spinlock_t *lock, 
struct tkt_q_head *tqhp)
 
/* Pick up the ticket values. */
asold = ACCESS_ONCE(*lock);
-   if ((asold.tickets.head  ~0x1) == asold.tickets.tail) {
+   if (tqhp-head == asold.tickets.tail) {
 
/* Attempt to mark the lock as not having a queue. */
asnew = asold;
-   asnew.tickets.head = ~0x1;
+   asnew.tickets.head = tqhp-head;
if (cmpxchg(lock-head_tail,
asold.head_tail,
asnew.head_tail) == asold.head_tail) {
@@ -128,12 +121,9 @@ void tkt_q_do_wake(arch_spinlock_t *lock)
struct tkt_q_head *tqhp;
struct tkt_q *tqp;
 
-   /*
-* If the queue is still being set up, wait for it.  Note that
-* the caller's xadd() provides the needed memory ordering.
-*/
-   while ((tqhp = tkt_q_find_head(lock)) == NULL)
-   cpu_relax();
+   tqhp = tkt_q_find_head(lock);
+   ACCESS_ONCE(lock-tickets.head) -= __TKT_SPIN_INC;
+   ACCESS_ONCE(tqhp-head) = (tqhp-head  ~0x1) +