Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-10 Thread Manfred Spraul

Hi Alan,

On 07/08/2017 06:21 PM, Alan Stern wrote:

Pardon me for barging in, but I found this whole interchange extremely
confusing...

On Sat, 8 Jul 2017, Ingo Molnar wrote:


* Paul E. McKenney  wrote:


On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:

* Manfred Spraul  wrote:


Hi Ingo,

On 07/07/2017 10:31 AM, Ingo Molnar wrote:

There's another, probably just as significant advantage: 
queued_spin_unlock_wait()
is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On
any bigger system this should make a very measurable difference - if
spin_unlock_wait() is ever used in a performance critical code path.

At least for ipc/sem:
Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in the
hot path.
So for sem_lock(), I either need a primitive that dirties the cacheline or
sem_lock() must continue to use spin_lock()/spin_unlock().

This statement doesn't seem to make sense.  Did Manfred mean to write
"smp_mb()" instead of "spin_lock()/spin_unlock()"?

Option 1:
fastpath:
spin_lock(local_lock)
smp_mb(); [[1]]
smp_load_acquire(global_flag);
slow path:
global_flag = 1;
smp_mb();


Option 2:
fastpath:
spin_lock(local_lock);
smp_load_acquire(global_flag)
slow path:
global_flag = 1;
spin_lock(local_lock);spin_unlock(local_lock).

Rational:
The ACQUIRE from spin_lock is at the read of local_lock, not at the write.
i.e.: Without the smp_mb() at [[1]], the CPU can do:
read local_lock;
read global_flag;
write local_lock;
For Option 2, the smp_mb() is not required, because fast path and slow 
path acquire the same lock.



Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
acquire
spinning on spin_unlock() and get very close to the slow path performance of a
pure cacheline-dirtying behavior.

This is even more confusing.  Did Ingo mean to suggest using
"spin_trylock()+spin_unlock()" in place of "spin_lock()+spin_unlock()"
could provide the desired ordering guarantee without delaying other
CPUs that may try to acquire the lock?  That seems highly questionable.

I agree :-)

--
Manfred


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Alan Stern
Pardon me for barging in, but I found this whole interchange extremely 
confusing...

On Sat, 8 Jul 2017, Ingo Molnar wrote:

> * Paul E. McKenney  wrote:
> 
> > On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
> > > 
> > > * Manfred Spraul  wrote:
> > > 
> > > > Hi Ingo,
> > > > 
> > > > On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > > > > 
> > > > > There's another, probably just as significant advantage: 
> > > > > queued_spin_unlock_wait()
> > > > > is 'read-only', while spin_lock()+spin_unlock() dirties the lock 
> > > > > cache line. On
> > > > > any bigger system this should make a very measurable difference - if
> > > > > spin_unlock_wait() is ever used in a performance critical code path.
> > > > At least for ipc/sem:
> > > > Dirtying the cacheline (in the slow path) allows to remove a smp_mb() 
> > > > in the
> > > > hot path.
> > > > So for sem_lock(), I either need a primitive that dirties the cacheline 
> > > > or
> > > > sem_lock() must continue to use spin_lock()/spin_unlock().

This statement doesn't seem to make sense.  Did Manfred mean to write 
"smp_mb()" instead of "spin_lock()/spin_unlock()"?

> > > Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
> > > acquire 
> > > spinning on spin_unlock() and get very close to the slow path performance 
> > > of a 
> > > pure cacheline-dirtying behavior.

This is even more confusing.  Did Ingo mean to suggest using 
"spin_trylock()+spin_unlock()" in place of "spin_lock()+spin_unlock()" 
could provide the desired ordering guarantee without delaying other 
CPUs that may try to acquire the lock?  That seems highly questionable.

> > > But adding something like spin_barrier(), which purely dirties the lock 
> > > cacheline, 
> > > would be even faster, right?
> > 
> > Interestingly enough, the arm64 and powerpc implementations of
> > spin_unlock_wait() were very close to what it sounds like you are
> > describing.
> 
> So could we perhaps solve all our problems by defining the generic version 
> thusly:
> 
> void spin_unlock_wait(spinlock_t *lock)
> {
>   if (spin_trylock(lock))
>   spin_unlock(lock);
> }

How could this possibly be a generic version of spin_unlock_wait()?  
It does nothing at all (with no ordering properties) if some other CPU
currently holds the lock, whereas the real spin_unlock_wait() would
wait until the other CPU released the lock (or possibly longer).

And if no other CPU currently holds the lock, this has exactly the same
performance properties as spin_lock()+spin_unlock(), so what's the
advantage?

Alan Stern

> ... and perhaps rename it to spin_barrier() [or whatever proper name there 
> would 
> be]?
> 
> Architectures can still optimize it, to remove the small window where the 
> lock is 
> held locally - as long as the ordering is at least as strong as the generic 
> version.
> 
> This would have various advantages:
> 
>  - semantics are well-defined
> 
>  - the generic implementation is already pretty well optimized (no spinning)
> 
>  - it would make it usable for the IPC performance optimization
> 
>  - architectures could still optimize it to eliminate the window where the 
> lock is
>held locally - if there's such instructions available.
> 
> Was this proposed before, or am I missing something?
> 
> Thanks,
> 
>   Ingo



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Paul E. McKenney
On Sat, Jul 08, 2017 at 02:30:19PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney  wrote:
> 
> > On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
> > > 
> > > * Manfred Spraul  wrote:
> > > 
> > > > Hi Ingo,
> > > > 
> > > > On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > > > > 
> > > > > There's another, probably just as significant advantage: 
> > > > > queued_spin_unlock_wait()
> > > > > is 'read-only', while spin_lock()+spin_unlock() dirties the lock 
> > > > > cache line. On
> > > > > any bigger system this should make a very measurable difference - if
> > > > > spin_unlock_wait() is ever used in a performance critical code path.
> > > > At least for ipc/sem:
> > > > Dirtying the cacheline (in the slow path) allows to remove a smp_mb() 
> > > > in the
> > > > hot path.
> > > > So for sem_lock(), I either need a primitive that dirties the cacheline 
> > > > or
> > > > sem_lock() must continue to use spin_lock()/spin_unlock().
> > > 
> > > Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
> > > acquire 
> > > spinning on spin_unlock() and get very close to the slow path performance 
> > > of a 
> > > pure cacheline-dirtying behavior.
> > > 
> > > But adding something like spin_barrier(), which purely dirties the lock 
> > > cacheline, 
> > > would be even faster, right?
> > 
> > Interestingly enough, the arm64 and powerpc implementations of
> > spin_unlock_wait() were very close to what it sounds like you are
> > describing.
> 
> So could we perhaps solve all our problems by defining the generic version 
> thusly:
> 
> void spin_unlock_wait(spinlock_t *lock)
> {
>   if (spin_trylock(lock))
>   spin_unlock(lock);
> }
> 
> ... and perhaps rename it to spin_barrier() [or whatever proper name there 
> would 
> be]?

As lockdep, 0day Test Robot, Linus Torvalds, and several others let me
know in response to my original (thankfully RFC!) patch series, this needs
to disable irqs to work in the general case.  For example, if the lock
in question is an irq-disabling lock, you take an interrupt just after
a successful spin_trylock(), and that interrupt acquires the same lock,
the actuarial statistics of your kernel degrade sharply and suddenly.

What I get for sending out untested patches!  :-/

> Architectures can still optimize it, to remove the small window where the 
> lock is 
> held locally - as long as the ordering is at least as strong as the generic 
> version.
> 
> This would have various advantages:
> 
>  - semantics are well-defined
> 
>  - the generic implementation is already pretty well optimized (no spinning)
> 
>  - it would make it usable for the IPC performance optimization
> 
>  - architectures could still optimize it to eliminate the window where the 
> lock is
>held locally - if there's such instructions available.
> 
> Was this proposed before, or am I missing something?

It was sort of proposed...

https://marc.info/?l=linux-arch=149912878628355=2

But do we have a situation where normal usage of spin_lock() and
spin_unlock() is causing performance or scalability trouble?

(We do have at least one situation in fnic that appears to be buggy use of
spin_is_locked(), and proposing a patch for that case in on my todo list.)

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
> > 
> > * Manfred Spraul  wrote:
> > 
> > > Hi Ingo,
> > > 
> > > On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > > > 
> > > > There's another, probably just as significant advantage: 
> > > > queued_spin_unlock_wait()
> > > > is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache 
> > > > line. On
> > > > any bigger system this should make a very measurable difference - if
> > > > spin_unlock_wait() is ever used in a performance critical code path.
> > > At least for ipc/sem:
> > > Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in 
> > > the
> > > hot path.
> > > So for sem_lock(), I either need a primitive that dirties the cacheline or
> > > sem_lock() must continue to use spin_lock()/spin_unlock().
> > 
> > Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
> > acquire 
> > spinning on spin_unlock() and get very close to the slow path performance 
> > of a 
> > pure cacheline-dirtying behavior.
> > 
> > But adding something like spin_barrier(), which purely dirties the lock 
> > cacheline, 
> > would be even faster, right?
> 
> Interestingly enough, the arm64 and powerpc implementations of
> spin_unlock_wait() were very close to what it sounds like you are
> describing.

So could we perhaps solve all our problems by defining the generic version 
thusly:

void spin_unlock_wait(spinlock_t *lock)
{
if (spin_trylock(lock))
spin_unlock(lock);
}

... and perhaps rename it to spin_barrier() [or whatever proper name there 
would 
be]?

Architectures can still optimize it, to remove the small window where the lock 
is 
held locally - as long as the ordering is at least as strong as the generic 
version.

This would have various advantages:

 - semantics are well-defined

 - the generic implementation is already pretty well optimized (no spinning)

 - it would make it usable for the IPC performance optimization

 - architectures could still optimize it to eliminate the window where the lock 
is
   held locally - if there's such instructions available.

Was this proposed before, or am I missing something?

Thanks,

Ingo


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Paul E. McKenney
On Sat, Jul 08, 2017 at 10:43:24AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney  wrote:
> 
> > On Fri, Jul 07, 2017 at 10:31:28AM +0200, Ingo Molnar wrote:
> > 
> > [ . . . ]
> > 
> > > In fact I'd argue that any future high performance spin_unlock_wait() 
> > > user is 
> > > probably better off open coding the unlock-wait poll loop (and possibly 
> > > thinking 
> > > hard about eliminating it altogether). If such patterns pop up in the 
> > > kernel we 
> > > can think about consolidating them into a single read-only primitive 
> > > again.
> > 
> > I would like any reintroduction to include a header comment saying exactly
> > what the consolidated primitive actually does and does not do.  ;-)
> > 
> > > I.e. I think the proposed changes are doing no harm, and the 
> > > unavailability of a 
> > > generic primitive does not hinder future optimizations either in any 
> > > significant 
> > > fashion.
> > 
> > I will have a v3 with updated comments from Manfred.  Thoughts on when/where
> > to push this?
> 
> Once everyone agrees I can apply it to the locking tree. I think PeterZ's was 
> the 
> only objection?

Oleg wasn't all that happy, either, but he did supply the relevant patch.

> > The reason I ask is if this does not go in during this merge window, I need
> > to fix the header comment on spin_unlock_wait().
> 
> Can try it next week after some testing - let's see how busy things get for 
> Linus 
> in the merge window?

Sounds good!  Either way is fine with me.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Paul E. McKenney
On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
> 
> * Manfred Spraul  wrote:
> 
> > Hi Ingo,
> > 
> > On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > > 
> > > There's another, probably just as significant advantage: 
> > > queued_spin_unlock_wait()
> > > is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache 
> > > line. On
> > > any bigger system this should make a very measurable difference - if
> > > spin_unlock_wait() is ever used in a performance critical code path.
> > At least for ipc/sem:
> > Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in the
> > hot path.
> > So for sem_lock(), I either need a primitive that dirties the cacheline or
> > sem_lock() must continue to use spin_lock()/spin_unlock().
> 
> Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
> acquire 
> spinning on spin_unlock() and get very close to the slow path performance of 
> a 
> pure cacheline-dirtying behavior.
> 
> But adding something like spin_barrier(), which purely dirties the lock 
> cacheline, 
> would be even faster, right?

Interestingly enough, the arm64 and powerpc implementations of
spin_unlock_wait() were very close to what it sounds like you are
describing.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> On Fri, Jul 07, 2017 at 10:31:28AM +0200, Ingo Molnar wrote:
> 
> [ . . . ]
> 
> > In fact I'd argue that any future high performance spin_unlock_wait() user 
> > is 
> > probably better off open coding the unlock-wait poll loop (and possibly 
> > thinking 
> > hard about eliminating it altogether). If such patterns pop up in the 
> > kernel we 
> > can think about consolidating them into a single read-only primitive again.
> 
> I would like any reintroduction to include a header comment saying exactly
> what the consolidated primitive actually does and does not do.  ;-)
> 
> > I.e. I think the proposed changes are doing no harm, and the unavailability 
> > of a 
> > generic primitive does not hinder future optimizations either in any 
> > significant 
> > fashion.
> 
> I will have a v3 with updated comments from Manfred.  Thoughts on when/where
> to push this?

Once everyone agrees I can apply it to the locking tree. I think PeterZ's was 
the 
only objection?

> The reason I ask is if this does not go in during this merge window, I need
> to fix the header comment on spin_unlock_wait().

Can try it next week after some testing - let's see how busy things get for 
Linus 
in the merge window?

Thanks,

Ingo


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Ingo Molnar

* Manfred Spraul  wrote:

> Hi Ingo,
> 
> On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > 
> > There's another, probably just as significant advantage: 
> > queued_spin_unlock_wait()
> > is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache 
> > line. On
> > any bigger system this should make a very measurable difference - if
> > spin_unlock_wait() is ever used in a performance critical code path.
> At least for ipc/sem:
> Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in the
> hot path.
> So for sem_lock(), I either need a primitive that dirties the cacheline or
> sem_lock() must continue to use spin_lock()/spin_unlock().

Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
acquire 
spinning on spin_unlock() and get very close to the slow path performance of a 
pure cacheline-dirtying behavior.

But adding something like spin_barrier(), which purely dirties the lock 
cacheline, 
would be even faster, right?

Thanks,

Ingo


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Manfred Spraul

Hi Ingo,

On 07/07/2017 10:31 AM, Ingo Molnar wrote:


There's another, probably just as significant advantage: 
queued_spin_unlock_wait()
is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On
any bigger system this should make a very measurable difference - if
spin_unlock_wait() is ever used in a performance critical code path.

At least for ipc/sem:
Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in 
the hot path.
So for sem_lock(), I either need a primitive that dirties the cacheline 
or sem_lock() must continue to use spin_lock()/spin_unlock().


--
Manfred


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Paul E. McKenney
On Fri, Jul 07, 2017 at 10:31:28AM +0200, Ingo Molnar wrote:

[ . . . ]

> In fact I'd argue that any future high performance spin_unlock_wait() user is 
> probably better off open coding the unlock-wait poll loop (and possibly 
> thinking 
> hard about eliminating it altogether). If such patterns pop up in the kernel 
> we 
> can think about consolidating them into a single read-only primitive again.

I would like any reintroduction to include a header comment saying exactly
what the consolidated primitive actually does and does not do.  ;-)

> I.e. I think the proposed changes are doing no harm, and the unavailability 
> of a 
> generic primitive does not hinder future optimizations either in any 
> significant 
> fashion.

I will have a v3 with updated comments from Manfred.  Thoughts on when/where
to push this?

The reason I ask is if this does not go in during this merge window, I need
to fix the header comment on spin_unlock_wait().

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Peter Zijlstra
On Fri, Jul 07, 2017 at 12:33:49PM +0200, Ingo Molnar wrote:

>  [1997/04] v2.1.36:
> 
>   the spin_unlock_wait() primitive gets introduced as part of release()


Whee, that goes _way_ further back than I thought it did :-)

>  [2017/07] v4.12:
> 
>   wait_task_inactive() is still alive and kicking. Its poll loop has
>   increased in complexity, but it still does not use spin_unlock_wait()
> 

I've tried 'fixing' that wait_task_inactive() thing a number of times,
but always failed :/ That is very nasty code indeed.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> You missed the one in do_exit(), which I thought was the original one.

Indeed, it's raw_spin_unlock_wait() which my git grep pattern missed.

But it's not the original spin_unlock_wait(): the pi_lock and priority 
inheritance 
is a newfangled invention that Linus (rightfully) resisted for years.

The original spin_unlock_wait() was for the global scheduler_lock, long gone.

Here's the full history of the original spin_unlock_wait() usecase in the 
do_exit() path, for the historically interested:

 [1997/04] v2.1.36:

  the spin_unlock_wait() primitive gets introduced as part of release()

 [1998/08] v2.1.114:

  the release() usecase gets converted to an open coded 
spin_lock()+unlock() 
  poll loop over scheduler_lock

 [1999/05] v2.3.11pre3:

  open coded loop is changed over to poll p->has_cpu

 [1999/07] v2.3.12pre6:

  ->has_cpu loop poll loop is converted to a spin_lock()+unlock() 
  poll loop over runqueue_lock

 [2000/06] 2.4.0-test6pre4:

  combined open coded p->has_cpu poll loop is added back, in addition to the
  lock()+unlock() loop

 [2000/11] 2.4.0-test12pre4:

  lock+unlock loop is changed from scheduler_lock to task_lock

 [2001/11] v2.4.14.9:

   ->has_cpu gets renamed to ->cpus_runnable

 [2001/12] v2.5.1.10:

   poll loop is factored out from exit()'s release() function 
   to the scheduler's new wait_task_inactive() function

 ...

 [2017/07] v4.12:

  wait_task_inactive() is still alive and kicking. Its poll loop has
  increased in complexity, but it still does not use spin_unlock_wait()

So it was always a mess, and we relatively early flipped from the clever 
spin_unlock_wait() implementation to an open coded lock+unlock poll loop.

TL;DR: The original do_exit() usecase is gone, it does not use 
spin_unlock_wait(),
   since 1998.

Thanks,

Ingo


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Peter Zijlstra  wrote:
> 
> > > It might even be that this is the defined semantics of spin_unlock_wait().
> > 
> > As is, spin_unlock_wait() is somewhat ill defined. IIRC it grew from an 
> > optimization by Oleg and subsequently got used elsewhere. And it being the 
> > subtle bugger it is, there were bugs.
> 
> I believe the historical, original spin_unlock_wait() came from early SMP 
> optimizations of the networking code - and then spread elsewhere, step by 
> step. 
> All but one of the networking uses went away since then - so I don't think 
> there's 
> any original usecase left.

No - the original usecase was task teardown: I still remembered that but didn't 
find the commit - but it's there in very old Linux kernel patches, done by 
DaveM 
originally in v2.1.36 (!):

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -136,6 +136,12 @@ void release(struct task_struct * p)
}
for (i=1 ; iprocessor != NO_PROC_ID)
+   barrier();
+   spin_unlock_wait(_lock);
+#endif


Other code learned to use spin_unlock_wait(): the original version of 
[hard]irq_enter() was the second user, net_family_read_lock() was the third 
user, 
followed by more uses in networking. All but one of those are not present in 
the 
current upstream kernel anymore.

This task-teardown FIXME was fixed in v2.1.114 (was replaced by an open coded 
poll 
loop), but the spin_unlock_wait() primitive remained.

The rest is history! ;-)

Thanks,

Ingo


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Peter Zijlstra
On Fri, Jul 07, 2017 at 10:31:28AM +0200, Ingo Molnar wrote:
> Here's a quick list of all the use cases:
> 
>  net/netfilter/nf_conntrack_core.c:
> 
>- This is I believe the 'original', historic spin_unlock_wait() usecase 
> that
>  still exists in the kernel. spin_unlock_wait() is only used in a rare 
> case, 
>  when the netfilter hash is resized via nf_conntrack_hash_resize() - 
> which is 
>  a very heavy operation to begin with. It will no doubt get slower with 
> the 
>  proposed changes, but it probably does not matter. A networking person 
>  Acked-by would be nice though.
> 
>  drivers/ata/libata-eh.c:
> 
>- Locking of the ATA port in ata_scsi_cmd_error_handler(), presumably this 
> can
>  race with IRQs and ioctls() on other CPUs. Very likely not performance 
>  sensitive in any fashion, on IO errors things stop for many seconds 
> anyway.
> 
>  ipc/sem.c:
> 
>- A rare race condition branch in the SysV IPC semaphore freeing code in 
>  exit_sem() - where even the main code flow is not performance sensitive, 
>  because typical database workloads get their semaphore arrays during 
> startup 
>  and don't ever do heavy runtime allocation/freeing of them.
> 
>  kernel/sched/completion.c:
> 
>- completion_done(). This is actually a (comparatively) rarely used 
> completion 
>  API call - almost all the upstream usecases are in drivers, plus two in 
>  filesystems - neither usecase seems in a performance critical hot path. 
>  Completions typically involve scheduling and context switching, so in 
> the 
>  worst case the proposed change adds overhead to a scheduling slow path.
> 

You missed the one in do_exit(), which I thought was the original one.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > > From: Paul E. McKenney
> > 
> > [ . . . ]
> > 
> > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > the optimization, OTOH this thing is really really tricky to use,
> > > and has lead to a number of bugs already.
> > 
> > I do agree, it is a bit sad to see these optimizations go.  So, should
> > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > so that they can be easily reverted should someone come up with good
> > semantics and a compelling use case with compelling performance benefits.
> 
> Ha!, but what would constitute 'good semantics' ?
> 
> The current thing is something along the lines of:
> 
>   "Waits for the currently observed critical section
>to complete with ACQUIRE ordering such that it will observe
>whatever state was left by said critical section."
> 
> With the 'obvious' benefit of limited interference on those actually
> wanting to acquire the lock, and a shorter wait time on our side too,
> since we only need to wait for completion of the current section, and
> not for however many contender are before us.

There's another, probably just as significant advantage: 
queued_spin_unlock_wait() 
is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On 
any bigger system this should make a very measurable difference - if 
spin_unlock_wait() is ever used in a performance critical code path.

> Not sure I have an actual (micro) benchmark that shows a difference
> though.

It should be pretty obvious from pretty much any profile, the actual 
lock+unlock 
sequence that modifies the lock cache line is essentially a global cacheline 
bounce.

> Is this all good enough to retain the thing, I dunno. Like I said, I'm 
> conflicted on the whole thing. On the one hand its a nice optimization, on 
> the 
> other hand I don't want to have to keep fixing these bugs.

So on one hand it's _obvious_ that spin_unlock_wait() is both faster on the 
local 
_and_ the remote CPUs for any sort of use case where performance matters - I 
don't 
even understand how that can be argued otherwise.

The real question, does any use-case (we care about) exist.

Here's a quick list of all the use cases:

 net/netfilter/nf_conntrack_core.c:

   - This is I believe the 'original', historic spin_unlock_wait() usecase that
 still exists in the kernel. spin_unlock_wait() is only used in a rare 
case, 
 when the netfilter hash is resized via nf_conntrack_hash_resize() - which 
is 
 a very heavy operation to begin with. It will no doubt get slower with the 
 proposed changes, but it probably does not matter. A networking person 
 Acked-by would be nice though.

 drivers/ata/libata-eh.c:

   - Locking of the ATA port in ata_scsi_cmd_error_handler(), presumably this 
can
 race with IRQs and ioctls() on other CPUs. Very likely not performance 
 sensitive in any fashion, on IO errors things stop for many seconds anyway.

 ipc/sem.c:

   - A rare race condition branch in the SysV IPC semaphore freeing code in 
 exit_sem() - where even the main code flow is not performance sensitive, 
 because typical database workloads get their semaphore arrays during 
startup 
 and don't ever do heavy runtime allocation/freeing of them.

 kernel/sched/completion.c:

   - completion_done(). This is actually a (comparatively) rarely used 
completion 
 API call - almost all the upstream usecases are in drivers, plus two in 
 filesystems - neither usecase seems in a performance critical hot path. 
 Completions typically involve scheduling and context switching, so in the 
 worst case the proposed change adds overhead to a scheduling slow path.

So I'd argue that unless there's some surprising performance aspect of a 
completion_done() user, the proposed changes should not cause any performance 
trouble.

In fact I'd argue that any future high performance spin_unlock_wait() user is 
probably better off open coding the unlock-wait poll loop (and possibly 
thinking 
hard about eliminating it altogether). If such patterns pop up in the kernel we 
can think about consolidating them into a single read-only primitive again.

I.e. I think the proposed changes are doing no harm, and the unavailability of 
a 
generic primitive does not hinder future optimizations either in any 
significant 
fashion.

Thanks,

Ingo


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Ingo Molnar

* Peter Zijlstra  wrote:

> > It might even be that this is the defined semantics of spin_unlock_wait().
> 
> As is, spin_unlock_wait() is somewhat ill defined. IIRC it grew from an 
> optimization by Oleg and subsequently got used elsewhere. And it being the 
> subtle bugger it is, there were bugs.

I believe the historical, original spin_unlock_wait() came from early SMP 
optimizations of the networking code - and then spread elsewhere, step by step. 
All but one of the networking uses went away since then - so I don't think 
there's 
any original usecase left.

Thanks,

Ingo


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Alan Stern
On Thu, 6 Jul 2017, Peter Zijlstra wrote:

> On Thu, Jul 06, 2017 at 12:49:12PM -0400, Alan Stern wrote:
> > On Thu, 6 Jul 2017, Paul E. McKenney wrote:
> > 
> > > On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > > > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > > > And yes, there are architecture-specific optimizations for an
> > > > > empty spin_lock()/spin_unlock() critical section, and the current
> > > > > arch_spin_unlock_wait() implementations show some of these 
> > > > > optimizations.
> > > > > But I expect that performance benefits would need to be demonstrated 
> > > > > at
> > > > > the system level.
> > > > 
> > > > I do in fact contended there are any optimizations for the exact
> > > > lock+unlock semantics.
> > > 
> > > You lost me on this one.
> > > 
> > > > The current spin_unlock_wait() is weaker. Most notably it will not (with
> > > > exception of ARM64/PPC for other reasons) cause waits on other CPUs.
> > > 
> > > Agreed, weaker semantics allow more optimizations.  So use cases needing
> > > only the weaker semantics should more readily show performance benefits.
> > > But either way, we need compelling use cases, and I do not believe that
> > > any of the existing spin_unlock_wait() calls are compelling.  Perhaps I
> > > am confused, but I am not seeing it for any of them.
> > 
> > If somebody really wants the full spin_unlock_wait semantics and
> > doesn't want to interfere with other CPUs, wouldn't synchronize_sched()
> > or something similar do the job?  It wouldn't be as efficient as
> > lock+unlock, but it also wouldn't affect other CPUs.
> 
> So please don't do that. That'll create massive pain for RT. Also I
> don't think it works. The whole point was that spin_unlock_wait() is
> _cheaper_ than lock()+unlock(). If it gets to be more expensive there is
> absolutely no point in using it.

Of course; that is obvious.

I was making a rhetorical point: You should not try to justify
spin_unlock_wait() on the basis that it doesn't cause waits on other
CPUs.

Alan Stern



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:08:50PM +0100, Will Deacon wrote:
> On Thu, Jul 06, 2017 at 06:50:36PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > > > From: Paul E. McKenney
> > > 
> > > [ . . . ]
> > > 
> > > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > > the optimization, OTOH this thing is really really tricky to use,
> > > > and has lead to a number of bugs already.
> > > 
> > > I do agree, it is a bit sad to see these optimizations go.  So, should
> > > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > > so that they can be easily reverted should someone come up with good
> > > semantics and a compelling use case with compelling performance benefits.
> > 
> > Ha!, but what would constitute 'good semantics' ?
> > 
> > The current thing is something along the lines of:
> > 
> >   "Waits for the currently observed critical section
> >to complete with ACQUIRE ordering such that it will observe
> >whatever state was left by said critical section."
> > 
> > With the 'obvious' benefit of limited interference on those actually
> > wanting to acquire the lock, and a shorter wait time on our side too,
> > since we only need to wait for completion of the current section, and
> > not for however many contender are before us.
> > 
> > Not sure I have an actual (micro) benchmark that shows a difference
> > though.
> > 
> > 
> > 
> > Is this all good enough to retain the thing, I dunno. Like I said, I'm
> > conflicted on the whole thing. On the one hand its a nice optimization,
> > on the other hand I don't want to have to keep fixing these bugs.
> 
> As I've said, I'd be keen to see us drop this and bring it back if/when we
> get a compelling use-case along with performance numbers. At that point,
> we'd be in a better position to define the semantics anyway, knowing what
> exactly is expected by the use-case.

Hear, hear!!!  ;-)

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:50:36PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > > From: Paul E. McKenney
> > 
> > [ . . . ]
> > 
> > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > the optimization, OTOH this thing is really really tricky to use,
> > > and has lead to a number of bugs already.
> > 
> > I do agree, it is a bit sad to see these optimizations go.  So, should
> > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > so that they can be easily reverted should someone come up with good
> > semantics and a compelling use case with compelling performance benefits.
> 
> Ha!, but what would constitute 'good semantics' ?

At this point, it beats the heck out of me!  ;-)

> The current thing is something along the lines of:
> 
>   "Waits for the currently observed critical section
>to complete with ACQUIRE ordering such that it will observe
>whatever state was left by said critical section."
> 
> With the 'obvious' benefit of limited interference on those actually
> wanting to acquire the lock, and a shorter wait time on our side too,
> since we only need to wait for completion of the current section, and
> not for however many contender are before us.
> 
> Not sure I have an actual (micro) benchmark that shows a difference
> though.
> 
> 
> 
> Is this all good enough to retain the thing, I dunno. Like I said, I'm
> conflicted on the whole thing. On the one hand its a nice optimization,
> on the other hand I don't want to have to keep fixing these bugs.

Yeah, if I had seen a compelling use case...  Oleg's task_work case was
closest, but given that it involved a task-local lock that shouldn't
be all -that- heavily contended, it is hard to see there being all that
much difference.

But maybe I am missing something here?  Wouldn't be the first time...

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Will Deacon
On Thu, Jul 06, 2017 at 06:50:36PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > > From: Paul E. McKenney
> > 
> > [ . . . ]
> > 
> > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > the optimization, OTOH this thing is really really tricky to use,
> > > and has lead to a number of bugs already.
> > 
> > I do agree, it is a bit sad to see these optimizations go.  So, should
> > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > so that they can be easily reverted should someone come up with good
> > semantics and a compelling use case with compelling performance benefits.
> 
> Ha!, but what would constitute 'good semantics' ?
> 
> The current thing is something along the lines of:
> 
>   "Waits for the currently observed critical section
>to complete with ACQUIRE ordering such that it will observe
>whatever state was left by said critical section."
> 
> With the 'obvious' benefit of limited interference on those actually
> wanting to acquire the lock, and a shorter wait time on our side too,
> since we only need to wait for completion of the current section, and
> not for however many contender are before us.
> 
> Not sure I have an actual (micro) benchmark that shows a difference
> though.
> 
> 
> 
> Is this all good enough to retain the thing, I dunno. Like I said, I'm
> conflicted on the whole thing. On the one hand its a nice optimization,
> on the other hand I don't want to have to keep fixing these bugs.

As I've said, I'd be keen to see us drop this and bring it back if/when we
get a compelling use-case along with performance numbers. At that point,
we'd be in a better position to define the semantics anyway, knowing what
exactly is expected by the use-case.

Will


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:41:34PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 09:24:12AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > > And yes, there are architecture-specific optimizations for an
> > > > empty spin_lock()/spin_unlock() critical section, and the current
> > > > arch_spin_unlock_wait() implementations show some of these 
> > > > optimizations.
> > > > But I expect that performance benefits would need to be demonstrated at
> > > > the system level.
> > > 
> > > I do in fact contended there are any optimizations for the exact
> > > lock+unlock semantics.
> > 
> > You lost me on this one.
> 
> For the exact semantics you'd have to fully participate in the fairness
> protocol. You have to in fact acquire the lock in order to have the
> other contending CPUs wait (otherwise my earlier case 3 will fail).
> 
> At that point I'm not sure there is much actual code you can leave out.
> 
> What actual optimization is there left at that point?

Got it.  It was just that I was having a hard time parsing your sentence.
You were contending that there are no optimizations for all implementations
for the full semantics.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 12:49:12PM -0400, Alan Stern wrote:
> On Thu, 6 Jul 2017, Paul E. McKenney wrote:
> 
> > On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > > And yes, there are architecture-specific optimizations for an
> > > > empty spin_lock()/spin_unlock() critical section, and the current
> > > > arch_spin_unlock_wait() implementations show some of these 
> > > > optimizations.
> > > > But I expect that performance benefits would need to be demonstrated at
> > > > the system level.
> > > 
> > > I do in fact contended there are any optimizations for the exact
> > > lock+unlock semantics.
> > 
> > You lost me on this one.
> > 
> > > The current spin_unlock_wait() is weaker. Most notably it will not (with
> > > exception of ARM64/PPC for other reasons) cause waits on other CPUs.
> > 
> > Agreed, weaker semantics allow more optimizations.  So use cases needing
> > only the weaker semantics should more readily show performance benefits.
> > But either way, we need compelling use cases, and I do not believe that
> > any of the existing spin_unlock_wait() calls are compelling.  Perhaps I
> > am confused, but I am not seeing it for any of them.
> 
> If somebody really wants the full spin_unlock_wait semantics and
> doesn't want to interfere with other CPUs, wouldn't synchronize_sched()
> or something similar do the job?  It wouldn't be as efficient as
> lock+unlock, but it also wouldn't affect other CPUs.

So please don't do that. That'll create massive pain for RT. Also I
don't think it works. The whole point was that spin_unlock_wait() is
_cheaper_ than lock()+unlock(). If it gets to be more expensive there is
absolutely no point in using it.




Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > From: Paul E. McKenney
> 
> [ . . . ]
> 
> > Now on the one hand I feel like Oleg that it would be a shame to loose
> > the optimization, OTOH this thing is really really tricky to use,
> > and has lead to a number of bugs already.
> 
> I do agree, it is a bit sad to see these optimizations go.  So, should
> this make mainline, I will be tagging the commits that spin_unlock_wait()
> so that they can be easily reverted should someone come up with good
> semantics and a compelling use case with compelling performance benefits.

Ha!, but what would constitute 'good semantics' ?

The current thing is something along the lines of:

  "Waits for the currently observed critical section
   to complete with ACQUIRE ordering such that it will observe
   whatever state was left by said critical section."

With the 'obvious' benefit of limited interference on those actually
wanting to acquire the lock, and a shorter wait time on our side too,
since we only need to wait for completion of the current section, and
not for however many contender are before us.

Not sure I have an actual (micro) benchmark that shows a difference
though.



Is this all good enough to retain the thing, I dunno. Like I said, I'm
conflicted on the whole thing. On the one hand its a nice optimization,
on the other hand I don't want to have to keep fixing these bugs.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Alan Stern
On Thu, 6 Jul 2017, Paul E. McKenney wrote:

> On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > And yes, there are architecture-specific optimizations for an
> > > empty spin_lock()/spin_unlock() critical section, and the current
> > > arch_spin_unlock_wait() implementations show some of these optimizations.
> > > But I expect that performance benefits would need to be demonstrated at
> > > the system level.
> > 
> > I do in fact contended there are any optimizations for the exact
> > lock+unlock semantics.
> 
> You lost me on this one.
> 
> > The current spin_unlock_wait() is weaker. Most notably it will not (with
> > exception of ARM64/PPC for other reasons) cause waits on other CPUs.
> 
> Agreed, weaker semantics allow more optimizations.  So use cases needing
> only the weaker semantics should more readily show performance benefits.
> But either way, we need compelling use cases, and I do not believe that
> any of the existing spin_unlock_wait() calls are compelling.  Perhaps I
> am confused, but I am not seeing it for any of them.

If somebody really wants the full spin_unlock_wait semantics and
doesn't want to interfere with other CPUs, wouldn't synchronize_sched()
or something similar do the job?  It wouldn't be as efficient as
lock+unlock, but it also wouldn't affect other CPUs.

Alan Stern



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 09:24:12AM -0700, Paul E. McKenney wrote:
> On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > And yes, there are architecture-specific optimizations for an
> > > empty spin_lock()/spin_unlock() critical section, and the current
> > > arch_spin_unlock_wait() implementations show some of these optimizations.
> > > But I expect that performance benefits would need to be demonstrated at
> > > the system level.
> > 
> > I do in fact contended there are any optimizations for the exact
> > lock+unlock semantics.
> 
> You lost me on this one.

For the exact semantics you'd have to fully participate in the fairness
protocol. You have to in fact acquire the lock in order to have the
other contending CPUs wait (otherwise my earlier case 3 will fail).

At that point I'm not sure there is much actual code you can leave out.

What actual optimization is there left at that point?


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > And yes, there are architecture-specific optimizations for an
> > empty spin_lock()/spin_unlock() critical section, and the current
> > arch_spin_unlock_wait() implementations show some of these optimizations.
> > But I expect that performance benefits would need to be demonstrated at
> > the system level.
> 
> I do in fact contended there are any optimizations for the exact
> lock+unlock semantics.

You lost me on this one.

> The current spin_unlock_wait() is weaker. Most notably it will not (with
> exception of ARM64/PPC for other reasons) cause waits on other CPUs.

Agreed, weaker semantics allow more optimizations.  So use cases needing
only the weaker semantics should more readily show performance benefits.
But either way, we need compelling use cases, and I do not believe that
any of the existing spin_unlock_wait() calls are compelling.  Perhaps I
am confused, but I am not seeing it for any of them.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > From: Paul E. McKenney

[ . . . ]

> Now on the one hand I feel like Oleg that it would be a shame to loose
> the optimization, OTOH this thing is really really tricky to use,
> and has lead to a number of bugs already.

I do agree, it is a bit sad to see these optimizations go.  So, should
this make mainline, I will be tagging the commits that spin_unlock_wait()
so that they can be easily reverted should someone come up with good
semantics and a compelling use case with compelling performance benefits.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> And yes, there are architecture-specific optimizations for an
> empty spin_lock()/spin_unlock() critical section, and the current
> arch_spin_unlock_wait() implementations show some of these optimizations.
> But I expect that performance benefits would need to be demonstrated at
> the system level.

I do in fact contended there are any optimizations for the exact
lock+unlock semantics.

The current spin_unlock_wait() is weaker. Most notably it will not (with
exception of ARM64/PPC for other reasons) cause waits on other CPUs.



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Peter Zijlstra
On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 06 July 2017 00:30
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This series therefore removes spin_unlock_wait() and changes
> > its users to instead use a lock/unlock pair.  The commits are as follows,
> > in three groups:
> > 
> > 1-7.Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
> > to instead use a spin_lock/spin_unlock pair.  These may be
> > applied in any order, but must be applied before any later
> > commits in this series.  The commit logs state why I believe
> > that these commits won't noticeably degrade performance.
> 
> I can't help feeling that it might be better to have a spin_lock_sync()
> call that is equivalent to a spin_lock/spin_unlock pair.
> The default implementation being an inline function that does exactly that.
> This would let an architecture implement a more efficient version.

So that has the IRQ inversion issue found earlier in this patch-set. Not
actually doing the acquire though breaks other guarantees. See later.

> It might even be that this is the defined semantics of spin_unlock_wait().

As is, spin_unlock_wait() is somewhat ill defined. IIRC it grew from an
optimization by Oleg and subsequently got used elsewhere. And it being
the subtle bugger it is, there were bugs.

But part of the problem with that definition is fairness. For fair locks,
spin_lock()+spin_unlock() partakes in the fairness protocol. Many of the
things that would make spin_lock_sync() cheaper preclude it doing that.

(with the exception of ticket locks, those could actually do this).

But I think we can argue we don't in fact want that, all we really need
is to ensure the completion of the _current_ lock. But then you've
violated that equivalent thing.

As is, this is all a bit up in the air -- some of the ticket lock
variants are fair or minimal, the qspinlock on is prone to starvation
(although I think I can in fact fix that for qspinlock).

> Note that it can only be useful to do a spin_lock/unlock pair if it is
> impossible for another code path to try to acquire the lock.
> (Or, at least, the code can't care if the lock is acquired just after.)
> So if it can de determined that the lock isn't held (a READ_ONCE()
> might be enough) the lock itself need not be acquired (with the
> associated slow bus cycles).

Now look at the ARM64/PPC implementations that do explicit stores.

READ_ONCE() can only ever be sufficient on strongly ordered
architectures, but given many performance critical architectures are in
fact weakly ordered, you've opened up the exact can of worms we want to
get rid of the thing for.


So given:

(1)

  spin_lock()
  spin_unlock()

does not in fact provide memory ordering. Any prior/subsequent
load/stores can leak into the section and cross there.

What the thing does do however is serialize against other critical
sections in that:

(2)

  CPU0  CPU1

spin_lock()
  spin_lock()
X = 5
spin_unlock()
  spin_unlock()
  r = X;

we must have r == 5 (due to address dependency on the lock; CPU1 does
the store to X, then a store-release to the lock. CPU0 then does a
load-acquire on the lock and that fully orders the subsequent load of X
to the prior store of X).


The other way around is more difficult though:

(3)

  CPU0  CPU1

  X=5
  spin_lock()
spin_lock()
  spin_unlock()
r = X;
spin_unlock()

Where the above will in fact observe r == 5, this will be very difficult
to achieve with anything that will not let CPU1 wait. Which was the
entire premise of the original optimization by Oleg.

One of the later fixes we did to spin_unlock_wait() is to give it
ACQUIRE semantics to deal with (2) but even that got massively tricky,
see for example the ARM64 / PPC implementations.



In short, I doubt there is any real optimization possible if you want to
retain exact lock+unlock semantics.

Now on the one hand I feel like Oleg that it would be a shame to loose
the optimization, OTOH this thing is really really tricky to use,
and has lead to a number of bugs already.


Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 06 July 2017 00:30
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This series therefore removes spin_unlock_wait() and changes
> > its users to instead use a lock/unlock pair.  The commits are as follows,
> > in three groups:
> > 
> > 1-7.Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
> > to instead use a spin_lock/spin_unlock pair.  These may be
> > applied in any order, but must be applied before any later
> > commits in this series.  The commit logs state why I believe
> > that these commits won't noticeably degrade performance.
> 
> I can't help feeling that it might be better to have a spin_lock_sync()
> call that is equivalent to a spin_lock/spin_unlock pair.
> The default implementation being an inline function that does exactly that.
> This would let an architecture implement a more efficient version.
> 
> It might even be that this is the defined semantics of spin_unlock_wait().

That was in fact my first proposal, see the comment header in current
mainline for spin_unlock_wait() in include/linux/spinlock.h.  But Linus
quite rightly pointed out that if spin_unlock_wait() was to be defined in
this way, we should get rid of spin_unlock_wait() entirely, especially
given that there are not very many calls to spin_unlock_wait() and
also given that none of them are particularly performance critical.
Hence the current patch set, which does just that.

> Note that it can only be useful to do a spin_lock/unlock pair if it is
> impossible for another code path to try to acquire the lock.
> (Or, at least, the code can't care if the lock is acquired just after.)

Indeed!  As Oleg Nesterov pointed out, a spin_lock()/spin_unlock() pair
is sort of like synchronize_rcu() for a given lock, where that lock's
critical sections play the role of RCU read-side critical sections.
So anything before the pair is visible to later critical sections, and
anything in prior critical sections is visible to anything after the pair.
But again, as Linus pointed out, if we are going to have these semantics,
just do spin_lock() immediately followed by spin_unlock().

> So if it can de determined that the lock isn't held (a READ_ONCE()
> might be enough) the lock itself need not be acquired (with the
> associated slow bus cycles).

If you want the full semantics of a spin_lock()/spin_unlock() pair,
you need a full memory barrier before the READ_ONCE(), even on x86.
Without that memory barrier, you don't get the effect of the release
implicit in spin_unlock().

For weaker architectures, such as PowerPC and ARM, a READ_ONCE()
does -not- suffice, not at all, even with smp_mb() before and after.
I encourage you to take a look at arch_spin_unlock_wait() in arm64
and powerpc if youi are interested.  There were also some lengthy LKML
threads discussing this about 18 months ago that could be illuminating.

And yes, there are architecture-specific optimizations for an
empty spin_lock()/spin_unlock() critical section, and the current
arch_spin_unlock_wait() implementations show some of these optimizations.
But I expect that performance benefits would need to be demonstrated at
the system level.

Thanx, Paul



RE: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread David Laight
From: Paul E. McKenney
> Sent: 06 July 2017 00:30
> There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> and it appears that all callers could do just as well with a lock/unlock
> pair.  This series therefore removes spin_unlock_wait() and changes
> its users to instead use a lock/unlock pair.  The commits are as follows,
> in three groups:
> 
> 1-7.  Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
>   to instead use a spin_lock/spin_unlock pair.  These may be
>   applied in any order, but must be applied before any later
>   commits in this series.  The commit logs state why I believe
>   that these commits won't noticeably degrade performance.

I can't help feeling that it might be better to have a spin_lock_sync()
call that is equivalent to a spin_lock/spin_unlock pair.
The default implementation being an inline function that does exactly that.
This would let an architecture implement a more efficient version.

It might even be that this is the defined semantics of spin_unlock_wait().

Note that it can only be useful to do a spin_lock/unlock pair if it is
impossible for another code path to try to acquire the lock.
(Or, at least, the code can't care if the lock is acquired just after.)
So if it can de determined that the lock isn't held (a READ_ONCE()
might be enough) the lock itself need not be acquired (with the
associated slow bus cycles).

David