Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}

2018-10-19 Thread Paul E. McKenney
On Fri, Oct 19, 2018 at 12:02:43PM +0100, Will Deacon wrote:
> On Thu, Oct 18, 2018 at 08:53:42PM -0700, Alexei Starovoitov wrote:
> > On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote:
> > > On 10/18/2018 05:33 PM, Alexei Starovoitov wrote:
> > > > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> > > >>  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> > > >> diff --git a/tools/arch/powerpc/include/asm/barrier.h 
> > > >> b/tools/arch/powerpc/include/asm/barrier.h
> > > >> index a634da0..905a2c6 100644
> > > >> --- a/tools/arch/powerpc/include/asm/barrier.h
> > > >> +++ b/tools/arch/powerpc/include/asm/barrier.h
> > > >> @@ -27,4 +27,20 @@
> > > >>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > >>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > > >>
> > > >> +#if defined(__powerpc64__)
> > > >> +#define smp_lwsync()  __asm__ __volatile__ ("lwsync" : : : "memory")
> > > >> +
> > > >> +#define smp_store_release(p, v)   \
> > > >> +do {  \
> > > >> +  smp_lwsync();   \
> > > >> +  WRITE_ONCE(*p, v);  \
> > > >> +} while (0)
> > > >> +
> > > >> +#define smp_load_acquire(p)   \
> > > >> +({\
> > > >> +  typeof(*p) ___p1 = READ_ONCE(*p);   \
> > > >> +  smp_lwsync();   \
> > > >> +  ___p1;  \
> > > > 
> > > > I don't like this proliferation of asm.
> > > > Why do we think that we can do better job than compiler?
> > > > can we please use gcc builtins instead?
> > > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > > > __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> > > > __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> > > > are done specifically for this use case if I'm not mistaken.
> > > > I think it pays to learn what compiler provides.
> > > 
> > > But are you sure the C11 memory model matches exact same model as kernel?
> > > Seems like last time Will looked into it [0] it wasn't the case ...
> > 
> > I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE)
> > with kernel's smp_load_acquire().
> > I've seen a bunch of user space ring buffer implementations implemented
> > with __atomic_load_n() primitives.
> > But let's ask experts who live in both worlds.
> 
> One thing to be wary of is if there is an implementation choice between
> how to implement load-acquire and store-release for a given architecture.
> In these situations, it's often important that concurrent software agrees
> on the "mapping", so we'd need to be sure that (a) All userspace compilers
> that we care about have compatible mappings and (b) These mappings are
> compatible with the kernel code.

Agreed!  Mixing and matching can be done, but it does require quite a
bit of care.

Thanx, Paul



Re: [PATCH 2/5] rhashtable: don't hold lock on first table throughout insertion.

2018-07-25 Thread Paul E. McKenney
On Wed, Jul 25, 2018 at 02:53:36PM +1000, NeilBrown wrote:
> On Tue, Jul 24 2018, Paul E. McKenney wrote:
> 
> > On Tue, Jul 24, 2018 at 07:52:03AM +1000, NeilBrown wrote:
> >> On Mon, Jul 23 2018, Paul E. McKenney wrote:
> >> 
> >> > On Mon, Jul 23, 2018 at 09:13:43AM +1000, NeilBrown wrote:
> >> >> On Sun, Jul 22 2018, Paul E. McKenney wrote:
> >> >> >
> >> >> > One issue is that the ->func pointer can legitimately be NULL while on
> >> >> > RCU's callback lists.  This happens when someone invokes kfree_rcu()
> >> >> > with the rcu_head structure at the beginning of the enclosing 
> >> >> > structure.
> >> >> > I could add an offset to avoid this, or perhaps the kmalloc() folks
> >> >> > could be persuaded Rao Shoaib's patch moving kfree_rcu() handling to
> >> >> > the slab allocators, so that RCU only ever sees function pointers in
> >> >> > the ->func field.
> >> >> >
> >> >> > Either way, this should be hidden behind an API to allow adjustments
> >> >> > to be made if needed.  Maybe something like is_after_call_rcu()?
> >> >> > This would (for example) allow debug-object checks to be used to catch
> >> >> > check-after-free bugs.
> >> >> >
> >> >> > Would something of that sort work for you?
> >> >> 
> >> >> Yes, if you could provide an is_after_call_rcu() API, that would
> >> >> perfectly suit my use-case.
> >> >
> >> > After beating my head against the object-debug code a bit, I have to ask
> >> > if it would be OK for you if the is_after_call_rcu() API also takes the
> >> > function that was passed to RCU.
> >> 
> >> Sure.  It feels a bit clumsy, but I can see it could be easier to make
> >> robust.
> >> So yes: I'm fine with pass the same function and rcu_head to both
> >> call_rcu() and is_after_call_rcu().  Actually, when I say it like that,
> >> it seems less clumsy :-)
> >
> > How about like this?  (It needs refinements, like lockdep, but should
> > get the gist.)
> >
> 
> Looks good ... except ... naming is hard.
> 
>  is_after_call_rcu_init()  asserts where in the lifecycle we are,
>  is_after_call_rcu() tests where in the lifecycle we are.
> 
>  The names are similar but the purpose is quite different.
>  Maybe s/is_after_call_rcu_init/call_rcu_init/ ??

How about rcu_head_init() and rcu_head_after_call_rcu()?

Thanx, Paul

> Thanks,
> NeilBrown
> 
> 
> > Thanx, Paul
> >
> > 
> >
> > commit 5aa0ebf4799b8bddbbd0124db1c008526e99fc7c
> > Author: Paul E. McKenney 
> > Date:   Tue Jul 24 15:28:09 2018 -0700
> >
> > rcu: Provide functions for determining if call_rcu() has been invoked
> > 
> > This commit adds is_after_call_rcu() and is_after_call_rcu_init()
> > functions to help RCU users detect when another CPU has passed
> > the specified rcu_head structure and function to call_rcu().
> > The is_after_call_rcu_init() should be invoked before making the
> > structure visible to RCU readers, and then the is_after_call_rcu() may
> >     be invoked from within an RCU read-side critical section on an rcu_head
> > structure that was obtained during a traversal of the data structure
> > in question.  The is_after_call_rcu() function will return true if the
> > rcu_head structure has already been passed (with the specified function)
> > to call_rcu(), otherwise it will return false.
> > 
> > If is_after_call_rcu_init() has not been invoked on the rcu_head
> > structure or if the rcu_head (AKA callback) has already been invoked,
> > then is_after_call_rcu() will do WARN_ON_ONCE().
> > 
> > Reported-by: NeilBrown 
> > Signed-off-by: Paul E. McKenney 
> >
> > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> > index e4f821165d0b..82e5a91539b5 100644
> > --- a/include/linux/rcupdate.h
> > +++ b/include/linux/rcupdate.h
> > @@ -857,6 +857,45 @@ static inline notrace void 
> > rcu_read_unlock_sched_notrace(void)
> >  #endif /* #else #ifdef CONFIG_ARCH_WEAK_RELEASE_ACQUIRE */
> >  
> >  
> > +/* Has the specified rcu_head structure been handed to call_rcu()? */
> 

[tip:core/rcu] netfilter: Remove now-redundant smp_read_barrier_depends()

2018-01-03 Thread tip-bot for Paul E. McKenney
Commit-ID:  4be2b04e43fd3d8164d7aeb1808e47fbeb0c0de0
Gitweb: https://git.kernel.org/tip/4be2b04e43fd3d8164d7aeb1808e47fbeb0c0de0
Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
AuthorDate: Mon, 9 Oct 2017 12:09:04 -0700
Committer:  Paul E. McKenney <paul...@linux.vnet.ibm.com>
CommitDate: Mon, 4 Dec 2017 10:52:57 -0800

netfilter: Remove now-redundant smp_read_barrier_depends()

READ_ONCE() now implies smp_read_barrier_depends(), which means that
the instances in arpt_do_table(), ipt_do_table(), and ip6t_do_table()
are now redundant.  This commit removes them and adjusts the comments.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
Cc: Florian Westphal <f...@strlen.de>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: <netfilter-de...@vger.kernel.org>
Cc: <coret...@netfilter.org>
Cc: <netdev@vger.kernel.org>
---
 net/ipv4/netfilter/arp_tables.c | 7 +--
 net/ipv4/netfilter/ip_tables.c  | 7 +--
 net/ipv6/netfilter/ip6_tables.c | 7 +--
 3 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index f88221a..d242c2d 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -202,13 +202,8 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 4cbe5e8..46866cc 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -260,13 +260,8 @@ ipt_do_table(struct sk_buff *skb,
WARN_ON(!(table->valid_hooks & (1 << hook)));
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index f06e250..ac1db84 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -282,12 +282,7 @@ ip6t_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
table_base = private->entries;
jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];


[tip:core/rcu] drivers/vhost: Remove now-redundant read_barrier_depends()

2018-01-03 Thread tip-bot for Paul E. McKenney
Commit-ID:  3a5db0b108e0a40f08c2bcff6a675dbf632b91e0
Gitweb: https://git.kernel.org/tip/3a5db0b108e0a40f08c2bcff6a675dbf632b91e0
Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
AuthorDate: Mon, 27 Nov 2017 09:45:10 -0800
Committer:  Paul E. McKenney <paul...@linux.vnet.ibm.com>
CommitDate: Tue, 5 Dec 2017 11:57:55 -0800

drivers/vhost: Remove now-redundant read_barrier_depends()

Because READ_ONCE() now implies read_barrier_depends(), the
read_barrier_depends() in next_desc() is now redundant.  This commit
therefore removes it and the related comments.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: <k...@vger.kernel.org>
Cc: <virtualizat...@lists.linux-foundation.org>
Cc: <netdev@vger.kernel.org>
---
 drivers/vhost/vhost.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 33ac2b1..78b5940 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1877,12 +1877,7 @@ static unsigned next_desc(struct vhost_virtqueue *vq, 
struct vring_desc *desc)
return -1U;
 
/* Check they're not leading us off end of descriptors. */
-   next = vhost16_to_cpu(vq, desc->next);
-   /* Make sure compiler knows to grab that: we don't want it changing! */
-   /* We will use the result as an index in an array, so most
-* architectures only need a compiler barrier here. */
-   read_barrier_depends();
-
+   next = vhost16_to_cpu(vq, READ_ONCE(desc->next));
return next;
 }
 


[tip:core/rcu] drivers/net/ethernet/qlogic/qed: Fix __qed_spq_block() ordering

2018-01-03 Thread tip-bot for Paul E. McKenney
Commit-ID:  cb7e125e03274cffa97d74433c876765efffaf6a
Gitweb: https://git.kernel.org/tip/cb7e125e03274cffa97d74433c876765efffaf6a
Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
AuthorDate: Mon, 9 Oct 2017 09:26:25 -0700
Committer:  Paul E. McKenney <paul...@linux.vnet.ibm.com>
CommitDate: Mon, 4 Dec 2017 10:52:52 -0800

drivers/net/ethernet/qlogic/qed: Fix __qed_spq_block() ordering

The __qed_spq_block() function expects an smp_read_barrier_depends()
to order a prior READ_ONCE() against a later load that does not depend
on the prior READ_ONCE(), an expectation that can fail to be met.
This commit therefore replaces the READ_ONCE() with smp_load_acquire()
and removes the smp_read_barrier_depends().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ariel Elior <ariel.el...@cavium.com>
Cc: <everest-linux...@cavium.com>
Cc: <netdev@vger.kernel.org>
---
 drivers/net/ethernet/qlogic/qed/qed_spq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c 
b/drivers/net/ethernet/qlogic/qed/qed_spq.c
index be48d9a..c1237ec 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
@@ -97,9 +97,7 @@ static int __qed_spq_block(struct qed_hwfn *p_hwfn,
 
while (iter_cnt--) {
/* Validate we receive completion update */
-   if (READ_ONCE(comp_done->done) == 1) {
-   /* Read updated FW return value */
-   smp_read_barrier_depends();
+   if (smp_load_acquire(_done->done) == 1) { /* ^^^ */
if (p_fw_ret)
*p_fw_ret = comp_done->fw_return_code;
return 0;


Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Paul E. McKenney
On Wed, Dec 06, 2017 at 12:09:36AM +0200, Michael S. Tsirkin wrote:
> On Tue, Dec 05, 2017 at 10:57:00PM +0100, Peter Zijlstra wrote:
> > On Tue, Dec 05, 2017 at 11:24:49PM +0200, Michael S. Tsirkin wrote:
> > > READ_ONCE is really all over the place (some code literally replaced all
> > > memory accesses with READ/WRITE ONCE).
> > 
> > Yeah, so?
> 
> Oh my point was I can't just look for READ_ONCE and go
> *that's the pair*. there are too many of these.
> At Paul's suggestion I will document the pairing *this read once has a
> barrier that is paired with that barrier*.
> 
> > Complain to the compiler people for forcing us into that.
> 
> In some cases when you end up with all accesses
> going through read/write once volatile just might better.

That is in fact what the jiffies counter does.  But you lose READ_ONCE()'s
automatic handling of DEC Alpha when you take that approach.

> > > Would an API like WRITE_POINTER()/smp_store_pointer make sense,
> > > and READ_POINTER for symmetry?
> > 
> > No, the whole point of the exercise was to get away from the fact that
> > dependent loads are special.
> 
> It's a pity that dependent stores are still special.

We can make READ_ONCE() not be special at zero cost on non-Alpha
systems, but both smp_wmb() and smp_store_release() are decidedly
not free of added overhead.

Thanx, Paul



Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Paul E. McKenney
On Tue, Dec 05, 2017 at 11:24:49PM +0200, Michael S. Tsirkin wrote:
> On Tue, Dec 05, 2017 at 12:08:01PM -0800, Paul E. McKenney wrote:
> > On Tue, Dec 05, 2017 at 09:51:48PM +0200, Michael S. Tsirkin wrote:
> > > On Tue, Dec 05, 2017 at 11:33:39AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Dec 05, 2017 at 09:24:21PM +0200, Michael S. Tsirkin wrote:
> > 
> > [ . . . ]
> > 
> > > > > and this barrier is no longer paired with anything until
> > > > > you realize there's a dependency barrier within READ_ONCE.
> > > > > 
> > > > > Barrier pairing was a useful tool to check code validity,
> > > > > maybe there are other, better tools now.
> > > > 
> > > > There are quite a few people who say that smp_store_release() is
> > > > easier for the tools to analyze than is smp_wmb().  My experience with
> > > > smp_read_barrier_depends() and rcu_dereference() leads me to believe
> > > > that they are correct.
> > > 
> > > OK, but smp_store_release is still not paired with anything since we
> > > rely on READ_ONCE to include the implicit dpendendency barrier.
> > 
> > Why wouldn't you consider the smp_store_release() to be paired with
> > the new improved READ_ONCE()?
> 
> READ_ONCE is really all over the place (some code literally replaced all
> memory accesses with READ/WRITE ONCE).
> 
> And I also prefer smp_wmb as it seems to be cheaper on ARM.
> 
> Would an API like WRITE_POINTER()/smp_store_pointer make sense,
> and READ_POINTER for symmetry?

What we do in some code is to comment the pairings, allowing the other
side of the pairing to be easily located.  Would that work for you?

Thanx, Paul



Re: [PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-05 Thread Paul E. McKenney
On Tue, Dec 05, 2017 at 09:51:48PM +0200, Michael S. Tsirkin wrote:
> On Tue, Dec 05, 2017 at 11:33:39AM -0800, Paul E. McKenney wrote:
> > On Tue, Dec 05, 2017 at 09:24:21PM +0200, Michael S. Tsirkin wrote:

[ . . . ]

> > > and this barrier is no longer paired with anything until
> > > you realize there's a dependency barrier within READ_ONCE.
> > > 
> > > Barrier pairing was a useful tool to check code validity,
> > > maybe there are other, better tools now.
> > 
> > There are quite a few people who say that smp_store_release() is
> > easier for the tools to analyze than is smp_wmb().  My experience with
> > smp_read_barrier_depends() and rcu_dereference() leads me to believe
> > that they are correct.
> 
> OK, but smp_store_release is still not paired with anything since we
> rely on READ_ONCE to include the implicit dpendendency barrier.

Why wouldn't you consider the smp_store_release() to be paired with
the new improved READ_ONCE()?

Thanx, Paul



[PATCH tip/core/rcu 14/21] netfilter: Remove now-redundant smp_read_barrier_depends()

2017-12-01 Thread Paul E. McKenney
READ_ONCE() now implies smp_read_barrier_depends(), which means that
the instances in arpt_do_table(), ipt_do_table(), and ip6t_do_table()
are now redundant.  This commit removes them and adjusts the comments.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
Cc: Florian Westphal <f...@strlen.de>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: <netfilter-de...@vger.kernel.org>
Cc: <coret...@netfilter.org>
Cc: <netdev@vger.kernel.org>
---
 net/ipv4/netfilter/arp_tables.c | 7 +--
 net/ipv4/netfilter/ip_tables.c  | 7 +--
 net/ipv6/netfilter/ip6_tables.c | 7 +--
 3 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index f88221aebc9d..d242c2d29161 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -202,13 +202,8 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 4cbe5e80f3bf..46866cc24a84 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -260,13 +260,8 @@ ipt_do_table(struct sk_buff *skb,
WARN_ON(!(table->valid_hooks & (1 << hook)));
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index f06e25065a34..ac1db84722a7 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -282,12 +282,7 @@ ip6t_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
table_base = private->entries;
jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
-- 
2.5.2



[PATCH tip/core/rcu 21/21] drivers/vhost: Remove now-redundant read_barrier_depends()

2017-12-01 Thread Paul E. McKenney
Because READ_ONCE() now implies read_barrier_depends(), the
read_barrier_depends() in next_desc() is now redundant.  This commit
therefore removes it and the related comments.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: <k...@vger.kernel.org>
Cc: <virtualizat...@lists.linux-foundation.org>
Cc: <netdev@vger.kernel.org>
---
 drivers/vhost/vhost.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 33ac2b186b85..78b5940a415a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1877,12 +1877,7 @@ static unsigned next_desc(struct vhost_virtqueue *vq, 
struct vring_desc *desc)
return -1U;
 
/* Check they're not leading us off end of descriptors. */
-   next = vhost16_to_cpu(vq, desc->next);
-   /* Make sure compiler knows to grab that: we don't want it changing! */
-   /* We will use the result as an index in an array, so most
-* architectures only need a compiler barrier here. */
-   read_barrier_depends();
-
+   next = vhost16_to_cpu(vq, READ_ONCE(desc->next));
return next;
 }
 
-- 
2.5.2



[PATCH tip/core/rcu 03/21] drivers/net/ethernet/qlogic/qed: Fix __qed_spq_block() ordering

2017-12-01 Thread Paul E. McKenney
The __qed_spq_block() function expects an smp_read_barrier_depends()
to order a prior READ_ONCE() against a later load that does not depend
on the prior READ_ONCE(), an expectation that can fail to be met.
This commit therefore replaces the READ_ONCE() with smp_load_acquire()
and removes the smp_read_barrier_depends().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ariel Elior <ariel.el...@cavium.com>
Cc: <everest-linux...@cavium.com>
Cc: <netdev@vger.kernel.org>
---
 drivers/net/ethernet/qlogic/qed/qed_spq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c 
b/drivers/net/ethernet/qlogic/qed/qed_spq.c
index be48d9abd001..c1237ec58b6c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
@@ -97,9 +97,7 @@ static int __qed_spq_block(struct qed_hwfn *p_hwfn,
 
while (iter_cnt--) {
/* Validate we receive completion update */
-   if (READ_ONCE(comp_done->done) == 1) {
-   /* Read updated FW return value */
-   smp_read_barrier_depends();
+   if (smp_load_acquire(_done->done) == 1) { /* ^^^ */
if (p_fw_ret)
*p_fw_ret = comp_done->fw_return_code;
return 0;
-- 
2.5.2



Re: [Patch net 01/16] net_sched: introduce a workqueue for RCU callbacks of tc filter

2017-10-27 Thread Paul E. McKenney
On Thu, Oct 26, 2017 at 09:39:26PM -0700, Eric Dumazet wrote:
> On Thu, 2017-10-26 at 21:28 -0700, Cong Wang wrote:
> > On Thu, Oct 26, 2017 at 9:05 PM, Eric Dumazet  
> > wrote:
> > > On Thu, 2017-10-26 at 18:24 -0700, Cong Wang wrote:
> > >> ...
> > >
> > >> On the other hand, this makes tcf_block_put() ugly and
> > >> harder to understand. Since David and Eric strongly dislike
> > >> adding synchronize_rcu(), this is probably the only
> > >> solution that could make everyone happy.
> > >
> > >
> > > ...
> > >
> > >> +static void tcf_block_put_deferred(struct work_struct *work)
> > >> +{
> > >> + struct tcf_block *block = container_of(work, struct tcf_block, 
> > >> work);
> > >> + struct tcf_chain *chain;
> > >>
> > >> + rtnl_lock();
> > >>   /* Hold a refcnt for all chains, except 0, in case they are gone. 
> > >> */
> > >>   list_for_each_entry(chain, >chain_list, list)
> > >>   if (chain->index)
> > >> @@ -292,13 +308,27 @@ void tcf_block_put(struct tcf_block *block)
> > >>   list_for_each_entry(chain, >chain_list, list)
> > >>   tcf_chain_flush(chain);
> > >>
> > >> - /* Wait for RCU callbacks to release the reference count. */
> > >> + INIT_WORK(>work, tcf_block_put_final);
> > >> + /* Wait for RCU callbacks to release the reference count and make
> > >> +  * sure their works have been queued before this.
> > >> +  */
> > >>   rcu_barrier();
> > >> + tcf_queue_work(>work);
> > >> + rtnl_unlock();
> > >> +}
> > >
> > >
> > > On a loaded server, rcu_barrier() typically takes 4 ms.
> > >
> > > Way better than synchronize_rcu() (about 90 ms) but still an issue when
> > > holding RTNL.
> > >
> > > We have thousands of filters, and management daemon restarts and rebuild
> > > TC hierarchy from scratch.
> > >
> > > Simply getting rid of 1000 old filters might block RTNL for a while, or
> > > maybe I misunderstood your patches.
> > >
> > 
> > Paul pointed out the same.
> > 
> > As I replied, this rcu_barrier() is NOT added by this patchset, it is 
> > already
> > there in current master branch.
> 
> You added the rtnl_lock()  rtnl_unlock()...
> 
> I really do not care if hundreds of tasks (not owning rtnl) call
> rcu_barrier()...
> 
> Also we are still using a 4.3 based kernel, and no rcu_barrier() is used
> in filters dismantle ( unregister_tcf_proto_ops() is not used in our
> workloads )
> 
> Somehow something went very wrong in net/sched in recent kernels.

Would this be a good time for me to repeat my suggestion that timers
be used to aggregate the work done in the workqueue handlers, thus
decreasing the number of rcu_barrier() calls done under RTNL?

Thanx, Paul



Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC

2017-10-26 Thread Paul E. McKenney
On Wed, Oct 25, 2017 at 09:49:09PM -0700, Cong Wang wrote:
> On Wed, Oct 25, 2017 at 5:19 PM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> > On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote:
> >> My solution is introducing a workqueue for tc filters
> >> and let each RCU callback defer the work to this
> >> workqueue. I solve the flush_workqueue() deadlock
> >> by queuing another work in the same workqueue
> >> at the end, so the execution order should be as same
> >> as it is now. The ugly part is now tcf_block_put() which
> >> looks like below:
> >>
> >>
> >> static void tcf_block_put_final(struct work_struct *work)
> >> {
> >> struct tcf_block *block = container_of(work, struct tcf_block, 
> >> work);
> >> struct tcf_chain *chain, *tmp;
> >>
> >> /* At this point, all the chains should have refcnt == 1. */
> >> rtnl_lock();
> >> list_for_each_entry_safe(chain, tmp, >chain_list, list)
> >> tcf_chain_put(chain);
> >> rtnl_unlock();
> >> kfree(block);
> >> }
> >
> > I am guessing that tcf_chain_put() sometimes does a call_rcu(),
> > and the callback function in turn calls schedule_work(), and that
> > tcf_block_put_deferred() is the workqueue handler function.
> 
> Yes, tcf_chain_put() triggers call_rcu() indirectly during flush,
> this is why we have rcu_barrier()'s in current code base.

OK, got it.

> >> static void tcf_block_put_deferred(struct work_struct *work)
> >> {
> >> struct tcf_block *block = container_of(work, struct tcf_block, 
> >> work);
> >> struct tcf_chain *chain;
> >>
> >> rtnl_lock();
> >> /* Hold a refcnt for all chains, except 0, in case they are gone. 
> >> */
> >> list_for_each_entry(chain, >chain_list, list)
> >> if (chain->index)
> >> tcf_chain_hold(chain);
> >>
> >> /* No race on the list, because no chain could be destroyed. */
> >> list_for_each_entry(chain, >chain_list, list)
> >> tcf_chain_flush(chain);
> >>
> >> INIT_WORK(>work, tcf_block_put_final);
> >> /* Wait for RCU callbacks to release the reference count and make
> >>  * sure their works have been queued before this.
> >>  */
> >> rcu_barrier();
> >
> > This one can take awhile...  Though in recent kernels it will often
> > be a bit faster than synchronize_rcu().
> 
> It is already in current code base, so it is not introduced here.

Very good, then no problems with added overhead from rcu_barrier().  ;-)

> > Note that rcu_barrier() only waits for callbacks posted via call_rcu()
> > before the rcu_barrier() starts, if that matters.
> 
> Yes, this is exactly what I expect.

Good.

> >> tcf_queue_work(>work);
> >> rtnl_unlock();
> >> }
> >
> > And it looks like tcf_block_put_deferred() queues itself as work as well.
> > Or maybe instead?
> 
> Yes, it queues itself after all the works queued via call_rcu(),
> to ensure it is the last.

OK.

> >> void tcf_block_put(struct tcf_block *block)
> >> {
> >> if (!block)
> >> return;
> >>
> >> INIT_WORK(>work, tcf_block_put_deferred);
> >> /* Wait for existing RCU callbacks to cool down, make sure their 
> >> works
> >>  * have been queued before this. We can not flush pending works 
> >> here
> >>  * because we are holding the RTNL lock.
> >>  */
> >> rcu_barrier();
> >> tcf_queue_work(>work);
> >> }
> >>
> >>
> >> Paul, does this make any sense to you? ;)
> >
> >  would be surprised if I fully understand the problem to be solved,
> > but my current guess is that the constraints are as follows:
> >
> > 1.  Things removed must be processed in order.
> 
> Sort of, RCU callbacks themselves don't have any order, they only
> need to be serialized with RTNL lock, so we have to defer it to a
> workqeueue.

OK, got it.

> What needs a strict order is tcf_block_put() vs. flying works. Before
> tcf_block_put() finishes, all the flying works must be done otherwise
> use-after-free. :-/

So when removing an entire chain, you flush any queued workqueue handlers
to make sure that 

Re: [Patch net 00/15] net_sched: remove RCU callbacks from TC

2017-10-25 Thread Paul E. McKenney
On Wed, Oct 25, 2017 at 03:37:40PM -0700, Cong Wang wrote:
> On Tue, Oct 24, 2017 at 6:43 PM, David Miller  wrote:
> > From: Cong Wang 
> > Date: Mon, 23 Oct 2017 15:02:49 -0700
> >
> >> Recently, the RCU callbacks used in TC filters and TC actions keep
> >> drawing my attention, they introduce at least 4 race condition bugs:
> >
> > Like Eric, I think doing a full RCU sync on every delete is too big
> > a pill to swallow.  This is a major control plane performance
> > regression.
> >
> > Please find another reasonable way to fix this.
> >
> 
> Alright... I finally find a way to make everyone happy.
> 
> My solution is introducing a workqueue for tc filters
> and let each RCU callback defer the work to this
> workqueue. I solve the flush_workqueue() deadlock
> by queuing another work in the same workqueue
> at the end, so the execution order should be as same
> as it is now. The ugly part is now tcf_block_put() which
> looks like below:
> 
> 
> static void tcf_block_put_final(struct work_struct *work)
> {
> struct tcf_block *block = container_of(work, struct tcf_block, work);
> struct tcf_chain *chain, *tmp;
> 
> /* At this point, all the chains should have refcnt == 1. */
> rtnl_lock();
> list_for_each_entry_safe(chain, tmp, >chain_list, list)
> tcf_chain_put(chain);
> rtnl_unlock();
> kfree(block);
> }

I am guessing that tcf_chain_put() sometimes does a call_rcu(),
and the callback function in turn calls schedule_work(), and that
tcf_block_put_deferred() is the workqueue handler function.

> static void tcf_block_put_deferred(struct work_struct *work)
> {
> struct tcf_block *block = container_of(work, struct tcf_block, work);
> struct tcf_chain *chain;
> 
> rtnl_lock();
> /* Hold a refcnt for all chains, except 0, in case they are gone. */
> list_for_each_entry(chain, >chain_list, list)
> if (chain->index)
> tcf_chain_hold(chain);
> 
> /* No race on the list, because no chain could be destroyed. */
> list_for_each_entry(chain, >chain_list, list)
> tcf_chain_flush(chain);
> 
> INIT_WORK(>work, tcf_block_put_final);
> /* Wait for RCU callbacks to release the reference count and make
>  * sure their works have been queued before this.
>  */
> rcu_barrier();

This one can take awhile...  Though in recent kernels it will often
be a bit faster than synchronize_rcu().

Note that rcu_barrier() only waits for callbacks posted via call_rcu()
before the rcu_barrier() starts, if that matters.

> tcf_queue_work(>work);
> rtnl_unlock();
> }

And it looks like tcf_block_put_deferred() queues itself as work as well.
Or maybe instead?

> void tcf_block_put(struct tcf_block *block)
> {
> if (!block)
> return;
> 
> INIT_WORK(>work, tcf_block_put_deferred);
> /* Wait for existing RCU callbacks to cool down, make sure their works
>  * have been queued before this. We can not flush pending works here
>  * because we are holding the RTNL lock.
>  */
> rcu_barrier();
> tcf_queue_work(>work);
> }
> 
> 
> Paul, does this make any sense to you? ;)

 would be surprised if I fully understand the problem to be solved,
but my current guess is that the constraints are as follows:

1.  Things removed must be processed in order.

2.  Things removes must not be processed until a grace period
has elapsed.

3.  Things being processed after a grace period should not be
processed concurrently with each other or with subsequent
removals.

4.  A given removal is not finalized until its reference count
reaches zero.

5.  RTNL might not be held when the reference count reaches zero.

Or did I lose the thread somewhere?

Thanx, Paul



Re: Get rid of RCU callbacks in TC filters?

2017-10-20 Thread Paul E. McKenney
On Thu, Oct 19, 2017 at 08:26:01PM -0700, Cong Wang wrote:
> On Wed, Oct 18, 2017 at 12:35 PM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> > On Wed, Oct 18, 2017 at 10:36:28AM -0700, Cong Wang wrote:
> >> Hi, all
> >>
> >> Recently, the RCU callbacks used in TC filters and TC actions keep
> >> drawing my attention, they introduce at least 4 race condition bugs:
> >>
> >> 1. A simple one fixed by Daniel:
> >>
> >> commit c78e1746d3ad7d548bdf3fe491898cc453911a49
> >> Author: Daniel Borkmann <dan...@iogearbox.net>
> >> Date:   Wed May 20 17:13:33 2015 +0200
> >>
> >> net: sched: fix call_rcu() race on classifier module unloads
> >>
> >> 2. A very nasty one fixed by me:
> >>
> >> commit 1697c4bb5245649a23f06a144cc38c06715e1b65
> >> Author: Cong Wang <xiyou.wangc...@gmail.com>
> >> Date:   Mon Sep 11 16:33:32 2017 -0700
> >>
> >> net_sched: carefully handle tcf_block_put()
> >>
> >> 3. Two more bugs found by Chris:
> >> https://patchwork.ozlabs.org/patch/826696/
> >> https://patchwork.ozlabs.org/patch/826695/
> >>
> >>
> >> Usually RCU callbacks are simple, however for TC filters and actions,
> >> they are complex because at least TC actions could be destroyed
> >> together with the TC filter in one callback. And RCU callbacks are
> >> invoked in BH context, without locking they are parallel too. All of
> >> these contribute to the cause of these nasty bugs. It looks like they
> >> bring us more problems than benefits.
> >>
> >> Therefore, I have been thinking about getting rid of these callbacks,
> >> because they are not strictly necessary, callers of these call_rcu()
> >> are all on slow path and have RTNL lock, so blocking is permitted in
> >> their contexts, and _I think_ it does not harm to use
> >> synchronize_rcu() on slow paths, at least I can argue RTNL lock is
> >> already there and is a bottleneck if we really care. :)
> >>
> >> There are 3 solutions here:
> >>
> >> 1) Get rid of these RCU callbacks and use synchronize_rcu(). The
> >> downside is this could hurt the performance of deleting TC filters,
> >> but again it is slow path comparing to skb classification path. Note,
> >> it is _not_ merely replacing call_rcu() with synchronize_rcu(),
> >> because many call_rcu()'s are actually in list iterations, we have to
> >> use a local list and call list_del_rcu()+list_add() before
> >> synchronize_rcu() (Or is there any other API I am not aware of?). If
> >> people really hate synchronize_rcu() because of performance, we could
> >> also defer the work to a workqueue and callers could keep their
> >> performance as they are.
> >>
> >> 2) Introduce a spinlock to serialize these RCU callbacks. But as I
> >> said in commit 1697c4bb5245 ("net_sched: carefully handle
> >> tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
> >> Potentially we need to do a lot of work to make it possible, if not
> >> impossible.
> >>
> >> 3) Keep these RCU callbacks and fix all race conditions. Like what
> >> Chris tries to do in his patchset, but my argument is that we can not
> >> prove we are really race-free even with Chris' patches and his patches
> >> are already large enough.
> >>
> >>
> >> What do you think? Any other ideas?
> >
> > 4) Move from call_rcu() to synchronize_rcu(), but if feasible use one
> > synchronize_rcu() for multiple deletions/iterations.
> 
> This is what I meant by using a local list, perhaps I didn't make it clear.

Ah, got it.

> > 5) Keep call_rcu(), but have the RCU callback schedule a workqueue.
> > The workqueue could then use blocking primitives, for example, acquiring
> > RTNL.
> 
> Yeah, this could work too but we would get one more async...
> 
> filter delete -> call_rcu() -> schedule_work() -> action destroy

True, but on the other hand you get to hold RTNL.

> > 6) As with #5, have the RCU callback schedule a workqueue, but aggregate
> > workqueue scheduling using a timer.  This would reduce the number of
> > RTNL acquisitions.
> 
> Ouch, sounds like even one more async:
> 
> filter delete -> call_rcu() -> schedule_work() -> timer -> flush_work()
> -> action destroy
> 
> :-(

Indeed, the price of scalability and performance is often added
asynchronous action at a distance.  But sometimes you can have
scalabil

Re: Get rid of RCU callbacks in TC filters?

2017-10-18 Thread Paul E. McKenney
On Wed, Oct 18, 2017 at 10:36:28AM -0700, Cong Wang wrote:
> Hi, all
> 
> Recently, the RCU callbacks used in TC filters and TC actions keep
> drawing my attention, they introduce at least 4 race condition bugs:
> 
> 1. A simple one fixed by Daniel:
> 
> commit c78e1746d3ad7d548bdf3fe491898cc453911a49
> Author: Daniel Borkmann 
> Date:   Wed May 20 17:13:33 2015 +0200
> 
> net: sched: fix call_rcu() race on classifier module unloads
> 
> 2. A very nasty one fixed by me:
> 
> commit 1697c4bb5245649a23f06a144cc38c06715e1b65
> Author: Cong Wang 
> Date:   Mon Sep 11 16:33:32 2017 -0700
> 
> net_sched: carefully handle tcf_block_put()
> 
> 3. Two more bugs found by Chris:
> https://patchwork.ozlabs.org/patch/826696/
> https://patchwork.ozlabs.org/patch/826695/
> 
> 
> Usually RCU callbacks are simple, however for TC filters and actions,
> they are complex because at least TC actions could be destroyed
> together with the TC filter in one callback. And RCU callbacks are
> invoked in BH context, without locking they are parallel too. All of
> these contribute to the cause of these nasty bugs. It looks like they
> bring us more problems than benefits.
> 
> Therefore, I have been thinking about getting rid of these callbacks,
> because they are not strictly necessary, callers of these call_rcu()
> are all on slow path and have RTNL lock, so blocking is permitted in
> their contexts, and _I think_ it does not harm to use
> synchronize_rcu() on slow paths, at least I can argue RTNL lock is
> already there and is a bottleneck if we really care. :)
> 
> There are 3 solutions here:
> 
> 1) Get rid of these RCU callbacks and use synchronize_rcu(). The
> downside is this could hurt the performance of deleting TC filters,
> but again it is slow path comparing to skb classification path. Note,
> it is _not_ merely replacing call_rcu() with synchronize_rcu(),
> because many call_rcu()'s are actually in list iterations, we have to
> use a local list and call list_del_rcu()+list_add() before
> synchronize_rcu() (Or is there any other API I am not aware of?). If
> people really hate synchronize_rcu() because of performance, we could
> also defer the work to a workqueue and callers could keep their
> performance as they are.
> 
> 2) Introduce a spinlock to serialize these RCU callbacks. But as I
> said in commit 1697c4bb5245 ("net_sched: carefully handle
> tcf_block_put()"), it is very hard to do because of tcf_chain_dump().
> Potentially we need to do a lot of work to make it possible, if not
> impossible.
> 
> 3) Keep these RCU callbacks and fix all race conditions. Like what
> Chris tries to do in his patchset, but my argument is that we can not
> prove we are really race-free even with Chris' patches and his patches
> are already large enough.
> 
> 
> What do you think? Any other ideas?

4) Move from call_rcu() to synchronize_rcu(), but if feasible use one
synchronize_rcu() for multiple deletions/iterations.

5) Keep call_rcu(), but have the RCU callback schedule a workqueue.
The workqueue could then use blocking primitives, for example, acquiring
RTNL.

6) As with #5, have the RCU callback schedule a workqueue, but aggregate
workqueue scheduling using a timer.  This would reduce the number of
RTNL acquisitions.

7) As with #5, have the RCU callback schedule a workqueue, but have each
iterator accumulate a list of things removed and do call_rcu() on the
list.  This is an alternative way of aggregating to reduce the number
of RTNL acquisitions.

There are many other ways to skin this cat.

Thanx, Paul



Re: [PATCH 0/4] RCU: introduce noref debug

2017-10-11 Thread Paul E. McKenney
On Wed, Oct 11, 2017 at 04:50:36PM +0200, Paolo Abeni wrote:
> On Tue, 2017-10-10 at 21:02 -0700, Paul E. McKenney wrote:
> > Linus and Ingo will ask me how users decide how they should set that
> > additional build flag.  Especially given that if there is code that
> > requires non-strict checking, isn't everyone required to set up non-strict
> > checking to avoid false positives?  Unless you can also configure out
> > all the code that requires non-strict checking, I suppose, but how
> > would you keep track of what to configure out?
> 
> I'm working to a new version using a single compile flag - without
> additional strict option.
> 
> I don't know of any other subsytem that stores rcu pointer in
> datastructures for a longer amount of time. That having said, I wonder
> if the tests should completely move to the networking subsystem for the
> time being. The Kconfig option would thus be called NET_DEBUG or
> something along the lines. For abstraction it would be possible to add
> an atomic_notifier_chain to the rcu_read/unlock methods, where multiple
> users or checkers could register for. That way we keep the users
> seperate from the implementation for the cost of a bit more layering
> and more indirect calls. But given that this will anyway slow down
> execution a lot, it will anyway only be suitable in
> testing/verification/debugging environments.

I like this approach.  And if it does a good job of finding networking
bugs over the next year or so, I would be quite happy to consider
something for all RCU users.

> > OK.  There will probably be some discussion about the API in that case.
> 
> I'll drop noref parameter, the key will became mandatory - the exact
> position of where the reference of RCU managed object is stored. In the
> case of noref dst it is >_skb_refdst. With this kind of API it
> should suite more subsystems.

Interesting.  Do you intend to allow rcu_track_noref() to refuse to
record a pointer?  Other than for the array-full case.

> > True enough.  Except that if people were good about always clearing the
> > pointer, then the pointer couldn't leak, right?  Or am I missing something
> > in your use cases?
> 
> This is correct. The dst_entry checking in skb, which this patch series
> implements there are strict brackets in the sense of skb_dst_set,
> skb_dst_set_noref, skb_dst_force, etc., which form brackets around the
> safe uses of those dst_entries. This patch series validates that the
> correct skb_dst_* functions are being called before the sk_buff leaves
> the rcu protected section. Thus we don't need to modify and review a
> lot of code but we can just patch into those helpers already.

Makes sense.  Those willing to strictly bracket uses gain some debug
assist.

> > Or to put it another way -- have you been able to catch any real
> > pointer-leak bugs with thister-leak bugs with this?  The other RCU
> > debug options have had pretty long found-bug lists before we accepted
> > them.
> 
> There have been two problems found so far, one is a rather minor one
> while the other one seems like a normal bug. The patches for those are
> part of this series (3/4 and 4/4).

I agree that you have started gathering evidence and that the early
indications are positive, if quite mildly so.  If this time next year
there are a few tens of such bugs found, preferably including some
serious ones, I would very likely look favorably on pulling this in to
allow others to use it.

Thanx, Paul



Re: [PATCH 0/4] RCU: introduce noref debug

2017-10-10 Thread Paul E. McKenney
On Mon, Oct 09, 2017 at 06:53:12PM +0200, Paolo Abeni wrote:
> On Fri, 2017-10-06 at 09:34 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 06, 2017 at 05:10:09PM +0200, Paolo Abeni wrote:
> > > Hi,
> > > 
> > > On Fri, 2017-10-06 at 06:34 -0700, Paul E. McKenney wrote:
> > > > On Fri, Oct 06, 2017 at 02:57:45PM +0200, Paolo Abeni wrote:
> > > > > The networking subsystem is currently using some kind of long-lived
> > > > > RCU-protected, references to avoid the overhead of full book-keeping.
> > > > > 
> > > > > Such references - skb_dst() noref - are stored inside the skbs and 
> > > > > can be
> > > > > moved across relevant slices of the network stack, with the users
> > > > > being in charge of properly clearing the relevant skb - or properly 
> > > > > refcount
> > > > > the related dst references - before the skb escapes the RCU section.
> > > > > 
> > > > > We currently don't have any deterministic debug infrastructure to 
> > > > > check
> > > > > the dst noref usages - and the introduction of others noref artifact 
> > > > > is
> > > > > currently under discussion.
> > > > > 
> > > > > This series tries to tackle the above introducing an RCU debug 
> > > > > infrastructure
> > > > > aimed at spotting incorrect noref pointer usage, in patch one. The
> > > > > infrastructure is small and must be explicitly enabled via a newly 
> > > > > introduced
> > > > > build option.
> > > > > 
> > > > > Patch two uses such infrastructure to track dst noref usage in the 
> > > > > networking
> > > > > stack.
> > > > > 
> > > > > Patch 3 and 4 are bugfixes for small buglet found running this 
> > > > > infrastructure
> > > > > on basic scenarios.
> > > 
> > > Thank you for the prompt reply!
> > > > 
> > > > This patchset does not look like it handles rcu_read_lock() nesting.
> > > > For example, given code like this:
> > > > 
> > > > void foo(void)
> > > > {
> > > > rcu_read_lock();
> > > > rcu_track_noref(, , true);
> > > > do_something();
> > > > rcu_track_noref(, , false);
> > > > rcu_read_unlock();
> > > > }
> > > > 
> > > > void bar(void)
> > > > {
> > > > rcu_read_lock();
> > > > rcu_track_noref(, , true);
> > > > do_something_more();
> > > > foo();
> > > > do_something_else();
> > > > rcu_track_noref(, , false);
> > > > rcu_read_unlock();
> > > > }
> > > > 
> > > > void grill(void)
> > > > {
> > > > foo();
> > > > }
> > > > 
> > > > It looks like foo()'s rcu_read_unlock() will complain about key1.
> > > > You could remove foo()'s rcu_read_lock() and rcu_read_unlock(), but
> > > > that will break the call from grill().
> > > 
> > > Actually the code should cope correctly with your example; when foo()'s
> > > rcu_read_unlock() is called, 'cache' contains:
> > > 
> > > { { , , 1},  // ...
> > > 
> > > and when the related __rcu_check_noref() is invoked preempt_count() is
> > > 2 - because the check is called before decreasing the preempt counter.
> > > 
> > > In the main loop inside __rcu_check_noref() we will hit always the
> > > 'continue' statement because 'cache->store[i].nesting != nesting', so
> > > no warn will be triggered.
> > 
> > You are right, it was too early, and my example wasn't correct.  How
> > about this one?
> > 
> > void foo(void (*f)(struct s *sp), struct s **spp)
> > {
> > rcu_read_lock();
> > rcu_track_noref(, , true);
> > f(spp);
> > rcu_track_noref(, , false);
> > rcu_read_unlock();
> > }
> > 
> > void barcb(struct s **spp)
> > {
> > *spp = 
> > rcu_track_noref(, *spp, true);
> > }
> > 
> >

Re: [PATCH RFC tip/core/rcu 14/15] netfilter: Remove now-redundant smp_read_barrier_depends()

2017-10-10 Thread Paul E. McKenney
On Tue, Oct 10, 2017 at 10:43:34AM +0200, Peter Zijlstra wrote:
> On Mon, Oct 09, 2017 at 05:22:48PM -0700, Paul E. McKenney wrote:
> > READ_ONCE() now implies smp_read_barrier_depends(), which means that
> > the instances in arpt_do_table(), ipt_do_table(), and ip6t_do_table()
> > are now redundant.  This commit removes them and adjusts the comments.
> 
> Similar to the previous patch, the lack of READ_ONCE() in the original
> code is a pre-existing bug. It would allow the compiler to tear the load
> and observe a composite of two difference pointer values, or reload the
> private pointer and result in table_base and jumpstacl being part of
> different objects.
> 
> It would be good to point out this actually fixes a bug in the code.

Assuming that these changes actually fixed something, agreed.  ;-)

Thanx, Paul



[PATCH RFC tip/core/rcu 03/15] drivers/net/ethernet/qlogic/qed: Fix __qed_spq_block() ordering

2017-10-09 Thread Paul E. McKenney
The __qed_spq_block() function expects an smp_read_barrier_depends()
to order a prior READ_ONCE() against a later load that does not depend
on the prior READ_ONCE(), an expectation that can fail to be met.
This commit therefore replaces the READ_ONCE() with smp_load_acquire()
and removes the smp_read_barrier_depends().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ariel Elior <ariel.el...@cavium.com>
Cc: <everest-linux...@cavium.com>
Cc: <netdev@vger.kernel.org>
---
 drivers/net/ethernet/qlogic/qed/qed_spq.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_spq.c 
b/drivers/net/ethernet/qlogic/qed/qed_spq.c
index be48d9abd001..c1237ec58b6c 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_spq.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_spq.c
@@ -97,9 +97,7 @@ static int __qed_spq_block(struct qed_hwfn *p_hwfn,
 
while (iter_cnt--) {
/* Validate we receive completion update */
-   if (READ_ONCE(comp_done->done) == 1) {
-   /* Read updated FW return value */
-   smp_read_barrier_depends();
+   if (smp_load_acquire(_done->done) == 1) { /* ^^^ */
if (p_fw_ret)
*p_fw_ret = comp_done->fw_return_code;
return 0;
-- 
2.5.2



[PATCH RFC tip/core/rcu 14/15] netfilter: Remove now-redundant smp_read_barrier_depends()

2017-10-09 Thread Paul E. McKenney
READ_ONCE() now implies smp_read_barrier_depends(), which means that
the instances in arpt_do_table(), ipt_do_table(), and ip6t_do_table()
are now redundant.  This commit removes them and adjusts the comments.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
Cc: Florian Westphal <f...@strlen.de>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: <netfilter-de...@vger.kernel.org>
Cc: <coret...@netfilter.org>
Cc: <netdev@vger.kernel.org>
---
 net/ipv4/netfilter/arp_tables.c | 7 +--
 net/ipv4/netfilter/ip_tables.c  | 7 +--
 net/ipv6/netfilter/ip6_tables.c | 7 +--
 3 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 9e2770fd00be..d555b3b31c49 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -202,13 +202,8 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 39286e543ee6..f63752bec442 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -260,13 +260,8 @@ ipt_do_table(struct sk_buff *skb,
WARN_ON(!(table->valid_hooks & (1 << hook)));
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
table_base = private->entries;
jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
 
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 01bd3ee5ebc6..52afcab9b0d6 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -282,12 +282,7 @@ ip6t_do_table(struct sk_buff *skb,
 
local_bh_disable();
addend = xt_write_recseq_begin();
-   private = table->private;
-   /*
-* Ensure we load private-> members after we've fetched the base
-* pointer.
-*/
-   smp_read_barrier_depends();
+   private = READ_ONCE(table->private); /* Address dependency. */
cpu= smp_processor_id();
table_base = private->entries;
jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
-- 
2.5.2



Re: [PATCH 0/4] RCU: introduce noref debug

2017-10-06 Thread Paul E. McKenney
On Fri, Oct 06, 2017 at 05:10:09PM +0200, Paolo Abeni wrote:
> Hi,
> 
> On Fri, 2017-10-06 at 06:34 -0700, Paul E. McKenney wrote:
> > On Fri, Oct 06, 2017 at 02:57:45PM +0200, Paolo Abeni wrote:
> > > The networking subsystem is currently using some kind of long-lived
> > > RCU-protected, references to avoid the overhead of full book-keeping.
> > > 
> > > Such references - skb_dst() noref - are stored inside the skbs and can be
> > > moved across relevant slices of the network stack, with the users
> > > being in charge of properly clearing the relevant skb - or properly 
> > > refcount
> > > the related dst references - before the skb escapes the RCU section.
> > > 
> > > We currently don't have any deterministic debug infrastructure to check
> > > the dst noref usages - and the introduction of others noref artifact is
> > > currently under discussion.
> > > 
> > > This series tries to tackle the above introducing an RCU debug 
> > > infrastructure
> > > aimed at spotting incorrect noref pointer usage, in patch one. The
> > > infrastructure is small and must be explicitly enabled via a newly 
> > > introduced
> > > build option.
> > > 
> > > Patch two uses such infrastructure to track dst noref usage in the 
> > > networking
> > > stack.
> > > 
> > > Patch 3 and 4 are bugfixes for small buglet found running this 
> > > infrastructure
> > > on basic scenarios.
> 
> Thank you for the prompt reply!
> > 
> > This patchset does not look like it handles rcu_read_lock() nesting.
> > For example, given code like this:
> > 
> > void foo(void)
> > {
> > rcu_read_lock();
> > rcu_track_noref(, , true);
> > do_something();
> > rcu_track_noref(, , false);
> > rcu_read_unlock();
> > }
> > 
> > void bar(void)
> > {
> > rcu_read_lock();
> > rcu_track_noref(, , true);
> > do_something_more();
> > foo();
> > do_something_else();
> > rcu_track_noref(, , false);
> > rcu_read_unlock();
> > }
> > 
> > void grill(void)
> > {
> > foo();
> > }
> > 
> > It looks like foo()'s rcu_read_unlock() will complain about key1.
> > You could remove foo()'s rcu_read_lock() and rcu_read_unlock(), but
> > that will break the call from grill().
> 
> Actually the code should cope correctly with your example; when foo()'s
> rcu_read_unlock() is called, 'cache' contains:
> 
> { { , , 1},  // ...
> 
> and when the related __rcu_check_noref() is invoked preempt_count() is
> 2 - because the check is called before decreasing the preempt counter.
> 
> In the main loop inside __rcu_check_noref() we will hit always the
> 'continue' statement because 'cache->store[i].nesting != nesting', so
> no warn will be triggered.

You are right, it was too early, and my example wasn't correct.  How
about this one?

void foo(void (*f)(struct s *sp), struct s **spp)
{
rcu_read_lock();
rcu_track_noref(, , true);
f(spp);
rcu_track_noref(, , false);
rcu_read_unlock();
}

void barcb(struct s **spp)
{
*spp = 
rcu_track_noref(, *spp, true);
}

void bar(void)
{
struct s *sp;

rcu_read_lock();
rcu_track_noref(, , true);
do_something_more();
foo(barcb, );
do_something_else(sp);
rcu_track_noref(, sp, false);
rcu_track_noref(, , false);
rcu_read_unlock();
}

void grillcb(struct s **spp)
{
*spp
}

void grill(void)
{
foo();
}

How does the user select the key argument?  It looks like someone
can choose to just pass in NULL -- is that the intent, or am I confused
about this as well?

> > Or am I missing something subtle here?  Given patch 3/4, I suspect not...
> 
> The problem with the code in patch 3/4 is different; currently
> ip_route_input_noref() is basically doing:
> 
> rcu_read_lock();
> 
> rcu_track_noref(, , true);
> 
> rcu_read_unlock();
> 
> So the rcu lock there silence any RCU based check inside
> ip_route_input_noref() but does not really protect the noref dst.
> 
> Please let me know if the above clarify the scenario.

OK.

I am also concerned about false negatives when the user invokes
rcu_track_noref(..., false) but then leaks the pointer anyway.  Or is
there something you are doing that catches this that I am missing?

Thanx, Paul



Re: [PATCH 0/4] RCU: introduce noref debug

2017-10-06 Thread Paul E. McKenney
On Fri, Oct 06, 2017 at 02:57:45PM +0200, Paolo Abeni wrote:
> The networking subsystem is currently using some kind of long-lived
> RCU-protected, references to avoid the overhead of full book-keeping.
> 
> Such references - skb_dst() noref - are stored inside the skbs and can be
> moved across relevant slices of the network stack, with the users
> being in charge of properly clearing the relevant skb - or properly refcount
> the related dst references - before the skb escapes the RCU section.
> 
> We currently don't have any deterministic debug infrastructure to check
> the dst noref usages - and the introduction of others noref artifact is
> currently under discussion.
> 
> This series tries to tackle the above introducing an RCU debug infrastructure
> aimed at spotting incorrect noref pointer usage, in patch one. The
> infrastructure is small and must be explicitly enabled via a newly introduced
> build option.
> 
> Patch two uses such infrastructure to track dst noref usage in the networking
> stack.
> 
> Patch 3 and 4 are bugfixes for small buglet found running this infrastructure
> on basic scenarios.

This patchset does not look like it handles rcu_read_lock() nesting.
For example, given code like this:

void foo(void)
{
rcu_read_lock();
rcu_track_noref(, , true);
do_something();
rcu_track_noref(, , false);
rcu_read_unlock();
}

void bar(void)
{
rcu_read_lock();
rcu_track_noref(, , true);
do_something_more();
foo();
do_something_else();
rcu_track_noref(, , false);
rcu_read_unlock();
}

void grill(void)
{
foo();
}

It looks like foo()'s rcu_read_unlock() will complain about key1.
You could remove foo()'s rcu_read_lock() and rcu_read_unlock(), but
that will break the call from grill().

Or am I missing something subtle here?  Given patch 3/4, I suspect not...

Thanx, Paul

> Paolo Abeni (4):
>   rcu: introduce noref debug
>   net: use RCU noref infrastructure to track dst noref
>   ipv4: drop unneeded and misleading RCU lock in ip_route_input_noref()
>   tcp: avoid noref dst leak on input path
> 
>  include/linux/rcupdate.h | 11 ++
>  include/linux/skbuff.h   |  1 +
>  include/net/dst.h|  5 +++
>  kernel/rcu/Kconfig.debug | 15 
>  kernel/rcu/Makefile  |  1 +
>  kernel/rcu/noref_debug.c | 89 
> 
>  net/ipv4/route.c |  7 +---
>  net/ipv4/tcp_input.c |  5 ++-
>  8 files changed, 127 insertions(+), 7 deletions(-)
>  create mode 100644 kernel/rcu/noref_debug.c
> 
> -- 
> 2.13.6
> 



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Paul E. McKenney
On Sat, Jul 08, 2017 at 02:30:19PM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote:
> 
> > On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
> > > 
> > > * Manfred Spraul <manf...@colorfullife.com> wrote:
> > > 
> > > > Hi Ingo,
> > > > 
> > > > On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > > > > 
> > > > > There's another, probably just as significant advantage: 
> > > > > queued_spin_unlock_wait()
> > > > > is 'read-only', while spin_lock()+spin_unlock() dirties the lock 
> > > > > cache line. On
> > > > > any bigger system this should make a very measurable difference - if
> > > > > spin_unlock_wait() is ever used in a performance critical code path.
> > > > At least for ipc/sem:
> > > > Dirtying the cacheline (in the slow path) allows to remove a smp_mb() 
> > > > in the
> > > > hot path.
> > > > So for sem_lock(), I either need a primitive that dirties the cacheline 
> > > > or
> > > > sem_lock() must continue to use spin_lock()/spin_unlock().
> > > 
> > > Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
> > > acquire 
> > > spinning on spin_unlock() and get very close to the slow path performance 
> > > of a 
> > > pure cacheline-dirtying behavior.
> > > 
> > > But adding something like spin_barrier(), which purely dirties the lock 
> > > cacheline, 
> > > would be even faster, right?
> > 
> > Interestingly enough, the arm64 and powerpc implementations of
> > spin_unlock_wait() were very close to what it sounds like you are
> > describing.
> 
> So could we perhaps solve all our problems by defining the generic version 
> thusly:
> 
> void spin_unlock_wait(spinlock_t *lock)
> {
>   if (spin_trylock(lock))
>   spin_unlock(lock);
> }
> 
> ... and perhaps rename it to spin_barrier() [or whatever proper name there 
> would 
> be]?

As lockdep, 0day Test Robot, Linus Torvalds, and several others let me
know in response to my original (thankfully RFC!) patch series, this needs
to disable irqs to work in the general case.  For example, if the lock
in question is an irq-disabling lock, you take an interrupt just after
a successful spin_trylock(), and that interrupt acquires the same lock,
the actuarial statistics of your kernel degrade sharply and suddenly.

What I get for sending out untested patches!  :-/

> Architectures can still optimize it, to remove the small window where the 
> lock is 
> held locally - as long as the ordering is at least as strong as the generic 
> version.
> 
> This would have various advantages:
> 
>  - semantics are well-defined
> 
>  - the generic implementation is already pretty well optimized (no spinning)
> 
>  - it would make it usable for the IPC performance optimization
> 
>  - architectures could still optimize it to eliminate the window where the 
> lock is
>held locally - if there's such instructions available.
> 
> Was this proposed before, or am I missing something?

It was sort of proposed...

https://marc.info/?l=linux-arch=149912878628355=2

But do we have a situation where normal usage of spin_lock() and
spin_unlock() is causing performance or scalability trouble?

(We do have at least one situation in fnic that appears to be buggy use of
spin_is_locked(), and proposing a patch for that case in on my todo list.)

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Paul E. McKenney
On Sat, Jul 08, 2017 at 10:43:24AM +0200, Ingo Molnar wrote:
> 
> * Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote:
> 
> > On Fri, Jul 07, 2017 at 10:31:28AM +0200, Ingo Molnar wrote:
> > 
> > [ . . . ]
> > 
> > > In fact I'd argue that any future high performance spin_unlock_wait() 
> > > user is 
> > > probably better off open coding the unlock-wait poll loop (and possibly 
> > > thinking 
> > > hard about eliminating it altogether). If such patterns pop up in the 
> > > kernel we 
> > > can think about consolidating them into a single read-only primitive 
> > > again.
> > 
> > I would like any reintroduction to include a header comment saying exactly
> > what the consolidated primitive actually does and does not do.  ;-)
> > 
> > > I.e. I think the proposed changes are doing no harm, and the 
> > > unavailability of a 
> > > generic primitive does not hinder future optimizations either in any 
> > > significant 
> > > fashion.
> > 
> > I will have a v3 with updated comments from Manfred.  Thoughts on when/where
> > to push this?
> 
> Once everyone agrees I can apply it to the locking tree. I think PeterZ's was 
> the 
> only objection?

Oleg wasn't all that happy, either, but he did supply the relevant patch.

> > The reason I ask is if this does not go in during this merge window, I need
> > to fix the header comment on spin_unlock_wait().
> 
> Can try it next week after some testing - let's see how busy things get for 
> Linus 
> in the merge window?

Sounds good!  Either way is fine with me.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-08 Thread Paul E. McKenney
On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
> 
> * Manfred Spraul  wrote:
> 
> > Hi Ingo,
> > 
> > On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > > 
> > > There's another, probably just as significant advantage: 
> > > queued_spin_unlock_wait()
> > > is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache 
> > > line. On
> > > any bigger system this should make a very measurable difference - if
> > > spin_unlock_wait() is ever used in a performance critical code path.
> > At least for ipc/sem:
> > Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in the
> > hot path.
> > So for sem_lock(), I either need a primitive that dirties the cacheline or
> > sem_lock() must continue to use spin_lock()/spin_unlock().
> 
> Technically you could use spin_trylock()+spin_unlock() and avoid the lock 
> acquire 
> spinning on spin_unlock() and get very close to the slow path performance of 
> a 
> pure cacheline-dirtying behavior.
> 
> But adding something like spin_barrier(), which purely dirties the lock 
> cacheline, 
> would be even faster, right?

Interestingly enough, the arm64 and powerpc implementations of
spin_unlock_wait() were very close to what it sounds like you are
describing.

Thanx, Paul



[PATCH v3 3/9] sched: Replace spin_unlock_wait() with lock/unlock pair

2017-07-07 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
do_task_dead() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock is
this tasks ->pi_lock, and this is called only after the task exits.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
[ paulmck: Replace leading smp_mb() with smp_mb__before_spinlock(),
  courtesy of Arnd Bergmann's noting its odd location. ]
---
 kernel/sched/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e91138fcde86..48a8760fedf4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3460,8 +3460,9 @@ void __noreturn do_task_dead(void)
 * To avoid it, we have to wait for releasing tsk->pi_lock which
 * is held by try_to_wake_up()
 */
-   smp_mb();
-   raw_spin_unlock_wait(>pi_lock);
+   smp_mb__before_spinlock();
+   raw_spin_lock_irq(>pi_lock);
+   raw_spin_unlock_irq(>pi_lock);
 
/* Causes final put_task_struct in finish_task_switch(): */
__set_current_state(TASK_DEAD);
-- 
2.5.2



[PATCH v3 7/9] drivers/ata: Replace spin_unlock_wait() with lock/unlock pair

2017-07-07 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore eliminates the spin_unlock_wait() call and
associated else-clause and hoists the then-clause's lock and unlock out of
the "if" statement.  This should be safe from a performance perspective
because according to Tejun there should be few if any drivers that don't
set their own error handler.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Cc: <linux-...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 drivers/ata/libata-eh.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index ef68232b5222..779f6f18c1f4 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -645,12 +645,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, 
struct ata_port *ap,
 * completions are honored.  A scmd is determined to have
 * timed out iff its associated qc is active and not failed.
 */
+   spin_lock_irqsave(ap->lock, flags);
if (ap->ops->error_handler) {
struct scsi_cmnd *scmd, *tmp;
int nr_timedout = 0;
 
-   spin_lock_irqsave(ap->lock, flags);
-
/* This must occur under the ap->lock as we don't want
   a polled recovery to race the real interrupt handler
 
@@ -700,12 +699,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, 
struct ata_port *ap,
if (nr_timedout)
__ata_port_freeze(ap);
 
-   spin_unlock_irqrestore(ap->lock, flags);
 
/* initialize eh_tries */
ap->eh_tries = ATA_EH_MAX_TRIES;
-   } else
-   spin_unlock_wait(ap->lock);
+   }
+   spin_unlock_irqrestore(ap->lock, flags);
 
 }
 EXPORT_SYMBOL(ata_scsi_cmd_error_handler);
-- 
2.5.2



[PATCH v3 2/9] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-07-07 Thread Paul E. McKenney
From: Oleg Nesterov <o...@redhat.com>

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
task_work_run() with a spin_lock_irq() and a spin_unlock_irq() aruond
the cmpxchg() dequeue loop.  This should be safe from a performance
perspective because ->pi_lock is local to the task and because calls to
the other side of the race, task_work_cancel(), should be rare.

Signed-off-by: Oleg Nesterov <o...@redhat.com>
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 kernel/task_work.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index d513051fcca2..836a72a66fba 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -96,20 +96,16 @@ void task_work_run(void)
 * work->func() can do task_work_add(), do not set
 * work_exited unless the list is empty.
 */
+   raw_spin_lock_irq(>pi_lock);
do {
work = READ_ONCE(task->task_works);
head = !work && (task->flags & PF_EXITING) ?
_exited : NULL;
} while (cmpxchg(>task_works, work, head) != work);
+   raw_spin_unlock_irq(>pi_lock);
 
if (!work)
break;
-   /*
-* Synchronize with task_work_cancel(). It can't remove
-* the first entry == work, cmpxchg(task_works) should
-* fail, but it can play with *work and other entries.
-*/
-   raw_spin_unlock_wait(>pi_lock);
 
do {
next = work->next;
-- 
2.5.2



[PATCH v3 4/9] completion: Replace spin_unlock_wait() with lock/unlock pair

2017-07-07 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
completion_done() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock
will be held only the wakeup happens really quickly.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/sched/completion.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 53f9558fa925..3e66712e1964 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -307,14 +307,9 @@ bool completion_done(struct completion *x)
 * If ->done, we need to wait for complete() to release ->wait.lock
 * otherwise we can end up freeing the completion before complete()
 * is done referencing it.
-*
-* The RMB pairs with complete()'s RELEASE of ->wait.lock and orders
-* the loads of ->done and ->wait.lock such that we cannot observe
-* the lock before complete() acquires it while observing the ->done
-* after it's acquired the lock.
 */
-   smp_rmb();
-   spin_unlock_wait(>wait.lock);
+   spin_lock_irq(>wait.lock);
+   spin_unlock_irq(>wait.lock);
return true;
 }
 EXPORT_SYMBOL(completion_done);
-- 
2.5.2



[PATCH v3 5/9] exit: Replace spin_unlock_wait() with lock/unlock pair

2017-07-07 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and
it appears that all callers could do just as well with a lock/unlock pair.
This commit therefore replaces the spin_unlock_wait() call in do_exit()
with spin_lock() followed immediately by spin_unlock().  This should be
safe from a performance perspective because the lock is a per-task lock,
and this is happening only at task-exit time.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/exit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 516acdb0e0ec..6d19c9090d43 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -832,7 +832,8 @@ void __noreturn do_exit(long code)
 * Ensure that we must observe the pi_state in exit_mm() ->
 * mm_release() -> exit_pi_state_list().
 */
-   raw_spin_unlock_wait(>pi_lock);
+   raw_spin_lock_irq(>pi_lock);
+   raw_spin_unlock_irq(>pi_lock);
 
if (unlikely(in_atomic())) {
pr_info("note: %s[%d] exited with preempt_count %d\n",
-- 
2.5.2



[PATCH v3 1/9] net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()

2017-07-07 Thread Paul E. McKenney
From: Manfred Spraul <manf...@colorfullife.com>

As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

V2: With improved comments, to clearly show how the barriers
pair.

Fixes: b16c29191dc8 ("netfilter: nf_conntrack: use safer way to lock all 
buckets")
Signed-off-by: Manfred Spraul <manf...@colorfullife.com>
Cc: <sta...@vger.kernel.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Sasha Levin <sasha.le...@oracle.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: netfilter-de...@vger.kernel.org
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 net/netfilter/nf_conntrack_core.c | 52 ++-
 1 file changed, 29 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index e847dbaa0c6b..e0560c41e371 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -96,19 +96,26 @@ static struct conntrack_gc_work conntrack_gc_work;
 
 void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
 {
+   /* 1) Acquire the lock */
spin_lock(lock);
-   while (unlikely(nf_conntrack_locks_all)) {
-   spin_unlock(lock);
 
-   /*
-* Order the 'nf_conntrack_locks_all' load vs. the
-* spin_unlock_wait() loads below, to ensure
-* that 'nf_conntrack_locks_all_lock' is indeed held:
-*/
-   smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */
-   spin_unlock_wait(_conntrack_locks_all_lock);
-   spin_lock(lock);
-   }
+   /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics
+* It pairs with the smp_store_release() in nf_conntrack_all_unlock()
+*/
+   if (likely(smp_load_acquire(_conntrack_locks_all) == false))
+   return;
+
+   /* fast path failed, unlock */
+   spin_unlock(lock);
+
+   /* Slow path 1) get global lock */
+   spin_lock(_conntrack_locks_all_lock);
+
+   /* Slow path 2) get the lock we want */
+   spin_lock(lock);
+
+   /* Slow path 3) release the global lock */
+   spin_unlock(_conntrack_locks_all_lock);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_lock);
 
@@ -149,28 +156,27 @@ static void nf_conntrack_all_lock(void)
int i;
 
spin_lock(_conntrack_locks_all_lock);
-   nf_conntrack_locks_all = true;
 
-   /*
-* Order the above store of 'nf_conntrack_locks_all' against
-* the spin_unlock_wait() loads below, such that if
-* nf_conntrack_lock() observes 'nf_conntrack_locks_all'
-* we must observe nf_conntrack_locks[] held:
-*/
-   smp_mb(); /* spin_lock(_conntrack_locks_all_lock) */
+   nf_conntrack_locks_all = true;
 
for (i = 0; i < CONNTRACK_LOCKS; i++) {
-   spin_unlock_wait(_conntrack_locks[i]);
+   spin_lock(_conntrack_locks[i]);
+
+   /* This spin_unlock provides the "release" to ensure that
+* nf_conntrack_locks_all==true is visible to everyone that
+* acquired spin_lock(_conntrack_locks[]).
+*/
+   spin_unlock(_conntrack_locks[i]);
}
 }
 
 static void nf_conntrack_all_unlock(void)
 {
-   /*
-* All prior stores must be complete before we clear
+   /* All prior stores must be complete before we clear
 * 'nf_conntrack_locks_all'. Otherwise nf_conntrack_lock()
 * might observe the false value but not the entire
-* critical section:
+* critical section.
+* It pairs with the smp_load_acquire() in nf_conntrack_lock()
 */
smp_store_release(_conntrack_locks_all, false);
spin_unlock(_conntrack_locks_all_lock);
-- 
2.5.2



[PATCH v3 9/9] arch: Remove spin_unlock_wait() arch-specific definitions

2017-07-07 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait() for all architectures providing them.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: <linux-a...@vger.kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Acked-by: Will Deacon <will.dea...@arm.com>
Acked-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/alpha/include/asm/spinlock.h|  5 
 arch/arc/include/asm/spinlock.h  |  5 
 arch/arm/include/asm/spinlock.h  | 16 --
 arch/arm64/include/asm/spinlock.h| 58 
 arch/blackfin/include/asm/spinlock.h |  5 
 arch/hexagon/include/asm/spinlock.h  |  5 
 arch/ia64/include/asm/spinlock.h | 21 -
 arch/m32r/include/asm/spinlock.h |  5 
 arch/metag/include/asm/spinlock.h|  5 
 arch/mips/include/asm/spinlock.h | 16 --
 arch/mn10300/include/asm/spinlock.h  |  5 
 arch/parisc/include/asm/spinlock.h   |  7 -
 arch/powerpc/include/asm/spinlock.h  | 33 
 arch/s390/include/asm/spinlock.h |  7 -
 arch/sh/include/asm/spinlock-cas.h   |  5 
 arch/sh/include/asm/spinlock-llsc.h  |  5 
 arch/sparc/include/asm/spinlock_32.h |  5 
 arch/sparc/include/asm/spinlock_64.h |  5 
 arch/tile/include/asm/spinlock_32.h  |  2 --
 arch/tile/include/asm/spinlock_64.h  |  2 --
 arch/tile/lib/spinlock_32.c  | 23 --
 arch/tile/lib/spinlock_64.c  | 22 --
 arch/xtensa/include/asm/spinlock.h   |  5 
 23 files changed, 5 insertions(+), 262 deletions(-)

diff --git a/arch/alpha/include/asm/spinlock.h 
b/arch/alpha/include/asm/spinlock.h
index a40b9fc0c6c3..718ac0b64adf 100644
--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x) ((x)->lock != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 return lock.lock == 0;
diff --git a/arch/arc/include/asm/spinlock.h b/arch/arc/include/asm/spinlock.h
index 233d5ffe6ec7..a325e6a36523 100644
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)  arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, !VAL);
-}
-
 #ifdef CONFIG_ARC_HAS_LLSC
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 4bec45442072..c030143c18c6 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -52,22 +52,6 @@ static inline void dsb_sev(void)
  * memory.
  */
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   u16 owner = READ_ONCE(lock->tickets.owner);
-
-   for (;;) {
-   arch_spinlock_t tmp = READ_ONCE(*lock);
-
-   if (tmp.tickets.owner == tmp.tickets.next ||
-   tmp.tickets.owner != owner)
-   break;
-
-   wfe();
-   }
-   smp_acquire__after_ctrl_dep();
-}
-
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
diff --git a/arch/arm64/include/asm/spinlock.h 
b/arch/arm64/include/asm/spinlock.h
index cae331d553f8..f445bd7f2b9f 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,58 +26,6 @@
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   unsigned int tmp;
-   arch_spinlock_t lockval;
-   u32 owner;
-
-   /*
-* Ensure prior spin_lock operations to other locks have completed
-* on this CPU before we test whether "lock" is locked.
-*/
-   smp_mb();
-   owner = READ_ONCE(lock->owner) << 16;
-
-   asm volatile(
-"  sevl\n"
-"1:wfe\n"
-"2:ldaxr   %w0, %2\n"
-   /* Is the lock free? */
-"  eor %w1, %w0, %w0, ror #16\n"
-"  cbz %w1, 3f\n"
-   /* Lock taken -- has there been a subsequent unlock->lock transition? */
-"  eor %w1, %w3, %w0, lsl #16\n"
-"  cbz 

[PATCH v3 6/9] ipc: Replace spin_unlock_wait() with lock/unlock pair

2017-07-07 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
exit_sem() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because exit_sem()
is rarely invoked in production.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Davidlohr Bueso <d...@stgolabs.net>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Acked-by: Manfred Spraul <manf...@colorfullife.com>
---
 ipc/sem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 947dc2348271..e88d0749a929 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2096,7 +2096,8 @@ void exit_sem(struct task_struct *tsk)
 * possibility where we exit while freeary() didn't
 * finish unlocking sem_undo_list.
 */
-   spin_unlock_wait(>lock);
+   spin_lock(>lock);
+   spin_unlock(>lock);
rcu_read_unlock();
break;
}
-- 
2.5.2



[PATCH v3 8/9] locking: Remove spin_unlock_wait() generic definitions

2017-07-07 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes spin_unlock_wait() and related
definitions from core code.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Arnd Bergmann <a...@arndb.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 include/asm-generic/qspinlock.h |  14 -
 include/linux/spinlock.h|  31 ---
 include/linux/spinlock_up.h |   6 ---
 kernel/locking/qspinlock.c  | 117 
 4 files changed, 168 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 9f0681bf1e87..66260777d644 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -22,17 +22,6 @@
 #include 
 
 /**
- * queued_spin_unlock_wait - wait until the _current_ lock holder releases the 
lock
- * @lock : Pointer to queued spinlock structure
- *
- * There is a very slight possibility of live-lock if the lockers keep coming
- * and the waiter is just unfortunate enough to not see any unlock state.
- */
-#ifndef queued_spin_unlock_wait
-extern void queued_spin_unlock_wait(struct qspinlock *lock);
-#endif
-
-/**
  * queued_spin_is_locked - is the spinlock locked?
  * @lock: Pointer to queued spinlock structure
  * Return: 1 if it is locked, 0 otherwise
@@ -41,8 +30,6 @@ extern void queued_spin_unlock_wait(struct qspinlock *lock);
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
/*
-* See queued_spin_unlock_wait().
-*
 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
 * isn't immediately observable.
 */
@@ -135,6 +122,5 @@ static __always_inline bool virt_spin_lock(struct qspinlock 
*lock)
 #define arch_spin_trylock(l)   queued_spin_trylock(l)
 #define arch_spin_unlock(l)queued_spin_unlock(l)
 #define arch_spin_lock_flags(l, f) queued_spin_lock(l)
-#define arch_spin_unlock_wait(l)   queued_spin_unlock_wait(l)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index d9510e8522d4..ef018a6e4985 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -130,12 +130,6 @@ do {   
\
 #define smp_mb__before_spinlock()  smp_wmb()
 #endif
 
-/**
- * raw_spin_unlock_wait - wait until the spinlock gets unlocked
- * @lock: the spinlock in question.
- */
-#define raw_spin_unlock_wait(lock) arch_spin_unlock_wait(&(lock)->raw_lock)
-
 #ifdef CONFIG_DEBUG_SPINLOCK
  extern void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
 #define do_raw_spin_lock_flags(lock, flags) do_raw_spin_lock(lock)
@@ -369,31 +363,6 @@ static __always_inline int spin_trylock_irq(spinlock_t 
*lock)
raw_spin_trylock_irqsave(spinlock_check(lock), flags); \
 })
 
-/**
- * spin_unlock_wait - Interpose between successive critical sections
- * @lock: the spinlock whose critical sections are to be interposed.
- *
- * Semantically this is equivalent to a spin_lock() immediately
- * followed by a spin_unlock().  However, most architectures have
- * more efficient implementations in which the spin_unlock_wait()
- * cannot block concurrent lock acquisition, and in some cases
- * where spin_unlock_wait() does not write to the lock variable.
- * Nevertheless, spin_unlock_wait() can have high overhead, so if
- * you feel the need to use it, please check to see if there is
- * a better way to get your job done.
- *
- * The ordering guarantees provided by spin_unlock_wait() are:
- *
- * 1.  All accesses preceding the spin_unlock_wait() happen before
- * any accesses in later critical sections for this same lock.
- * 2.  All accesses following the spin_unlock_wait() happen after
- * any accesses in earlier critical sections for this same lock.
- */
-static __always_inline void spin_unlock_wait(spinlock_t *lock)
-{
-   raw_spin_unlock_wait(>rlock);
-}
-
 static __always_inline int spin_is_locked(spinlock_t *lock)
 {
return raw_spin_is_locked(>rlock);
diff --git a/include/linux/spinlock_up.h b/include/linux/spinlock_up.h
index 0d9848de677d..612fb530af41 100644
--- a/include/linux/spinlock_up.h
+++ b/include/linux/spinlock_up.h
@@ -26,11 +26,6 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define arch_spin_is_locked(x) ((x)->slock == 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, VAL);
-}
-
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
lock->slock = 0;
@@ -73,7 +68,6 

[PATCH v3 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Paul E. McKenney
Hello!

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This series therefore removes spin_unlock_wait() and changes
its users to instead use a lock/unlock pair.  The commits are as follows,
in three groups:

1-7.Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
to instead use a spin_lock/spin_unlock pair.  These may be
applied in any order, but must be applied before any later
commits in this series.  The commit logs state why I believe
that these commits won't noticeably degrade performance.

8.  Remove core-kernel definitions for spin_unlock_wait() and
raw_spin_unlock_wait().

9.  Remove arch-specific definitions of arch_spin_unlock_wait().

Changes since v2:

o   Add comment-only update from Manfred.

Changes since v1:

o   Disable interrupts where needed, thus avoiding embarrassing
interrupt-based self-deadlocks.

o   Substitute Manfred's patch for contrack_lock (#1 above).

o   Substitute Oleg's patch for task_work (#2 above).

o   Use more efficient barrier based on Arnd Bergmann feedback.

o   Merge the arch commits.

Thanx, Paul



 arch/alpha/include/asm/spinlock.h|5 -
 arch/arc/include/asm/spinlock.h  |5 -
 arch/arm/include/asm/spinlock.h  |   16 
 arch/arm64/include/asm/spinlock.h|   58 +
 arch/blackfin/include/asm/spinlock.h |5 -
 arch/hexagon/include/asm/spinlock.h  |5 -
 arch/ia64/include/asm/spinlock.h |   21 --
 arch/m32r/include/asm/spinlock.h |5 -
 arch/metag/include/asm/spinlock.h|5 -
 arch/mips/include/asm/spinlock.h |   16 
 arch/mn10300/include/asm/spinlock.h  |5 -
 arch/parisc/include/asm/spinlock.h   |7 --
 arch/powerpc/include/asm/spinlock.h  |   33 -
 arch/s390/include/asm/spinlock.h |7 --
 arch/sh/include/asm/spinlock-cas.h   |5 -
 arch/sh/include/asm/spinlock-llsc.h  |5 -
 arch/sparc/include/asm/spinlock_32.h |5 -
 arch/sparc/include/asm/spinlock_64.h |5 -
 arch/tile/include/asm/spinlock_32.h  |2 
 arch/tile/include/asm/spinlock_64.h  |2 
 arch/tile/lib/spinlock_32.c  |   23 --
 arch/tile/lib/spinlock_64.c  |   22 --
 arch/xtensa/include/asm/spinlock.h   |5 -
 drivers/ata/libata-eh.c  |8 --
 include/asm-generic/qspinlock.h  |   14 
 include/linux/spinlock.h |   31 -
 include/linux/spinlock_up.h  |6 -
 ipc/sem.c|3 
 kernel/exit.c|3 
 kernel/locking/qspinlock.c   |  117 ---
 kernel/sched/completion.c|9 --
 kernel/sched/core.c  |5 -
 kernel/task_work.c   |8 --
 net/netfilter/nf_conntrack_core.c|   52 ---
 34 files changed, 48 insertions(+), 475 deletions(-)



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-07 Thread Paul E. McKenney
On Fri, Jul 07, 2017 at 10:31:28AM +0200, Ingo Molnar wrote:

[ . . . ]

> In fact I'd argue that any future high performance spin_unlock_wait() user is 
> probably better off open coding the unlock-wait poll loop (and possibly 
> thinking 
> hard about eliminating it altogether). If such patterns pop up in the kernel 
> we 
> can think about consolidating them into a single read-only primitive again.

I would like any reintroduction to include a header comment saying exactly
what the consolidated primitive actually does and does not do.  ;-)

> I.e. I think the proposed changes are doing no harm, and the unavailability 
> of a 
> generic primitive does not hinder future optimizations either in any 
> significant 
> fashion.

I will have a v3 with updated comments from Manfred.  Thoughts on when/where
to push this?

The reason I ask is if this does not go in during this merge window, I need
to fix the header comment on spin_unlock_wait().

Thanx, Paul



Re: [PATCH v2 1/9] net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 08:45:59PM +0200, Manfred Spraul wrote:
> Hi Paul,
> 
> On 07/06/2017 01:31 AM, Paul E. McKenney wrote:
> >From: Manfred Spraul <manf...@colorfullife.com>
> >
> >As we want to remove spin_unlock_wait() and replace it with explicit
> >spin_lock()/spin_unlock() calls, we can use this to simplify the
> >locking.
> >
> >In addition:
> >- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> >- The new code avoids the backwards loop.
> >
> >Only slightly tested, I did not manage to trigger calls to
> >nf_conntrack_all_lock().
> 
> If you want:
> Attached would be V2, with adapted comments.

I do like the improved comments, thank you!  Queued, and will be part
of a later v3 of the series.

Thanx, Paul

> --
> Manfred

> >From e3562faa1bc96e883108505e05deecaf38c87a26 Mon Sep 17 00:00:00 2001
> From: Manfred Spraul <manf...@colorfullife.com>
> Date: Sun, 21 Aug 2016 07:17:55 +0200
> Subject: [PATCH 1/2] net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()
> 
> As we want to remove spin_unlock_wait() and replace it with explicit
> spin_lock()/spin_unlock() calls, we can use this to simplify the
> locking.
> 
> In addition:
> - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> - The new code avoids the backwards loop.
> 
> Only slightly tested, I did not manage to trigger calls to
> nf_conntrack_all_lock().
> 
> V2: With improved comments, to clearly show how the barriers
> pair.
> 
> Fixes: b16c29191dc8
> Signed-off-by: Manfred Spraul <manf...@colorfullife.com>
> Cc: <sta...@vger.kernel.org>
> Cc: Alan Stern <st...@rowland.harvard.edu>
> Cc: Sasha Levin <sasha.le...@oracle.com>
> Cc: Pablo Neira Ayuso <pa...@netfilter.org>
> Cc: netfilter-de...@vger.kernel.org
> ---
>  net/netfilter/nf_conntrack_core.c | 52 
> ++-
>  1 file changed, 29 insertions(+), 23 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c 
> b/net/netfilter/nf_conntrack_core.c
> index 9979f46..51390fe 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -96,19 +96,26 @@ static struct conntrack_gc_work conntrack_gc_work;
> 
>  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
>  {
> + /* 1) Acquire the lock */
>   spin_lock(lock);
> - while (unlikely(nf_conntrack_locks_all)) {
> - spin_unlock(lock);
> 
> - /*
> -  * Order the 'nf_conntrack_locks_all' load vs. the
> -  * spin_unlock_wait() loads below, to ensure
> -  * that 'nf_conntrack_locks_all_lock' is indeed held:
> -  */
> - smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */
> - spin_unlock_wait(_conntrack_locks_all_lock);
> - spin_lock(lock);
> - }
> + /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics
> +  * It pairs with the smp_store_release() in nf_conntrack_all_unlock()
> +  */
> + if (likely(smp_load_acquire(_conntrack_locks_all) == false))
> + return;
> +
> + /* fast path failed, unlock */
> + spin_unlock(lock);
> +
> + /* Slow path 1) get global lock */
> + spin_lock(_conntrack_locks_all_lock);
> +
> + /* Slow path 2) get the lock we want */
> + spin_lock(lock);
> +
> + /* Slow path 3) release the global lock */
> + spin_unlock(_conntrack_locks_all_lock);
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_lock);
> 
> @@ -149,28 +156,27 @@ static void nf_conntrack_all_lock(void)
>   int i;
> 
>   spin_lock(_conntrack_locks_all_lock);
> - nf_conntrack_locks_all = true;
> 
> - /*
> -  * Order the above store of 'nf_conntrack_locks_all' against
> -  * the spin_unlock_wait() loads below, such that if
> -  * nf_conntrack_lock() observes 'nf_conntrack_locks_all'
> -  * we must observe nf_conntrack_locks[] held:
> -  */
> - smp_mb(); /* spin_lock(_conntrack_locks_all_lock) */
> + nf_conntrack_locks_all = true;
> 
>   for (i = 0; i < CONNTRACK_LOCKS; i++) {
> - spin_unlock_wait(_conntrack_locks[i]);
> + spin_lock(_conntrack_locks[i]);
> +
> + /* This spin_unlock provides the "release" to ensure that
> +  * nf_conntrack_locks_all==true is visible to everyone that
> +  * acquired spin_lock(_conntrack_locks[]).
> +  */
> + spin_unlock(_conntrack_locks[i]);
>   }
>  }
> 
>  static void nf_conntrack_all_unlock(void)

Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:08:50PM +0100, Will Deacon wrote:
> On Thu, Jul 06, 2017 at 06:50:36PM +0200, Peter Zijlstra wrote:
> > On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > > > From: Paul E. McKenney
> > > 
> > > [ . . . ]
> > > 
> > > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > > the optimization, OTOH this thing is really really tricky to use,
> > > > and has lead to a number of bugs already.
> > > 
> > > I do agree, it is a bit sad to see these optimizations go.  So, should
> > > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > > so that they can be easily reverted should someone come up with good
> > > semantics and a compelling use case with compelling performance benefits.
> > 
> > Ha!, but what would constitute 'good semantics' ?
> > 
> > The current thing is something along the lines of:
> > 
> >   "Waits for the currently observed critical section
> >to complete with ACQUIRE ordering such that it will observe
> >whatever state was left by said critical section."
> > 
> > With the 'obvious' benefit of limited interference on those actually
> > wanting to acquire the lock, and a shorter wait time on our side too,
> > since we only need to wait for completion of the current section, and
> > not for however many contender are before us.
> > 
> > Not sure I have an actual (micro) benchmark that shows a difference
> > though.
> > 
> > 
> > 
> > Is this all good enough to retain the thing, I dunno. Like I said, I'm
> > conflicted on the whole thing. On the one hand its a nice optimization,
> > on the other hand I don't want to have to keep fixing these bugs.
> 
> As I've said, I'd be keen to see us drop this and bring it back if/when we
> get a compelling use-case along with performance numbers. At that point,
> we'd be in a better position to define the semantics anyway, knowing what
> exactly is expected by the use-case.

Hear, hear!!!  ;-)

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:50:36PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > > From: Paul E. McKenney
> > 
> > [ . . . ]
> > 
> > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > the optimization, OTOH this thing is really really tricky to use,
> > > and has lead to a number of bugs already.
> > 
> > I do agree, it is a bit sad to see these optimizations go.  So, should
> > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > so that they can be easily reverted should someone come up with good
> > semantics and a compelling use case with compelling performance benefits.
> 
> Ha!, but what would constitute 'good semantics' ?

At this point, it beats the heck out of me!  ;-)

> The current thing is something along the lines of:
> 
>   "Waits for the currently observed critical section
>to complete with ACQUIRE ordering such that it will observe
>whatever state was left by said critical section."
> 
> With the 'obvious' benefit of limited interference on those actually
> wanting to acquire the lock, and a shorter wait time on our side too,
> since we only need to wait for completion of the current section, and
> not for however many contender are before us.
> 
> Not sure I have an actual (micro) benchmark that shows a difference
> though.
> 
> 
> 
> Is this all good enough to retain the thing, I dunno. Like I said, I'm
> conflicted on the whole thing. On the one hand its a nice optimization,
> on the other hand I don't want to have to keep fixing these bugs.

Yeah, if I had seen a compelling use case...  Oleg's task_work case was
closest, but given that it involved a task-local lock that shouldn't
be all -that- heavily contended, it is hard to see there being all that
much difference.

But maybe I am missing something here?  Wouldn't be the first time...

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:41:34PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 09:24:12AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > > > And yes, there are architecture-specific optimizations for an
> > > > empty spin_lock()/spin_unlock() critical section, and the current
> > > > arch_spin_unlock_wait() implementations show some of these 
> > > > optimizations.
> > > > But I expect that performance benefits would need to be demonstrated at
> > > > the system level.
> > > 
> > > I do in fact contended there are any optimizations for the exact
> > > lock+unlock semantics.
> > 
> > You lost me on this one.
> 
> For the exact semantics you'd have to fully participate in the fairness
> protocol. You have to in fact acquire the lock in order to have the
> other contending CPUs wait (otherwise my earlier case 3 will fail).
> 
> At that point I'm not sure there is much actual code you can leave out.
> 
> What actual optimization is there left at that point?

Got it.  It was just that I was having a hard time parsing your sentence.
You were contending that there are no optimizations for all implementations
for the full semantics.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:10:47PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 08:21:10AM -0700, Paul E. McKenney wrote:
> > And yes, there are architecture-specific optimizations for an
> > empty spin_lock()/spin_unlock() critical section, and the current
> > arch_spin_unlock_wait() implementations show some of these optimizations.
> > But I expect that performance benefits would need to be demonstrated at
> > the system level.
> 
> I do in fact contended there are any optimizations for the exact
> lock+unlock semantics.

You lost me on this one.

> The current spin_unlock_wait() is weaker. Most notably it will not (with
> exception of ARM64/PPC for other reasons) cause waits on other CPUs.

Agreed, weaker semantics allow more optimizations.  So use cases needing
only the weaker semantics should more readily show performance benefits.
But either way, we need compelling use cases, and I do not believe that
any of the existing spin_unlock_wait() calls are compelling.  Perhaps I
am confused, but I am not seeing it for any of them.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > From: Paul E. McKenney

[ . . . ]

> Now on the one hand I feel like Oleg that it would be a shame to loose
> the optimization, OTOH this thing is really really tricky to use,
> and has lead to a number of bugs already.

I do agree, it is a bit sad to see these optimizations go.  So, should
this make mainline, I will be tagging the commits that spin_unlock_wait()
so that they can be easily reverted should someone come up with good
semantics and a compelling use case with compelling performance benefits.

Thanx, Paul



Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Paul E. McKenney
On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> From: Paul E. McKenney
> > Sent: 06 July 2017 00:30
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This series therefore removes spin_unlock_wait() and changes
> > its users to instead use a lock/unlock pair.  The commits are as follows,
> > in three groups:
> > 
> > 1-7.Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
> > to instead use a spin_lock/spin_unlock pair.  These may be
> > applied in any order, but must be applied before any later
> > commits in this series.  The commit logs state why I believe
> > that these commits won't noticeably degrade performance.
> 
> I can't help feeling that it might be better to have a spin_lock_sync()
> call that is equivalent to a spin_lock/spin_unlock pair.
> The default implementation being an inline function that does exactly that.
> This would let an architecture implement a more efficient version.
> 
> It might even be that this is the defined semantics of spin_unlock_wait().

That was in fact my first proposal, see the comment header in current
mainline for spin_unlock_wait() in include/linux/spinlock.h.  But Linus
quite rightly pointed out that if spin_unlock_wait() was to be defined in
this way, we should get rid of spin_unlock_wait() entirely, especially
given that there are not very many calls to spin_unlock_wait() and
also given that none of them are particularly performance critical.
Hence the current patch set, which does just that.

> Note that it can only be useful to do a spin_lock/unlock pair if it is
> impossible for another code path to try to acquire the lock.
> (Or, at least, the code can't care if the lock is acquired just after.)

Indeed!  As Oleg Nesterov pointed out, a spin_lock()/spin_unlock() pair
is sort of like synchronize_rcu() for a given lock, where that lock's
critical sections play the role of RCU read-side critical sections.
So anything before the pair is visible to later critical sections, and
anything in prior critical sections is visible to anything after the pair.
But again, as Linus pointed out, if we are going to have these semantics,
just do spin_lock() immediately followed by spin_unlock().

> So if it can de determined that the lock isn't held (a READ_ONCE()
> might be enough) the lock itself need not be acquired (with the
> associated slow bus cycles).

If you want the full semantics of a spin_lock()/spin_unlock() pair,
you need a full memory barrier before the READ_ONCE(), even on x86.
Without that memory barrier, you don't get the effect of the release
implicit in spin_unlock().

For weaker architectures, such as PowerPC and ARM, a READ_ONCE()
does -not- suffice, not at all, even with smp_mb() before and after.
I encourage you to take a look at arch_spin_unlock_wait() in arm64
and powerpc if youi are interested.  There were also some lengthy LKML
threads discussing this about 18 months ago that could be illuminating.

And yes, there are architecture-specific optimizations for an
empty spin_lock()/spin_unlock() critical section, and the current
arch_spin_unlock_wait() implementations show some of these optimizations.
But I expect that performance benefits would need to be demonstrated at
the system level.

Thanx, Paul



Re: [PATCH RFC 21/26] powerpc: Remove spin_unlock_wait() arch-specific definitions

2017-07-05 Thread Paul E. McKenney
On Sun, Jul 02, 2017 at 11:58:07AM +0800, Boqun Feng wrote:
> On Thu, Jun 29, 2017 at 05:01:29PM -0700, Paul E. McKenney wrote:
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This commit therefore removes the underlying arch-specific
> > arch_spin_unlock_wait().
> > 
> > Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> > Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> > Cc: Paul Mackerras <pau...@samba.org>
> > Cc: Michael Ellerman <m...@ellerman.id.au>
> > Cc: <linuxppc-...@lists.ozlabs.org>
> > Cc: Will Deacon <will.dea...@arm.com>
> > Cc: Peter Zijlstra <pet...@infradead.org>
> > Cc: Alan Stern <st...@rowland.harvard.edu>
> > Cc: Andrea Parri <parri.and...@gmail.com>
> > Cc: Linus Torvalds <torva...@linux-foundation.org>
> 
> Acked-by: Boqun Feng <boqun.f...@gmail.com>

And finally applied in preparation for v2 of the patch series.

Thank you!!!

Thanx, Paul

> Regards,
> Boqun
> 
> > ---
> >  arch/powerpc/include/asm/spinlock.h | 33 -
> >  1 file changed, 33 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/spinlock.h 
> > b/arch/powerpc/include/asm/spinlock.h
> > index 8c1b913de6d7..d256e448ea49 100644
> > --- a/arch/powerpc/include/asm/spinlock.h
> > +++ b/arch/powerpc/include/asm/spinlock.h
> > @@ -170,39 +170,6 @@ static inline void arch_spin_unlock(arch_spinlock_t 
> > *lock)
> > lock->slock = 0;
> >  }
> >  
> > -static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
> > -{
> > -   arch_spinlock_t lock_val;
> > -
> > -   smp_mb();
> > -
> > -   /*
> > -* Atomically load and store back the lock value (unchanged). This
> > -* ensures that our observation of the lock value is ordered with
> > -* respect to other lock operations.
> > -*/
> > -   __asm__ __volatile__(
> > -"1:" PPC_LWARX(%0, 0, %2, 0) "\n"
> > -"  stwcx. %0, 0, %2\n"
> > -"  bne- 1b\n"
> > -   : "=" (lock_val), "+m" (*lock)
> > -   : "r" (lock)
> > -   : "cr0", "xer");
> > -
> > -   if (arch_spin_value_unlocked(lock_val))
> > -   goto out;
> > -
> > -   while (lock->slock) {
> > -   HMT_low();
> > -   if (SHARED_PROCESSOR)
> > -   __spin_yield(lock);
> > -   }
> > -   HMT_medium();
> > -
> > -out:
> > -   smp_mb();
> > -}
> > -
> >  /*
> >   * Read-write spinlocks, allowing multiple readers
> >   * but only one writer.
> > -- 
> > 2.5.2
> > 




[PATCH v2 8/9] locking: Remove spin_unlock_wait() generic definitions

2017-07-05 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes spin_unlock_wait() and related
definitions from core code.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Arnd Bergmann <a...@arndb.de>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 include/asm-generic/qspinlock.h |  14 -
 include/linux/spinlock.h|  31 ---
 include/linux/spinlock_up.h |   6 ---
 kernel/locking/qspinlock.c  | 117 
 4 files changed, 168 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 9f0681bf1e87..66260777d644 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -22,17 +22,6 @@
 #include 
 
 /**
- * queued_spin_unlock_wait - wait until the _current_ lock holder releases the 
lock
- * @lock : Pointer to queued spinlock structure
- *
- * There is a very slight possibility of live-lock if the lockers keep coming
- * and the waiter is just unfortunate enough to not see any unlock state.
- */
-#ifndef queued_spin_unlock_wait
-extern void queued_spin_unlock_wait(struct qspinlock *lock);
-#endif
-
-/**
  * queued_spin_is_locked - is the spinlock locked?
  * @lock: Pointer to queued spinlock structure
  * Return: 1 if it is locked, 0 otherwise
@@ -41,8 +30,6 @@ extern void queued_spin_unlock_wait(struct qspinlock *lock);
 static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
 {
/*
-* See queued_spin_unlock_wait().
-*
 * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
 * isn't immediately observable.
 */
@@ -135,6 +122,5 @@ static __always_inline bool virt_spin_lock(struct qspinlock 
*lock)
 #define arch_spin_trylock(l)   queued_spin_trylock(l)
 #define arch_spin_unlock(l)queued_spin_unlock(l)
 #define arch_spin_lock_flags(l, f) queued_spin_lock(l)
-#define arch_spin_unlock_wait(l)   queued_spin_unlock_wait(l)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index d9510e8522d4..ef018a6e4985 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -130,12 +130,6 @@ do {   
\
 #define smp_mb__before_spinlock()  smp_wmb()
 #endif
 
-/**
- * raw_spin_unlock_wait - wait until the spinlock gets unlocked
- * @lock: the spinlock in question.
- */
-#define raw_spin_unlock_wait(lock) arch_spin_unlock_wait(&(lock)->raw_lock)
-
 #ifdef CONFIG_DEBUG_SPINLOCK
  extern void do_raw_spin_lock(raw_spinlock_t *lock) __acquires(lock);
 #define do_raw_spin_lock_flags(lock, flags) do_raw_spin_lock(lock)
@@ -369,31 +363,6 @@ static __always_inline int spin_trylock_irq(spinlock_t 
*lock)
raw_spin_trylock_irqsave(spinlock_check(lock), flags); \
 })
 
-/**
- * spin_unlock_wait - Interpose between successive critical sections
- * @lock: the spinlock whose critical sections are to be interposed.
- *
- * Semantically this is equivalent to a spin_lock() immediately
- * followed by a spin_unlock().  However, most architectures have
- * more efficient implementations in which the spin_unlock_wait()
- * cannot block concurrent lock acquisition, and in some cases
- * where spin_unlock_wait() does not write to the lock variable.
- * Nevertheless, spin_unlock_wait() can have high overhead, so if
- * you feel the need to use it, please check to see if there is
- * a better way to get your job done.
- *
- * The ordering guarantees provided by spin_unlock_wait() are:
- *
- * 1.  All accesses preceding the spin_unlock_wait() happen before
- * any accesses in later critical sections for this same lock.
- * 2.  All accesses following the spin_unlock_wait() happen after
- * any accesses in earlier critical sections for this same lock.
- */
-static __always_inline void spin_unlock_wait(spinlock_t *lock)
-{
-   raw_spin_unlock_wait(>rlock);
-}
-
 static __always_inline int spin_is_locked(spinlock_t *lock)
 {
return raw_spin_is_locked(>rlock);
diff --git a/include/linux/spinlock_up.h b/include/linux/spinlock_up.h
index 0d9848de677d..612fb530af41 100644
--- a/include/linux/spinlock_up.h
+++ b/include/linux/spinlock_up.h
@@ -26,11 +26,6 @@
 #ifdef CONFIG_DEBUG_SPINLOCK
 #define arch_spin_is_locked(x) ((x)->slock == 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, VAL);
-}
-
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
lock->slock = 0;
@@ -73,7 +68,6 

[PATCH v2 9/9] arch: Remove spin_unlock_wait() arch-specific definitions

2017-07-05 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait() for all architectures providing them.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: <linux-a...@vger.kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Acked-by: Will Deacon <will.dea...@arm.com>
Acked-by: Boqun Feng <boqun.f...@gmail.com>
---
 arch/alpha/include/asm/spinlock.h|  5 
 arch/arc/include/asm/spinlock.h  |  5 
 arch/arm/include/asm/spinlock.h  | 16 --
 arch/arm64/include/asm/spinlock.h| 58 
 arch/blackfin/include/asm/spinlock.h |  5 
 arch/hexagon/include/asm/spinlock.h  |  5 
 arch/ia64/include/asm/spinlock.h | 21 -
 arch/m32r/include/asm/spinlock.h |  5 
 arch/metag/include/asm/spinlock.h|  5 
 arch/mips/include/asm/spinlock.h | 16 --
 arch/mn10300/include/asm/spinlock.h  |  5 
 arch/parisc/include/asm/spinlock.h   |  7 -
 arch/powerpc/include/asm/spinlock.h  | 33 
 arch/s390/include/asm/spinlock.h |  7 -
 arch/sh/include/asm/spinlock-cas.h   |  5 
 arch/sh/include/asm/spinlock-llsc.h  |  5 
 arch/sparc/include/asm/spinlock_32.h |  5 
 arch/sparc/include/asm/spinlock_64.h |  5 
 arch/tile/include/asm/spinlock_32.h  |  2 --
 arch/tile/include/asm/spinlock_64.h  |  2 --
 arch/tile/lib/spinlock_32.c  | 23 --
 arch/tile/lib/spinlock_64.c  | 22 --
 arch/xtensa/include/asm/spinlock.h   |  5 
 23 files changed, 5 insertions(+), 262 deletions(-)

diff --git a/arch/alpha/include/asm/spinlock.h 
b/arch/alpha/include/asm/spinlock.h
index a40b9fc0c6c3..718ac0b64adf 100644
--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x) ((x)->lock != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 return lock.lock == 0;
diff --git a/arch/arc/include/asm/spinlock.h b/arch/arc/include/asm/spinlock.h
index 233d5ffe6ec7..a325e6a36523 100644
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)  arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, !VAL);
-}
-
 #ifdef CONFIG_ARC_HAS_LLSC
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 4bec45442072..c030143c18c6 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -52,22 +52,6 @@ static inline void dsb_sev(void)
  * memory.
  */
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   u16 owner = READ_ONCE(lock->tickets.owner);
-
-   for (;;) {
-   arch_spinlock_t tmp = READ_ONCE(*lock);
-
-   if (tmp.tickets.owner == tmp.tickets.next ||
-   tmp.tickets.owner != owner)
-   break;
-
-   wfe();
-   }
-   smp_acquire__after_ctrl_dep();
-}
-
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
diff --git a/arch/arm64/include/asm/spinlock.h 
b/arch/arm64/include/asm/spinlock.h
index cae331d553f8..f445bd7f2b9f 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,58 +26,6 @@
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   unsigned int tmp;
-   arch_spinlock_t lockval;
-   u32 owner;
-
-   /*
-* Ensure prior spin_lock operations to other locks have completed
-* on this CPU before we test whether "lock" is locked.
-*/
-   smp_mb();
-   owner = READ_ONCE(lock->owner) << 16;
-
-   asm volatile(
-"  sevl\n"
-"1:wfe\n"
-"2:ldaxr   %w0, %2\n"
-   /* Is the lock free? */
-"  eor %w1, %w0, %w0, ror #16\n"
-"  cbz %w1, 3f\n"
-   /* Lock taken -- has there been a subsequent unlock->lock transition? */
-"  eor %w1, %w3, %w0, lsl #16\n"
-"  cbz 

[PATCH v2 7/9] drivers/ata: Replace spin_unlock_wait() with lock/unlock pair

2017-07-05 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore eliminates the spin_unlock_wait() call and
associated else-clause and hoists the then-clause's lock and unlock out of
the "if" statement.  This should be safe from a performance perspective
because according to Tejun there should be few if any drivers that don't
set their own error handler.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Cc: <linux-...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 drivers/ata/libata-eh.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index ef68232b5222..779f6f18c1f4 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -645,12 +645,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, 
struct ata_port *ap,
 * completions are honored.  A scmd is determined to have
 * timed out iff its associated qc is active and not failed.
 */
+   spin_lock_irqsave(ap->lock, flags);
if (ap->ops->error_handler) {
struct scsi_cmnd *scmd, *tmp;
int nr_timedout = 0;
 
-   spin_lock_irqsave(ap->lock, flags);
-
/* This must occur under the ap->lock as we don't want
   a polled recovery to race the real interrupt handler
 
@@ -700,12 +699,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, 
struct ata_port *ap,
if (nr_timedout)
__ata_port_freeze(ap);
 
-   spin_unlock_irqrestore(ap->lock, flags);
 
/* initialize eh_tries */
ap->eh_tries = ATA_EH_MAX_TRIES;
-   } else
-   spin_unlock_wait(ap->lock);
+   }
+   spin_unlock_irqrestore(ap->lock, flags);
 
 }
 EXPORT_SYMBOL(ata_scsi_cmd_error_handler);
-- 
2.5.2



[PATCH v2 6/9] ipc: Replace spin_unlock_wait() with lock/unlock pair

2017-07-05 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
exit_sem() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because exit_sem()
is rarely invoked in production.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Davidlohr Bueso <d...@stgolabs.net>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Acked-by: Manfred Spraul <manf...@colorfullife.com>
---
 ipc/sem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index 947dc2348271..e88d0749a929 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2096,7 +2096,8 @@ void exit_sem(struct task_struct *tsk)
 * possibility where we exit while freeary() didn't
 * finish unlocking sem_undo_list.
 */
-   spin_unlock_wait(>lock);
+   spin_lock(>lock);
+   spin_unlock(>lock);
rcu_read_unlock();
break;
}
-- 
2.5.2



[PATCH v2 4/9] completion: Replace spin_unlock_wait() with lock/unlock pair

2017-07-05 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
completion_done() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock
will be held only the wakeup happens really quickly.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/sched/completion.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 53f9558fa925..3e66712e1964 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -307,14 +307,9 @@ bool completion_done(struct completion *x)
 * If ->done, we need to wait for complete() to release ->wait.lock
 * otherwise we can end up freeing the completion before complete()
 * is done referencing it.
-*
-* The RMB pairs with complete()'s RELEASE of ->wait.lock and orders
-* the loads of ->done and ->wait.lock such that we cannot observe
-* the lock before complete() acquires it while observing the ->done
-* after it's acquired the lock.
 */
-   smp_rmb();
-   spin_unlock_wait(>wait.lock);
+   spin_lock_irq(>wait.lock);
+   spin_unlock_irq(>wait.lock);
return true;
 }
 EXPORT_SYMBOL(completion_done);
-- 
2.5.2



[PATCH v2 5/9] exit: Replace spin_unlock_wait() with lock/unlock pair

2017-07-05 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and
it appears that all callers could do just as well with a lock/unlock pair.
This commit therefore replaces the spin_unlock_wait() call in do_exit()
with spin_lock() followed immediately by spin_unlock().  This should be
safe from a performance perspective because the lock is a per-task lock,
and this is happening only at task-exit time.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/exit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 516acdb0e0ec..6d19c9090d43 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -832,7 +832,8 @@ void __noreturn do_exit(long code)
 * Ensure that we must observe the pi_state in exit_mm() ->
 * mm_release() -> exit_pi_state_list().
 */
-   raw_spin_unlock_wait(>pi_lock);
+   raw_spin_lock_irq(>pi_lock);
+   raw_spin_unlock_irq(>pi_lock);
 
if (unlikely(in_atomic())) {
pr_info("note: %s[%d] exited with preempt_count %d\n",
-- 
2.5.2



[PATCH v2 3/9] sched: Replace spin_unlock_wait() with lock/unlock pair

2017-07-05 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
do_task_dead() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock is
this tasks ->pi_lock, and this is called only after the task exits.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
[ paulmck: Replace leading smp_mb() with smp_mb__before_spinlock(),
  courtesy of Arnd Bergmann's noting its odd location. ]
---
 kernel/sched/core.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e91138fcde86..48a8760fedf4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3460,8 +3460,9 @@ void __noreturn do_task_dead(void)
 * To avoid it, we have to wait for releasing tsk->pi_lock which
 * is held by try_to_wake_up()
 */
-   smp_mb();
-   raw_spin_unlock_wait(>pi_lock);
+   smp_mb__before_spinlock();
+   raw_spin_lock_irq(>pi_lock);
+   raw_spin_unlock_irq(>pi_lock);
 
/* Causes final put_task_struct in finish_task_switch(): */
__set_current_state(TASK_DEAD);
-- 
2.5.2



[PATCH v2 1/9] net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()

2017-07-05 Thread Paul E. McKenney
From: Manfred Spraul <manf...@colorfullife.com>

As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

Fixes: b16c29191dc8
Signed-off-by: Manfred Spraul <manf...@colorfullife.com>
Cc: <sta...@vger.kernel.org>
Cc: Sasha Levin <sasha.le...@oracle.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: netfilter-de...@vger.kernel.org
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 net/netfilter/nf_conntrack_core.c | 44 +--
 1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index e847dbaa0c6b..1193565c38ae 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
 
 void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
 {
+   /* 1) Acquire the lock */
spin_lock(lock);
-   while (unlikely(nf_conntrack_locks_all)) {
-   spin_unlock(lock);
 
-   /*
-* Order the 'nf_conntrack_locks_all' load vs. the
-* spin_unlock_wait() loads below, to ensure
-* that 'nf_conntrack_locks_all_lock' is indeed held:
-*/
-   smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */
-   spin_unlock_wait(_conntrack_locks_all_lock);
-   spin_lock(lock);
-   }
+   /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
+   if (likely(smp_load_acquire(_conntrack_locks_all) == false))
+   return;
+
+   /* fast path failed, unlock */
+   spin_unlock(lock);
+
+   /* Slow path 1) get global lock */
+   spin_lock(_conntrack_locks_all_lock);
+
+   /* Slow path 2) get the lock we want */
+   spin_lock(lock);
+
+   /* Slow path 3) release the global lock */
+   spin_unlock(_conntrack_locks_all_lock);
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_lock);
 
@@ -149,18 +154,17 @@ static void nf_conntrack_all_lock(void)
int i;
 
spin_lock(_conntrack_locks_all_lock);
-   nf_conntrack_locks_all = true;
 
-   /*
-* Order the above store of 'nf_conntrack_locks_all' against
-* the spin_unlock_wait() loads below, such that if
-* nf_conntrack_lock() observes 'nf_conntrack_locks_all'
-* we must observe nf_conntrack_locks[] held:
-*/
-   smp_mb(); /* spin_lock(_conntrack_locks_all_lock) */
+   nf_conntrack_locks_all = true;
 
for (i = 0; i < CONNTRACK_LOCKS; i++) {
-   spin_unlock_wait(_conntrack_locks[i]);
+   spin_lock(_conntrack_locks[i]);
+
+   /* This spin_unlock provides the "release" to ensure that
+* nf_conntrack_locks_all==true is visible to everyone that
+* acquired spin_lock(_conntrack_locks[]).
+*/
+   spin_unlock(_conntrack_locks[i]);
}
 }
 
-- 
2.5.2



[PATCH v2 2/9] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-07-05 Thread Paul E. McKenney
From: Oleg Nesterov <o...@redhat.com>

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
task_work_run() with a spin_lock_irq() and a spin_unlock_irq() aruond
the cmpxchg() dequeue loop.  This should be safe from a performance
perspective because ->pi_lock is local to the task and because calls to
the other side of the race, task_work_cancel(), should be rare.

Signed-off-by: Oleg Nesterov <o...@redhat.com>
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
---
 kernel/task_work.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index d513051fcca2..836a72a66fba 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -96,20 +96,16 @@ void task_work_run(void)
 * work->func() can do task_work_add(), do not set
 * work_exited unless the list is empty.
 */
+   raw_spin_lock_irq(>pi_lock);
do {
work = READ_ONCE(task->task_works);
head = !work && (task->flags & PF_EXITING) ?
_exited : NULL;
} while (cmpxchg(>task_works, work, head) != work);
+   raw_spin_unlock_irq(>pi_lock);
 
if (!work)
break;
-   /*
-* Synchronize with task_work_cancel(). It can't remove
-* the first entry == work, cmpxchg(task_works) should
-* fail, but it can play with *work and other entries.
-*/
-   raw_spin_unlock_wait(>pi_lock);
 
do {
next = work->next;
-- 
2.5.2



[PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-05 Thread Paul E. McKenney
Hello!

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This series therefore removes spin_unlock_wait() and changes
its users to instead use a lock/unlock pair.  The commits are as follows,
in three groups:

1-7.Change uses of spin_unlock_wait() and raw_spin_unlock_wait()
to instead use a spin_lock/spin_unlock pair.  These may be
applied in any order, but must be applied before any later
commits in this series.  The commit logs state why I believe
that these commits won't noticeably degrade performance.

8.  Remove core-kernel definitions for spin_unlock_wait() and
raw_spin_unlock_wait().

9.  Remove arch-specific definitions of arch_spin_unlock_wait().

Changes since v1:

o   Disable interrupts where needed, thus avoiding embarrassing
interrupt-based self-deadlocks.

o   Substitute Manfred's patch for contrack_lock (#1 above).

o   Substitute Oleg's patch for task_work (#2 above).

o   Use more efficient barrier based on Arnd Bergmann feedback.

Thanx, Paul



 arch/alpha/include/asm/spinlock.h|5 -
 arch/arc/include/asm/spinlock.h  |5 -
 arch/arm/include/asm/spinlock.h  |   16 
 arch/arm64/include/asm/spinlock.h|   58 +
 arch/blackfin/include/asm/spinlock.h |5 -
 arch/hexagon/include/asm/spinlock.h  |5 -
 arch/ia64/include/asm/spinlock.h |   21 --
 arch/m32r/include/asm/spinlock.h |5 -
 arch/metag/include/asm/spinlock.h|5 -
 arch/mips/include/asm/spinlock.h |   16 
 arch/mn10300/include/asm/spinlock.h  |5 -
 arch/parisc/include/asm/spinlock.h   |7 --
 arch/powerpc/include/asm/spinlock.h  |   33 -
 arch/s390/include/asm/spinlock.h |7 --
 arch/sh/include/asm/spinlock-cas.h   |5 -
 arch/sh/include/asm/spinlock-llsc.h  |5 -
 arch/sparc/include/asm/spinlock_32.h |5 -
 arch/sparc/include/asm/spinlock_64.h |5 -
 arch/tile/include/asm/spinlock_32.h  |2 
 arch/tile/include/asm/spinlock_64.h  |2 
 arch/tile/lib/spinlock_32.c  |   23 --
 arch/tile/lib/spinlock_64.c  |   22 --
 arch/xtensa/include/asm/spinlock.h   |5 -
 drivers/ata/libata-eh.c  |8 --
 include/asm-generic/qspinlock.h  |   14 
 include/linux/spinlock.h |   31 -
 include/linux/spinlock_up.h  |6 -
 ipc/sem.c|3 
 kernel/exit.c|3 
 kernel/locking/qspinlock.c   |  117 ---
 kernel/sched/completion.c|9 --
 kernel/sched/core.c  |5 -
 kernel/task_work.c   |8 --
 net/netfilter/nf_conntrack_core.c|   44 +++--
 34 files changed, 43 insertions(+), 472 deletions(-)



Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney
On Mon, Jul 03, 2017 at 05:39:36PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 03, 2017 at 03:49:42PM -0700, Linus Torvalds wrote:
> > On Mon, Jul 3, 2017 at 3:30 PM, Paul E. McKenney
> > <paul...@linux.vnet.ibm.com> wrote:
> > >
> > > That certainly is one interesting function, isn't it?  I wonder what
> > > happens if you replace the raw_spin_is_locked() calls with an
> > > unlock under a trylock check?  ;-)
> > 
> > Deadlock due to interrupts again?
> 
> Unless I am missing something subtle, the kgdb_cpu_enter() function in
> question has a local_irq_save() over the "interesting" portion of its
> workings, so interrupt-handler self-deadlock should not happen.
> 
> > Didn't your spin_unlock_wait() patches teach you anything? Checking
> > state is fundamentally different from taking the lock. Even a trylock.
> 
> That was an embarrassing bug, no two ways about it.  :-/
> 
> > I guess you could try with the irqsave versions. But no, we're not doing 
> > that.
> 
> Again, no need in this case.
> 
> But I agree with Will's assessment of this function...
> 
> The raw_spin_is_locked() looks to be asking if -any- CPU holds the
> dbg_slave_lock, and the answer could of course change immediately
> on return from raw_spin_is_locked().  Perhaps the theory is that
> if other CPU holds the lock, this CPU is supposed to be subjected to
> kgdb_roundup_cpus().  Except that the CPU that held dbg_slave_lock might
> be just about to release that lock.  Odd.
> 
> Seems like there should be a get_online_cpus() somewhere, but maybe
> that constraint is to be manually enforced.

Except that invoking get_online_cpus() from an exception handler would
be of course be a spectacularly bad idea.  I would feel better if the
num_online_cpus() was under the local_irq_save(), but perhaps this code
is relying on the stop_machine().  Except that it appears we could
deadlock with offline waiting for stop_machine() to complete and kdbg
waiting for all CPUs to report, including those in stop_machine().

Looks like the current situation is "Don't use kdbg if there is any
possibility of CPU-hotplug operations."  Not necessarily an unreasonable
restriction.

But I need to let me eyes heal a bit before looking at this more.

Thanx, Paul



Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney
On Mon, Jul 03, 2017 at 03:49:42PM -0700, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 3:30 PM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> >
> > That certainly is one interesting function, isn't it?  I wonder what
> > happens if you replace the raw_spin_is_locked() calls with an
> > unlock under a trylock check?  ;-)
> 
> Deadlock due to interrupts again?

Unless I am missing something subtle, the kgdb_cpu_enter() function in
question has a local_irq_save() over the "interesting" portion of its
workings, so interrupt-handler self-deadlock should not happen.

> Didn't your spin_unlock_wait() patches teach you anything? Checking
> state is fundamentally different from taking the lock. Even a trylock.

That was an embarrassing bug, no two ways about it.  :-/

> I guess you could try with the irqsave versions. But no, we're not doing that.

Again, no need in this case.

But I agree with Will's assessment of this function...

The raw_spin_is_locked() looks to be asking if -any- CPU holds the
dbg_slave_lock, and the answer could of course change immediately
on return from raw_spin_is_locked().  Perhaps the theory is that
if other CPU holds the lock, this CPU is supposed to be subjected to
kgdb_roundup_cpus().  Except that the CPU that held dbg_slave_lock might
be just about to release that lock.  Odd.

Seems like there should be a get_online_cpus() somewhere, but maybe
that constraint is to be manually enforced.

Thanx, Paul



Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney
On Mon, Jul 03, 2017 at 06:13:38PM +0100, Will Deacon wrote:
> On Mon, Jul 03, 2017 at 09:40:22AM -0700, Linus Torvalds wrote:
> > On Mon, Jul 3, 2017 at 9:18 AM, Paul E. McKenney
> > <paul...@linux.vnet.ibm.com> wrote:
> > >
> > > Agreed, and my next step is to look at spin_lock() followed by
> > > spin_is_locked(), not necessarily the same lock.
> > 
> > Hmm. Most (all?) "spin_is_locked()" really should be about the same
> > thread that took the lock (ie it's about asserts and lock debugging).
> > 
> > The optimistic ABBA avoidance pattern for spinlocks *should* be
> > 
> > spin_lock(inner)
> > ...
> > if (!try_lock(outer)) {
> >spin_unlock(inner);
> >.. do them in the right order ..
> > 
> > so I don't think spin_is_locked() should have any memory barriers.
> > 
> > In fact, the core function for spin_is_locked() is arguably
> > arch_spin_value_unlocked() which doesn't even do the access itself.
> 
> Yeah, but there's some spaced-out stuff going on in kgdb_cpu_enter where
> it looks to me like raw_spin_is_locked is used for synchronization. My
> eyes are hurting looking at it, though.

That certainly is one interesting function, isn't it?  I wonder what
happens if you replace the raw_spin_is_locked() calls with an
unlock under a trylock check?  ;-)

Thanx, Paul



Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney
On Mon, Jul 03, 2017 at 09:40:22AM -0700, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 9:18 AM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> >
> > Agreed, and my next step is to look at spin_lock() followed by
> > spin_is_locked(), not necessarily the same lock.
> 
> Hmm. Most (all?) "spin_is_locked()" really should be about the same
> thread that took the lock (ie it's about asserts and lock debugging).

Good to know, that does make things easier.  ;-)

I am not certain that it is feasible to automatically recognize
non-assert/non-debugging use cases of spin_is_locked(), but there is
aways manual inspection.

> The optimistic ABBA avoidance pattern for spinlocks *should* be
> 
> spin_lock(inner)
> ...
> if (!try_lock(outer)) {
>spin_unlock(inner);
>.. do them in the right order ..
> 
> so I don't think spin_is_locked() should have any memory barriers.
> 
> In fact, the core function for spin_is_locked() is arguably
> arch_spin_value_unlocked() which doesn't even do the access itself.

OK, so we should rework any cases where people are relying on acquisition
of one spin_lock() being ordered with a later spin_is_locked() on some
other lock by that same thread.

Thanx, Paul



Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Paul E. McKenney
On Mon, Jul 03, 2017 at 04:04:14PM -0400, Alan Stern wrote:
> On Mon, 3 Jul 2017, Paul E. McKenney wrote:
> 
> > On Mon, Jul 03, 2017 at 10:39:49AM -0400, Alan Stern wrote:
> > > On Sat, 1 Jul 2017, Manfred Spraul wrote:
> > > 
> > > > As we want to remove spin_unlock_wait() and replace it with explicit
> > > > spin_lock()/spin_unlock() calls, we can use this to simplify the
> > > > locking.
> > > > 
> > > > In addition:
> > > > - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> > > > - The new code avoids the backwards loop.
> > > > 
> > > > Only slightly tested, I did not manage to trigger calls to
> > > > nf_conntrack_all_lock().
> > > > 
> > > > Fixes: b16c29191dc8
> > > > Signed-off-by: Manfred Spraul <manf...@colorfullife.com>
> > > > Cc: <sta...@vger.kernel.org>
> > > > Cc: Sasha Levin <sasha.le...@oracle.com>
> > > > Cc: Pablo Neira Ayuso <pa...@netfilter.org>
> > > > Cc: netfilter-de...@vger.kernel.org
> > > > ---
> > > >  net/netfilter/nf_conntrack_core.c | 44 
> > > > +--
> > > >  1 file changed, 24 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/net/netfilter/nf_conntrack_core.c 
> > > > b/net/netfilter/nf_conntrack_core.c
> > > > index e847dba..1193565 100644
> > > > --- a/net/netfilter/nf_conntrack_core.c
> > > > +++ b/net/netfilter/nf_conntrack_core.c
> > > > @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
> > > >  
> > > >  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> > > >  {
> > > > +   /* 1) Acquire the lock */
> > > > spin_lock(lock);
> > > > -   while (unlikely(nf_conntrack_locks_all)) {
> > > > -   spin_unlock(lock);
> > > >  
> > > > -   /*
> > > > -* Order the 'nf_conntrack_locks_all' load vs. the
> > > > -* spin_unlock_wait() loads below, to ensure
> > > > -* that 'nf_conntrack_locks_all_lock' is indeed held:
> > > > -*/
> > > > -   smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */
> > > > -   spin_unlock_wait(_conntrack_locks_all_lock);
> > > > -   spin_lock(lock);
> > > > -   }
> > > > +   /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> > > > +   if (likely(smp_load_acquire(_conntrack_locks_all) == false))
> > > > +   return;
> > > 
> > > As far as I can tell, this read does not need to have ACQUIRE
> > > semantics.
> > > 
> > > You need to guarantee that two things can never happen:
> > > 
> > > (1) We read nf_conntrack_locks_all == false, and this routine's
> > >   critical section for nf_conntrack_locks[i] runs after the
> > >   (empty) critical section for that lock in 
> > >   nf_conntrack_all_lock().
> > > 
> > > (2) We read nf_conntrack_locks_all == true, and this routine's 
> > >   critical section for nf_conntrack_locks_all_lock runs before 
> > >   the critical section in nf_conntrack_all_lock().
> > > 
> > > In fact, neither one can happen even if smp_load_acquire() is replaced
> > > with READ_ONCE().  The reason is simple enough, using this property of
> > > spinlocks:
> > > 
> > >   If critical section CS1 runs before critical section CS2 (for 
> > >   the same lock) then: (a) every write coming before CS1's
> > >   spin_unlock() will be visible to any read coming after CS2's
> > >   spin_lock(), and (b) no write coming after CS2's spin_lock()
> > >   will be visible to any read coming before CS1's spin_unlock().
> > > 
> > > Thus for (1), assuming the critical sections run in the order mentioned
> > > above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
> > > before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
> > > acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
> > > by (a) the read will always see the write.
> > > 
> > > Similarly for (2), since nf_conntrack_all_lock() acquires 
> > > nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all, 
> > > and since nf

Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Paul E. McKenney
On Mon, Jul 03, 2017 at 10:39:49AM -0400, Alan Stern wrote:
> On Sat, 1 Jul 2017, Manfred Spraul wrote:
> 
> > As we want to remove spin_unlock_wait() and replace it with explicit
> > spin_lock()/spin_unlock() calls, we can use this to simplify the
> > locking.
> > 
> > In addition:
> > - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> > - The new code avoids the backwards loop.
> > 
> > Only slightly tested, I did not manage to trigger calls to
> > nf_conntrack_all_lock().
> > 
> > Fixes: b16c29191dc8
> > Signed-off-by: Manfred Spraul 
> > Cc: 
> > Cc: Sasha Levin 
> > Cc: Pablo Neira Ayuso 
> > Cc: netfilter-de...@vger.kernel.org
> > ---
> >  net/netfilter/nf_conntrack_core.c | 44 
> > +--
> >  1 file changed, 24 insertions(+), 20 deletions(-)
> > 
> > diff --git a/net/netfilter/nf_conntrack_core.c 
> > b/net/netfilter/nf_conntrack_core.c
> > index e847dba..1193565 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
> >  
> >  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> >  {
> > +   /* 1) Acquire the lock */
> > spin_lock(lock);
> > -   while (unlikely(nf_conntrack_locks_all)) {
> > -   spin_unlock(lock);
> >  
> > -   /*
> > -* Order the 'nf_conntrack_locks_all' load vs. the
> > -* spin_unlock_wait() loads below, to ensure
> > -* that 'nf_conntrack_locks_all_lock' is indeed held:
> > -*/
> > -   smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */
> > -   spin_unlock_wait(_conntrack_locks_all_lock);
> > -   spin_lock(lock);
> > -   }
> > +   /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> > +   if (likely(smp_load_acquire(_conntrack_locks_all) == false))
> > +   return;
> 
> As far as I can tell, this read does not need to have ACQUIRE
> semantics.
> 
> You need to guarantee that two things can never happen:
> 
> (1) We read nf_conntrack_locks_all == false, and this routine's
>   critical section for nf_conntrack_locks[i] runs after the
>   (empty) critical section for that lock in 
>   nf_conntrack_all_lock().
> 
> (2) We read nf_conntrack_locks_all == true, and this routine's 
>   critical section for nf_conntrack_locks_all_lock runs before 
>   the critical section in nf_conntrack_all_lock().
> 
> In fact, neither one can happen even if smp_load_acquire() is replaced
> with READ_ONCE().  The reason is simple enough, using this property of
> spinlocks:
> 
>   If critical section CS1 runs before critical section CS2 (for 
>   the same lock) then: (a) every write coming before CS1's
>   spin_unlock() will be visible to any read coming after CS2's
>   spin_lock(), and (b) no write coming after CS2's spin_lock()
>   will be visible to any read coming before CS1's spin_unlock().
> 
> Thus for (1), assuming the critical sections run in the order mentioned
> above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
> before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
> acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
> by (a) the read will always see the write.
> 
> Similarly for (2), since nf_conntrack_all_lock() acquires 
> nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all, 
> and since nf_conntrack_lock() reads nf_conntrack_locks_all before 
> releasing nf_conntrack_locks_all_lock, by (b) the read cannot see the 
> write.

And the Linux kernel memory model (https://lwn.net/Articles/718628/
and https://lwn.net/Articles/720550/) agrees with Alan.  Here is
a litmus test, which emulates spin_lock() with xchg_acquire() and
spin_unlock() with smp_store_release():



C C-ManfredSpraul-L1G1xchgnr.litmus

(* Expected result: Never.  *)

{
}

P0(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
{
/* Acquire local lock. */
r10 = xchg_acquire(lcl, 1);
r1 = READ_ONCE(*nfcla);
if (r1) {
smp_store_release(lcl, 0);
r11 = xchg_acquire(gbl, 1);
r12 = xchg_acquire(lcl, 1);
smp_store_release(gbl, 0);
}
r2 = READ_ONCE(*gbl_held);
WRITE_ONCE(*lcl_held, 1);
WRITE_ONCE(*lcl_held, 0);
smp_store_release(lcl, 0);
}

P1(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
{
/* Acquire global lock. */
r10 = xchg_acquire(gbl, 1);
WRITE_ONCE(*nfcla, 1);
r11 = xchg_acquire(lcl, 1);
smp_store_release(lcl, 0);
r2 = READ_ONCE(*lcl_held);
WRITE_ONCE(*gbl_held, 1);
WRITE_ONCE(*gbl_held, 0);

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney
On Mon, Jul 03, 2017 at 02:15:14PM +0100, Will Deacon wrote:
> On Fri, Jun 30, 2017 at 03:18:40PM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 30, 2017 at 02:13:39PM +0100, Will Deacon wrote:
> > > On Fri, Jun 30, 2017 at 05:38:15AM -0700, Paul E. McKenney wrote:
> > > > I also need to check all uses of spin_is_locked().  There might no
> > > > longer be any that rely on any particular ordering...
> > > 
> > > Right. I think we're looking for the "insane case" as per 38b850a73034
> > > (which was apparently used by ipc/sem.c at the time, but no longer).
> > > 
> > > There's a usage in kernel/debug/debug_core.c, but it doesn't fill me with
> > > joy.
> > 
> > That is indeed an interesting one...  But my first round will be what
> > semantics the implementations seem to provide:
> > 
> > Acquire courtesy of TSO: s390, sparc, x86.
> > Acquire: ia64 (in reality fully ordered).
> > Control dependency: alpha, arc, arm, blackfin, hexagon, m32r, mn10300, tile,
> > xtensa.
> > Control dependency plus leading full barrier: arm64, powerpc.
> > UP-only: c6x, cris, frv, h8300, m68k, microblaze nios2, openrisc, um, 
> > unicore32.
> > 
> > Special cases:
> > metag: Acquire if !CONFIG_METAG_SMP_WRITE_REORDERING.
> >Otherwise control dependency?
> > mips: Control dependency, acquire if CONFIG_CPU_CAVIUM_OCTEON.
> > parisc: Acquire courtesy of TSO, but why barrier in smp_load_acquire?
> > sh: Acquire if one of SH4A, SH5, or J2, otherwise acquire?  UP-only?
> > 
> > Are these correct, or am I missing something with any of them?
> 
> That looks about right but, at least on ARM, I think we have to consider
> the semantics of spin_is_locked with respect to the other spin_* functions,
> rather than in isolation.
> 
> For example, ARM only has a control dependency, but spin_lock has a trailing
> smp_mb() and spin_unlock has both leading and trailing smp_mb().

Agreed, and my next step is to look at spin_lock() followed by
spin_is_locked(), not necessarily the same lock.

Thanx, Paul



Re: [PATCH RFC 06/26] ipc: Replace spin_unlock_wait() with lock/unlock pair

2017-07-01 Thread Paul E. McKenney
On Sat, Jul 01, 2017 at 09:23:03PM +0200, Manfred Spraul wrote:
> On 06/30/2017 02:01 AM, Paul E. McKenney wrote:
> >There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> >and it appears that all callers could do just as well with a lock/unlock
> >pair.  This commit therefore replaces the spin_unlock_wait() call in
> >exit_sem() with spin_lock() followed immediately by spin_unlock().
> >This should be safe from a performance perspective because exit_sem()
> >is rarely invoked in production.
> >
> >Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> >Cc: Andrew Morton <a...@linux-foundation.org>
> >Cc: Davidlohr Bueso <d...@stgolabs.net>
> >Cc: Manfred Spraul <manf...@colorfullife.com>
> >Cc: Will Deacon <will.dea...@arm.com>
> >Cc: Peter Zijlstra <pet...@infradead.org>
> >Cc: Alan Stern <st...@rowland.harvard.edu>
> >Cc: Andrea Parri <parri.and...@gmail.com>
> >Cc: Linus Torvalds <torva...@linux-foundation.org>
> Acked-by: Manfred Spraul <manf...@colorfullife.com>

Applied, thank you!

Thanx, Paul

> >---
> >  ipc/sem.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> >diff --git a/ipc/sem.c b/ipc/sem.c
> >index 947dc2348271..e88d0749a929 100644
> >--- a/ipc/sem.c
> >+++ b/ipc/sem.c
> >@@ -2096,7 +2096,8 @@ void exit_sem(struct task_struct *tsk)
> >  * possibility where we exit while freeary() didn't
> >  * finish unlocking sem_undo_list.
> >  */
> >-spin_unlock_wait(>lock);
> >+spin_lock(>lock);
> >+spin_unlock(>lock);
> > rcu_read_unlock();
> > break;
> > }
> 
> 



Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-01 Thread Paul E. McKenney
On Sat, Jul 01, 2017 at 09:44:12PM +0200, Manfred Spraul wrote:
> Hi Paul,
> 
> On 06/30/2017 02:01 AM, Paul E. McKenney wrote:
> >There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> >and it appears that all callers could do just as well with a lock/unlock
> >pair.  This commit therefore replaces the spin_unlock_wait() calls
> >in nf_conntrack_lock() and nf_conntrack_all_lock() with spin_lock()
> >followed immediately by spin_unlock().  These functions do not appear
> >to be invoked on any fastpaths.
> >
> >Signed-off-by: Paul E. McKenney<paul...@linux.vnet.ibm.com>
> >Cc: Pablo Neira Ayuso<pa...@netfilter.org>
> >Cc: Jozsef Kadlecsik<kad...@blackhole.kfki.hu>
> >Cc: Florian Westphal<f...@strlen.de>
> >Cc: "David S. Miller"<da...@davemloft.net>
> >Cc:<netfilter-de...@vger.kernel.org>
> >Cc:<coret...@netfilter.org>
> >Cc:<netdev@vger.kernel.org>
> >Cc: Will Deacon<will.dea...@arm.com>
> >Cc: Peter Zijlstra<pet...@infradead.org>
> >Cc: Alan Stern<st...@rowland.harvard.edu>
> >Cc: Andrea Parri<parri.and...@gmail.com>
> >Cc: Linus Torvalds<torva...@linux-foundation.org>
> >---
> >  net/netfilter/nf_conntrack_core.c | 26 --
> >  1 file changed, 8 insertions(+), 18 deletions(-)
> >
> >diff --git a/net/netfilter/nf_conntrack_core.c 
> >b/net/netfilter/nf_conntrack_core.c
> >index e847dbaa0c6b..9f997859d160 100644
> >--- a/net/netfilter/nf_conntrack_core.c
> >+++ b/net/netfilter/nf_conntrack_core.c
> >@@ -99,15 +99,11 @@ void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> > spin_lock(lock);
> > while (unlikely(nf_conntrack_locks_all)) {
> I think here an ACQUIRE is missing.
> > spin_unlock(lock);
> >-
> >-/*
> >- * Order the 'nf_conntrack_locks_all' load vs. the
> >- * spin_unlock_wait() loads below, to ensure
> >- * that 'nf_conntrack_locks_all_lock' is indeed held:
> >- */
> >-smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */
> >-spin_unlock_wait(_conntrack_locks_all_lock);
> >+/* Wait for nf_conntrack_locks_all_lock holder to release ... */
> >+spin_lock(_conntrack_locks_all_lock);
> >+spin_unlock(_conntrack_locks_all_lock);
> > spin_lock(lock);
> >+/* ... and retry. */
> > }
> >  }
> As far as I see, nf_conntrack_locks[] nests inside
> nf_conntrack_lock_all_lock.
> So
>spin_lock(_conntrack_locks_all_lock);
>spin_lock(lock);
>spin_unlock(_conntrack_locks_all_lock);
> 
> can replace the retry logic.
> 
> Correct? Then what about the attached patch?

At first glance, it looks correct to me, thank you!  I have replaced my
patch with this one for testing and further review.

Thanx, Paul

> --
> Manfred
> 
> 

> >From 453e7a77f3756d939c754031b092cbdfbd149559 Mon Sep 17 00:00:00 2001
> From: Manfred Spraul <manf...@colorfullife.com>
> Date: Sun, 21 Aug 2016 07:17:55 +0200
> Subject: [PATCH] net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()
> 
> As we want to remove spin_unlock_wait() and replace it with explicit
> spin_lock()/spin_unlock() calls, we can use this to simplify the
> locking.
> 
> In addition:
> - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> - The new code avoids the backwards loop.
> 
> Only slightly tested, I did not manage to trigger calls to
> nf_conntrack_all_lock().
> 
> Fixes: b16c29191dc8
> Signed-off-by: Manfred Spraul <manf...@colorfullife.com>
> Cc: <sta...@vger.kernel.org>
> Cc: Sasha Levin <sasha.le...@oracle.com>
> Cc: Pablo Neira Ayuso <pa...@netfilter.org>
> Cc: netfilter-de...@vger.kernel.org
> ---
>  net/netfilter/nf_conntrack_core.c | 44 
> +--
>  1 file changed, 24 insertions(+), 20 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c 
> b/net/netfilter/nf_conntrack_core.c
> index e847dba..1193565 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
> 
>  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
>  {
> + /* 1) Acquire the lock */
>   spin_lock(lock);
> - while (unlikely(nf_conntrack_locks_all)) {
> - spin_unlock(lock);
> 
> - /*
> -  * Order the 

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 02:13:39PM +0100, Will Deacon wrote:
> On Fri, Jun 30, 2017 at 05:38:15AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 30, 2017 at 10:19:29AM +0100, Will Deacon wrote:
> > > On Thu, Jun 29, 2017 at 05:01:16PM -0700, Paul E. McKenney wrote:
> > > > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > > > and it appears that all callers could do just as well with a lock/unlock
> > > > pair.  This commit therefore removes spin_unlock_wait() and related
> > > > definitions from core code.
> > > > 
> > > > Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> > > > Cc: Arnd Bergmann <a...@arndb.de>
> > > > Cc: Ingo Molnar <mi...@redhat.com>
> > > > Cc: Will Deacon <will.dea...@arm.com>
> > > > Cc: Peter Zijlstra <pet...@infradead.org>
> > > > Cc: Alan Stern <st...@rowland.harvard.edu>
> > > > Cc: Andrea Parri <parri.and...@gmail.com>
> > > > Cc: Linus Torvalds <torva...@linux-foundation.org>
> > > > ---
> > > >  include/asm-generic/qspinlock.h |  14 -
> > > >  include/linux/spinlock.h|  31 ---
> > > >  include/linux/spinlock_up.h |   6 ---
> > > >  kernel/locking/qspinlock.c  | 117 
> > > > 
> > > >  4 files changed, 168 deletions(-)
> > > 
> > > [...]
> > > 
> > > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > > > index b2caec7315af..64a9051e4c2c 100644
> > > > --- a/kernel/locking/qspinlock.c
> > > > +++ b/kernel/locking/qspinlock.c
> > > > @@ -267,123 +267,6 @@ static __always_inline u32  
> > > > __pv_wait_head_or_lock(struct qspinlock *lock,
> > > >  #define queued_spin_lock_slowpath  native_queued_spin_lock_slowpath
> > > >  #endif
> > > >  
> > > > -/*
> > > > - * Various notes on spin_is_locked() and spin_unlock_wait(), which are
> > > > - * 'interesting' functions:
> > > > - *
> > > > - * PROBLEM: some architectures have an interesting issue with atomic 
> > > > ACQUIRE
> > > > - * operations in that the ACQUIRE applies to the LOAD _not_ the STORE 
> > > > (ARM64,
> > > > - * PPC). Also qspinlock has a similar issue per construction, the 
> > > > setting of
> > > > - * the locked byte can be unordered acquiring the lock proper.
> > > > - *
> > > > - * This gets to be 'interesting' in the following cases, where the 
> > > > /should/s
> > > > - * end up false because of this issue.
> > > > - *
> > > > - *
> > > > - * CASE 1:
> > > > - *
> > > > - * So the spin_is_locked() correctness issue comes from something like:
> > > > - *
> > > > - *   CPU0  CPU1
> > > > - *
> > > > - *   global_lock();local_lock(i)
> > > > - * spin_lock() spin_lock([i])
> > > > - * for (i)   if (!spin_is_locked()) {
> > > > - *   spin_unlock_wait([i]);  
> > > > smp_acquire__after_ctrl_dep();
> > > > - * return;
> > > > - *   }
> > > > - *   // deal with fail
> > > > - *
> > > > - * Where it is important CPU1 sees G locked or CPU0 sees L[i] locked 
> > > > such
> > > > - * that there is exclusion between the two critical sections.
> > > > - *
> > > > - * The load from spin_is_locked() /should/ be constrained by the 
> > > > ACQUIRE from
> > > > - * spin_lock([i]), and similarly the load(s) from 
> > > > spin_unlock_wait([i])
> > > > - * /should/ be constrained by the ACQUIRE from spin_lock().
> > > > - *
> > > > - * Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.
> > > 
> > > Might be worth keeping this comment about spin_is_locked, since we're not
> > > removing that guy just yet!
> > 
> > Ah, all the examples had spin_unlock_wait() in them.  So what I need to
> > do is to create a spin_unlock_wait()-free example to illustrate the
> > text starting with "The load from spin_is_locked(", correct?
> 
> Yeah, I think so.
> 
> >

Re: [PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 01:02:48PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 30, 2017 at 09:21:23PM +0200, Oleg Nesterov wrote:
> > On 06/30, Paul E. McKenney wrote:
> > >
> > > On Fri, Jun 30, 2017 at 05:20:10PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I do not think the overhead will be noticeable in this particular case.
> > > >
> > > > But I am not sure I understand why do we want to unlock_wait. Yes I 
> > > > agree,
> >^
> > 
> > if it was not clear, I tried to say "why do we want to _remove_ 
> > unlock_wait".
> > 
> > > > it has some problems, but still...
> > > >
> > > > The code above looks strange for me. If we are going to repeat this 
> > > > pattern
> > > > the perhaps we should add a helper for lock+unlock and name it 
> > > > unlock_wait2 ;)
> > > >
> > > > If not, we should probably change this code more:
> > >
> > > This looks -much- better than my patch!  May I have your Signed-off-by?
> > 
> > Only if you promise to replace all RCU flavors with a single simple 
> > implementation
> > based on rwlock ;)
> 
> ;-) ;-) ;-)
> 
> Here you go:
> 
> https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/poormanurcu-2015.pdf
> 
> > Seriously, of course I won't argue, and it seems that nobody except me likes
> > this primitive, but to me spin_unlock_wait() looks like synchronize_rcu(() 
> > and
> > sometimes it makes sense.
> 
> Well, that analogy was what led me to propose that its semantics be
> defined as spin_lock() immediately followed by spin_unlock().  But that
> didn't go over well.
> 
> > Including this particular case. task_work_run() is going to flush/destroy 
> > the
> > ->task_works list, so it needs to wait until all currently executing 
> > "readers"
> > (task_work_cancel()'s which have started before ->task_works was updated) 
> > have
> > completed.
> 
> Understood!

And please see below for the resulting patch and commit log.  Please let
me know if I broke something.

Thanx, Paul



commit 6c0801c9ab19fc2f4c1e2436eb1b72e0af9a317b
Author: Oleg Nesterov <o...@redhat.com>
Date:   Fri Jun 30 13:13:59 2017 -0700

task_work: Replace spin_unlock_wait() with lock/unlock pair

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
task_work_run() with a spin_lock_irq() and a spin_unlock_irq() aruond
the cmpxchg() dequeue loop.  This should be safe from a performance
perspective because ->pi_lock is local to the task and because calls to
the other side of the race, task_work_cancel(), should be rare.

Signed-off-by: Oleg Nesterov <o...@redhat.com>
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>

diff --git a/kernel/task_work.c b/kernel/task_work.c
index d513051fcca2..836a72a66fba 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -96,20 +96,16 @@ void task_work_run(void)
 * work->func() can do task_work_add(), do not set
 * work_exited unless the list is empty.
 */
+   raw_spin_lock_irq(>pi_lock);
do {
work = READ_ONCE(task->task_works);
head = !work && (task->flags & PF_EXITING) ?
_exited : NULL;
} while (cmpxchg(>task_works, work, head) != work);
+   raw_spin_unlock_irq(>pi_lock);
 
if (!work)
break;
-   /*
-* Synchronize with task_work_cancel(). It can't remove
-* the first entry == work, cmpxchg(task_works) should
-* fail, but it can play with *work and other entries.
-*/
-   raw_spin_unlock_wait(>pi_lock);
 
do {
next = work->next;



Re: [PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 03:50:33PM -0400, Alan Stern wrote:
> On Fri, 30 Jun 2017, Oleg Nesterov wrote:
> 
> > On 06/30, Paul E. McKenney wrote:
> > >
> > > On Fri, Jun 30, 2017 at 05:20:10PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I do not think the overhead will be noticeable in this particular case.
> > > >
> > > > But I am not sure I understand why do we want to unlock_wait. Yes I 
> > > > agree,
> >^
> > 
> > if it was not clear, I tried to say "why do we want to _remove_ 
> > unlock_wait".
> > 
> > > > it has some problems, but still...
> > > >
> > > > The code above looks strange for me. If we are going to repeat this 
> > > > pattern
> > > > the perhaps we should add a helper for lock+unlock and name it 
> > > > unlock_wait2 ;)
> > > >
> > > > If not, we should probably change this code more:
> > >
> > > This looks -much- better than my patch!  May I have your Signed-off-by?
> > 
> > Only if you promise to replace all RCU flavors with a single simple 
> > implementation
> > based on rwlock ;)
> > 
> > Seriously, of course I won't argue, and it seems that nobody except me likes
> > this primitive, but to me spin_unlock_wait() looks like synchronize_rcu(() 
> > and
> > sometimes it makes sense.
> 
> If it looks like synchronize_rcu(), why not actually use 
> synchronize_rcu()?

My guess is that the latencies of synchronize_rcu() don't suit his needs.
When the lock is not held, spin_unlock_wait() is quite fast, even
compared to expedited grace periods.

Thanx, Paul

> Alan Stern
> 
> > Including this particular case. task_work_run() is going to flush/destroy 
> > the
> > ->task_works list, so it needs to wait until all currently executing 
> > "readers"
> > (task_work_cancel()'s which have started before ->task_works was updated) 
> > have
> > completed.
> 



Re: [PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 09:21:23PM +0200, Oleg Nesterov wrote:
> On 06/30, Paul E. McKenney wrote:
> >
> > On Fri, Jun 30, 2017 at 05:20:10PM +0200, Oleg Nesterov wrote:
> > >
> > > I do not think the overhead will be noticeable in this particular case.
> > >
> > > But I am not sure I understand why do we want to unlock_wait. Yes I agree,
>^
> 
> if it was not clear, I tried to say "why do we want to _remove_ unlock_wait".
> 
> > > it has some problems, but still...
> > >
> > > The code above looks strange for me. If we are going to repeat this 
> > > pattern
> > > the perhaps we should add a helper for lock+unlock and name it 
> > > unlock_wait2 ;)
> > >
> > > If not, we should probably change this code more:
> >
> > This looks -much- better than my patch!  May I have your Signed-off-by?
> 
> Only if you promise to replace all RCU flavors with a single simple 
> implementation
> based on rwlock ;)

;-) ;-) ;-)

Here you go:

https://github.com/pramalhe/ConcurrencyFreaks/blob/master/papers/poormanurcu-2015.pdf

> Seriously, of course I won't argue, and it seems that nobody except me likes
> this primitive, but to me spin_unlock_wait() looks like synchronize_rcu(() and
> sometimes it makes sense.

Well, that analogy was what led me to propose that its semantics be
defined as spin_lock() immediately followed by spin_unlock().  But that
didn't go over well.

> Including this particular case. task_work_run() is going to flush/destroy the
> ->task_works list, so it needs to wait until all currently executing "readers"
> (task_work_cancel()'s which have started before ->task_works was updated) have
> completed.

Understood!

Thanx, Paul



Re: [PATCH RFC 12/26] arm64: Remove spin_unlock_wait() arch-specific definitions

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 10:20:57AM +0100, Will Deacon wrote:
> On Thu, Jun 29, 2017 at 05:01:20PM -0700, Paul E. McKenney wrote:
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This commit therefore removes the underlying arch-specific
> > arch_spin_unlock_wait().
> > 
> > Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> > Cc: Catalin Marinas <catalin.mari...@arm.com>
> > Cc: Will Deacon <will.dea...@arm.com>
> > Cc: <linux-arm-ker...@lists.infradead.org>
> > Cc: Peter Zijlstra <pet...@infradead.org>
> > Cc: Alan Stern <st...@rowland.harvard.edu>
> > Cc: Andrea Parri <parri.and...@gmail.com>
> > Cc: Linus Torvalds <torva...@linux-foundation.org>
> > ---
> >  arch/arm64/include/asm/spinlock.h | 58 
> > ---
> >  1 file changed, 5 insertions(+), 53 deletions(-)
> 
> I'm going to miss this code.

;-) ;-) ;-)

> Acked-by: Will Deacon <will.dea...@arm.com>

Applied, thank you!

Thanx, Paul



Re: [PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 09:16:07AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 30, 2017 at 05:20:10PM +0200, Oleg Nesterov wrote:
> > On 06/30, Paul E. McKenney wrote:
> > >
> > > > > + raw_spin_lock_irq(>pi_lock);
> > > > > + raw_spin_unlock_irq(>pi_lock);
> > >
> > > I agree that the spin_unlock_wait() implementations would avoid the
> > > deadlock with an acquisition from an interrupt handler, while also
> > > avoiding the need to momentarily disable interrupts.  The ->pi_lock is
> > > a per-task lock, so I am assuming (perhaps naively) that contention is
> > > not a problem.  So is the overhead of interrupt disabling likely to be
> > > noticeable here?
> > 
> > I do not think the overhead will be noticeable in this particular case.
> > 
> > But I am not sure I understand why do we want to unlock_wait. Yes I agree,
> > it has some problems, but still...

Well, I tried documenting exactly what it did and did not do, which got
an ack from Peter.

https://marc.info/?l=linux-kernel=149575078313105

However, my later pull request spawned a bit of discussion:

https://marc.info/?l=linux-kernel=149730349001044

This discussion led me to propose strengthening spin_unlock_wait()
to act as a lock/unlock pair.  This can be implemented on x86 as
an smp_mb() followed by a read-only spinloop, as shown on branch
spin_unlock_wait.2017.06.23a on my -rcu tree.

Linus was not amused, and said that if we were going to make
spin_unlock_wait() have the semantics of lock+unlock, we should just
open-code that, especially given that there are way more definitions
of spin_unlock_wait() than there are uses.  He also suggested making
spin_unlock_wait() have only acquire semantics (x86 spin loop with
no memory-barrier instructions) and add explicit barriers where
required.

https://marc.info/?l=linux-kernel=149860012913036

I did a series for this which may be found on branch
spin_unlock_wait.2017.06.27a on my -rcu tree.

This approach was not loved by others (see later on the above thread), and
Linus's reply (which reiterated his opposition to lock+unlock semantics)
suggested the possibility of removing spin_unlock_wait() entirely.

https://marc.info/?l=linux-kernel=149869476911620

So I figured, in for a penny, in for a pound, and therefore did the series
that includes this patch.  The most recent update (which does not yet
include your improved version) is on branch spin_unlock_wait.2017.06.30b
of my -rcu tree.

Hey, you asked!  ;-)

Thanx, Paul

> > The code above looks strange for me. If we are going to repeat this pattern
> > the perhaps we should add a helper for lock+unlock and name it unlock_wait2 
> > ;)
> > 
> > If not, we should probably change this code more:
> 
> This looks -much- better than my patch!  May I have your Signed-off-by?
> 
>   Thanx, Paul
> 
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -96,20 +96,16 @@ void task_work_run(void)
> >  * work->func() can do task_work_add(), do not set
> >  * work_exited unless the list is empty.
> >  */
> > +   raw_spin_lock_irq(>pi_lock);
> > do {
> > work = READ_ONCE(task->task_works);
> > head = !work && (task->flags & PF_EXITING) ?
> > _exited : NULL;
> > } while (cmpxchg(>task_works, work, head) != work);
> > +   raw_spin_unlock_irq(>pi_lock);
> > 
> > if (!work)
> > break;
> > -   /*
> > -* Synchronize with task_work_cancel(). It can't remove
> > -* the first entry == work, cmpxchg(task_works) should
> > -* fail, but it can play with *work and other entries.
> > -*/
> > -   raw_spin_unlock_wait(>pi_lock);
> > 
> > do {
> > next = work->next;
> > 
> > performance-wise this is almost the same, and if we do not really care about
> > overhead we can simplify the code: this way it is obvious that we can't race
> > with task_work_cancel().
> > 
> > Oleg.
> > 



Re: [PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 05:20:10PM +0200, Oleg Nesterov wrote:
> On 06/30, Paul E. McKenney wrote:
> >
> > > > +   raw_spin_lock_irq(>pi_lock);
> > > > +   raw_spin_unlock_irq(>pi_lock);
> >
> > I agree that the spin_unlock_wait() implementations would avoid the
> > deadlock with an acquisition from an interrupt handler, while also
> > avoiding the need to momentarily disable interrupts.  The ->pi_lock is
> > a per-task lock, so I am assuming (perhaps naively) that contention is
> > not a problem.  So is the overhead of interrupt disabling likely to be
> > noticeable here?
> 
> I do not think the overhead will be noticeable in this particular case.
> 
> But I am not sure I understand why do we want to unlock_wait. Yes I agree,
> it has some problems, but still...
> 
> The code above looks strange for me. If we are going to repeat this pattern
> the perhaps we should add a helper for lock+unlock and name it unlock_wait2 ;)
> 
> If not, we should probably change this code more:

This looks -much- better than my patch!  May I have your Signed-off-by?

Thanx, Paul

> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -96,20 +96,16 @@ void task_work_run(void)
>* work->func() can do task_work_add(), do not set
>* work_exited unless the list is empty.
>*/
> + raw_spin_lock_irq(>pi_lock);
>   do {
>   work = READ_ONCE(task->task_works);
>   head = !work && (task->flags & PF_EXITING) ?
>   _exited : NULL;
>   } while (cmpxchg(>task_works, work, head) != work);
> + raw_spin_unlock_irq(>pi_lock);
> 
>   if (!work)
>   break;
> - /*
> -  * Synchronize with task_work_cancel(). It can't remove
> -  * the first entry == work, cmpxchg(task_works) should
> -  * fail, but it can play with *work and other entries.
> -  */
> - raw_spin_unlock_wait(>pi_lock);
> 
>   do {
>   next = work->next;
> 
> performance-wise this is almost the same, and if we do not really care about
> overhead we can simplify the code: this way it is obvious that we can't race
> with task_work_cancel().
> 
> Oleg.
> 



Re: [PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 01:04:45PM +0200, Oleg Nesterov wrote:
> On 06/29, Paul E. McKenney wrote:
> >
> > --- a/kernel/task_work.c
> > +++ b/kernel/task_work.c
> > @@ -109,7 +109,8 @@ void task_work_run(void)
> >  * the first entry == work, cmpxchg(task_works) should
> >  * fail, but it can play with *work and other entries.
> >  */
> > -   raw_spin_unlock_wait(>pi_lock);
> > +   raw_spin_lock(>pi_lock);
> > +   raw_spin_unlock(>pi_lock);
> 
> Well, bit the you need spin_lock_irq(). And this is one of the reasons
> why I personally think unlock_wait have some sense...

Good catch, and I clearly need to double-check the other commits for
any need to disable interrupts.  Anyway, like this, with the addition
of a flags variable, correct?

> > +   raw_spin_lock_irq(>pi_lock);
> > +   raw_spin_unlock_irq(>pi_lock);

I agree that the spin_unlock_wait() implementations would avoid the
deadlock with an acquisition from an interrupt handler, while also
avoiding the need to momentarily disable interrupts.  The ->pi_lock is
a per-task lock, so I am assuming (perhaps naively) that contention is
not a problem.  So is the overhead of interrupt disabling likely to be
noticeable here?

Thanx, Paul



Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 10:19:29AM +0100, Will Deacon wrote:
> On Thu, Jun 29, 2017 at 05:01:16PM -0700, Paul E. McKenney wrote:
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This commit therefore removes spin_unlock_wait() and related
> > definitions from core code.
> > 
> > Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> > Cc: Arnd Bergmann <a...@arndb.de>
> > Cc: Ingo Molnar <mi...@redhat.com>
> > Cc: Will Deacon <will.dea...@arm.com>
> > Cc: Peter Zijlstra <pet...@infradead.org>
> > Cc: Alan Stern <st...@rowland.harvard.edu>
> > Cc: Andrea Parri <parri.and...@gmail.com>
> > Cc: Linus Torvalds <torva...@linux-foundation.org>
> > ---
> >  include/asm-generic/qspinlock.h |  14 -
> >  include/linux/spinlock.h|  31 ---
> >  include/linux/spinlock_up.h |   6 ---
> >  kernel/locking/qspinlock.c  | 117 
> > 
> >  4 files changed, 168 deletions(-)
> 
> [...]
> 
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index b2caec7315af..64a9051e4c2c 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -267,123 +267,6 @@ static __always_inline u32  
> > __pv_wait_head_or_lock(struct qspinlock *lock,
> >  #define queued_spin_lock_slowpath  native_queued_spin_lock_slowpath
> >  #endif
> >  
> > -/*
> > - * Various notes on spin_is_locked() and spin_unlock_wait(), which are
> > - * 'interesting' functions:
> > - *
> > - * PROBLEM: some architectures have an interesting issue with atomic 
> > ACQUIRE
> > - * operations in that the ACQUIRE applies to the LOAD _not_ the STORE 
> > (ARM64,
> > - * PPC). Also qspinlock has a similar issue per construction, the setting 
> > of
> > - * the locked byte can be unordered acquiring the lock proper.
> > - *
> > - * This gets to be 'interesting' in the following cases, where the 
> > /should/s
> > - * end up false because of this issue.
> > - *
> > - *
> > - * CASE 1:
> > - *
> > - * So the spin_is_locked() correctness issue comes from something like:
> > - *
> > - *   CPU0  CPU1
> > - *
> > - *   global_lock();local_lock(i)
> > - * spin_lock() spin_lock([i])
> > - * for (i)   if (!spin_is_locked()) {
> > - *   spin_unlock_wait([i]);  smp_acquire__after_ctrl_dep();
> > - * return;
> > - *   }
> > - *   // deal with fail
> > - *
> > - * Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such
> > - * that there is exclusion between the two critical sections.
> > - *
> > - * The load from spin_is_locked() /should/ be constrained by the ACQUIRE 
> > from
> > - * spin_lock([i]), and similarly the load(s) from spin_unlock_wait([i])
> > - * /should/ be constrained by the ACQUIRE from spin_lock().
> > - *
> > - * Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.
> 
> Might be worth keeping this comment about spin_is_locked, since we're not
> removing that guy just yet!

Ah, all the examples had spin_unlock_wait() in them.  So what I need to
do is to create a spin_unlock_wait()-free example to illustrate the
text starting with "The load from spin_is_locked(", correct?

I also need to check all uses of spin_is_locked().  There might no
longer be any that rely on any particular ordering...

Thanx, Paul



Re: [PATCH RFC 03/26] sched: Replace spin_unlock_wait() with lock/unlock pair

2017-06-30 Thread Paul E. McKenney
On Fri, Jun 30, 2017 at 12:31:50PM +0200, Arnd Bergmann wrote:
> On Fri, Jun 30, 2017 at 2:01 AM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This commit therefore replaces the spin_unlock_wait() call in
> > do_task_dead() with spin_lock() followed immediately by spin_unlock().
> > This should be safe from a performance perspective because the lock is
> > this tasks ->pi_lock, and this is called only after the task exits.
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e91138fcde86..6dea3d9728c8 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3461,7 +3461,8 @@ void __noreturn do_task_dead(void)
> >  * is held by try_to_wake_up()
> >  */
> > smp_mb();
> > -   raw_spin_unlock_wait(>pi_lock);
> > +   raw_spin_lock(>pi_lock);
> > +   raw_spin_unlock(>pi_lock);
> 
> Does the raw_spin_lock()/raw_spin_unlock() imply an smp_mb() or stronger?
> Maybe it would be clearer to remove the extra barrier if so.

No, it does not in general, but it does on most architectures, and
there are ways to allow those architectures to gain the benefit of their
stronger locks.  For example, would this work?  

> >  * is held by try_to_wake_up()
> >  */
> > -   smp_mb();
> > -   raw_spin_unlock_wait(>pi_lock);
> > +   smp_mb__before_spinlock();
> > +   raw_spin_lock(>pi_lock);
> > +   raw_spin_unlock(>pi_lock);

Thanx, Paul



Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
On Thu, Jun 29, 2017 at 05:10:41PM -0700, Linus Torvalds wrote:
> On Thu, Jun 29, 2017 at 5:06 PM, Linus Torvalds
>  wrote:
> >
> > Please don't make this one commit fopr every architecture.
> >
> > Once something gets removed, it gets removed. There's no point in
> > "remove it from architecture X". If there are no more users, we're
> > done with it, and making it be 25 patches with the same commit message
> > instead of just one doesn't help anybody.
> 
> Just to clarify: I think the actual *users* are worth doing one by
> one, particularly if there are user-specific explanations of what that
> particular code wanted, and why spin_unlock_wait() doesn't really
> help.

Got it, and I did merge -only- the arch-specific definition removals
into one commit.  Should I also merge the core-code definition removals
into that same commit, or is it OK to remove the core-code definitions
with one commit and the arch-specific definitions with another.

(My guess is that you would prefer I removed -all- definitions with one
commit, including the core-kernel definitions, but at this point I figure
I should just ask.)

> And I think that you actually have those per-user insights by now,
> after looking at the long thread.

One Acked-by thus far, so some progress!

> So I'm not saying "do one patch for the whole series". One patch per
> removal of use is fine - in fact preferred.

Got it.  It allows the developers and maintainers to tell me where my
analysis is wrong, for one thing.  ;-)

> But once there are no actual more users, just remove all the
> architecture definitions in one go, because explaining the same thing
> several times doesn't actually help anything.
> 
> In fact, *if* we end up ever resurrecting that thing, it's good if we
> can resurrect it in one go. Then we can resurrect the one or two users
> that turned out to matter after all and could come up with why some
> particular ordering was ok too.

Understood!

Thanx, Paul



Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
On Thu, Jun 29, 2017 at 05:09:56PM -0700, Paul E. McKenney wrote:
> On Thu, Jun 29, 2017 at 05:06:16PM -0700, Linus Torvalds wrote:
> > On Thu, Jun 29, 2017 at 5:01 PM, Paul E. McKenney
> > <paul...@linux.vnet.ibm.com> wrote:
> > > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > > and it appears that all callers could do just as well with a lock/unlock
> > > pair.  This commit therefore removes the underlying arch-specific
> > > arch_spin_unlock_wait().
> > 
> > Please don't make this one commit fopr every architecture.
> > 
> > Once something gets removed, it gets removed. There's no point in
> > "remove it from architecture X". If there are no more users, we're
> > done with it, and making it be 25 patches with the same commit message
> > instead of just one doesn't help anybody.
> 
> Apologies, I will merge them.

I suppose that I could have sent the merged patch...  Please see below.

Thanx, Paul

--------

commit e81efa4219e9b07e476448572aafe8f9c4ad28c8
Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Date:   Thu Jun 29 15:53:02 2017 -0700

arch: Remove spin_unlock_wait() arch-specific definitions

There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait() for all architectures providing them.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: <linux-a...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>

diff --git a/arch/alpha/include/asm/spinlock.h 
b/arch/alpha/include/asm/spinlock.h
index a40b9fc0c6c3..718ac0b64adf 100644
--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x) ((x)->lock != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 return lock.lock == 0;
diff --git a/arch/arc/include/asm/spinlock.h b/arch/arc/include/asm/spinlock.h
index 233d5ffe6ec7..a325e6a36523 100644
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)  arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, !VAL);
-}
-
 #ifdef CONFIG_ARC_HAS_LLSC
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
diff --git a/arch/arm/include/asm/spinlock.h b/arch/arm/include/asm/spinlock.h
index 4bec45442072..c030143c18c6 100644
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -52,22 +52,6 @@ static inline void dsb_sev(void)
  * memory.
  */
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   u16 owner = READ_ONCE(lock->tickets.owner);
-
-   for (;;) {
-   arch_spinlock_t tmp = READ_ONCE(*lock);
-
-   if (tmp.tickets.owner == tmp.tickets.next ||
-   tmp.tickets.owner != owner)
-   break;
-
-   wfe();
-   }
-   smp_acquire__after_ctrl_dep();
-}
-
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
diff --git a/arch/arm64/include/asm/spinlock.h 
b/arch/arm64/include/asm/spinlock.h
index cae331d553f8..f445bd7f2b9f 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,58 +26,6 @@
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   unsigned int tmp;
-   arch_spinlock_t lockval;
-   u32 owner;
-
-   /*
-* Ensure prior spin_lock operations to other locks have completed
-* on this CPU before we test whether "lock" is locked.
-*/
-   smp_mb();
-   owner = READ_ONCE(lock->owner) << 16;
-
-   asm volatile(
-"  sevl\n"
-"1:wfe\n"
-"2:ldaxr   %w0, %2\n"
-   /* Is the lock free? */
-"  eor %w1, %w0, %w0, ror #16\n"
-"  cbz %w1, 3f\n"
-   /* Lock ta

[PATCH RFC 05/26] exit: Replace spin_unlock_wait() with lock/unlock pair

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and
it appears that all callers could do just as well with a lock/unlock pair.
This commit therefore replaces the spin_unlock_wait() call in do_exit()
with spin_lock() followed immediately by spin_unlock().  This should be
safe from a performance perspective because the lock is a per-task lock,
and this is happening only at task-exit time.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/exit.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 516acdb0e0ec..1a976e47ddd1 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -832,7 +832,8 @@ void __noreturn do_exit(long code)
 * Ensure that we must observe the pi_state in exit_mm() ->
 * mm_release() -> exit_pi_state_list().
 */
-   raw_spin_unlock_wait(>pi_lock);
+   raw_spin_lock(>pi_lock);
+   raw_spin_unlock(>pi_lock);
 
if (unlikely(in_atomic())) {
pr_info("note: %s[%d] exited with preempt_count %d\n",
-- 
2.5.2



[PATCH RFC 17/26] metag: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: James Hogan <james.ho...@imgtec.com>
Cc: <linux-me...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/metag/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/metag/include/asm/spinlock.h 
b/arch/metag/include/asm/spinlock.h
index c0c7a22be1ae..ddf7fe5708a6 100644
--- a/arch/metag/include/asm/spinlock.h
+++ b/arch/metag/include/asm/spinlock.h
@@ -15,11 +15,6 @@
  * locked.
  */
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
 #definearch_read_lock_flags(lock, flags) arch_read_lock(lock)
-- 
2.5.2



[PATCH RFC 10/26] arc: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Vineet Gupta <vgu...@synopsys.com>
Cc: <linux-snps-...@lists.infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/arc/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/arc/include/asm/spinlock.h b/arch/arc/include/asm/spinlock.h
index 233d5ffe6ec7..a325e6a36523 100644
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)  arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, !VAL);
-}
-
 #ifdef CONFIG_ARC_HAS_LLSC
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
-- 
2.5.2



[PATCH RFC 02/26] task_work: Replace spin_unlock_wait() with lock/unlock pair

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
task_work_run() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because calls to the
other side of the race, task_work_cancel(), should be rare.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Oleg Nesterov <o...@redhat.com>
Cc: Andrew Morton <a...@linux-foundation.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/task_work.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/task_work.c b/kernel/task_work.c
index d513051fcca2..b9b428832229 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -109,7 +109,8 @@ void task_work_run(void)
 * the first entry == work, cmpxchg(task_works) should
 * fail, but it can play with *work and other entries.
 */
-   raw_spin_unlock_wait(>pi_lock);
+   raw_spin_lock(>pi_lock);
+   raw_spin_unlock(>pi_lock);
 
do {
next = work->next;
-- 
2.5.2



[PATCH RFC 16/26] m32r: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/m32r/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/m32r/include/asm/spinlock.h b/arch/m32r/include/asm/spinlock.h
index 323c7fc953cd..a56825592b90 100644
--- a/arch/m32r/include/asm/spinlock.h
+++ b/arch/m32r/include/asm/spinlock.h
@@ -30,11 +30,6 @@
 #define arch_spin_is_locked(x) (*(volatile int *)(&(x)->slock) <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, VAL > 0);
-}
-
 /**
  * arch_spin_trylock - Try spin lock and return a result
  * @lock: Pointer to the lock variable
-- 
2.5.2



[PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Chris Metcalf <cmetc...@mellanox.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/tile/include/asm/spinlock_32.h |  2 --
 arch/tile/include/asm/spinlock_64.h |  2 --
 arch/tile/lib/spinlock_32.c | 23 ---
 arch/tile/lib/spinlock_64.c | 22 --
 4 files changed, 49 deletions(-)

diff --git a/arch/tile/include/asm/spinlock_32.h 
b/arch/tile/include/asm/spinlock_32.h
index b14b1ba5bf9c..cba8ba9b8da6 100644
--- a/arch/tile/include/asm/spinlock_32.h
+++ b/arch/tile/include/asm/spinlock_32.h
@@ -64,8 +64,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
lock->current_ticket = old_ticket + TICKET_QUANTUM;
 }
 
-void arch_spin_unlock_wait(arch_spinlock_t *lock);
-
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
diff --git a/arch/tile/include/asm/spinlock_64.h 
b/arch/tile/include/asm/spinlock_64.h
index b9718fb4e74a..9a2c2d605752 100644
--- a/arch/tile/include/asm/spinlock_64.h
+++ b/arch/tile/include/asm/spinlock_64.h
@@ -58,8 +58,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
__insn_fetchadd4(>lock, 1U << __ARCH_SPIN_CURRENT_SHIFT);
 }
 
-void arch_spin_unlock_wait(arch_spinlock_t *lock);
-
 void arch_spin_lock_slow(arch_spinlock_t *lock, u32 val);
 
 /* Grab the "next" ticket number and bump it atomically.
diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
index 076c6cc43113..db9333f2447c 100644
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -62,29 +62,6 @@ int arch_spin_trylock(arch_spinlock_t *lock)
 }
 EXPORT_SYMBOL(arch_spin_trylock);
 
-void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   u32 iterations = 0;
-   int curr = READ_ONCE(lock->current_ticket);
-   int next = READ_ONCE(lock->next_ticket);
-
-   /* Return immediately if unlocked. */
-   if (next == curr)
-   return;
-
-   /* Wait until the current locker has released the lock. */
-   do {
-   delay_backoff(iterations++);
-   } while (READ_ONCE(lock->current_ticket) == curr);
-
-   /*
-* The TILE architecture doesn't do read speculation; therefore
-* a control dependency guarantees a LOAD->{LOAD,STORE} order.
-*/
-   barrier();
-}
-EXPORT_SYMBOL(arch_spin_unlock_wait);
-
 /*
  * The low byte is always reserved to be the marker for a "tns" operation
  * since the low bit is set to "1" by a tns.  The next seven bits are
diff --git a/arch/tile/lib/spinlock_64.c b/arch/tile/lib/spinlock_64.c
index a4b5b2cbce93..de414c22892f 100644
--- a/arch/tile/lib/spinlock_64.c
+++ b/arch/tile/lib/spinlock_64.c
@@ -62,28 +62,6 @@ int arch_spin_trylock(arch_spinlock_t *lock)
 }
 EXPORT_SYMBOL(arch_spin_trylock);
 
-void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   u32 iterations = 0;
-   u32 val = READ_ONCE(lock->lock);
-   u32 curr = arch_spin_current(val);
-
-   /* Return immediately if unlocked. */
-   if (arch_spin_next(val) == curr)
-   return;
-
-   /* Wait until the current locker has released the lock. */
-   do {
-   delay_backoff(iterations++);
-   } while (arch_spin_current(READ_ONCE(lock->lock)) == curr);
-
-   /*
-* The TILE architecture doesn't do read speculation; therefore
-* a control dependency guarantees a LOAD->{LOAD,STORE} order.
-*/
-   barrier();
-}
-EXPORT_SYMBOL(arch_spin_unlock_wait);
 
 /*
  * If the read lock fails due to a writer, we retry periodically
-- 
2.5.2



[PATCH RFC 03/26] sched: Replace spin_unlock_wait() with lock/unlock pair

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
do_task_dead() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock is
this tasks ->pi_lock, and this is called only after the task exits.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e91138fcde86..6dea3d9728c8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3461,7 +3461,8 @@ void __noreturn do_task_dead(void)
 * is held by try_to_wake_up()
 */
smp_mb();
-   raw_spin_unlock_wait(>pi_lock);
+   raw_spin_lock(>pi_lock);
+   raw_spin_unlock(>pi_lock);
 
/* Causes final put_task_struct in finish_task_switch(): */
__set_current_state(TASK_DEAD);
-- 
2.5.2



Re: [PATCH RFC 25/26] tile: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
On Thu, Jun 29, 2017 at 05:06:16PM -0700, Linus Torvalds wrote:
> On Thu, Jun 29, 2017 at 5:01 PM, Paul E. McKenney
> <paul...@linux.vnet.ibm.com> wrote:
> > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > and it appears that all callers could do just as well with a lock/unlock
> > pair.  This commit therefore removes the underlying arch-specific
> > arch_spin_unlock_wait().
> 
> Please don't make this one commit fopr every architecture.
> 
> Once something gets removed, it gets removed. There's no point in
> "remove it from architecture X". If there are no more users, we're
> done with it, and making it be 25 patches with the same commit message
> instead of just one doesn't help anybody.

Apologies, I will merge them.

Thanx, Paul



[PATCH RFC 04/26] completion: Replace spin_unlock_wait() with lock/unlock pair

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() call in
completion_done() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock
will be held only the wakeup happens really quickly.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 kernel/sched/completion.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index 53f9558fa925..d4b89a59629f 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -307,14 +307,9 @@ bool completion_done(struct completion *x)
 * If ->done, we need to wait for complete() to release ->wait.lock
 * otherwise we can end up freeing the completion before complete()
 * is done referencing it.
-*
-* The RMB pairs with complete()'s RELEASE of ->wait.lock and orders
-* the loads of ->done and ->wait.lock such that we cannot observe
-* the lock before complete() acquires it while observing the ->done
-* after it's acquired the lock.
 */
-   smp_rmb();
-   spin_unlock_wait(>wait.lock);
+   spin_lock(>wait.lock);
+   spin_unlock(>wait.lock);
return true;
 }
 EXPORT_SYMBOL(completion_done);
-- 
2.5.2



[PATCH RFC 18/26] mips: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Ralf Baechle <r...@linux-mips.org>
Cc: <linux-m...@linux-mips.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/mips/include/asm/spinlock.h | 16 
 1 file changed, 16 deletions(-)

diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index a8df44d60607..81b4945031ee 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -50,22 +50,6 @@ static inline int arch_spin_value_unlocked(arch_spinlock_t 
lock)
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   u16 owner = READ_ONCE(lock->h.serving_now);
-   smp_rmb();
-   for (;;) {
-   arch_spinlock_t tmp = READ_ONCE(*lock);
-
-   if (tmp.h.serving_now == tmp.h.ticket ||
-   tmp.h.serving_now != owner)
-   break;
-
-   cpu_relax();
-   }
-   smp_acquire__after_ctrl_dep();
-}
-
 static inline int arch_spin_is_contended(arch_spinlock_t *lock)
 {
u32 counters = ACCESS_ONCE(lock->lock);
-- 
2.5.2



[PATCH RFC 13/26] blackfin: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Steven Miao <real...@gmail.com>
Cc: <adi-buildroot-de...@lists.sourceforge.net>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/blackfin/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/blackfin/include/asm/spinlock.h 
b/arch/blackfin/include/asm/spinlock.h
index c58f4a83ed6f..f6431439d15d 100644
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -48,11 +48,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
__raw_spin_unlock_asm(>lock);
 }
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 static inline int arch_read_can_lock(arch_rwlock_t *rw)
 {
return __raw_uncached_fetch_asm(>lock) > 0;
-- 
2.5.2



[PATCH RFC 07/26] drivers/ata: Replace spin_unlock_wait() with lock/unlock pair

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore eliminates the spin_unlock_wait() call and
associated else-clause and hoists the then-clause's lock and unlock out of
the "if" statement.  This should be safe from a performance perspective
because according to Tejun there should be few if any drivers that don't
set their own error handler.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Acked-by: Tejun Heo <t...@kernel.org>
Cc: <linux-...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 drivers/ata/libata-eh.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index ef68232b5222..779f6f18c1f4 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -645,12 +645,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, 
struct ata_port *ap,
 * completions are honored.  A scmd is determined to have
 * timed out iff its associated qc is active and not failed.
 */
+   spin_lock_irqsave(ap->lock, flags);
if (ap->ops->error_handler) {
struct scsi_cmnd *scmd, *tmp;
int nr_timedout = 0;
 
-   spin_lock_irqsave(ap->lock, flags);
-
/* This must occur under the ap->lock as we don't want
   a polled recovery to race the real interrupt handler
 
@@ -700,12 +699,11 @@ void ata_scsi_cmd_error_handler(struct Scsi_Host *host, 
struct ata_port *ap,
if (nr_timedout)
__ata_port_freeze(ap);
 
-   spin_unlock_irqrestore(ap->lock, flags);
 
/* initialize eh_tries */
ap->eh_tries = ATA_EH_MAX_TRIES;
-   } else
-   spin_unlock_wait(ap->lock);
+   }
+   spin_unlock_irqrestore(ap->lock, flags);
 
 }
 EXPORT_SYMBOL(ata_scsi_cmd_error_handler);
-- 
2.5.2



[PATCH RFC 15/26] ia64: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Tony Luck <tony.l...@intel.com>
Cc: Fenghua Yu <fenghua...@intel.com>
Cc: <linux-i...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/ia64/include/asm/spinlock.h | 21 -
 1 file changed, 21 deletions(-)

diff --git a/arch/ia64/include/asm/spinlock.h b/arch/ia64/include/asm/spinlock.h
index ca9e76149a4a..df2c121164b8 100644
--- a/arch/ia64/include/asm/spinlock.h
+++ b/arch/ia64/include/asm/spinlock.h
@@ -76,22 +76,6 @@ static __always_inline void 
__ticket_spin_unlock(arch_spinlock_t *lock)
ACCESS_ONCE(*p) = (tmp + 2) & ~1;
 }
 
-static __always_inline void __ticket_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   int *p = (int *)>lock, ticket;
-
-   ia64_invala();
-
-   for (;;) {
-   asm volatile ("ld4.c.nc %0=[%1]" : "=r"(ticket) : "r"(p) : 
"memory");
-   if (!(((ticket >> TICKET_SHIFT) ^ ticket) & TICKET_MASK))
-   return;
-   cpu_relax();
-   }
-
-   smp_acquire__after_ctrl_dep();
-}
-
 static inline int __ticket_spin_is_locked(arch_spinlock_t *lock)
 {
long tmp = ACCESS_ONCE(lock->lock);
@@ -143,11 +127,6 @@ static __always_inline void 
arch_spin_lock_flags(arch_spinlock_t *lock,
arch_spin_lock(lock);
 }
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   __ticket_spin_unlock_wait(lock);
-}
-
 #define arch_read_can_lock(rw) (*(volatile int *)(rw) >= 0)
 #define arch_write_can_lock(rw)(*(volatile int *)(rw) == 0)
 
-- 
2.5.2



[PATCH RFC 14/26] hexagon: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Richard Kuo <r...@codeaurora.org>
Cc: <linux-hexa...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/hexagon/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/hexagon/include/asm/spinlock.h 
b/arch/hexagon/include/asm/spinlock.h
index a1c55788c5d6..53a8d5885887 100644
--- a/arch/hexagon/include/asm/spinlock.h
+++ b/arch/hexagon/include/asm/spinlock.h
@@ -179,11 +179,6 @@ static inline unsigned int 
arch_spin_trylock(arch_spinlock_t *lock)
  */
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 #define arch_spin_is_locked(x) ((x)->lock != 0)
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
-- 
2.5.2



[PATCH RFC 12/26] arm64: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.mari...@arm.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: <linux-arm-ker...@lists.infradead.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/arm64/include/asm/spinlock.h | 58 ---
 1 file changed, 5 insertions(+), 53 deletions(-)

diff --git a/arch/arm64/include/asm/spinlock.h 
b/arch/arm64/include/asm/spinlock.h
index cae331d553f8..f445bd7f2b9f 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -26,58 +26,6 @@
  * The memory barriers are implicit with the load-acquire and store-release
  * instructions.
  */
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   unsigned int tmp;
-   arch_spinlock_t lockval;
-   u32 owner;
-
-   /*
-* Ensure prior spin_lock operations to other locks have completed
-* on this CPU before we test whether "lock" is locked.
-*/
-   smp_mb();
-   owner = READ_ONCE(lock->owner) << 16;
-
-   asm volatile(
-"  sevl\n"
-"1:wfe\n"
-"2:ldaxr   %w0, %2\n"
-   /* Is the lock free? */
-"  eor %w1, %w0, %w0, ror #16\n"
-"  cbz %w1, 3f\n"
-   /* Lock taken -- has there been a subsequent unlock->lock transition? */
-"  eor %w1, %w3, %w0, lsl #16\n"
-"  cbz %w1, 1b\n"
-   /*
-* The owner has been updated, so there was an unlock->lock
-* transition that we missed. That means we can rely on the
-* store-release of the unlock operation paired with the
-* load-acquire of the lock operation to publish any of our
-* previous stores to the new lock owner and therefore don't
-* need to bother with the writeback below.
-*/
-"  b   4f\n"
-"3:\n"
-   /*
-* Serialise against any concurrent lockers by writing back the
-* unlocked lock value
-*/
-   ARM64_LSE_ATOMIC_INSN(
-   /* LL/SC */
-"  stxr%w1, %w0, %2\n"
-   __nops(2),
-   /* LSE atomics */
-"  mov %w1, %w0\n"
-"  cas %w0, %w0, %2\n"
-"  eor %w1, %w1, %w0\n")
-   /* Somebody else wrote to the lock, GOTO 10 and reload the value */
-"  cbnz%w1, 2b\n"
-"4:"
-   : "=" (lockval), "=" (tmp), "+Q" (*lock)
-   : "r" (owner)
-   : "memory");
-}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
@@ -176,7 +124,11 @@ static inline int arch_spin_value_unlocked(arch_spinlock_t 
lock)
 
 static inline int arch_spin_is_locked(arch_spinlock_t *lock)
 {
-   smp_mb(); /* See arch_spin_unlock_wait */
+   /*
+* Ensure prior spin_lock operations to other locks have completed
+* on this CPU before we test whether "lock" is locked.
+*/
+   smp_mb(); /* ^^^ */
return !arch_spin_value_unlocked(READ_ONCE(*lock));
 }
 
-- 
2.5.2



[PATCH RFC 19/26] mn10300: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: David Howells <dhowe...@redhat.com>
Cc: <linux-am33-l...@redhat.com>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/mn10300/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/mn10300/include/asm/spinlock.h 
b/arch/mn10300/include/asm/spinlock.h
index 9c7b8f7942d8..fe413b41df6c 100644
--- a/arch/mn10300/include/asm/spinlock.h
+++ b/arch/mn10300/include/asm/spinlock.h
@@ -26,11 +26,6 @@
 
 #define arch_spin_is_locked(x) (*(volatile signed char *)(&(x)->slock) != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, !VAL);
-}
-
 static inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
asm volatile(
-- 
2.5.2



[PATCH RFC 22/26] s390: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidef...@de.ibm.com>
Cc: Heiko Carstens <heiko.carst...@de.ibm.com>
Cc: <linux-s...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/s390/include/asm/spinlock.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index f7838ecd83c6..217ee5210c32 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -98,13 +98,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lp)
: "cc", "memory");
 }
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   while (arch_spin_is_locked(lock))
-   arch_spin_relax(lock);
-   smp_acquire__after_ctrl_dep();
-}
-
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
-- 
2.5.2



[PATCH RFC 21/26] powerpc: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
Cc: Paul Mackerras <pau...@samba.org>
Cc: Michael Ellerman <m...@ellerman.id.au>
Cc: <linuxppc-...@lists.ozlabs.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/powerpc/include/asm/spinlock.h | 33 -
 1 file changed, 33 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 8c1b913de6d7..d256e448ea49 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -170,39 +170,6 @@ static inline void arch_spin_unlock(arch_spinlock_t *lock)
lock->slock = 0;
 }
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   arch_spinlock_t lock_val;
-
-   smp_mb();
-
-   /*
-* Atomically load and store back the lock value (unchanged). This
-* ensures that our observation of the lock value is ordered with
-* respect to other lock operations.
-*/
-   __asm__ __volatile__(
-"1:" PPC_LWARX(%0, 0, %2, 0) "\n"
-"  stwcx. %0, 0, %2\n"
-"  bne- 1b\n"
-   : "=" (lock_val), "+m" (*lock)
-   : "r" (lock)
-   : "cr0", "xer");
-
-   if (arch_spin_value_unlocked(lock_val))
-   goto out;
-
-   while (lock->slock) {
-   HMT_low();
-   if (SHARED_PROCESSOR)
-   __spin_yield(lock);
-   }
-   HMT_medium();
-
-out:
-   smp_mb();
-}
-
 /*
  * Read-write spinlocks, allowing multiple readers
  * but only one writer.
-- 
2.5.2



[PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore replaces the spin_unlock_wait() calls
in nf_conntrack_lock() and nf_conntrack_all_lock() with spin_lock()
followed immediately by spin_unlock().  These functions do not appear
to be invoked on any fastpaths.

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
Cc: Florian Westphal <f...@strlen.de>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: <netfilter-de...@vger.kernel.org>
Cc: <coret...@netfilter.org>
Cc: <netdev@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 net/netfilter/nf_conntrack_core.c | 26 --
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index e847dbaa0c6b..9f997859d160 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -99,15 +99,11 @@ void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
spin_lock(lock);
while (unlikely(nf_conntrack_locks_all)) {
spin_unlock(lock);
-
-   /*
-* Order the 'nf_conntrack_locks_all' load vs. the
-* spin_unlock_wait() loads below, to ensure
-* that 'nf_conntrack_locks_all_lock' is indeed held:
-*/
-   smp_rmb(); /* spin_lock(_conntrack_locks_all_lock) */
-   spin_unlock_wait(_conntrack_locks_all_lock);
+   /* Wait for nf_conntrack_locks_all_lock holder to release ... */
+   spin_lock(_conntrack_locks_all_lock);
+   spin_unlock(_conntrack_locks_all_lock);
spin_lock(lock);
+   /* ... and retry. */
}
 }
 EXPORT_SYMBOL_GPL(nf_conntrack_lock);
@@ -150,17 +146,11 @@ static void nf_conntrack_all_lock(void)
 
spin_lock(_conntrack_locks_all_lock);
nf_conntrack_locks_all = true;
-
-   /*
-* Order the above store of 'nf_conntrack_locks_all' against
-* the spin_unlock_wait() loads below, such that if
-* nf_conntrack_lock() observes 'nf_conntrack_locks_all'
-* we must observe nf_conntrack_locks[] held:
-*/
-   smp_mb(); /* spin_lock(_conntrack_locks_all_lock) */
-
for (i = 0; i < CONNTRACK_LOCKS; i++) {
-   spin_unlock_wait(_conntrack_locks[i]);
+   /* Wait for any current holder to release lock. */
+   spin_lock(_conntrack_locks[i]);
+   spin_unlock(_conntrack_locks[i]);
+   /* Next acquisition will see nf_conntrack_locks_all == true. */
}
 }
 
-- 
2.5.2



[PATCH RFC 24/26] sparc: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: <sparcli...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/sparc/include/asm/spinlock_32.h | 5 -
 arch/sparc/include/asm/spinlock_64.h | 5 -
 2 files changed, 10 deletions(-)

diff --git a/arch/sparc/include/asm/spinlock_32.h 
b/arch/sparc/include/asm/spinlock_32.h
index 8011e79f59c9..67345b2dc408 100644
--- a/arch/sparc/include/asm/spinlock_32.h
+++ b/arch/sparc/include/asm/spinlock_32.h
@@ -14,11 +14,6 @@
 
 #define arch_spin_is_locked(lock) (*((volatile unsigned char *)(lock)) != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
__asm__ __volatile__(
diff --git a/arch/sparc/include/asm/spinlock_64.h 
b/arch/sparc/include/asm/spinlock_64.h
index 07c9f2e9bf57..923d57f9b79d 100644
--- a/arch/sparc/include/asm/spinlock_64.h
+++ b/arch/sparc/include/asm/spinlock_64.h
@@ -26,11 +26,6 @@
 
 #define arch_spin_is_locked(lp)((lp)->lock != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
unsigned long tmp;
-- 
2.5.2



[PATCH RFC 23/26] sh: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Yoshinori Sato <ys...@users.sourceforge.jp>
Cc: Rich Felker <dal...@libc.org>
Cc: <linux...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/sh/include/asm/spinlock-cas.h  | 5 -
 arch/sh/include/asm/spinlock-llsc.h | 5 -
 2 files changed, 10 deletions(-)

diff --git a/arch/sh/include/asm/spinlock-cas.h 
b/arch/sh/include/asm/spinlock-cas.h
index c46e8cc7b515..5ed7dbbd94ff 100644
--- a/arch/sh/include/asm/spinlock-cas.h
+++ b/arch/sh/include/asm/spinlock-cas.h
@@ -29,11 +29,6 @@ static inline unsigned __sl_cas(volatile unsigned *p, 
unsigned old, unsigned new
 #define arch_spin_is_locked(x) ((x)->lock <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, VAL > 0);
-}
-
 static inline void arch_spin_lock(arch_spinlock_t *lock)
 {
while (!__sl_cas(>lock, 1, 0));
diff --git a/arch/sh/include/asm/spinlock-llsc.h 
b/arch/sh/include/asm/spinlock-llsc.h
index cec78143fa83..f77263aae760 100644
--- a/arch/sh/include/asm/spinlock-llsc.h
+++ b/arch/sh/include/asm/spinlock-llsc.h
@@ -21,11 +21,6 @@
 #define arch_spin_is_locked(x) ((x)->lock <= 0)
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, VAL > 0);
-}
-
 /*
  * Simple spin lock operations.  There are two variants, one clears IRQ's
  * on the local processor, one does not.
-- 
2.5.2



[PATCH RFC 26/26] xtensa: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Chris Zankel <ch...@zankel.net>
Cc: Max Filippov <jcmvb...@gmail.com>
Cc: <linux-xte...@linux-xtensa.org>
---
 arch/xtensa/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/xtensa/include/asm/spinlock.h 
b/arch/xtensa/include/asm/spinlock.h
index a36221cf6363..3bb49681ee24 100644
--- a/arch/xtensa/include/asm/spinlock.h
+++ b/arch/xtensa/include/asm/spinlock.h
@@ -33,11 +33,6 @@
 
 #define arch_spin_is_locked(x) ((x)->slock != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>slock, !VAL);
-}
-
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
 static inline void arch_spin_lock(arch_spinlock_t *lock)
-- 
2.5.2



[PATCH RFC 20/26] parisc: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: "James E.J. Bottomley" <j...@parisc-linux.org>
Cc: Helge Deller <del...@gmx.de>
Cc: <linux-par...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/parisc/include/asm/spinlock.h | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/parisc/include/asm/spinlock.h 
b/arch/parisc/include/asm/spinlock.h
index e32936cd7f10..55bfe4affca3 100644
--- a/arch/parisc/include/asm/spinlock.h
+++ b/arch/parisc/include/asm/spinlock.h
@@ -14,13 +14,6 @@ static inline int arch_spin_is_locked(arch_spinlock_t *x)
 
 #define arch_spin_lock(lock) arch_spin_lock_flags(lock, 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *x)
-{
-   volatile unsigned int *a = __ldcw_align(x);
-
-   smp_cond_load_acquire(a, VAL);
-}
-
 static inline void arch_spin_lock_flags(arch_spinlock_t *x,
 unsigned long flags)
 {
-- 
2.5.2



[PATCH RFC 09/26] alpha: Remove spin_unlock_wait() arch-specific definitions

2017-06-29 Thread Paul E. McKenney
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair.  This commit therefore removes the underlying arch-specific
arch_spin_unlock_wait().

Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Richard Henderson <r...@twiddle.net>
Cc: Ivan Kokshaysky <i...@jurassic.park.msu.ru>
Cc: Matt Turner <matts...@gmail.com>
Cc: <linux-al...@vger.kernel.org>
Cc: Will Deacon <will.dea...@arm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Alan Stern <st...@rowland.harvard.edu>
Cc: Andrea Parri <parri.and...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
---
 arch/alpha/include/asm/spinlock.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/alpha/include/asm/spinlock.h 
b/arch/alpha/include/asm/spinlock.h
index a40b9fc0c6c3..718ac0b64adf 100644
--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -16,11 +16,6 @@
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x) ((x)->lock != 0)
 
-static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
-{
-   smp_cond_load_acquire(>lock, !VAL);
-}
-
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
 return lock.lock == 0;
-- 
2.5.2



  1   2   3   >