Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-29 Thread Maciej W. Rozycki
On Thu, 28 Jan 2016, Leonid Yegoshin wrote:

> In http://patchwork.linux-mips.org/patch/10505/ the very last mesg exchange
> is:
[...]
> ... and that stops forever...

 Thanks for the reminder -- last June was very hectic, I travelled a lot 
and I lost the discussion from my radar.  Apologies for that.  I replied 
in that thread now with my results.  I hope this helps.

  Maciej


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-29 Thread Maciej W. Rozycki
On Thu, 28 Jan 2016, Leonid Yegoshin wrote:

> In http://patchwork.linux-mips.org/patch/10505/ the very last mesg exchange
> is:
[...]
> ... and that stops forever...

 Thanks for the reminder -- last June was very hectic, I travelled a lot 
and I lost the discussion from my radar.  Apologies for that.  I replied 
in that thread now with my results.  I hope this helps.

  Maciej


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Leonid Yegoshin

On 01/27/2016 03:26 AM, Maciej W. Rozycki wrote:

On Fri, 15 Jan 2016, Leonid Yegoshin wrote:


So you need to build a different kernel for some types of MIPS systems?
Or do you do boot-time rewriting, like a number of other arches do?

I don't know. I would like to have responses. Ralf asked Maciej about old
systems and that came nowhere. Even rewrite - don't know what to do with that:
no lightweight SYNC or no SYNC at all - yes, it is still possible that SYNC on
some systems can be too heavy or even harmful, nobody tested that.

  I don't recall being asked; mind that I might not get to messages I have
not been cc-ed in a timely manner and I may miss some altogether.  With
the amount of mailing list traffic that passes by me my scanner may fail
to trigger.  Sorry if this causes anybody trouble, but such is life.

  Coincidentally, I have just posted some notes on SYNC in a different
thread, see .
There's a reference to an older message of mine there too.  I hope this
answers your questions.

   Maciej
In http://patchwork.linux-mips.org/patch/10505/the very last mesg 
exchange is:


Maciej,

do you have an R4000 / R4600 / R5000 / R7000 / SiByte system at hand to
test this?
...
  Ralf

Maciej W. Rozycki- June 5, 2015, 9:18 p.m.

On Fri, 5 Jun 2015, Ralf Baechle wrote:


do you have an R4000 / R4600 / R5000 / R7000 / SiByte system at hand to
test this?


 I should be able to check R4400 (that is virtually the same as R4000)
next week or so.  As to SiByte -- not before next month I'm afraid.  I
don't have access to any of the other processors you named.  You may
want to find a better person if you want to accept this change soon.

  Maciej

... and that stops forever...

- Leonid.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 10:04:47AM +0800, Boqun Feng wrote:
> On Tue, Jan 26, 2016 at 03:29:21PM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> > > On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
> > >  wrote:
> > > >
> > > > You might as well just write it as
> > > >
> > > > struct foo x = READ_ONCE(*ptr);
> > > > x->bar = 5;
> > > >
> > > > because that "smp_read_barrier_depends()" does NOTHING wrt the second 
> > > > write.
> > > 
> > > Just to clarify: on alpha it adds a memory barrier, but that memory
> > > barrier is useless.
> > 
> > No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> > needed.  That said, I believe that we should encourage rcu_dereference*()
> > or lockless_dereference() instead of READ_ONCE() for documentation
> > reasons, though.
> > 
> > > On non-alpha, it is a no-op, and obviously does nothing simply because
> > > it generates no code.
> > > 
> > > So if anybody believes that the "smp_read_barrier_depends()" does
> > > something, they are *wrong*.
> > 
> > The other problem with smp_read_barrier_depends() is that it is often
> > a pain figuring out which prior load it is supposed to apply to.
> > Hence my preference for rcu_dereference*() and lockless_dereference().
> > 
> 
> Because semantically speaking, rcu_derefence*() and
> lockless_dereference() are CONSUME(i.e. data/address dependent
> read->read and read->write pairs are ordered), whereas
> smp_read_barrier_depends() only guarantees read->read pairs with data
> dependency are ordered, right?
> 
> If so, maybe we need to call it out in memory-barriers.txt, for example:
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index 904ee42..6b262c2 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1703,8 +1703,8 @@ There are some more advanced barrier functions:
>  
>  
>   (*) lockless_dereference();
> - This can be thought of as a pointer-fetch wrapper around the
> - smp_read_barrier_depends() data-dependency barrier.
> + This is a load, and any load or store that has a data dependency on the
> + value returned by this load won't be reordered before this load.

This is a good start, but more is needed to warn people off of
smp_read_barrier_depends().  But yes, better explanation would be good.

Thanx, Paul

>   This is also similar to rcu_dereference(), but in cases where
>   object lifetime is handled by some mechanism other than RCU, for
> 
> 
> Regards,
> Boqun
> 
> > > And if anybody sends out an email with that smp_read_barrier_depends()
> > > in an example, they are actively just confusing other people, which is
> > > even worse than just being wrong. Which is why I jumped in.
> > > 
> > > So stop perpetuating the myth that smp_read_barrier_depends() does
> > > something here. It does not. It's a bug, and it has become this "mind
> > > virus" for some people that seem to believe that it does something.
> > 
> > It looks like I should add words to memory-barriers.txt de-emphasizing
> > smp_read_barrier_depends().  I will take a look at that.
> > 
> > > I had to remove this crap once from the kernel already, see commit
> > > 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> > > atomic*_read_ctrl()").
> > > 
> > > I don't want to ever see that broken construct again. And I want to
> > > make sure that everybody is educated about how broken it was. I'm
> > > extremely unhappy that it came up again.
> > 
> > Well, if it makes you feel better, that was control dependencies and this
> > was data dependencies.  So it was not -exactly- the same.  ;-)
> > 
> > (Sorry, couldn't resist...)
> > 
> > > If it turns out that some architecture does actually need a barrier
> > > between a read and a dependent write, then that will mean that
> > > 
> > >  (a) we'll have to make up a _new_ barrier, because
> > > "smp_read_barrier_depends()" is not that barrier. We'll presumably
> > > then have to make that new barrier part of "rcu_derefence()" and
> > > friends.
> > 
> > Agreed.  We can worry about whether or not we replace the current
> > smp_read_barrier_depends() with that new barrier when and if such
> > hardware appears.
> > 
> > >  (b) we will have found an architecture with even worse memory
> > > ordering semantics than alpha, and we'll have to stop castigating
> > > alpha for being the worst memory ordering ever.
> > 
> > ;-) ;-) ;-)
> > 
> > > but I sincerely hope that we'll never find that kind of broken 
> > > architecture.
> > 
> > Apparently at least some hardware vendors are reading memory-barriers.txt,
> > so perhaps the odds of that kind of breakage have reduced.
> > 
> > Thanx, Paul
> > 




Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 10:25:46AM +, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 11:58:20AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> > > On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > > > PPC Overlapping Group-B sets version 4
> > > > > > ""
> > > > > > (* When the Group-B sets from two different barriers involve 
> > > > > > instructions in
> > > > > >the same thread, within that thread one set must contain the 
> > > > > > other.
> > > > > > 
> > > > > > P0  P1  P2
> > > > > > Rx=1Wy=1Wz=2
> > > > > > dep.lwsync  lwsync
> > > > > > Ry=0Wz=1Wx=1
> > > > > > Rz=1
> > > > > > 
> > > > > > assert(!(z=2))
> > > > > > 
> > > > > >Forbidden by ppcmem, allowed by herd.
> > > > > > *)
> > > > > > {
> > > > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > > > }
> > > > > >  P0 | P1| P2;
> > > > > >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > > > >  xor r7,r6,r6   | lwsync| lwsync;
> > > > > >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> > > > > >  lwz r8,0(r3)   |   |   ;
> > > > > > 
> > > > > > exists
> > > > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > > > 
> > > > > That really hurts. Assuming that the "assert(!(z=2))" is actually 
> > > > > there
> > > > > to constrain the coherence order of z to be {0->1->2}, then I think 
> > > > > that
> > > > > this test is forbidden on arm using dmb instead of lwsync. That said, 
> > > > > I
> > > > > also don't think the Rz=1 in P0 changes anything.
> > > > 
> > > > What about the smp_wmb() variant of dmb that orders only stores?
> > > 
> > > Tricky, but I think it still works out if the coherence order of z is as
> > > I described above. The line of reasoning is weird though -- I ended up
> > > considering the two cases where P0 reads z before and after it reads x
> > > and what that means for the read of y.
> > 
> > By "works out" you mean that ARM prohibits the outcome?
> 
> Yes, that's my understanding.

Very good, we have agreement between the two architectures, then.  ;-)

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Maciej W. Rozycki
On Wed, 27 Jan 2016, Ralf Baechle wrote:

> > So you need to build a different kernel for some types of MIPS systems?
> 
> Yes.  We can't really do without.  Classic MIPS code is not relocatable
> without the complexity of PIC code as used by ELF DSOs - and their
> performanc penalty.  Plus we have a number of architecture revisions
> ovr the decades, big and little endian, 32 and 64 bit as the major
> stumbling stones.  There however are groups of similar systems that
> can share kernel binaries.

 Matt (cc-ed) has recently posted patches to add support for a relocatable 
kernel, implemented without the usual overhead of PIC code.  It works by 
retaining relocations in a fully-linked binary and then simply replaying 
the work the static linker does when assigning addresses, as the image 
loaded is copied to its intended destination at an early bootstrap stage.  
See: 

 
for details.

 I think this framework can be reused by carefully choosing instructions 
used in early bootstrap code, up to the relocation stage, so that it is 
runnable anywhere (not the same as PIC!) like early ld.so initialisation 
and then loading the whole attached image starting from an address where 
RAM does exist on target hardware.

 Endianness is a different matter, obviously we can't build a single image 
for both, although for distributions' sake an approach similar to one used 
with bi-endian firmware (for hardware which has an easy way to switch the 
endianness, e.g. a physical jumper or a configuration bit stored in flash 
memory; not to be confused with the reverse user endianness mode) might be 
feasible, by glueing two kernel images together and then selecting the 
right one early in bootstrap, perhaps again reusing Matt's framework.  
I'm not sure if this is worth the effort though, I suspect the usage level 
of this feature would be minimal.

 All in all I think making a generic MIPS kernel just might be feasible, 
but with the diversity of options available the effort required would be 
enormous.  NetBSD for example I believe supports building a kernel that 
correctly runs on both R3000 (MIPS I, 32-bit) and R4000 (MIPS III, 64-bit) 
DEC hardware (as did DEC Ultrix, the vendor OS for these systems).  These 
processors are different enough from each other that you cannot use the 
same code for cache, memory and exception management in an OS kernel -- 
backward compatibility is only provided for user software.  That proves 
the concept, however in a very limited way only, not even covering SMP, 
and their R4000 kernel does not support 64-bit userland I believe.  They 
still have completely separate ports for other MIPS hardware, such as for 
Broadcom SiByte SB-1 (MIPS64r1) processors.

  Maciej


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Maciej W. Rozycki
On Fri, 15 Jan 2016, Leonid Yegoshin wrote:

> > So you need to build a different kernel for some types of MIPS systems?
> > Or do you do boot-time rewriting, like a number of other arches do?
> 
> I don't know. I would like to have responses. Ralf asked Maciej about old
> systems and that came nowhere. Even rewrite - don't know what to do with that:
> no lightweight SYNC or no SYNC at all - yes, it is still possible that SYNC on
> some systems can be too heavy or even harmful, nobody tested that.

 I don't recall being asked; mind that I might not get to messages I have 
not been cc-ed in a timely manner and I may miss some altogether.  With 
the amount of mailing list traffic that passes by me my scanner may fail 
to trigger.  Sorry if this causes anybody trouble, but such is life.

 Coincidentally, I have just posted some notes on SYNC in a different 
thread, see .  
There's a reference to an older message of mine there too.  I hope this 
answers your questions.

  Maciej


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Ralf Baechle
On Thu, Jan 14, 2016 at 04:47:53PM -0800, Paul E. McKenney wrote:

> So you need to build a different kernel for some types of MIPS systems?

Yes.  We can't really do without.  Classic MIPS code is not relocatable
without the complexity of PIC code as used by ELF DSOs - and their
performanc penalty.  Plus we have a number of architecture revisions
ovr the decades, big and little endian, 32 and 64 bit as the major
stumbling stones.  There however are groups of similar systems that
can share kernel binaries.

> Or do you do boot-time rewriting, like a number of other arches do?

We don't rewrite the code (as in the .text of the vmlinux binary) but we
do runtime code generation for a few highly performance sensitive area
of the kernel code such as copy_page() or TLB exception handlers.  This
allows more flexibility than just inserting templates into the kernel
code.  Downside - it means we have some of the complexity of as and ld
in the kernel.

  Ralf


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Will Deacon
On Tue, Jan 26, 2016 at 11:58:20AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> > On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > > PPC Overlapping Group-B sets version 4
> > > > > ""
> > > > > (* When the Group-B sets from two different barriers involve 
> > > > > instructions in
> > > > >the same thread, within that thread one set must contain the other.
> > > > > 
> > > > >   P0  P1  P2
> > > > >   Rx=1Wy=1Wz=2
> > > > >   dep.lwsync  lwsync
> > > > >   Ry=0Wz=1Wx=1
> > > > >   Rz=1
> > > > > 
> > > > >   assert(!(z=2))
> > > > > 
> > > > >Forbidden by ppcmem, allowed by herd.
> > > > > *)
> > > > > {
> > > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > > }
> > > > >  P0   | P1| P2;
> > > > >  lwz r6,0(r1) | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > > >  xor r7,r6,r6 | lwsync| lwsync;
> > > > >  lwzx r7,r7,r2| stw r4,0(r3)  | stw r4,0(r1)  ;
> > > > >  lwz r8,0(r3) |   |   ;
> > > > > 
> > > > > exists
> > > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > > 
> > > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > > also don't think the Rz=1 in P0 changes anything.
> > > 
> > > What about the smp_wmb() variant of dmb that orders only stores?
> > 
> > Tricky, but I think it still works out if the coherence order of z is as
> > I described above. The line of reasoning is weird though -- I ended up
> > considering the two cases where P0 reads z before and after it reads x
> > and what that means for the read of y.
> 
> By "works out" you mean that ARM prohibits the outcome?

Yes, that's my understanding.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Will Deacon
On Tue, Jan 26, 2016 at 03:37:33PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 12:10:10PM +, Will Deacon wrote:
> > On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> > > PPC WRCnf+addrs
> > > ""
> > > {
> > > 0:r2=x; 0:r3=y;
> > > 1:r2=x; 1:r3=y;
> > > 2:r2=x; 2:r3=y;
> > > c=a; d=b; x=c; y=d;
> > > }
> > >  P0   | P1| P2;
> > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > >   | stw r2,0(r8)  | lwz r9,0(r8)  ;
> > > exists
> > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > 
> > Agreed.
> 
> OK, thank you!  Would you agree that it would be good to replace the
> current xor-based fake-dependency litmus tests with tests having real
> dependencies?

Yes, because it would look a lot more like real (kernel) code.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 12:13:39PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 11:19:27AM +0100, Peter Zijlstra wrote:

> > So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
> > not someone trying to implement RCsc locks using locally transitive
> > RELEASE/ACQUIRE operations need exactly this stuff?
> > 
> > That is, I am afraid we need to cover the mix of local and global
> > transitive operations at least in overview.
> 
> True, but we haven't gotten to locking yet.

The mythical smp_mb__after_release_acquire() then ;-)

(and yes, I know you're going to say we don't have that)

> That said, I would argue
> that smp_mb__after_unlock_lock() upgrades locks to transitive, and
> thus would not be an exception to the "no combining transitive and
> non-transitive steps in cycles" rule.

But But But ;-) It does that exactly by combining. I suspect this is
(partly) the source of your SC chains with one PC link example.




Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 10:04:47AM +0800, Boqun Feng wrote:
> On Tue, Jan 26, 2016 at 03:29:21PM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> > > On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
> > >  wrote:
> > > >
> > > > You might as well just write it as
> > > >
> > > > struct foo x = READ_ONCE(*ptr);
> > > > x->bar = 5;
> > > >
> > > > because that "smp_read_barrier_depends()" does NOTHING wrt the second 
> > > > write.
> > > 
> > > Just to clarify: on alpha it adds a memory barrier, but that memory
> > > barrier is useless.
> > 
> > No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> > needed.  That said, I believe that we should encourage rcu_dereference*()
> > or lockless_dereference() instead of READ_ONCE() for documentation
> > reasons, though.
> > 
> > > On non-alpha, it is a no-op, and obviously does nothing simply because
> > > it generates no code.
> > > 
> > > So if anybody believes that the "smp_read_barrier_depends()" does
> > > something, they are *wrong*.
> > 
> > The other problem with smp_read_barrier_depends() is that it is often
> > a pain figuring out which prior load it is supposed to apply to.
> > Hence my preference for rcu_dereference*() and lockless_dereference().
> > 
> 
> Because semantically speaking, rcu_derefence*() and
> lockless_dereference() are CONSUME(i.e. data/address dependent
> read->read and read->write pairs are ordered), whereas
> smp_read_barrier_depends() only guarantees read->read pairs with data
> dependency are ordered, right?
> 
> If so, maybe we need to call it out in memory-barriers.txt, for example:
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index 904ee42..6b262c2 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1703,8 +1703,8 @@ There are some more advanced barrier functions:
>  
>  
>   (*) lockless_dereference();
> - This can be thought of as a pointer-fetch wrapper around the
> - smp_read_barrier_depends() data-dependency barrier.
> + This is a load, and any load or store that has a data dependency on the
> + value returned by this load won't be reordered before this load.

This is a good start, but more is needed to warn people off of
smp_read_barrier_depends().  But yes, better explanation would be good.

Thanx, Paul

>   This is also similar to rcu_dereference(), but in cases where
>   object lifetime is handled by some mechanism other than RCU, for
> 
> 
> Regards,
> Boqun
> 
> > > And if anybody sends out an email with that smp_read_barrier_depends()
> > > in an example, they are actively just confusing other people, which is
> > > even worse than just being wrong. Which is why I jumped in.
> > > 
> > > So stop perpetuating the myth that smp_read_barrier_depends() does
> > > something here. It does not. It's a bug, and it has become this "mind
> > > virus" for some people that seem to believe that it does something.
> > 
> > It looks like I should add words to memory-barriers.txt de-emphasizing
> > smp_read_barrier_depends().  I will take a look at that.
> > 
> > > I had to remove this crap once from the kernel already, see commit
> > > 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> > > atomic*_read_ctrl()").
> > > 
> > > I don't want to ever see that broken construct again. And I want to
> > > make sure that everybody is educated about how broken it was. I'm
> > > extremely unhappy that it came up again.
> > 
> > Well, if it makes you feel better, that was control dependencies and this
> > was data dependencies.  So it was not -exactly- the same.  ;-)
> > 
> > (Sorry, couldn't resist...)
> > 
> > > If it turns out that some architecture does actually need a barrier
> > > between a read and a dependent write, then that will mean that
> > > 
> > >  (a) we'll have to make up a _new_ barrier, because
> > > "smp_read_barrier_depends()" is not that barrier. We'll presumably
> > > then have to make that new barrier part of "rcu_derefence()" and
> > > friends.
> > 
> > Agreed.  We can worry about whether or not we replace the current
> > smp_read_barrier_depends() with that new barrier when and if such
> > hardware appears.
> > 
> > >  (b) we will have found an architecture with even worse memory
> > > ordering semantics than alpha, and we'll have to stop castigating
> > > alpha for being the worst memory ordering ever.
> > 
> > ;-) ;-) ;-)
> > 
> > > but I sincerely hope that we'll never find that kind of broken 
> > > architecture.
> > 
> > Apparently at least some hardware vendors are reading memory-barriers.txt,
> > so perhaps the odds of that kind of breakage have reduced.
> > 
> > Thanx, Paul
> > 




Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 10:25:46AM +, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 11:58:20AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> > > On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > > > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > > > PPC Overlapping Group-B sets version 4
> > > > > > ""
> > > > > > (* When the Group-B sets from two different barriers involve 
> > > > > > instructions in
> > > > > >the same thread, within that thread one set must contain the 
> > > > > > other.
> > > > > > 
> > > > > > P0  P1  P2
> > > > > > Rx=1Wy=1Wz=2
> > > > > > dep.lwsync  lwsync
> > > > > > Ry=0Wz=1Wx=1
> > > > > > Rz=1
> > > > > > 
> > > > > > assert(!(z=2))
> > > > > > 
> > > > > >Forbidden by ppcmem, allowed by herd.
> > > > > > *)
> > > > > > {
> > > > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > > > }
> > > > > >  P0 | P1| P2;
> > > > > >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > > > >  xor r7,r6,r6   | lwsync| lwsync;
> > > > > >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> > > > > >  lwz r8,0(r3)   |   |   ;
> > > > > > 
> > > > > > exists
> > > > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > > > 
> > > > > That really hurts. Assuming that the "assert(!(z=2))" is actually 
> > > > > there
> > > > > to constrain the coherence order of z to be {0->1->2}, then I think 
> > > > > that
> > > > > this test is forbidden on arm using dmb instead of lwsync. That said, 
> > > > > I
> > > > > also don't think the Rz=1 in P0 changes anything.
> > > > 
> > > > What about the smp_wmb() variant of dmb that orders only stores?
> > > 
> > > Tricky, but I think it still works out if the coherence order of z is as
> > > I described above. The line of reasoning is weird though -- I ended up
> > > considering the two cases where P0 reads z before and after it reads x
> > > and what that means for the read of y.
> > 
> > By "works out" you mean that ARM prohibits the outcome?
> 
> Yes, that's my understanding.

Very good, we have agreement between the two architectures, then.  ;-)

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Leonid Yegoshin

On 01/27/2016 03:26 AM, Maciej W. Rozycki wrote:

On Fri, 15 Jan 2016, Leonid Yegoshin wrote:


So you need to build a different kernel for some types of MIPS systems?
Or do you do boot-time rewriting, like a number of other arches do?

I don't know. I would like to have responses. Ralf asked Maciej about old
systems and that came nowhere. Even rewrite - don't know what to do with that:
no lightweight SYNC or no SYNC at all - yes, it is still possible that SYNC on
some systems can be too heavy or even harmful, nobody tested that.

  I don't recall being asked; mind that I might not get to messages I have
not been cc-ed in a timely manner and I may miss some altogether.  With
the amount of mailing list traffic that passes by me my scanner may fail
to trigger.  Sorry if this causes anybody trouble, but such is life.

  Coincidentally, I have just posted some notes on SYNC in a different
thread, see .
There's a reference to an older message of mine there too.  I hope this
answers your questions.

   Maciej
In http://patchwork.linux-mips.org/patch/10505/the very last mesg 
exchange is:


Maciej,

do you have an R4000 / R4600 / R5000 / R7000 / SiByte system at hand to
test this?
...
  Ralf

Maciej W. Rozycki- June 5, 2015, 9:18 p.m.

On Fri, 5 Jun 2015, Ralf Baechle wrote:


do you have an R4000 / R4600 / R5000 / R7000 / SiByte system at hand to
test this?


 I should be able to check R4400 (that is virtually the same as R4000)
next week or so.  As to SiByte -- not before next month I'm afraid.  I
don't have access to any of the other processors you named.  You may
want to find a better person if you want to accept this change soon.

  Maciej

... and that stops forever...

- Leonid.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Maciej W. Rozycki
On Fri, 15 Jan 2016, Leonid Yegoshin wrote:

> > So you need to build a different kernel for some types of MIPS systems?
> > Or do you do boot-time rewriting, like a number of other arches do?
> 
> I don't know. I would like to have responses. Ralf asked Maciej about old
> systems and that came nowhere. Even rewrite - don't know what to do with that:
> no lightweight SYNC or no SYNC at all - yes, it is still possible that SYNC on
> some systems can be too heavy or even harmful, nobody tested that.

 I don't recall being asked; mind that I might not get to messages I have 
not been cc-ed in a timely manner and I may miss some altogether.  With 
the amount of mailing list traffic that passes by me my scanner may fail 
to trigger.  Sorry if this causes anybody trouble, but such is life.

 Coincidentally, I have just posted some notes on SYNC in a different 
thread, see .  
There's a reference to an older message of mine there too.  I hope this 
answers your questions.

  Maciej


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Maciej W. Rozycki
On Wed, 27 Jan 2016, Ralf Baechle wrote:

> > So you need to build a different kernel for some types of MIPS systems?
> 
> Yes.  We can't really do without.  Classic MIPS code is not relocatable
> without the complexity of PIC code as used by ELF DSOs - and their
> performanc penalty.  Plus we have a number of architecture revisions
> ovr the decades, big and little endian, 32 and 64 bit as the major
> stumbling stones.  There however are groups of similar systems that
> can share kernel binaries.

 Matt (cc-ed) has recently posted patches to add support for a relocatable 
kernel, implemented without the usual overhead of PIC code.  It works by 
retaining relocations in a fully-linked binary and then simply replaying 
the work the static linker does when assigning addresses, as the image 
loaded is copied to its intended destination at an early bootstrap stage.  
See: 

 
for details.

 I think this framework can be reused by carefully choosing instructions 
used in early bootstrap code, up to the relocation stage, so that it is 
runnable anywhere (not the same as PIC!) like early ld.so initialisation 
and then loading the whole attached image starting from an address where 
RAM does exist on target hardware.

 Endianness is a different matter, obviously we can't build a single image 
for both, although for distributions' sake an approach similar to one used 
with bi-endian firmware (for hardware which has an easy way to switch the 
endianness, e.g. a physical jumper or a configuration bit stored in flash 
memory; not to be confused with the reverse user endianness mode) might be 
feasible, by glueing two kernel images together and then selecting the 
right one early in bootstrap, perhaps again reusing Matt's framework.  
I'm not sure if this is worth the effort though, I suspect the usage level 
of this feature would be minimal.

 All in all I think making a generic MIPS kernel just might be feasible, 
but with the diversity of options available the effort required would be 
enormous.  NetBSD for example I believe supports building a kernel that 
correctly runs on both R3000 (MIPS I, 32-bit) and R4000 (MIPS III, 64-bit) 
DEC hardware (as did DEC Ultrix, the vendor OS for these systems).  These 
processors are different enough from each other that you cannot use the 
same code for cache, memory and exception management in an OS kernel -- 
backward compatibility is only provided for user software.  That proves 
the concept, however in a very limited way only, not even covering SMP, 
and their R4000 kernel does not support 64-bit userland I believe.  They 
still have completely separate ports for other MIPS hardware, such as for 
Broadcom SiByte SB-1 (MIPS64r1) processors.

  Maciej


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Will Deacon
On Tue, Jan 26, 2016 at 11:58:20AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> > On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > > PPC Overlapping Group-B sets version 4
> > > > > ""
> > > > > (* When the Group-B sets from two different barriers involve 
> > > > > instructions in
> > > > >the same thread, within that thread one set must contain the other.
> > > > > 
> > > > >   P0  P1  P2
> > > > >   Rx=1Wy=1Wz=2
> > > > >   dep.lwsync  lwsync
> > > > >   Ry=0Wz=1Wx=1
> > > > >   Rz=1
> > > > > 
> > > > >   assert(!(z=2))
> > > > > 
> > > > >Forbidden by ppcmem, allowed by herd.
> > > > > *)
> > > > > {
> > > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > > }
> > > > >  P0   | P1| P2;
> > > > >  lwz r6,0(r1) | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > > >  xor r7,r6,r6 | lwsync| lwsync;
> > > > >  lwzx r7,r7,r2| stw r4,0(r3)  | stw r4,0(r1)  ;
> > > > >  lwz r8,0(r3) |   |   ;
> > > > > 
> > > > > exists
> > > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > > 
> > > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > > also don't think the Rz=1 in P0 changes anything.
> > > 
> > > What about the smp_wmb() variant of dmb that orders only stores?
> > 
> > Tricky, but I think it still works out if the coherence order of z is as
> > I described above. The line of reasoning is weird though -- I ended up
> > considering the two cases where P0 reads z before and after it reads x
> > and what that means for the read of y.
> 
> By "works out" you mean that ARM prohibits the outcome?

Yes, that's my understanding.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 12:13:39PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 11:19:27AM +0100, Peter Zijlstra wrote:

> > So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
> > not someone trying to implement RCsc locks using locally transitive
> > RELEASE/ACQUIRE operations need exactly this stuff?
> > 
> > That is, I am afraid we need to cover the mix of local and global
> > transitive operations at least in overview.
> 
> True, but we haven't gotten to locking yet.

The mythical smp_mb__after_release_acquire() then ;-)

(and yes, I know you're going to say we don't have that)

> That said, I would argue
> that smp_mb__after_unlock_lock() upgrades locks to transitive, and
> thus would not be an exception to the "no combining transitive and
> non-transitive steps in cycles" rule.

But But But ;-) It does that exactly by combining. I suspect this is
(partly) the source of your SC chains with one PC link example.




Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Will Deacon
On Tue, Jan 26, 2016 at 03:37:33PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 12:10:10PM +, Will Deacon wrote:
> > On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> > > PPC WRCnf+addrs
> > > ""
> > > {
> > > 0:r2=x; 0:r3=y;
> > > 1:r2=x; 1:r3=y;
> > > 2:r2=x; 2:r3=y;
> > > c=a; d=b; x=c; y=d;
> > > }
> > >  P0   | P1| P2;
> > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > >   | stw r2,0(r8)  | lwz r9,0(r8)  ;
> > > exists
> > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > 
> > Agreed.
> 
> OK, thank you!  Would you agree that it would be good to replace the
> current xor-based fake-dependency litmus tests with tests having real
> dependencies?

Yes, because it would look a lot more like real (kernel) code.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Ralf Baechle
On Thu, Jan 14, 2016 at 04:47:53PM -0800, Paul E. McKenney wrote:

> So you need to build a different kernel for some types of MIPS systems?

Yes.  We can't really do without.  Classic MIPS code is not relocatable
without the complexity of PIC code as used by ELF DSOs - and their
performanc penalty.  Plus we have a number of architecture revisions
ovr the decades, big and little endian, 32 and 64 bit as the major
stumbling stones.  There however are groups of similar systems that
can share kernel binaries.

> Or do you do boot-time rewriting, like a number of other arches do?

We don't rewrite the code (as in the .text of the vmlinux binary) but we
do runtime code generation for a few highly performance sensitive area
of the kernel code such as copy_page() or TLB exception handlers.  This
allows more flexibility than just inserting templates into the kernel
code.  Downside - it means we have some of the complexity of as and ld
in the kernel.

  Ralf


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:

> If it turns out that some architecture does actually need a barrier
> between a read and a dependent write, then that will mean that
> 
>  (a) we'll have to make up a _new_ barrier, because
> "smp_read_barrier_depends()" is not that barrier. We'll presumably
> then have to make that new barrier part of "rcu_derefence()" and
> friends.
> 
>  (b) we will have found an architecture with even worse memory
> ordering semantics than alpha, and we'll have to stop castigating
> alpha for being the worst memory ordering ever.
> 
> but I sincerely hope that we'll never find that kind of broken architecture.

So for a moment it looked like MIPS wanted to equal or surpass Alpha in
this respect.

And Paul made the point that smp_read_barrier_depends() really should
be smp_aquire_barrier_depends() in that we rely on both dependent reads
and writes to be ordered against the initial pointer load.

Now, as you've made abundantly clear, Alpha does this, although it needs
the little extra help in the dependent read department.

The 'problem' is that someone seemed to have used our
Documentation/memory-barriers.txt as a specification for what hardware
is permitted and we require. And in that light Paul noted that
read_barrier_depends really should be considered an
acquire_barrier_depends and order both dependent reads and writes
against the (prior) read (if nothing else already does).

Now clearly, any sane architecture doesn't need anything like this, but
again our document doesn't seem to judge. That is, from reading the
document one can get the impression is a perfectly fine thing to do.
Nowhere does our disdain for this thing show.



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng
On Tue, Jan 26, 2016 at 03:29:21PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> > On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
> >  wrote:
> > >
> > > You might as well just write it as
> > >
> > > struct foo x = READ_ONCE(*ptr);
> > > x->bar = 5;
> > >
> > > because that "smp_read_barrier_depends()" does NOTHING wrt the second 
> > > write.
> > 
> > Just to clarify: on alpha it adds a memory barrier, but that memory
> > barrier is useless.
> 
> No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> needed.  That said, I believe that we should encourage rcu_dereference*()
> or lockless_dereference() instead of READ_ONCE() for documentation
> reasons, though.
> 
> > On non-alpha, it is a no-op, and obviously does nothing simply because
> > it generates no code.
> > 
> > So if anybody believes that the "smp_read_barrier_depends()" does
> > something, they are *wrong*.
> 
> The other problem with smp_read_barrier_depends() is that it is often
> a pain figuring out which prior load it is supposed to apply to.
> Hence my preference for rcu_dereference*() and lockless_dereference().
> 

Because semantically speaking, rcu_derefence*() and
lockless_dereference() are CONSUME(i.e. data/address dependent
read->read and read->write pairs are ordered), whereas
smp_read_barrier_depends() only guarantees read->read pairs with data
dependency are ordered, right?

If so, maybe we need to call it out in memory-barriers.txt, for example:

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 904ee42..6b262c2 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1703,8 +1703,8 @@ There are some more advanced barrier functions:
 
 
  (*) lockless_dereference();
- This can be thought of as a pointer-fetch wrapper around the
- smp_read_barrier_depends() data-dependency barrier.
+ This is a load, and any load or store that has a data dependency on the
+ value returned by this load won't be reordered before this load.
 
  This is also similar to rcu_dereference(), but in cases where
  object lifetime is handled by some mechanism other than RCU, for


Regards,
Boqun

> > And if anybody sends out an email with that smp_read_barrier_depends()
> > in an example, they are actively just confusing other people, which is
> > even worse than just being wrong. Which is why I jumped in.
> > 
> > So stop perpetuating the myth that smp_read_barrier_depends() does
> > something here. It does not. It's a bug, and it has become this "mind
> > virus" for some people that seem to believe that it does something.
> 
> It looks like I should add words to memory-barriers.txt de-emphasizing
> smp_read_barrier_depends().  I will take a look at that.
> 
> > I had to remove this crap once from the kernel already, see commit
> > 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> > atomic*_read_ctrl()").
> > 
> > I don't want to ever see that broken construct again. And I want to
> > make sure that everybody is educated about how broken it was. I'm
> > extremely unhappy that it came up again.
> 
> Well, if it makes you feel better, that was control dependencies and this
> was data dependencies.  So it was not -exactly- the same.  ;-)
> 
> (Sorry, couldn't resist...)
> 
> > If it turns out that some architecture does actually need a barrier
> > between a read and a dependent write, then that will mean that
> > 
> >  (a) we'll have to make up a _new_ barrier, because
> > "smp_read_barrier_depends()" is not that barrier. We'll presumably
> > then have to make that new barrier part of "rcu_derefence()" and
> > friends.
> 
> Agreed.  We can worry about whether or not we replace the current
> smp_read_barrier_depends() with that new barrier when and if such
> hardware appears.
> 
> >  (b) we will have found an architecture with even worse memory
> > ordering semantics than alpha, and we'll have to stop castigating
> > alpha for being the worst memory ordering ever.
> 
> ;-) ;-) ;-)
> 
> > but I sincerely hope that we'll never find that kind of broken architecture.
> 
> Apparently at least some hardware vendors are reading memory-barriers.txt,
> so perhaps the odds of that kind of breakage have reduced.
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 03:45:23PM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenney
>  wrote:
> >
> > No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> > needed.  That said, I believe that we should encourage rcu_dereference*()
> > or lockless_dereference() instead of READ_ONCE() for documentation
> > reasons, though.
> 
> I agree that that is likely the right thing to do in pretty much all 
> situations.
> 
> In theory, there might be performance situations where we'd want to
> actively avoid the smp_read_barrier_depends() inherent in those, but
> considering that it's only a performance issue on alpha, and we
> probably have all of two or three people using Linux on alpha, it's a
> pretty theoretical performance worry.

Agreed!

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenney
 wrote:
>
> No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> needed.  That said, I believe that we should encourage rcu_dereference*()
> or lockless_dereference() instead of READ_ONCE() for documentation
> reasons, though.

I agree that that is likely the right thing to do in pretty much all situations.

In theory, there might be performance situations where we'd want to
actively avoid the smp_read_barrier_depends() inherent in those, but
considering that it's only a performance issue on alpha, and we
probably have all of two or three people using Linux on alpha, it's a
pretty theoretical performance worry.

  Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 12:10:10PM +, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 02:41:34PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > > > > See my earlier reply [1] (but also, your WRC Linux example looks 
> > > > > > more
> > > > > > like a variant on WWC and I couldn't really follow it).
> > > > > 
> > > > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > > > I am sure that it will seem more natural with time and experience...
> > > > 
> > > > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > > > last access from a store to a load to get WRC.  Plus the levels of
> > > > indirection definitely didn't match up, did they?
> > > 
> > > Nope, it was pretty baffling!
> > 
> > "It is a service that I provide."  ;-)
> > 
> > > > struct foo {
> > > > struct foo *next;
> > > > };
> > > > struct foo a;
> > > > struct foo b;
> > > > struct foo c = {  };
> > > > struct foo d = {  };
> > > > struct foo x = {  };
> > > > struct foo y = {  };
> > > > struct foo *r1, *r2, *r3;
> > > > 
> > > > void cpu0(void)
> > > > {
> > > > WRITE_ONCE(x.next, );
> > > > }
> > > > 
> > > > void cpu1(void)
> > > > {
> > > > r1 = lockless_dereference(x.next);
> > > > WRITE_ONCE(r1->next, );
> > > > }
> > > > 
> > > > void cpu2(void)
> > > > {
> > > > r2 = lockless_dereference(y.next);
> > > > r3 = READ_ONCE(r2->next);
> > > > }
> > > > 
> > > > In this case, it is legal to end the run with:
> > > > 
> > > > r1 ==  && r2 ==  && r3 == 
> > > > 
> > > > Please see below for a ppcmem litmus test.
> > > > 
> > > > So, did I get it right this time?  ;-)
> > > 
> > > The code above looks correct to me (in that it matches WRC+addrs),
> > > but your litmus test:
> > > 
> > > > PPC WRCnf+addrs
> > > > ""
> > > > {
> > > > 0:r2=x; 0:r3=y;
> > > > 1:r2=x; 1:r3=y;
> > > > 2:r2=x; 2:r3=y;
> > > > c=a; d=b; x=c; y=d;
> > > > }
> > > >  P0   | P1| P2;
> > > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > > >   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > > > exists
> > > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > > 
> > > Seems to be missing the address dependency on P1.
> > 
> > You are quite correct!  How about the following?
> 
> I think that's it!
> 
> > As before, both herd and ppcmem say that the cycle is allowed, as
> > expected, given non-transitive ordering.  To prohibit the cycle, P1
> > needs a suitable memory-barrier instruction.
> > 
> > 
> > 
> > PPC WRCnf+addrs
> > ""
> > {
> > 0:r2=x; 0:r3=y;
> > 1:r2=x; 1:r3=y;
> > 2:r2=x; 2:r3=y;
> > c=a; d=b; x=c; y=d;
> > }
> >  P0   | P1| P2;
> >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> >   | stw r2,0(r8)  | lwz r9,0(r8)  ;
> > exists
> > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> 
> Agreed.

OK, thank you!  Would you agree that it would be good to replace the
current xor-based fake-dependency litmus tests with tests having real
dependencies?

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
>  wrote:
> >
> > You might as well just write it as
> >
> > struct foo x = READ_ONCE(*ptr);
> > x->bar = 5;
> >
> > because that "smp_read_barrier_depends()" does NOTHING wrt the second write.
> 
> Just to clarify: on alpha it adds a memory barrier, but that memory
> barrier is useless.

No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
needed.  That said, I believe that we should encourage rcu_dereference*()
or lockless_dereference() instead of READ_ONCE() for documentation
reasons, though.

> On non-alpha, it is a no-op, and obviously does nothing simply because
> it generates no code.
> 
> So if anybody believes that the "smp_read_barrier_depends()" does
> something, they are *wrong*.

The other problem with smp_read_barrier_depends() is that it is often
a pain figuring out which prior load it is supposed to apply to.
Hence my preference for rcu_dereference*() and lockless_dereference().

> And if anybody sends out an email with that smp_read_barrier_depends()
> in an example, they are actively just confusing other people, which is
> even worse than just being wrong. Which is why I jumped in.
> 
> So stop perpetuating the myth that smp_read_barrier_depends() does
> something here. It does not. It's a bug, and it has become this "mind
> virus" for some people that seem to believe that it does something.

It looks like I should add words to memory-barriers.txt de-emphasizing
smp_read_barrier_depends().  I will take a look at that.

> I had to remove this crap once from the kernel already, see commit
> 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> atomic*_read_ctrl()").
> 
> I don't want to ever see that broken construct again. And I want to
> make sure that everybody is educated about how broken it was. I'm
> extremely unhappy that it came up again.

Well, if it makes you feel better, that was control dependencies and this
was data dependencies.  So it was not -exactly- the same.  ;-)

(Sorry, couldn't resist...)

> If it turns out that some architecture does actually need a barrier
> between a read and a dependent write, then that will mean that
> 
>  (a) we'll have to make up a _new_ barrier, because
> "smp_read_barrier_depends()" is not that barrier. We'll presumably
> then have to make that new barrier part of "rcu_derefence()" and
> friends.

Agreed.  We can worry about whether or not we replace the current
smp_read_barrier_depends() with that new barrier when and if such
hardware appears.

>  (b) we will have found an architecture with even worse memory
> ordering semantics than alpha, and we'll have to stop castigating
> alpha for being the worst memory ordering ever.

;-) ;-) ;-)

> but I sincerely hope that we'll never find that kind of broken architecture.

Apparently at least some hardware vendors are reading memory-barriers.txt,
so perhaps the odds of that kind of breakage have reduced.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
 wrote:
>
> You might as well just write it as
>
> struct foo x = READ_ONCE(*ptr);
> x->bar = 5;
>
> because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

Just to clarify: on alpha it adds a memory barrier, but that memory
barrier is useless.

On non-alpha, it is a no-op, and obviously does nothing simply because
it generates no code.

So if anybody believes that the "smp_read_barrier_depends()" does
something, they are *wrong*.

And if anybody sends out an email with that smp_read_barrier_depends()
in an example, they are actively just confusing other people, which is
even worse than just being wrong. Which is why I jumped in.

So stop perpetuating the myth that smp_read_barrier_depends() does
something here. It does not. It's a bug, and it has become this "mind
virus" for some people that seem to believe that it does something.

I had to remove this crap once from the kernel already, see commit
105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
atomic*_read_ctrl()").

I don't want to ever see that broken construct again. And I want to
make sure that everybody is educated about how broken it was. I'm
extremely unhappy that it came up again.

If it turns out that some architecture does actually need a barrier
between a read and a dependent write, then that will mean that

 (a) we'll have to make up a _new_ barrier, because
"smp_read_barrier_depends()" is not that barrier. We'll presumably
then have to make that new barrier part of "rcu_derefence()" and
friends.

 (b) we will have found an architecture with even worse memory
ordering semantics than alpha, and we'll have to stop castigating
alpha for being the worst memory ordering ever.

but I sincerely hope that we'll never find that kind of broken architecture.

   Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 12:10 PM, Paul E. McKenney
 wrote:
> On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote:
>>
>> > struct foo *x = READ_ONCE(*ptr);
>> > smp_read_barrier_depends();
>> > x->bar = 5;
>>
>> This case is complete BS. Stop perpetuating it. I already removed a
>> number of bogus cases of it, and I removed the incorrect documentation
>> that had this crap.
>
> If I understand your objection correctly, you want the above pattern
> expressed either like this:
>
> struct foo *x = rcu_dereference(*ptr);
> x->bar = 5;
>
> Or like this:
>
> struct foo *x = lockless_dereference(*ptr);
> x->bar = 5;
>
> Or am I missing your point?

You are entirely missing the point.

You might as well just write it as

struct foo x = READ_ONCE(*ptr);
x->bar = 5;

because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

So what I am saying is simple: anybody who writes that
"smp_read_barrier_depends()" in there is just ttoally and completely
WRONG, and the fact that Peter wrote it out after I removed several
instances of that bloody f*cking idiocy is disturbing.

Don't do it. It's BS. It's wrong. Don't make excuses for it.

 Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 12:52:07AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Mon, Jan 18, 2016 at 07:46:29AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> > > Paul E. McKenney  wrote:
> > > >
> > > > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > > The reason for this is that smp_read_barrier_depends() must order the
> > > > pointer load against any subsequent read or write through a dereference
> > > > of that pointer.  For example:
> > > > 
> > > >p = READ_ONCE(gp);
> > > >smp_rmb();
> > > >r1 = p->a; /* ordered by smp_rmb(). */
> > > >p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> > > >r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > > > 
> > > > In contrast:
> > > > 
> > > >p = READ_ONCE(gp);
> > > >smp_read_barrier_depends();
> > > >r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> > > >p->b = 42; /* ordered by smp_read_barrier_depends(). */
> > > >r2 = x; /* not ordered by smp_read_barrier_depends(), which is 
> > > > OK. */
> > > > 
> > > > Again, if your hardware maintains local ordering for address
> > > > and data dependencies, you can have read_barrier_depends() and
> > > > smp_read_barrier_depends() be no-ops like they are for most
> > > > architectures.
> > > > 
> > > > Does that help?
> > > 
> > > This is crazy! smp_rmb started out being strictly stronger than
> > > smp_read_barrier_depends, when did this stop being the case?
> > 
> > Hello, Herbert!
> > 
> > It is true that most Linux kernel code relies only on the read-read
> > properties of dependencies, but the read-write properties are useful.
> > Admittedly relatively rarely, but useful.
> > 
> > The better comparison for smp_read_barrier_depends(), especially in
> > its rcu_dereference*() form, is smp_load_acquire().
> 
> Confused..
> 
> I recall that last time you and Linus came into a conclusion that even
> on Alpha, a barrier for read->write with data dependency is unnecessary:
> 
> http://article.gmane.org/gmane.linux.kernel/2077661
> 
> And in an earlier mail of that thread, Linus made his point that
> smp_read_barrier_depends() should only be used to order read->read.

Those examples involved read-to-write with conditionals, as in:

if (READ_ONCE(a))
WRITE_ONCE(b, 1);

Without the "if", no ordering is guaranteed on weakly ordered CPUs.
(The volatile accesses keep ordering within the compiler for once...

> So right now, are we going to extend the semantics of
> smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
> still only work for read->read, and assume all the architectures won't
> reorder read->write with data dependency, so that the code above having
> a smp_rmb() also works?

The semantics of smp_read_barrier_depends() has been both read-to-write
and read-to-read for some time now, this patch just catches the
documentation up with reality.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:09:27AM +, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 11:32:00AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:
> > 
> > > Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> > > specification (I seem to keep repeating this).
> > 
> > Do we want this ?

Seems likely to me.  ;-)

> > ---
> >  Documentation/memory-barriers.txt | 17 +
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/Documentation/memory-barriers.txt 
> > b/Documentation/memory-barriers.txt
> > index a61be39c7b51..433326ebdc26 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1,3 +1,4 @@
> > +
> >  
> >  LINUX KERNEL MEMORY BARRIERS
> >  
> > @@ -5,6 +6,22 @@
> >  By: David Howells 
> >  Paul E. McKenney 
> >  
> > +==
> > +DISCLAIMER
> > +==
> > +
> > +This document is not a specification; it is intentionally (for the sake of
> > +brevity) and unintentionally (due to being human) incomplete. This 
> > document is
> > +meant as a guide to using the various memory barriers provided by Linux, 
> > but
> > +in case of any doubt (and there are many) please ask.
> 
> It might be worth adding you and me to the top of the file, to save Paul
> Cc'ing us on questions (get_maintainer.pl points at poor old Corbet for
> this file).
> 
> But yes, it seems that something like this is required.

So Peter, would you like to update your patch to include yourself
and Will as authors?

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:19:27AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> 
> > > > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > > > of two globally ordered sequences connected by a single local link.
> > > > 
> > > > The conclusion that I am slowly coming to is that litmus tests should
> > > > not be thought of as linear chains, but rather as cycles.  If you think
> > > > of it as a cycle, then it doesn't matter where the local link is, just
> > > > how many of them and how they are connected.
> > > 
> > > Do you have some examples of this? I'm struggling to make it work in my
> > > mind, or are you talking specifically in the context of the kernel
> > > memory model?
> > 
> > Now that you mention it, maybe it would be best to keep the transitive
> > and non-transitive separate for the time being anyway.  Just because it
> > might be possible to deal with does not necessarily mean that we should
> > be encouraging it.  ;-)
> 
> So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
> not someone trying to implement RCsc locks using locally transitive
> RELEASE/ACQUIRE operations need exactly this stuff?
> 
> That is, I am afraid we need to cover the mix of local and global
> transitive operations at least in overview.

True, but we haven't gotten to locking yet.  That said, I would argue
that smp_mb__after_unlock_lock() upgrades locks to transitive, and
thus would not be an exception to the "no combining transitive and
non-transitive steps in cycles" rule.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 02:20:46PM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 14, 2016 at 01:24:34PM -0800, Leonid Yegoshin wrote:
> > > On 01/14/2016 12:48 PM, Paul E. McKenney wrote:
> > > >
> > > >So SYNC_RMB is intended to implement smp_rmb(), correct?
> > > Yes.
> > > >
> > > >You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > >smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > 
> > > If smp_read_barrier_depends() is used to separate not only two reads
> > > but read pointer and WRITE basing on that pointer (example below) -
> > > yes. I just doesn't see any example of this in famous
> > > Documentation/memory-barriers.txt and had no chance to know what you
> > > use it in this way too.
> > 
> > Well, Documentation/memory-barriers.txt was intended as a guide for Linux
> > kernel hackers, and not for hardware architects.
> 
> Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> specification (I seem to keep repeating this).
> 
> > 
> > 
> > commit 955720966e216b00613fcf60188d507c103f0e80
> > Author: Paul E. McKenney 
> > Date:   Thu Jan 14 14:17:04 2016 -0800
> > 
> > documentation: Subsequent writes ordered by rcu_dereference()
> > 
> > The current memory-barriers.txt does not address the possibility of
> > a write to a dereferenced pointer.  This should be rare, 
> 
> How are these rare? Isn't:
> 
>   rcu_read_lock()
>   obj = rcu_dereference(ptr);
>   if (!atomic_inc_not_zero(>ref))
>   obj = NULL;
>   rcu_read_unlock();
> 
> a _very_ common thing to do?

It is, but it provides its own barriers, so does not need to rely on
dependency ordering.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstra  wrote:
> >
> > This is distinct from:
> 
> That may be distinct, but:
> 
> > struct foo *x = READ_ONCE(*ptr);
> > smp_read_barrier_depends();
> > x->bar = 5;
> 
> This case is complete BS. Stop perpetuating it. I already removed a
> number of bogus cases of it, and I removed the incorrect documentation
> that had this crap.

If I understand your objection correctly, you want the above pattern
expressed either like this:

struct foo *x = rcu_dereference(*ptr);
x->bar = 5;

Or like this:

struct foo *x = lockless_dereference(*ptr);
x->bar = 5;

Or am I missing your point?

> It's called "smp_READ_barrier_depends()" for a reason.
> 
> Alpha is the only one that needs it, and alpha needs it only for
> dependent READS.
> 
> It's not called smp_read_write_barrier_depends(). It's not called
> "smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else.
> 
> So alpha does have an implied dependency chain from a read to a
> subsequent dependent write, and does not need any extra barriers.
> 
> Alpha does *not* have a dependency chain from a read to a subsequent
> read, which is why we need that horrible crappy
> smp_read_barrier_depends(). But it's the only reason.
> 
> This is the alpha reference manual wrt read-to-write dependency:
> 
>   5.6.1.7 Definition of Dependence Constraint
> 
> The depends relation (DP) is defined as follows. Given u and v
> issued by processor Pi, where u
> is a read or an instruction fetch and v is a write, u precedes v
> in DP order (written u DP v, that
> is, v depends on u) in either of the following situations:
> 
>  • u determines the execution of v, the location accessed by v, or
> the value written by v.
>  • u determines the execution or address or value of another
> memory access z that precedes
> 
> v or might precede v (that is, would precede v in some execution
> path depending
> on the value read by u) by processor issue constraint (see Section 
> 5.6.1.3).
> 
> Note that the dependence barrier honors not only control flow, but
> address and data values too.  This is a different syntax than we use,
> but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or
> conditional dependency between the two implies an ordering.
> 
> So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where
> the second read is data-dependent on the first. Nothing else.
> 
> So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a
> barrier between two reads, then that is a bug.

And the smp_read_barrier_depends() in both rcu_dereference() and
in lockless_dereference() is ordering the read-to-read case and the
underlying hardware is ordering the read-to-write case on weakly ordered
hardware.

Or, again, am I missing your point?

Thanx, Paul

> The above code is crap.  It's exactly as much crap as
> 
>a = READ_ONCE(x);
>smp_rmb();
>WRITE_ONCE(b, y);
> 
> because a "rmb()" simply doesn't have anything to do with
> read-vs-subsequent-write ordering.
> 
>  Linus
> 



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > PPC Overlapping Group-B sets version 4
> > > > ""
> > > > (* When the Group-B sets from two different barriers involve 
> > > > instructions in
> > > >the same thread, within that thread one set must contain the other.
> > > > 
> > > > P0  P1  P2
> > > > Rx=1Wy=1Wz=2
> > > > dep.lwsync  lwsync
> > > > Ry=0Wz=1Wx=1
> > > > Rz=1
> > > > 
> > > > assert(!(z=2))
> > > > 
> > > >Forbidden by ppcmem, allowed by herd.
> > > > *)
> > > > {
> > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > }
> > > >  P0 | P1| P2;
> > > >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > >  xor r7,r6,r6   | lwsync| lwsync;
> > > >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> > > >  lwz r8,0(r3)   |   |   ;
> > > > 
> > > > exists
> > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > 
> > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > also don't think the Rz=1 in P0 changes anything.
> > 
> > What about the smp_wmb() variant of dmb that orders only stores?
> 
> Tricky, but I think it still works out if the coherence order of z is as
> I described above. The line of reasoning is weird though -- I ended up
> considering the two cases where P0 reads z before and after it reads x
> and what that means for the read of y.

By "works out" you mean that ARM prohibits the outcome?

BTW, I never have seen a real-world use for this case.  At the moment
it is mostly a cautionary tale about memory-model corner cases and
tools.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstra  wrote:
>
> This is distinct from:

That may be distinct, but:

> struct foo *x = READ_ONCE(*ptr);
> smp_read_barrier_depends();
> x->bar = 5;

This case is complete BS. Stop perpetuating it. I already removed a
number of bogus cases of it, and I removed the incorrect documentation
that had this crap.

It's called "smp_READ_barrier_depends()" for a reason.

Alpha is the only one that needs it, and alpha needs it only for
dependent READS.

It's not called smp_read_write_barrier_depends(). It's not called
"smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else.

So alpha does have an implied dependency chain from a read to a
subsequent dependent write, and does not need any extra barriers.

Alpha does *not* have a dependency chain from a read to a subsequent
read, which is why we need that horrible crappy
smp_read_barrier_depends(). But it's the only reason.

This is the alpha reference manual wrt read-to-write dependency:

  5.6.1.7 Definition of Dependence Constraint

The depends relation (DP) is defined as follows. Given u and v
issued by processor Pi, where u
is a read or an instruction fetch and v is a write, u precedes v
in DP order (written u DP v, that
is, v depends on u) in either of the following situations:

 • u determines the execution of v, the location accessed by v, or
the value written by v.
 • u determines the execution or address or value of another
memory access z that precedes

v or might precede v (that is, would precede v in some execution
path depending
on the value read by u) by processor issue constraint (see Section 5.6.1.3).

Note that the dependence barrier honors not only control flow, but
address and data values too.  This is a different syntax than we use,
but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or
conditional dependency between the two implies an ordering.

So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where
the second read is data-dependent on the first. Nothing else.

So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a
barrier between two reads, then that is a bug.

The above code is crap.  It's exactly as much crap as

   a = READ_ONCE(x);
   smp_rmb();
   WRITE_ONCE(b, y);

because a "rmb()" simply doesn't have anything to do with
read-vs-subsequent-write ordering.

 Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Wed, Jan 27, 2016 at 12:52:07AM +0800, Boqun Feng wrote:
> I recall that last time you and Linus came into a conclusion that even
> on Alpha, a barrier for read->write with data dependency is unnecessary:
> 
> http://article.gmane.org/gmane.linux.kernel/2077661
> 
> And in an earlier mail of that thread, Linus made his point that
> smp_read_barrier_depends() should only be used to order read->read.
> 
> So right now, are we going to extend the semantics of
> smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
> still only work for read->read, and assume all the architectures won't
> reorder read->write with data dependency, so that the code above having
> a smp_rmb() also works?

That discussions was about control dependencies. So writes that _depend_
on a prior read having an explicit value.

So something like:

struct foo *x = READ_ONCE(*ptr);
smp_read_barrier_depends()
if (x->val == 5)
x->bar = 5;

In that case, the load of x->val must be complete and its value
determined _before_ the store to x->bar can happen.

This is distinct from:

struct foo *x = READ_ONCE(*ptr);
smp_read_barrier_depends();
x->bar = 5;

And its the second case where smp_read_barrier_depends() read->write
order matters.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng
Hi Paul,

On Mon, Jan 18, 2016 at 07:46:29AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> > Paul E. McKenney  wrote:
> > >
> > > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > The reason for this is that smp_read_barrier_depends() must order the
> > > pointer load against any subsequent read or write through a dereference
> > > of that pointer.  For example:
> > > 
> > >p = READ_ONCE(gp);
> > >smp_rmb();
> > >r1 = p->a; /* ordered by smp_rmb(). */
> > >p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> > >r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > > 
> > > In contrast:
> > > 
> > >p = READ_ONCE(gp);
> > >smp_read_barrier_depends();
> > >r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> > >p->b = 42; /* ordered by smp_read_barrier_depends(). */
> > >r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. 
> > > */
> > > 
> > > Again, if your hardware maintains local ordering for address
> > > and data dependencies, you can have read_barrier_depends() and
> > > smp_read_barrier_depends() be no-ops like they are for most
> > > architectures.
> > > 
> > > Does that help?
> > 
> > This is crazy! smp_rmb started out being strictly stronger than
> > smp_read_barrier_depends, when did this stop being the case?
> 
> Hello, Herbert!
> 
> It is true that most Linux kernel code relies only on the read-read
> properties of dependencies, but the read-write properties are useful.
> Admittedly relatively rarely, but useful.
> 
> The better comparison for smp_read_barrier_depends(), especially in
> its rcu_dereference*() form, is smp_load_acquire().
> 

Confused..

I recall that last time you and Linus came into a conclusion that even
on Alpha, a barrier for read->write with data dependency is unnecessary:

http://article.gmane.org/gmane.linux.kernel/2077661

And in an earlier mail of that thread, Linus made his point that
smp_read_barrier_depends() should only be used to order read->read.

So right now, are we going to extend the semantics of
smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
still only work for read->read, and assume all the architectures won't
reorder read->write with data dependency, so that the code above having
a smp_rmb() also works?

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng
Hi Will,

On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > PPC Overlapping Group-B sets version 4
> > > > ""
> > > > (* When the Group-B sets from two different barriers involve 
> > > > instructions in
> > > >the same thread, within that thread one set must contain the other.
> > > > 
> > > > P0  P1  P2
> > > > Rx=1Wy=1Wz=2
> > > > dep.lwsync  lwsync
> > > > Ry=0Wz=1Wx=1
> > > > Rz=1
> > > > 
> > > > assert(!(z=2))
> > > > 
> > > >Forbidden by ppcmem, allowed by herd.
> > > > *)
> > > > {
> > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > }
> > > >  P0 | P1| P2;
> > > >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > >  xor r7,r6,r6   | lwsync| lwsync;
> > > >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> > > >  lwz r8,0(r3)   |   |   ;
> > > > 
> > > > exists
> > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > 
> > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > also don't think the Rz=1 in P0 changes anything.
> > 
> > What about the smp_wmb() variant of dmb that orders only stores?
> 
> Tricky, but I think it still works out if the coherence order of z is as
> I described above. The line of reasoning is weird though -- I ended up
> considering the two cases where P0 reads z before and after it reads x
 ^^^
Because of the fact that two reads on the same processors can't be
executed simultaneously? I feel like this is exactly something herd
missed.

> and what that means for the read of y.
> 

And the reasoning on PPC is similar, so looks like the read of z on P0
is a necessary condition for the exists clause to be forbidden.

Regards,
Boqun

> Will


signature.asc
Description: PGP signature


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Will Deacon
On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > PPC Overlapping Group-B sets version 4
> > > ""
> > > (* When the Group-B sets from two different barriers involve instructions 
> > > in
> > >the same thread, within that thread one set must contain the other.
> > > 
> > >   P0  P1  P2
> > >   Rx=1Wy=1Wz=2
> > >   dep.lwsync  lwsync
> > >   Ry=0Wz=1Wx=1
> > >   Rz=1
> > > 
> > >   assert(!(z=2))
> > > 
> > >Forbidden by ppcmem, allowed by herd.
> > > *)
> > > {
> > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > }
> > >  P0   | P1| P2;
> > >  lwz r6,0(r1) | stw r4,0(r2)  | stw r5,0(r3)  ;
> > >  xor r7,r6,r6 | lwsync| lwsync;
> > >  lwzx r7,r7,r2| stw r4,0(r3)  | stw r4,0(r1)  ;
> > >  lwz r8,0(r3) |   |   ;
> > > 
> > > exists
> > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > 
> > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > to constrain the coherence order of z to be {0->1->2}, then I think that
> > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > also don't think the Rz=1 in P0 changes anything.
> 
> What about the smp_wmb() variant of dmb that orders only stores?

Tricky, but I think it still works out if the coherence order of z is as
I described above. The line of reasoning is weird though -- I ended up
considering the two cases where P0 reads z before and after it reads x
and what that means for the read of y.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Will Deacon
On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 02:41:34PM +, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > > > like a variant on WWC and I couldn't really follow it).
> > > > 
> > > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > > I am sure that it will seem more natural with time and experience...
> > > 
> > > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > > last access from a store to a load to get WRC.  Plus the levels of
> > > indirection definitely didn't match up, did they?
> > 
> > Nope, it was pretty baffling!
> 
> "It is a service that I provide."  ;-)
> 
> > >   struct foo {
> > >   struct foo *next;
> > >   };
> > >   struct foo a;
> > >   struct foo b;
> > >   struct foo c = {  };
> > >   struct foo d = {  };
> > >   struct foo x = {  };
> > >   struct foo y = {  };
> > >   struct foo *r1, *r2, *r3;
> > > 
> > >   void cpu0(void)
> > >   {
> > >   WRITE_ONCE(x.next, );
> > >   }
> > > 
> > >   void cpu1(void)
> > >   {
> > >   r1 = lockless_dereference(x.next);
> > >   WRITE_ONCE(r1->next, );
> > >   }
> > > 
> > >   void cpu2(void)
> > >   {
> > >   r2 = lockless_dereference(y.next);
> > >   r3 = READ_ONCE(r2->next);
> > >   }
> > > 
> > > In this case, it is legal to end the run with:
> > > 
> > >   r1 ==  && r2 ==  && r3 == 
> > > 
> > > Please see below for a ppcmem litmus test.
> > > 
> > > So, did I get it right this time?  ;-)
> > 
> > The code above looks correct to me (in that it matches WRC+addrs),
> > but your litmus test:
> > 
> > > PPC WRCnf+addrs
> > > ""
> > > {
> > > 0:r2=x; 0:r3=y;
> > > 1:r2=x; 1:r3=y;
> > > 2:r2=x; 2:r3=y;
> > > c=a; d=b; x=c; y=d;
> > > }
> > >  P0   | P1| P2;
> > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > >   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > > exists
> > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > 
> > Seems to be missing the address dependency on P1.
> 
> You are quite correct!  How about the following?

I think that's it!

> As before, both herd and ppcmem say that the cycle is allowed, as
> expected, given non-transitive ordering.  To prohibit the cycle, P1
> needs a suitable memory-barrier instruction.
> 
> 
> 
> PPC WRCnf+addrs
> ""
> {
> 0:r2=x; 0:r3=y;
> 1:r2=x; 1:r3=y;
> 2:r2=x; 2:r3=y;
> c=a; d=b; x=c; y=d;
> }
>  P0   | P1| P2;
>  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
>   | stw r2,0(r8)  | lwz r9,0(r8)  ;
> exists
> (1:r8=y /\ 2:r8=x /\ 2:r9=c)

Agreed.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Will Deacon
On Tue, Jan 26, 2016 at 11:32:00AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:
> 
> > Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> > specification (I seem to keep repeating this).
> 
> Do we want this ?
> 
> ---
>  Documentation/memory-barriers.txt | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index a61be39c7b51..433326ebdc26 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1,3 +1,4 @@
> +
>
>LINUX KERNEL MEMORY BARRIERS
>
> @@ -5,6 +6,22 @@
>  By: David Howells 
>  Paul E. McKenney 
>  
> +==
> +DISCLAIMER
> +==
> +
> +This document is not a specification; it is intentionally (for the sake of
> +brevity) and unintentionally (due to being human) incomplete. This document 
> is
> +meant as a guide to using the various memory barriers provided by Linux, but
> +in case of any doubt (and there are many) please ask.

It might be worth adding you and me to the top of the file, to save Paul
Cc'ing us on questions (get_maintainer.pl points at poor old Corbet for
this file).

But yes, it seems that something like this is required.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:

> Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> specification (I seem to keep repeating this).

Do we want this ?

---
 Documentation/memory-barriers.txt | 17 +
 1 file changed, 17 insertions(+)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index a61be39c7b51..433326ebdc26 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1,3 +1,4 @@
+
 
 LINUX KERNEL MEMORY BARRIERS
 
@@ -5,6 +6,22 @@
 By: David Howells 
 Paul E. McKenney 
 
+==
+DISCLAIMER
+==
+
+This document is not a specification; it is intentionally (for the sake of
+brevity) and unintentionally (due to being human) incomplete. This document is
+meant as a guide to using the various memory barriers provided by Linux, but
+in case of any doubt (and there are many) please ask.
+
+I repeat, this document is not a specification of what Linux expects from
+hardware.
+
+=
+INDEX
+=
+
 Contents:
 
  (*) Abstract memory access model.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Thu, Jan 14, 2016 at 02:20:46PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 14, 2016 at 01:24:34PM -0800, Leonid Yegoshin wrote:
> > On 01/14/2016 12:48 PM, Paul E. McKenney wrote:
> > >
> > >So SYNC_RMB is intended to implement smp_rmb(), correct?
> > Yes.
> > >
> > >You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > >smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > 
> > If smp_read_barrier_depends() is used to separate not only two reads
> > but read pointer and WRITE basing on that pointer (example below) -
> > yes. I just doesn't see any example of this in famous
> > Documentation/memory-barriers.txt and had no chance to know what you
> > use it in this way too.
> 
> Well, Documentation/memory-barriers.txt was intended as a guide for Linux
> kernel hackers, and not for hardware architects.

Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
specification (I seem to keep repeating this).

> 
> 
> commit 955720966e216b00613fcf60188d507c103f0e80
> Author: Paul E. McKenney 
> Date:   Thu Jan 14 14:17:04 2016 -0800
> 
> documentation: Subsequent writes ordered by rcu_dereference()
> 
> The current memory-barriers.txt does not address the possibility of
> a write to a dereferenced pointer.  This should be rare, 

How are these rare? Isn't:

rcu_read_lock()
obj = rcu_dereference(ptr);
if (!atomic_inc_not_zero(>ref))
obj = NULL;
rcu_read_unlock();

a _very_ common thing to do?


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:

> > > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > > of two globally ordered sequences connected by a single local link.
> > > 
> > > The conclusion that I am slowly coming to is that litmus tests should
> > > not be thought of as linear chains, but rather as cycles.  If you think
> > > of it as a cycle, then it doesn't matter where the local link is, just
> > > how many of them and how they are connected.
> > 
> > Do you have some examples of this? I'm struggling to make it work in my
> > mind, or are you talking specifically in the context of the kernel
> > memory model?
> 
> Now that you mention it, maybe it would be best to keep the transitive
> and non-transitive separate for the time being anyway.  Just because it
> might be possible to deal with does not necessarily mean that we should
> be encouraging it.  ;-)

So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
not someone trying to implement RCsc locks using locally transitive
RELEASE/ACQUIRE operations need exactly this stuff?

That is, I am afraid we need to cover the mix of local and global
transitive operations at least in overview.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 10:12:11PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 06:02:34PM +, Will Deacon wrote:

> > Thanks for having a go at this. I tried defining something axiomatically,
> > but got stuck pretty quickly. In my scheme, I used "data-directed
> > transitivity" instead of "local transitivity", since the latter seems to
> > be a bit of a misnomer.
> 
> I figured that "local" meant local to the CPUs participating in the
> release-acquire chain.  As opposed to smp_mb() chains where the ordering
> is "global" as in visible to all CPUs, whether on the chain or not.
> Does that help?

That is in fact how I read and understood it.




Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Will Deacon
On Tue, Jan 26, 2016 at 11:32:00AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:
> 
> > Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> > specification (I seem to keep repeating this).
> 
> Do we want this ?
> 
> ---
>  Documentation/memory-barriers.txt | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index a61be39c7b51..433326ebdc26 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1,3 +1,4 @@
> +
>
>LINUX KERNEL MEMORY BARRIERS
>
> @@ -5,6 +6,22 @@
>  By: David Howells 
>  Paul E. McKenney 
>  
> +==
> +DISCLAIMER
> +==
> +
> +This document is not a specification; it is intentionally (for the sake of
> +brevity) and unintentionally (due to being human) incomplete. This document 
> is
> +meant as a guide to using the various memory barriers provided by Linux, but
> +in case of any doubt (and there are many) please ask.

It might be worth adding you and me to the top of the file, to save Paul
Cc'ing us on questions (get_maintainer.pl points at poor old Corbet for
this file).

But yes, it seems that something like this is required.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Will Deacon
On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > PPC Overlapping Group-B sets version 4
> > > ""
> > > (* When the Group-B sets from two different barriers involve instructions 
> > > in
> > >the same thread, within that thread one set must contain the other.
> > > 
> > >   P0  P1  P2
> > >   Rx=1Wy=1Wz=2
> > >   dep.lwsync  lwsync
> > >   Ry=0Wz=1Wx=1
> > >   Rz=1
> > > 
> > >   assert(!(z=2))
> > > 
> > >Forbidden by ppcmem, allowed by herd.
> > > *)
> > > {
> > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > }
> > >  P0   | P1| P2;
> > >  lwz r6,0(r1) | stw r4,0(r2)  | stw r5,0(r3)  ;
> > >  xor r7,r6,r6 | lwsync| lwsync;
> > >  lwzx r7,r7,r2| stw r4,0(r3)  | stw r4,0(r1)  ;
> > >  lwz r8,0(r3) |   |   ;
> > > 
> > > exists
> > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > 
> > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > to constrain the coherence order of z to be {0->1->2}, then I think that
> > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > also don't think the Rz=1 in P0 changes anything.
> 
> What about the smp_wmb() variant of dmb that orders only stores?

Tricky, but I think it still works out if the coherence order of z is as
I described above. The line of reasoning is weird though -- I ended up
considering the two cases where P0 reads z before and after it reads x
and what that means for the read of y.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:

> > > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > > of two globally ordered sequences connected by a single local link.
> > > 
> > > The conclusion that I am slowly coming to is that litmus tests should
> > > not be thought of as linear chains, but rather as cycles.  If you think
> > > of it as a cycle, then it doesn't matter where the local link is, just
> > > how many of them and how they are connected.
> > 
> > Do you have some examples of this? I'm struggling to make it work in my
> > mind, or are you talking specifically in the context of the kernel
> > memory model?
> 
> Now that you mention it, maybe it would be best to keep the transitive
> and non-transitive separate for the time being anyway.  Just because it
> might be possible to deal with does not necessarily mean that we should
> be encouraging it.  ;-)

So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
not someone trying to implement RCsc locks using locally transitive
RELEASE/ACQUIRE operations need exactly this stuff?

That is, I am afraid we need to cover the mix of local and global
transitive operations at least in overview.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Will Deacon
On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 02:41:34PM +, Will Deacon wrote:
> > On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > > > like a variant on WWC and I couldn't really follow it).
> > > > 
> > > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > > I am sure that it will seem more natural with time and experience...
> > > 
> > > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > > last access from a store to a load to get WRC.  Plus the levels of
> > > indirection definitely didn't match up, did they?
> > 
> > Nope, it was pretty baffling!
> 
> "It is a service that I provide."  ;-)
> 
> > >   struct foo {
> > >   struct foo *next;
> > >   };
> > >   struct foo a;
> > >   struct foo b;
> > >   struct foo c = {  };
> > >   struct foo d = {  };
> > >   struct foo x = {  };
> > >   struct foo y = {  };
> > >   struct foo *r1, *r2, *r3;
> > > 
> > >   void cpu0(void)
> > >   {
> > >   WRITE_ONCE(x.next, );
> > >   }
> > > 
> > >   void cpu1(void)
> > >   {
> > >   r1 = lockless_dereference(x.next);
> > >   WRITE_ONCE(r1->next, );
> > >   }
> > > 
> > >   void cpu2(void)
> > >   {
> > >   r2 = lockless_dereference(y.next);
> > >   r3 = READ_ONCE(r2->next);
> > >   }
> > > 
> > > In this case, it is legal to end the run with:
> > > 
> > >   r1 ==  && r2 ==  && r3 == 
> > > 
> > > Please see below for a ppcmem litmus test.
> > > 
> > > So, did I get it right this time?  ;-)
> > 
> > The code above looks correct to me (in that it matches WRC+addrs),
> > but your litmus test:
> > 
> > > PPC WRCnf+addrs
> > > ""
> > > {
> > > 0:r2=x; 0:r3=y;
> > > 1:r2=x; 1:r3=y;
> > > 2:r2=x; 2:r3=y;
> > > c=a; d=b; x=c; y=d;
> > > }
> > >  P0   | P1| P2;
> > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > >   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > > exists
> > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > 
> > Seems to be missing the address dependency on P1.
> 
> You are quite correct!  How about the following?

I think that's it!

> As before, both herd and ppcmem say that the cycle is allowed, as
> expected, given non-transitive ordering.  To prohibit the cycle, P1
> needs a suitable memory-barrier instruction.
> 
> 
> 
> PPC WRCnf+addrs
> ""
> {
> 0:r2=x; 0:r3=y;
> 1:r2=x; 1:r3=y;
> 2:r2=x; 2:r3=y;
> c=a; d=b; x=c; y=d;
> }
>  P0   | P1| P2;
>  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
>   | stw r2,0(r8)  | lwz r9,0(r8)  ;
> exists
> (1:r8=y /\ 2:r8=x /\ 2:r9=c)

Agreed.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Thu, Jan 14, 2016 at 02:20:46PM -0800, Paul E. McKenney wrote:
> On Thu, Jan 14, 2016 at 01:24:34PM -0800, Leonid Yegoshin wrote:
> > On 01/14/2016 12:48 PM, Paul E. McKenney wrote:
> > >
> > >So SYNC_RMB is intended to implement smp_rmb(), correct?
> > Yes.
> > >
> > >You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > >smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > 
> > If smp_read_barrier_depends() is used to separate not only two reads
> > but read pointer and WRITE basing on that pointer (example below) -
> > yes. I just doesn't see any example of this in famous
> > Documentation/memory-barriers.txt and had no chance to know what you
> > use it in this way too.
> 
> Well, Documentation/memory-barriers.txt was intended as a guide for Linux
> kernel hackers, and not for hardware architects.

Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
specification (I seem to keep repeating this).

> 
> 
> commit 955720966e216b00613fcf60188d507c103f0e80
> Author: Paul E. McKenney 
> Date:   Thu Jan 14 14:17:04 2016 -0800
> 
> documentation: Subsequent writes ordered by rcu_dereference()
> 
> The current memory-barriers.txt does not address the possibility of
> a write to a dereferenced pointer.  This should be rare, 

How are these rare? Isn't:

rcu_read_lock()
obj = rcu_dereference(ptr);
if (!atomic_inc_not_zero(>ref))
obj = NULL;
rcu_read_unlock();

a _very_ common thing to do?


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:

> Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> specification (I seem to keep repeating this).

Do we want this ?

---
 Documentation/memory-barriers.txt | 17 +
 1 file changed, 17 insertions(+)

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index a61be39c7b51..433326ebdc26 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1,3 +1,4 @@
+
 
 LINUX KERNEL MEMORY BARRIERS
 
@@ -5,6 +6,22 @@
 By: David Howells 
 Paul E. McKenney 
 
+==
+DISCLAIMER
+==
+
+This document is not a specification; it is intentionally (for the sake of
+brevity) and unintentionally (due to being human) incomplete. This document is
+meant as a guide to using the various memory barriers provided by Linux, but
+in case of any doubt (and there are many) please ask.
+
+I repeat, this document is not a specification of what Linux expects from
+hardware.
+
+=
+INDEX
+=
+
 Contents:
 
  (*) Abstract memory access model.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:

> If it turns out that some architecture does actually need a barrier
> between a read and a dependent write, then that will mean that
> 
>  (a) we'll have to make up a _new_ barrier, because
> "smp_read_barrier_depends()" is not that barrier. We'll presumably
> then have to make that new barrier part of "rcu_derefence()" and
> friends.
> 
>  (b) we will have found an architecture with even worse memory
> ordering semantics than alpha, and we'll have to stop castigating
> alpha for being the worst memory ordering ever.
> 
> but I sincerely hope that we'll never find that kind of broken architecture.

So for a moment it looked like MIPS wanted to equal or surpass Alpha in
this respect.

And Paul made the point that smp_read_barrier_depends() really should
be smp_aquire_barrier_depends() in that we rely on both dependent reads
and writes to be ordered against the initial pointer load.

Now, as you've made abundantly clear, Alpha does this, although it needs
the little extra help in the dependent read department.

The 'problem' is that someone seemed to have used our
Documentation/memory-barriers.txt as a specification for what hardware
is permitted and we require. And in that light Paul noted that
read_barrier_depends really should be considered an
acquire_barrier_depends and order both dependent reads and writes
against the (prior) read (if nothing else already does).

Now clearly, any sane architecture doesn't need anything like this, but
again our document doesn't seem to judge. That is, from reading the
document one can get the impression is a perfectly fine thing to do.
Nowhere does our disdain for this thing show.



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng
Hi Will,

On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > PPC Overlapping Group-B sets version 4
> > > > ""
> > > > (* When the Group-B sets from two different barriers involve 
> > > > instructions in
> > > >the same thread, within that thread one set must contain the other.
> > > > 
> > > > P0  P1  P2
> > > > Rx=1Wy=1Wz=2
> > > > dep.lwsync  lwsync
> > > > Ry=0Wz=1Wx=1
> > > > Rz=1
> > > > 
> > > > assert(!(z=2))
> > > > 
> > > >Forbidden by ppcmem, allowed by herd.
> > > > *)
> > > > {
> > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > }
> > > >  P0 | P1| P2;
> > > >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > >  xor r7,r6,r6   | lwsync| lwsync;
> > > >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> > > >  lwz r8,0(r3)   |   |   ;
> > > > 
> > > > exists
> > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > 
> > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > also don't think the Rz=1 in P0 changes anything.
> > 
> > What about the smp_wmb() variant of dmb that orders only stores?
> 
> Tricky, but I think it still works out if the coherence order of z is as
> I described above. The line of reasoning is weird though -- I ended up
> considering the two cases where P0 reads z before and after it reads x
 ^^^
Because of the fact that two reads on the same processors can't be
executed simultaneously? I feel like this is exactly something herd
missed.

> and what that means for the read of y.
> 

And the reasoning on PPC is similar, so looks like the read of z on P0
is a necessary condition for the exists clause to be forbidden.

Regards,
Boqun

> Will


signature.asc
Description: PGP signature


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng
Hi Paul,

On Mon, Jan 18, 2016 at 07:46:29AM -0800, Paul E. McKenney wrote:
> On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> > Paul E. McKenney  wrote:
> > >
> > > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > The reason for this is that smp_read_barrier_depends() must order the
> > > pointer load against any subsequent read or write through a dereference
> > > of that pointer.  For example:
> > > 
> > >p = READ_ONCE(gp);
> > >smp_rmb();
> > >r1 = p->a; /* ordered by smp_rmb(). */
> > >p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> > >r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > > 
> > > In contrast:
> > > 
> > >p = READ_ONCE(gp);
> > >smp_read_barrier_depends();
> > >r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> > >p->b = 42; /* ordered by smp_read_barrier_depends(). */
> > >r2 = x; /* not ordered by smp_read_barrier_depends(), which is OK. 
> > > */
> > > 
> > > Again, if your hardware maintains local ordering for address
> > > and data dependencies, you can have read_barrier_depends() and
> > > smp_read_barrier_depends() be no-ops like they are for most
> > > architectures.
> > > 
> > > Does that help?
> > 
> > This is crazy! smp_rmb started out being strictly stronger than
> > smp_read_barrier_depends, when did this stop being the case?
> 
> Hello, Herbert!
> 
> It is true that most Linux kernel code relies only on the read-read
> properties of dependencies, but the read-write properties are useful.
> Admittedly relatively rarely, but useful.
> 
> The better comparison for smp_read_barrier_depends(), especially in
> its rcu_dereference*() form, is smp_load_acquire().
> 

Confused..

I recall that last time you and Linus came into a conclusion that even
on Alpha, a barrier for read->write with data dependency is unnecessary:

http://article.gmane.org/gmane.linux.kernel/2077661

And in an earlier mail of that thread, Linus made his point that
smp_read_barrier_depends() should only be used to order read->read.

So right now, are we going to extend the semantics of
smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
still only work for read->read, and assume all the architectures won't
reorder read->write with data dependency, so that the code above having
a smp_rmb() also works?

Regards,
Boqun


signature.asc
Description: PGP signature


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Wed, Jan 27, 2016 at 12:52:07AM +0800, Boqun Feng wrote:
> I recall that last time you and Linus came into a conclusion that even
> on Alpha, a barrier for read->write with data dependency is unnecessary:
> 
> http://article.gmane.org/gmane.linux.kernel/2077661
> 
> And in an earlier mail of that thread, Linus made his point that
> smp_read_barrier_depends() should only be used to order read->read.
> 
> So right now, are we going to extend the semantics of
> smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
> still only work for read->read, and assume all the architectures won't
> reorder read->write with data dependency, so that the code above having
> a smp_rmb() also works?

That discussions was about control dependencies. So writes that _depend_
on a prior read having an explicit value.

So something like:

struct foo *x = READ_ONCE(*ptr);
smp_read_barrier_depends()
if (x->val == 5)
x->bar = 5;

In that case, the load of x->val must be complete and its value
determined _before_ the store to x->bar can happen.

This is distinct from:

struct foo *x = READ_ONCE(*ptr);
smp_read_barrier_depends();
x->bar = 5;

And its the second case where smp_read_barrier_depends() read->write
order matters.


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstra  wrote:
>
> This is distinct from:

That may be distinct, but:

> struct foo *x = READ_ONCE(*ptr);
> smp_read_barrier_depends();
> x->bar = 5;

This case is complete BS. Stop perpetuating it. I already removed a
number of bogus cases of it, and I removed the incorrect documentation
that had this crap.

It's called "smp_READ_barrier_depends()" for a reason.

Alpha is the only one that needs it, and alpha needs it only for
dependent READS.

It's not called smp_read_write_barrier_depends(). It's not called
"smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else.

So alpha does have an implied dependency chain from a read to a
subsequent dependent write, and does not need any extra barriers.

Alpha does *not* have a dependency chain from a read to a subsequent
read, which is why we need that horrible crappy
smp_read_barrier_depends(). But it's the only reason.

This is the alpha reference manual wrt read-to-write dependency:

  5.6.1.7 Definition of Dependence Constraint

The depends relation (DP) is defined as follows. Given u and v
issued by processor Pi, where u
is a read or an instruction fetch and v is a write, u precedes v
in DP order (written u DP v, that
is, v depends on u) in either of the following situations:

 • u determines the execution of v, the location accessed by v, or
the value written by v.
 • u determines the execution or address or value of another
memory access z that precedes

v or might precede v (that is, would precede v in some execution
path depending
on the value read by u) by processor issue constraint (see Section 5.6.1.3).

Note that the dependence barrier honors not only control flow, but
address and data values too.  This is a different syntax than we use,
but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or
conditional dependency between the two implies an ordering.

So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where
the second read is data-dependent on the first. Nothing else.

So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a
barrier between two reads, then that is a bug.

The above code is crap.  It's exactly as much crap as

   a = READ_ONCE(x);
   smp_rmb();
   WRITE_ONCE(b, y);

because a "rmb()" simply doesn't have anything to do with
read-vs-subsequent-write ordering.

 Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 12:16:09PM +, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > PPC Overlapping Group-B sets version 4
> > > > ""
> > > > (* When the Group-B sets from two different barriers involve 
> > > > instructions in
> > > >the same thread, within that thread one set must contain the other.
> > > > 
> > > > P0  P1  P2
> > > > Rx=1Wy=1Wz=2
> > > > dep.lwsync  lwsync
> > > > Ry=0Wz=1Wx=1
> > > > Rz=1
> > > > 
> > > > assert(!(z=2))
> > > > 
> > > >Forbidden by ppcmem, allowed by herd.
> > > > *)
> > > > {
> > > > 0:r1=x; 0:r2=y; 0:r3=z;
> > > > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > > > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > > > }
> > > >  P0 | P1| P2;
> > > >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> > > >  xor r7,r6,r6   | lwsync| lwsync;
> > > >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> > > >  lwz r8,0(r3)   |   |   ;
> > > > 
> > > > exists
> > > > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> > > 
> > > That really hurts. Assuming that the "assert(!(z=2))" is actually there
> > > to constrain the coherence order of z to be {0->1->2}, then I think that
> > > this test is forbidden on arm using dmb instead of lwsync. That said, I
> > > also don't think the Rz=1 in P0 changes anything.
> > 
> > What about the smp_wmb() variant of dmb that orders only stores?
> 
> Tricky, but I think it still works out if the coherence order of z is as
> I described above. The line of reasoning is weird though -- I ended up
> considering the two cases where P0 reads z before and after it reads x
> and what that means for the read of y.

By "works out" you mean that ARM prohibits the outcome?

BTW, I never have seen a real-world use for this case.  At the moment
it is mostly a cautionary tale about memory-model corner cases and
tools.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstra  wrote:
> >
> > This is distinct from:
> 
> That may be distinct, but:
> 
> > struct foo *x = READ_ONCE(*ptr);
> > smp_read_barrier_depends();
> > x->bar = 5;
> 
> This case is complete BS. Stop perpetuating it. I already removed a
> number of bogus cases of it, and I removed the incorrect documentation
> that had this crap.

If I understand your objection correctly, you want the above pattern
expressed either like this:

struct foo *x = rcu_dereference(*ptr);
x->bar = 5;

Or like this:

struct foo *x = lockless_dereference(*ptr);
x->bar = 5;

Or am I missing your point?

> It's called "smp_READ_barrier_depends()" for a reason.
> 
> Alpha is the only one that needs it, and alpha needs it only for
> dependent READS.
> 
> It's not called smp_read_write_barrier_depends(). It's not called
> "smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else.
> 
> So alpha does have an implied dependency chain from a read to a
> subsequent dependent write, and does not need any extra barriers.
> 
> Alpha does *not* have a dependency chain from a read to a subsequent
> read, which is why we need that horrible crappy
> smp_read_barrier_depends(). But it's the only reason.
> 
> This is the alpha reference manual wrt read-to-write dependency:
> 
>   5.6.1.7 Definition of Dependence Constraint
> 
> The depends relation (DP) is defined as follows. Given u and v
> issued by processor Pi, where u
> is a read or an instruction fetch and v is a write, u precedes v
> in DP order (written u DP v, that
> is, v depends on u) in either of the following situations:
> 
>  • u determines the execution of v, the location accessed by v, or
> the value written by v.
>  • u determines the execution or address or value of another
> memory access z that precedes
> 
> v or might precede v (that is, would precede v in some execution
> path depending
> on the value read by u) by processor issue constraint (see Section 
> 5.6.1.3).
> 
> Note that the dependence barrier honors not only control flow, but
> address and data values too.  This is a different syntax than we use,
> but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or
> conditional dependency between the two implies an ordering.
> 
> So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where
> the second read is data-dependent on the first. Nothing else.
> 
> So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a
> barrier between two reads, then that is a bug.

And the smp_read_barrier_depends() in both rcu_dereference() and
in lockless_dereference() is ordering the read-to-read case and the
underlying hardware is ordering the read-to-write case on weakly ordered
hardware.

Or, again, am I missing your point?

Thanx, Paul

> The above code is crap.  It's exactly as much crap as
> 
>a = READ_ONCE(x);
>smp_rmb();
>WRITE_ONCE(b, y);
> 
> because a "rmb()" simply doesn't have anything to do with
> read-vs-subsequent-write ordering.
> 
>  Linus
> 



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:19:27AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 25, 2016 at 10:03:22PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> 
> > > > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > > > of two globally ordered sequences connected by a single local link.
> > > > 
> > > > The conclusion that I am slowly coming to is that litmus tests should
> > > > not be thought of as linear chains, but rather as cycles.  If you think
> > > > of it as a cycle, then it doesn't matter where the local link is, just
> > > > how many of them and how they are connected.
> > > 
> > > Do you have some examples of this? I'm struggling to make it work in my
> > > mind, or are you talking specifically in the context of the kernel
> > > memory model?
> > 
> > Now that you mention it, maybe it would be best to keep the transitive
> > and non-transitive separate for the time being anyway.  Just because it
> > might be possible to deal with does not necessarily mean that we should
> > be encouraging it.  ;-)
> 
> So isn't smp_mb__after_unlock_lock() exactly such a scenario? And would
> not someone trying to implement RCsc locks using locally transitive
> RELEASE/ACQUIRE operations need exactly this stuff?
> 
> That is, I am afraid we need to cover the mix of local and global
> transitive operations at least in overview.

True, but we haven't gotten to locking yet.  That said, I would argue
that smp_mb__after_unlock_lock() upgrades locks to transitive, and
thus would not be an exception to the "no combining transitive and
non-transitive steps in cycles" rule.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 14, 2016 at 02:20:46PM -0800, Paul E. McKenney wrote:
> > On Thu, Jan 14, 2016 at 01:24:34PM -0800, Leonid Yegoshin wrote:
> > > On 01/14/2016 12:48 PM, Paul E. McKenney wrote:
> > > >
> > > >So SYNC_RMB is intended to implement smp_rmb(), correct?
> > > Yes.
> > > >
> > > >You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > >smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > 
> > > If smp_read_barrier_depends() is used to separate not only two reads
> > > but read pointer and WRITE basing on that pointer (example below) -
> > > yes. I just doesn't see any example of this in famous
> > > Documentation/memory-barriers.txt and had no chance to know what you
> > > use it in this way too.
> > 
> > Well, Documentation/memory-barriers.txt was intended as a guide for Linux
> > kernel hackers, and not for hardware architects.
> 
> Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> specification (I seem to keep repeating this).
> 
> > 
> > 
> > commit 955720966e216b00613fcf60188d507c103f0e80
> > Author: Paul E. McKenney 
> > Date:   Thu Jan 14 14:17:04 2016 -0800
> > 
> > documentation: Subsequent writes ordered by rcu_dereference()
> > 
> > The current memory-barriers.txt does not address the possibility of
> > a write to a dereferenced pointer.  This should be rare, 
> 
> How are these rare? Isn't:
> 
>   rcu_read_lock()
>   obj = rcu_dereference(ptr);
>   if (!atomic_inc_not_zero(>ref))
>   obj = NULL;
>   rcu_read_unlock();
> 
> a _very_ common thing to do?

It is, but it provides its own barriers, so does not need to rely on
dependency ordering.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 11:09:27AM +, Will Deacon wrote:
> On Tue, Jan 26, 2016 at 11:32:00AM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 26, 2016 at 11:24:02AM +0100, Peter Zijlstra wrote:
> > 
> > > Yeah, this goes under the header: memory-barriers.txt is _NOT_ a
> > > specification (I seem to keep repeating this).
> > 
> > Do we want this ?

Seems likely to me.  ;-)

> > ---
> >  Documentation/memory-barriers.txt | 17 +
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/Documentation/memory-barriers.txt 
> > b/Documentation/memory-barriers.txt
> > index a61be39c7b51..433326ebdc26 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1,3 +1,4 @@
> > +
> >  
> >  LINUX KERNEL MEMORY BARRIERS
> >  
> > @@ -5,6 +6,22 @@
> >  By: David Howells 
> >  Paul E. McKenney 
> >  
> > +==
> > +DISCLAIMER
> > +==
> > +
> > +This document is not a specification; it is intentionally (for the sake of
> > +brevity) and unintentionally (due to being human) incomplete. This 
> > document is
> > +meant as a guide to using the various memory barriers provided by Linux, 
> > but
> > +in case of any doubt (and there are many) please ask.
> 
> It might be worth adding you and me to the top of the file, to save Paul
> Cc'ing us on questions (get_maintainer.pl points at poor old Corbet for
> this file).
> 
> But yes, it seems that something like this is required.

So Peter, would you like to update your patch to include yourself
and Will as authors?

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 12:52:07AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Mon, Jan 18, 2016 at 07:46:29AM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 18, 2016 at 04:19:29PM +0800, Herbert Xu wrote:
> > > Paul E. McKenney  wrote:
> > > >
> > > > You could use SYNC_ACQUIRE() to implement read_barrier_depends() and
> > > > smp_read_barrier_depends(), but SYNC_RMB probably does not suffice.
> > > > The reason for this is that smp_read_barrier_depends() must order the
> > > > pointer load against any subsequent read or write through a dereference
> > > > of that pointer.  For example:
> > > > 
> > > >p = READ_ONCE(gp);
> > > >smp_rmb();
> > > >r1 = p->a; /* ordered by smp_rmb(). */
> > > >p->b = 42; /* NOT ordered by smp_rmb(), BUG!!! */
> > > >r2 = x; /* ordered by smp_rmb(), but doesn't need to be. */
> > > > 
> > > > In contrast:
> > > > 
> > > >p = READ_ONCE(gp);
> > > >smp_read_barrier_depends();
> > > >r1 = p->a; /* ordered by smp_read_barrier_depends(). */
> > > >p->b = 42; /* ordered by smp_read_barrier_depends(). */
> > > >r2 = x; /* not ordered by smp_read_barrier_depends(), which is 
> > > > OK. */
> > > > 
> > > > Again, if your hardware maintains local ordering for address
> > > > and data dependencies, you can have read_barrier_depends() and
> > > > smp_read_barrier_depends() be no-ops like they are for most
> > > > architectures.
> > > > 
> > > > Does that help?
> > > 
> > > This is crazy! smp_rmb started out being strictly stronger than
> > > smp_read_barrier_depends, when did this stop being the case?
> > 
> > Hello, Herbert!
> > 
> > It is true that most Linux kernel code relies only on the read-read
> > properties of dependencies, but the read-write properties are useful.
> > Admittedly relatively rarely, but useful.
> > 
> > The better comparison for smp_read_barrier_depends(), especially in
> > its rcu_dereference*() form, is smp_load_acquire().
> 
> Confused..
> 
> I recall that last time you and Linus came into a conclusion that even
> on Alpha, a barrier for read->write with data dependency is unnecessary:
> 
> http://article.gmane.org/gmane.linux.kernel/2077661
> 
> And in an earlier mail of that thread, Linus made his point that
> smp_read_barrier_depends() should only be used to order read->read.

Those examples involved read-to-write with conditionals, as in:

if (READ_ONCE(a))
WRITE_ONCE(b, 1);

Without the "if", no ordering is guaranteed on weakly ordered CPUs.
(The volatile accesses keep ordering within the compiler for once...

> So right now, are we going to extend the semantics of
> smp_read_barrier_depends()? Can we just make smp_read_barrier_depends()
> still only work for read->read, and assume all the architectures won't
> reorder read->write with data dependency, so that the code above having
> a smp_rmb() also works?

The semantics of smp_read_barrier_depends() has been both read-to-write
and read-to-read for some time now, this patch just catches the
documentation up with reality.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 12:10 PM, Paul E. McKenney
 wrote:
> On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote:
>>
>> > struct foo *x = READ_ONCE(*ptr);
>> > smp_read_barrier_depends();
>> > x->bar = 5;
>>
>> This case is complete BS. Stop perpetuating it. I already removed a
>> number of bogus cases of it, and I removed the incorrect documentation
>> that had this crap.
>
> If I understand your objection correctly, you want the above pattern
> expressed either like this:
>
> struct foo *x = rcu_dereference(*ptr);
> x->bar = 5;
>
> Or like this:
>
> struct foo *x = lockless_dereference(*ptr);
> x->bar = 5;
>
> Or am I missing your point?

You are entirely missing the point.

You might as well just write it as

struct foo x = READ_ONCE(*ptr);
x->bar = 5;

because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

So what I am saying is simple: anybody who writes that
"smp_read_barrier_depends()" in there is just ttoally and completely
WRONG, and the fact that Peter wrote it out after I removed several
instances of that bloody f*cking idiocy is disturbing.

Don't do it. It's BS. It's wrong. Don't make excuses for it.

 Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
 wrote:
>
> You might as well just write it as
>
> struct foo x = READ_ONCE(*ptr);
> x->bar = 5;
>
> because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

Just to clarify: on alpha it adds a memory barrier, but that memory
barrier is useless.

On non-alpha, it is a no-op, and obviously does nothing simply because
it generates no code.

So if anybody believes that the "smp_read_barrier_depends()" does
something, they are *wrong*.

And if anybody sends out an email with that smp_read_barrier_depends()
in an example, they are actively just confusing other people, which is
even worse than just being wrong. Which is why I jumped in.

So stop perpetuating the myth that smp_read_barrier_depends() does
something here. It does not. It's a bug, and it has become this "mind
virus" for some people that seem to believe that it does something.

I had to remove this crap once from the kernel already, see commit
105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
atomic*_read_ctrl()").

I don't want to ever see that broken construct again. And I want to
make sure that everybody is educated about how broken it was. I'm
extremely unhappy that it came up again.

If it turns out that some architecture does actually need a barrier
between a read and a dependent write, then that will mean that

 (a) we'll have to make up a _new_ barrier, because
"smp_read_barrier_depends()" is not that barrier. We'll presumably
then have to make that new barrier part of "rcu_derefence()" and
friends.

 (b) we will have found an architecture with even worse memory
ordering semantics than alpha, and we'll have to stop castigating
alpha for being the worst memory ordering ever.

but I sincerely hope that we'll never find that kind of broken architecture.

   Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
>  wrote:
> >
> > You might as well just write it as
> >
> > struct foo x = READ_ONCE(*ptr);
> > x->bar = 5;
> >
> > because that "smp_read_barrier_depends()" does NOTHING wrt the second write.
> 
> Just to clarify: on alpha it adds a memory barrier, but that memory
> barrier is useless.

No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
needed.  That said, I believe that we should encourage rcu_dereference*()
or lockless_dereference() instead of READ_ONCE() for documentation
reasons, though.

> On non-alpha, it is a no-op, and obviously does nothing simply because
> it generates no code.
> 
> So if anybody believes that the "smp_read_barrier_depends()" does
> something, they are *wrong*.

The other problem with smp_read_barrier_depends() is that it is often
a pain figuring out which prior load it is supposed to apply to.
Hence my preference for rcu_dereference*() and lockless_dereference().

> And if anybody sends out an email with that smp_read_barrier_depends()
> in an example, they are actively just confusing other people, which is
> even worse than just being wrong. Which is why I jumped in.
> 
> So stop perpetuating the myth that smp_read_barrier_depends() does
> something here. It does not. It's a bug, and it has become this "mind
> virus" for some people that seem to believe that it does something.

It looks like I should add words to memory-barriers.txt de-emphasizing
smp_read_barrier_depends().  I will take a look at that.

> I had to remove this crap once from the kernel already, see commit
> 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> atomic*_read_ctrl()").
> 
> I don't want to ever see that broken construct again. And I want to
> make sure that everybody is educated about how broken it was. I'm
> extremely unhappy that it came up again.

Well, if it makes you feel better, that was control dependencies and this
was data dependencies.  So it was not -exactly- the same.  ;-)

(Sorry, couldn't resist...)

> If it turns out that some architecture does actually need a barrier
> between a read and a dependent write, then that will mean that
> 
>  (a) we'll have to make up a _new_ barrier, because
> "smp_read_barrier_depends()" is not that barrier. We'll presumably
> then have to make that new barrier part of "rcu_derefence()" and
> friends.

Agreed.  We can worry about whether or not we replace the current
smp_read_barrier_depends() with that new barrier when and if such
hardware appears.

>  (b) we will have found an architecture with even worse memory
> ordering semantics than alpha, and we'll have to stop castigating
> alpha for being the worst memory ordering ever.

;-) ;-) ;-)

> but I sincerely hope that we'll never find that kind of broken architecture.

Apparently at least some hardware vendors are reading memory-barriers.txt,
so perhaps the odds of that kind of breakage have reduced.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 12:10:10PM +, Will Deacon wrote:
> On Mon, Jan 25, 2016 at 05:06:46PM -0800, Paul E. McKenney wrote:
> > On Mon, Jan 25, 2016 at 02:41:34PM +, Will Deacon wrote:
> > > On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > > > > See my earlier reply [1] (but also, your WRC Linux example looks 
> > > > > > more
> > > > > > like a variant on WWC and I couldn't really follow it).
> > > > > 
> > > > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > > > I am sure that it will seem more natural with time and experience...
> > > > 
> > > > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > > > last access from a store to a load to get WRC.  Plus the levels of
> > > > indirection definitely didn't match up, did they?
> > > 
> > > Nope, it was pretty baffling!
> > 
> > "It is a service that I provide."  ;-)
> > 
> > > > struct foo {
> > > > struct foo *next;
> > > > };
> > > > struct foo a;
> > > > struct foo b;
> > > > struct foo c = {  };
> > > > struct foo d = {  };
> > > > struct foo x = {  };
> > > > struct foo y = {  };
> > > > struct foo *r1, *r2, *r3;
> > > > 
> > > > void cpu0(void)
> > > > {
> > > > WRITE_ONCE(x.next, );
> > > > }
> > > > 
> > > > void cpu1(void)
> > > > {
> > > > r1 = lockless_dereference(x.next);
> > > > WRITE_ONCE(r1->next, );
> > > > }
> > > > 
> > > > void cpu2(void)
> > > > {
> > > > r2 = lockless_dereference(y.next);
> > > > r3 = READ_ONCE(r2->next);
> > > > }
> > > > 
> > > > In this case, it is legal to end the run with:
> > > > 
> > > > r1 ==  && r2 ==  && r3 == 
> > > > 
> > > > Please see below for a ppcmem litmus test.
> > > > 
> > > > So, did I get it right this time?  ;-)
> > > 
> > > The code above looks correct to me (in that it matches WRC+addrs),
> > > but your litmus test:
> > > 
> > > > PPC WRCnf+addrs
> > > > ""
> > > > {
> > > > 0:r2=x; 0:r3=y;
> > > > 1:r2=x; 1:r3=y;
> > > > 2:r2=x; 2:r3=y;
> > > > c=a; d=b; x=c; y=d;
> > > > }
> > > >  P0   | P1| P2;
> > > >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> > > >   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > > > exists
> > > > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> > > 
> > > Seems to be missing the address dependency on P1.
> > 
> > You are quite correct!  How about the following?
> 
> I think that's it!
> 
> > As before, both herd and ppcmem say that the cycle is allowed, as
> > expected, given non-transitive ordering.  To prohibit the cycle, P1
> > needs a suitable memory-barrier instruction.
> > 
> > 
> > 
> > PPC WRCnf+addrs
> > ""
> > {
> > 0:r2=x; 0:r3=y;
> > 1:r2=x; 1:r3=y;
> > 2:r2=x; 2:r3=y;
> > c=a; d=b; x=c; y=d;
> > }
> >  P0   | P1| P2;
> >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> >   | stw r2,0(r8)  | lwz r9,0(r8)  ;
> > exists
> > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> 
> Agreed.

OK, thank you!  Would you agree that it would be good to replace the
current xor-based fake-dependency litmus tests with tests having real
dependencies?

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenney
 wrote:
>
> No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> needed.  That said, I believe that we should encourage rcu_dereference*()
> or lockless_dereference() instead of READ_ONCE() for documentation
> reasons, though.

I agree that that is likely the right thing to do in pretty much all situations.

In theory, there might be performance situations where we'd want to
actively avoid the smp_read_barrier_depends() inherent in those, but
considering that it's only a performance issue on alpha, and we
probably have all of two or three people using Linux on alpha, it's a
pretty theoretical performance worry.

  Linus


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Paul E. McKenney
On Tue, Jan 26, 2016 at 03:45:23PM -0800, Linus Torvalds wrote:
> On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenney
>  wrote:
> >
> > No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> > needed.  That said, I believe that we should encourage rcu_dereference*()
> > or lockless_dereference() instead of READ_ONCE() for documentation
> > reasons, though.
> 
> I agree that that is likely the right thing to do in pretty much all 
> situations.
> 
> In theory, there might be performance situations where we'd want to
> actively avoid the smp_read_barrier_depends() inherent in those, but
> considering that it's only a performance issue on alpha, and we
> probably have all of two or three people using Linux on alpha, it's a
> pretty theoretical performance worry.

Agreed!

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Boqun Feng
On Tue, Jan 26, 2016 at 03:29:21PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 26, 2016 at 02:33:40PM -0800, Linus Torvalds wrote:
> > On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
> >  wrote:
> > >
> > > You might as well just write it as
> > >
> > > struct foo x = READ_ONCE(*ptr);
> > > x->bar = 5;
> > >
> > > because that "smp_read_barrier_depends()" does NOTHING wrt the second 
> > > write.
> > 
> > Just to clarify: on alpha it adds a memory barrier, but that memory
> > barrier is useless.
> 
> No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> needed.  That said, I believe that we should encourage rcu_dereference*()
> or lockless_dereference() instead of READ_ONCE() for documentation
> reasons, though.
> 
> > On non-alpha, it is a no-op, and obviously does nothing simply because
> > it generates no code.
> > 
> > So if anybody believes that the "smp_read_barrier_depends()" does
> > something, they are *wrong*.
> 
> The other problem with smp_read_barrier_depends() is that it is often
> a pain figuring out which prior load it is supposed to apply to.
> Hence my preference for rcu_dereference*() and lockless_dereference().
> 

Because semantically speaking, rcu_derefence*() and
lockless_dereference() are CONSUME(i.e. data/address dependent
read->read and read->write pairs are ordered), whereas
smp_read_barrier_depends() only guarantees read->read pairs with data
dependency are ordered, right?

If so, maybe we need to call it out in memory-barriers.txt, for example:

diff --git a/Documentation/memory-barriers.txt 
b/Documentation/memory-barriers.txt
index 904ee42..6b262c2 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1703,8 +1703,8 @@ There are some more advanced barrier functions:
 
 
  (*) lockless_dereference();
- This can be thought of as a pointer-fetch wrapper around the
- smp_read_barrier_depends() data-dependency barrier.
+ This is a load, and any load or store that has a data dependency on the
+ value returned by this load won't be reordered before this load.
 
  This is also similar to rcu_dereference(), but in cases where
  object lifetime is handled by some mechanism other than RCU, for


Regards,
Boqun

> > And if anybody sends out an email with that smp_read_barrier_depends()
> > in an example, they are actively just confusing other people, which is
> > even worse than just being wrong. Which is why I jumped in.
> > 
> > So stop perpetuating the myth that smp_read_barrier_depends() does
> > something here. It does not. It's a bug, and it has become this "mind
> > virus" for some people that seem to believe that it does something.
> 
> It looks like I should add words to memory-barriers.txt de-emphasizing
> smp_read_barrier_depends().  I will take a look at that.
> 
> > I had to remove this crap once from the kernel already, see commit
> > 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
> > atomic*_read_ctrl()").
> > 
> > I don't want to ever see that broken construct again. And I want to
> > make sure that everybody is educated about how broken it was. I'm
> > extremely unhappy that it came up again.
> 
> Well, if it makes you feel better, that was control dependencies and this
> was data dependencies.  So it was not -exactly- the same.  ;-)
> 
> (Sorry, couldn't resist...)
> 
> > If it turns out that some architecture does actually need a barrier
> > between a read and a dependent write, then that will mean that
> > 
> >  (a) we'll have to make up a _new_ barrier, because
> > "smp_read_barrier_depends()" is not that barrier. We'll presumably
> > then have to make that new barrier part of "rcu_derefence()" and
> > friends.
> 
> Agreed.  We can worry about whether or not we replace the current
> smp_read_barrier_depends() with that new barrier when and if such
> hardware appears.
> 
> >  (b) we will have found an architecture with even worse memory
> > ordering semantics than alpha, and we'll have to stop castigating
> > alpha for being the worst memory ordering ever.
> 
> ;-) ;-) ;-)
> 
> > but I sincerely hope that we'll never find that kind of broken architecture.
> 
> Apparently at least some hardware vendors are reading memory-barriers.txt,
> so perhaps the odds of that kind of breakage have reduced.
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Peter Zijlstra
On Mon, Jan 25, 2016 at 10:12:11PM -0800, Paul E. McKenney wrote:
> On Mon, Jan 25, 2016 at 06:02:34PM +, Will Deacon wrote:

> > Thanks for having a go at this. I tried defining something axiomatically,
> > but got stuck pretty quickly. In my scheme, I used "data-directed
> > transitivity" instead of "local transitivity", since the latter seems to
> > be a bit of a misnomer.
> 
> I figured that "local" meant local to the CPUs participating in the
> release-acquire chain.  As opposed to smp_mb() chains where the ordering
> is "global" as in visible to all CPUs, whether on the chain or not.
> Does that help?

That is in fact how I read and understood it.




Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Paul E. McKenney
On Mon, Jan 25, 2016 at 06:02:34PM +, Will Deacon wrote:
> Hi Paul,
> 
> On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > > > and smp_read_acquire(), 
> > > 
> > > But they provide different grades of transitivity, which is where all
> > > the confusion lays.
> > > 
> > > smp_mb() is strongly/globally transitive, all CPUs will agree on the 
> > > order.
> > > 
> > > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > > involved in the handover will agree on the order.
> > 
> > Good point!
> > 
> > Using grace periods in place of smp_mb() also provides strong/global
> > transitivity, but also insanely high latencies.  ;-)
> > 
> > The patch below updates Documentation/memory-barriers.txt to define
> > local vs. global transitivity.  The corresponding ppcmem litmus test
> > is included below as well.
> > 
> > Should we start putting litmus tests for the various examples
> > somewhere, perhaps in a litmus-tests directory within each participating
> > architecture?  I have a pile of powerpc-related litmus tests on my laptop,
> > but they probably aren't doing all that much good there.
> 
> I too would like to have the litmus tests in the kernel so that we can
> refer to them from memory-barriers.txt. Ideally they wouldn't be targetted
> to a particular arch, however.

Agreed.  Working on it...

> > PPC local-transitive
> > ""
> > {
> > 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
> > 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
> > 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
> > 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
> > }
> >  P0   | P1   | P2   | P3   ;
> >  lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
> >  lwsync   | lwsync   | lwsync   | sync ;
> >  stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
> >  lwsync   | lwz r7,0(r2) |  |  ;
> >  stw r1,0(r5) | lwsync   |  |  ;
> >   | stw r1,0(r6) |  |  ;
> > exists
> > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
> > (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
> > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
> > (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)
> 
> i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.

Yep!

> > 
> > 
> > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> > Author: Paul E. McKenney 
> > Date:   Fri Jan 15 09:30:42 2016 -0800
> > 
> > documentation: Distinguish between local and global transitivity
> > 
> > The introduction of smp_load_acquire() and smp_store_release() had
> > the side effect of introducing a weaker notion of transitivity:
> > The transitivity of full smp_mb() barriers is global, but that
> > of smp_store_release()/smp_load_acquire() chains is local.  This
> > commit therefore introduces the notion of local transitivity and
> > gives an example.
> > 
> > Reported-by: Peter Zijlstra 
> > Reported-by: Will Deacon 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/Documentation/memory-barriers.txt 
> > b/Documentation/memory-barriers.txt
> > index c66ba46d8079..d8109ed99342 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access 
> > to CPU 1's writes.
> >  General barriers are therefore required to ensure that all CPUs agree
> >  on the combined order of CPU 1's and CPU 2's accesses.
> >  
> > -To reiterate, if your code requires transitivity, use general barriers
> > -throughout.
> > +General barriers provide "global transitivity", so that all CPUs will
> > +agree on the order of operations.  In contrast, a chain of release-acquire
> > +pairs provides only "local transitivity", so that only those CPUs on
> > +the chain are guaranteed to agree on the combined order of the accesses.
> 
> Thanks for having a go at this. I tried defining something axiomatically,
> but got stuck pretty quickly. In my scheme, I used "data-directed
> transitivity" instead of "local transitivity", since the latter seems to
> be a bit of a misnomer.

I figured that "local" meant local to the CPUs participating in the
release-acquire chain.  As opposed to smp_mb() chains where the ordering
is "global" as in visible to all CPUs, whether on the chain or not.
Does that help?

> > +For example, switching to C code in deference to Herman Hollerith:
> > +
> > +   int u, v, x, y, z;
> > +
> > +   void cpu0(void)
> > +   {
> > +   r0 = smp_load_acquire();
> > +   WRITE_ONCE(u, 1);
> > +   smp_store_release(, 1);
> > +   }
> > +
> > + 

Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Paul E. McKenney
On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> > > 
> > > > > And the stuff we're confused about is how best to express the 
> > > > > difference
> > > > > and guarantees of these two forms of transitivity and how exactly they
> > > > > interact.
> > > > 
> > > > Hoping my memory-barrier.txt patch helps here...
> > > 
> > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > of two globally ordered sequences connected by a single local link.
> > 
> > The conclusion that I am slowly coming to is that litmus tests should
> > not be thought of as linear chains, but rather as cycles.  If you think
> > of it as a cycle, then it doesn't matter where the local link is, just
> > how many of them and how they are connected.
> 
> Do you have some examples of this? I'm struggling to make it work in my
> mind, or are you talking specifically in the context of the kernel
> memory model?

Now that you mention it, maybe it would be best to keep the transitive
and non-transitive separate for the time being anyway.  Just because it
might be possible to deal with does not necessarily mean that we should
be encouraging it.  ;-)

> > But I will admit that there are some rather strange litmus tests that
> > challenge this cycle-centric view, for example, the one shown below.
> > It turns out that herd and ppcmem disagree on the outcome.  (The Power
> > architects side with ppcmem.)
> > 
> > > And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
> > > of the stores looses a conflict, and if that scenario matters. If it
> > > does, we should inspect the same case for other barriers.
> > 
> > Indeed.  I am still working on how these should be described.  My
> > current thought is to be quite conservative on what ordering is
> > actually respected, however, the current task is formalizing how
> > RCU plays with the rest of the memory model.
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > PPC Overlapping Group-B sets version 4
> > ""
> > (* When the Group-B sets from two different barriers involve instructions in
> >the same thread, within that thread one set must contain the other.
> > 
> > P0  P1  P2
> > Rx=1Wy=1Wz=2
> > dep.lwsync  lwsync
> > Ry=0Wz=1Wx=1
> > Rz=1
> > 
> > assert(!(z=2))
> > 
> >Forbidden by ppcmem, allowed by herd.
> > *)
> > {
> > 0:r1=x; 0:r2=y; 0:r3=z;
> > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > }
> >  P0 | P1| P2;
> >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> >  xor r7,r6,r6   | lwsync| lwsync;
> >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> >  lwz r8,0(r3)   |   |   ;
> > 
> > exists
> > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> 
> That really hurts. Assuming that the "assert(!(z=2))" is actually there
> to constrain the coherence order of z to be {0->1->2}, then I think that
> this test is forbidden on arm using dmb instead of lwsync. That said, I
> also don't think the Rz=1 in P0 changes anything.

What about the smp_wmb() variant of dmb that orders only stores?

> The double negatives don't help here! (it is forbidden to guarantee that
> z is not always 2).

Yes, this is a weird one, and I don't know of any use of it.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Paul E. McKenney
On Mon, Jan 25, 2016 at 02:41:34PM +, Will Deacon wrote:
> On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > > like a variant on WWC and I couldn't really follow it).
> > > 
> > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > I am sure that it will seem more natural with time and experience...
> > 
> > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > last access from a store to a load to get WRC.  Plus the levels of
> > indirection definitely didn't match up, did they?
> 
> Nope, it was pretty baffling!

"It is a service that I provide."  ;-)

> > struct foo {
> > struct foo *next;
> > };
> > struct foo a;
> > struct foo b;
> > struct foo c = {  };
> > struct foo d = {  };
> > struct foo x = {  };
> > struct foo y = {  };
> > struct foo *r1, *r2, *r3;
> > 
> > void cpu0(void)
> > {
> > WRITE_ONCE(x.next, );
> > }
> > 
> > void cpu1(void)
> > {
> > r1 = lockless_dereference(x.next);
> > WRITE_ONCE(r1->next, );
> > }
> > 
> > void cpu2(void)
> > {
> > r2 = lockless_dereference(y.next);
> > r3 = READ_ONCE(r2->next);
> > }
> > 
> > In this case, it is legal to end the run with:
> > 
> > r1 ==  && r2 ==  && r3 == 
> > 
> > Please see below for a ppcmem litmus test.
> > 
> > So, did I get it right this time?  ;-)
> 
> The code above looks correct to me (in that it matches WRC+addrs),
> but your litmus test:
> 
> > PPC WRCnf+addrs
> > ""
> > {
> > 0:r2=x; 0:r3=y;
> > 1:r2=x; 1:r3=y;
> > 2:r2=x; 2:r3=y;
> > c=a; d=b; x=c; y=d;
> > }
> >  P0   | P1| P2;
> >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> >   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > exists
> > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> 
> Seems to be missing the address dependency on P1.

You are quite correct!  How about the following?

As before, both herd and ppcmem say that the cycle is allowed, as
expected, given non-transitive ordering.  To prohibit the cycle, P1
needs a suitable memory-barrier instruction.

Thanx, Paul



PPC WRCnf+addrs
""
{
0:r2=x; 0:r3=y;
1:r2=x; 1:r3=y;
2:r2=x; 2:r3=y;
c=a; d=b; x=c; y=d;
}
 P0   | P1| P2;
 stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
  | stw r2,0(r8)  | lwz r9,0(r8)  ;
exists
(1:r8=y /\ 2:r8=x /\ 2:r9=c)

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Will Deacon
Hi Paul,

On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > > and smp_read_acquire(), 
> > 
> > But they provide different grades of transitivity, which is where all
> > the confusion lays.
> > 
> > smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> > 
> > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > involved in the handover will agree on the order.
> 
> Good point!
> 
> Using grace periods in place of smp_mb() also provides strong/global
> transitivity, but also insanely high latencies.  ;-)
> 
> The patch below updates Documentation/memory-barriers.txt to define
> local vs. global transitivity.  The corresponding ppcmem litmus test
> is included below as well.
> 
> Should we start putting litmus tests for the various examples
> somewhere, perhaps in a litmus-tests directory within each participating
> architecture?  I have a pile of powerpc-related litmus tests on my laptop,
> but they probably aren't doing all that much good there.

I too would like to have the litmus tests in the kernel so that we can
refer to them from memory-barriers.txt. Ideally they wouldn't be targetted
to a particular arch, however.

> PPC local-transitive
> ""
> {
> 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
> 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
> 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
> 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
> }
>  P0   | P1   | P2   | P3   ;
>  lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
>  lwsync   | lwsync   | lwsync   | sync ;
>  stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
>  lwsync   | lwz r7,0(r2) |  |  ;
>  stw r1,0(r5) | lwsync   |  |  ;
>   | stw r1,0(r6) |  |  ;
> exists
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
> (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
> (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)

i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.

> 
> 
> commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> Author: Paul E. McKenney 
> Date:   Fri Jan 15 09:30:42 2016 -0800
> 
> documentation: Distinguish between local and global transitivity
> 
> The introduction of smp_load_acquire() and smp_store_release() had
> the side effect of introducing a weaker notion of transitivity:
> The transitivity of full smp_mb() barriers is global, but that
> of smp_store_release()/smp_load_acquire() chains is local.  This
> commit therefore introduces the notion of local transitivity and
> gives an example.
> 
> Reported-by: Peter Zijlstra 
> Reported-by: Will Deacon 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index c66ba46d8079..d8109ed99342 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to 
> CPU 1's writes.
>  General barriers are therefore required to ensure that all CPUs agree
>  on the combined order of CPU 1's and CPU 2's accesses.
>  
> -To reiterate, if your code requires transitivity, use general barriers
> -throughout.
> +General barriers provide "global transitivity", so that all CPUs will
> +agree on the order of operations.  In contrast, a chain of release-acquire
> +pairs provides only "local transitivity", so that only those CPUs on
> +the chain are guaranteed to agree on the combined order of the accesses.

Thanks for having a go at this. I tried defining something axiomatically,
but got stuck pretty quickly. In my scheme, I used "data-directed
transitivity" instead of "local transitivity", since the latter seems to
be a bit of a misnomer.

> +For example, switching to C code in deference to Herman Hollerith:
> +
> + int u, v, x, y, z;
> +
> + void cpu0(void)
> + {
> + r0 = smp_load_acquire();
> + WRITE_ONCE(u, 1);
> + smp_store_release(, 1);
> + }
> +
> + void cpu1(void)
> + {
> + r1 = smp_load_acquire();
> + r4 = READ_ONCE(v);
> + r5 = READ_ONCE(u);
> + smp_store_release(, 1);
> + }
> +
> + void cpu2(void)
> + {
> + r2 = smp_load_acquire();
> + smp_store_release(, 1);
> + }
> +
> + void cpu3(void)
> + {
> + WRITE_ONCE(v, 1);
> + smp_mb();
> + r3 = READ_ONCE(u);
> + }
> +
> +Because cpu0(), cpu1(), and cpu2() participate in a 

Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Will Deacon
On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> > 
> > > > And the stuff we're confused about is how best to express the difference
> > > > and guarantees of these two forms of transitivity and how exactly they
> > > > interact.
> > > 
> > > Hoping my memory-barrier.txt patch helps here...
> > 
> > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > of two globally ordered sequences connected by a single local link.
> 
> The conclusion that I am slowly coming to is that litmus tests should
> not be thought of as linear chains, but rather as cycles.  If you think
> of it as a cycle, then it doesn't matter where the local link is, just
> how many of them and how they are connected.

Do you have some examples of this? I'm struggling to make it work in my
mind, or are you talking specifically in the context of the kernel
memory model?

> But I will admit that there are some rather strange litmus tests that
> challenge this cycle-centric view, for example, the one shown below.
> It turns out that herd and ppcmem disagree on the outcome.  (The Power
> architects side with ppcmem.)
> 
> > And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
> > of the stores looses a conflict, and if that scenario matters. If it
> > does, we should inspect the same case for other barriers.
> 
> Indeed.  I am still working on how these should be described.  My
> current thought is to be quite conservative on what ordering is
> actually respected, however, the current task is formalizing how
> RCU plays with the rest of the memory model.
> 
>   Thanx, Paul
> 
> 
> 
> PPC Overlapping Group-B sets version 4
> ""
> (* When the Group-B sets from two different barriers involve instructions in
>the same thread, within that thread one set must contain the other.
> 
>   P0  P1  P2
>   Rx=1Wy=1Wz=2
>   dep.lwsync  lwsync
>   Ry=0Wz=1Wx=1
>   Rz=1
> 
>   assert(!(z=2))
> 
>Forbidden by ppcmem, allowed by herd.
> *)
> {
> 0:r1=x; 0:r2=y; 0:r3=z;
> 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> }
>  P0   | P1| P2;
>  lwz r6,0(r1) | stw r4,0(r2)  | stw r5,0(r3)  ;
>  xor r7,r6,r6 | lwsync| lwsync;
>  lwzx r7,r7,r2| stw r4,0(r3)  | stw r4,0(r1)  ;
>  lwz r8,0(r3) |   |   ;
> 
> exists
> (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)

That really hurts. Assuming that the "assert(!(z=2))" is actually there
to constrain the coherence order of z to be {0->1->2}, then I think that
this test is forbidden on arm using dmb instead of lwsync. That said, I
also don't think the Rz=1 in P0 changes anything.

The double negatives don't help here! (it is forbidden to guarantee that
z is not always 2).

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Will Deacon
On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > like a variant on WWC and I couldn't really follow it).
> > 
> > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > I am sure that it will seem more natural with time and experience...
> 
> Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> last access from a store to a load to get WRC.  Plus the levels of
> indirection definitely didn't match up, did they?

Nope, it was pretty baffling!

>   struct foo {
>   struct foo *next;
>   };
>   struct foo a;
>   struct foo b;
>   struct foo c = {  };
>   struct foo d = {  };
>   struct foo x = {  };
>   struct foo y = {  };
>   struct foo *r1, *r2, *r3;
> 
>   void cpu0(void)
>   {
>   WRITE_ONCE(x.next, );
>   }
> 
>   void cpu1(void)
>   {
>   r1 = lockless_dereference(x.next);
>   WRITE_ONCE(r1->next, );
>   }
> 
>   void cpu2(void)
>   {
>   r2 = lockless_dereference(y.next);
>   r3 = READ_ONCE(r2->next);
>   }
> 
> In this case, it is legal to end the run with:
> 
>   r1 ==  && r2 ==  && r3 == 
> 
> Please see below for a ppcmem litmus test.
> 
> So, did I get it right this time?  ;-)

The code above looks correct to me (in that it matches WRC+addrs),
but your litmus test:

> PPC WRCnf+addrs
> ""
> {
> 0:r2=x; 0:r3=y;
> 1:r2=x; 1:r3=y;
> 2:r2=x; 2:r3=y;
> c=a; d=b; x=c; y=d;
> }
>  P0   | P1| P2;
>  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
>   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> exists
> (1:r8=y /\ 2:r8=x /\ 2:r9=c)

Seems to be missing the address dependency on P1.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Paul E. McKenney
On Mon, Jan 25, 2016 at 06:02:34PM +, Will Deacon wrote:
> Hi Paul,
> 
> On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > > > and smp_read_acquire(), 
> > > 
> > > But they provide different grades of transitivity, which is where all
> > > the confusion lays.
> > > 
> > > smp_mb() is strongly/globally transitive, all CPUs will agree on the 
> > > order.
> > > 
> > > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > > involved in the handover will agree on the order.
> > 
> > Good point!
> > 
> > Using grace periods in place of smp_mb() also provides strong/global
> > transitivity, but also insanely high latencies.  ;-)
> > 
> > The patch below updates Documentation/memory-barriers.txt to define
> > local vs. global transitivity.  The corresponding ppcmem litmus test
> > is included below as well.
> > 
> > Should we start putting litmus tests for the various examples
> > somewhere, perhaps in a litmus-tests directory within each participating
> > architecture?  I have a pile of powerpc-related litmus tests on my laptop,
> > but they probably aren't doing all that much good there.
> 
> I too would like to have the litmus tests in the kernel so that we can
> refer to them from memory-barriers.txt. Ideally they wouldn't be targetted
> to a particular arch, however.

Agreed.  Working on it...

> > PPC local-transitive
> > ""
> > {
> > 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
> > 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
> > 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
> > 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
> > }
> >  P0   | P1   | P2   | P3   ;
> >  lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
> >  lwsync   | lwsync   | lwsync   | sync ;
> >  stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
> >  lwsync   | lwz r7,0(r2) |  |  ;
> >  stw r1,0(r5) | lwsync   |  |  ;
> >   | stw r1,0(r6) |  |  ;
> > exists
> > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
> > (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
> > (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
> > (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)
> 
> i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.

Yep!

> > 
> > 
> > commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> > Author: Paul E. McKenney 
> > Date:   Fri Jan 15 09:30:42 2016 -0800
> > 
> > documentation: Distinguish between local and global transitivity
> > 
> > The introduction of smp_load_acquire() and smp_store_release() had
> > the side effect of introducing a weaker notion of transitivity:
> > The transitivity of full smp_mb() barriers is global, but that
> > of smp_store_release()/smp_load_acquire() chains is local.  This
> > commit therefore introduces the notion of local transitivity and
> > gives an example.
> > 
> > Reported-by: Peter Zijlstra 
> > Reported-by: Will Deacon 
> > Signed-off-by: Paul E. McKenney 
> > 
> > diff --git a/Documentation/memory-barriers.txt 
> > b/Documentation/memory-barriers.txt
> > index c66ba46d8079..d8109ed99342 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access 
> > to CPU 1's writes.
> >  General barriers are therefore required to ensure that all CPUs agree
> >  on the combined order of CPU 1's and CPU 2's accesses.
> >  
> > -To reiterate, if your code requires transitivity, use general barriers
> > -throughout.
> > +General barriers provide "global transitivity", so that all CPUs will
> > +agree on the order of operations.  In contrast, a chain of release-acquire
> > +pairs provides only "local transitivity", so that only those CPUs on
> > +the chain are guaranteed to agree on the combined order of the accesses.
> 
> Thanks for having a go at this. I tried defining something axiomatically,
> but got stuck pretty quickly. In my scheme, I used "data-directed
> transitivity" instead of "local transitivity", since the latter seems to
> be a bit of a misnomer.

I figured that "local" meant local to the CPUs participating in the
release-acquire chain.  As opposed to smp_mb() chains where the ordering
is "global" as in visible to all CPUs, whether on the chain or not.
Does that help?

> > +For example, switching to C code in deference to Herman Hollerith:
> > +
> > +   int u, v, x, y, z;
> > +
> > +   void cpu0(void)
> > +   {
> > +   r0 = 

Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Paul E. McKenney
On Mon, Jan 25, 2016 at 04:42:43PM +, Will Deacon wrote:
> On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> > > On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> > > 
> > > > > And the stuff we're confused about is how best to express the 
> > > > > difference
> > > > > and guarantees of these two forms of transitivity and how exactly they
> > > > > interact.
> > > > 
> > > > Hoping my memory-barrier.txt patch helps here...
> > > 
> > > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > > of two globally ordered sequences connected by a single local link.
> > 
> > The conclusion that I am slowly coming to is that litmus tests should
> > not be thought of as linear chains, but rather as cycles.  If you think
> > of it as a cycle, then it doesn't matter where the local link is, just
> > how many of them and how they are connected.
> 
> Do you have some examples of this? I'm struggling to make it work in my
> mind, or are you talking specifically in the context of the kernel
> memory model?

Now that you mention it, maybe it would be best to keep the transitive
and non-transitive separate for the time being anyway.  Just because it
might be possible to deal with does not necessarily mean that we should
be encouraging it.  ;-)

> > But I will admit that there are some rather strange litmus tests that
> > challenge this cycle-centric view, for example, the one shown below.
> > It turns out that herd and ppcmem disagree on the outcome.  (The Power
> > architects side with ppcmem.)
> > 
> > > And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
> > > of the stores looses a conflict, and if that scenario matters. If it
> > > does, we should inspect the same case for other barriers.
> > 
> > Indeed.  I am still working on how these should be described.  My
> > current thought is to be quite conservative on what ordering is
> > actually respected, however, the current task is formalizing how
> > RCU plays with the rest of the memory model.
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > PPC Overlapping Group-B sets version 4
> > ""
> > (* When the Group-B sets from two different barriers involve instructions in
> >the same thread, within that thread one set must contain the other.
> > 
> > P0  P1  P2
> > Rx=1Wy=1Wz=2
> > dep.lwsync  lwsync
> > Ry=0Wz=1Wx=1
> > Rz=1
> > 
> > assert(!(z=2))
> > 
> >Forbidden by ppcmem, allowed by herd.
> > *)
> > {
> > 0:r1=x; 0:r2=y; 0:r3=z;
> > 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> > 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> > }
> >  P0 | P1| P2;
> >  lwz r6,0(r1)   | stw r4,0(r2)  | stw r5,0(r3)  ;
> >  xor r7,r6,r6   | lwsync| lwsync;
> >  lwzx r7,r7,r2  | stw r4,0(r3)  | stw r4,0(r1)  ;
> >  lwz r8,0(r3)   |   |   ;
> > 
> > exists
> > (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)
> 
> That really hurts. Assuming that the "assert(!(z=2))" is actually there
> to constrain the coherence order of z to be {0->1->2}, then I think that
> this test is forbidden on arm using dmb instead of lwsync. That said, I
> also don't think the Rz=1 in P0 changes anything.

What about the smp_wmb() variant of dmb that orders only stores?

> The double negatives don't help here! (it is forbidden to guarantee that
> z is not always 2).

Yes, this is a weird one, and I don't know of any use of it.

Thanx, Paul



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Paul E. McKenney
On Mon, Jan 25, 2016 at 02:41:34PM +, Will Deacon wrote:
> On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > > like a variant on WWC and I couldn't really follow it).
> > > 
> > > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > > I am sure that it will seem more natural with time and experience...
> > 
> > Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> > last access from a store to a load to get WRC.  Plus the levels of
> > indirection definitely didn't match up, did they?
> 
> Nope, it was pretty baffling!

"It is a service that I provide."  ;-)

> > struct foo {
> > struct foo *next;
> > };
> > struct foo a;
> > struct foo b;
> > struct foo c = {  };
> > struct foo d = {  };
> > struct foo x = {  };
> > struct foo y = {  };
> > struct foo *r1, *r2, *r3;
> > 
> > void cpu0(void)
> > {
> > WRITE_ONCE(x.next, );
> > }
> > 
> > void cpu1(void)
> > {
> > r1 = lockless_dereference(x.next);
> > WRITE_ONCE(r1->next, );
> > }
> > 
> > void cpu2(void)
> > {
> > r2 = lockless_dereference(y.next);
> > r3 = READ_ONCE(r2->next);
> > }
> > 
> > In this case, it is legal to end the run with:
> > 
> > r1 ==  && r2 ==  && r3 == 
> > 
> > Please see below for a ppcmem litmus test.
> > 
> > So, did I get it right this time?  ;-)
> 
> The code above looks correct to me (in that it matches WRC+addrs),
> but your litmus test:
> 
> > PPC WRCnf+addrs
> > ""
> > {
> > 0:r2=x; 0:r3=y;
> > 1:r2=x; 1:r3=y;
> > 2:r2=x; 2:r3=y;
> > c=a; d=b; x=c; y=d;
> > }
> >  P0   | P1| P2;
> >  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
> >   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> > exists
> > (1:r8=y /\ 2:r8=x /\ 2:r9=c)
> 
> Seems to be missing the address dependency on P1.

You are quite correct!  How about the following?

As before, both herd and ppcmem say that the cycle is allowed, as
expected, given non-transitive ordering.  To prohibit the cycle, P1
needs a suitable memory-barrier instruction.

Thanx, Paul



PPC WRCnf+addrs
""
{
0:r2=x; 0:r3=y;
1:r2=x; 1:r3=y;
2:r2=x; 2:r3=y;
c=a; d=b; x=c; y=d;
}
 P0   | P1| P2;
 stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
  | stw r2,0(r8)  | lwz r9,0(r8)  ;
exists
(1:r8=y /\ 2:r8=x /\ 2:r9=c)

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Will Deacon
On Fri, Jan 15, 2016 at 11:28:45AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 09:54:01AM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 15, 2016 at 10:24:32AM +, Will Deacon wrote:
> > > See my earlier reply [1] (but also, your WRC Linux example looks more
> > > like a variant on WWC and I couldn't really follow it).
> > 
> > I will revisit my WRC Linux example.  And yes, creating litmus tests
> > that use non-fake dependencies is still a bit of an undertaking.  :-/
> > I am sure that it will seem more natural with time and experience...
> 
> Hmmm...  You are quite right, I did do WWC.  I need to change cpu2()'s
> last access from a store to a load to get WRC.  Plus the levels of
> indirection definitely didn't match up, did they?

Nope, it was pretty baffling!

>   struct foo {
>   struct foo *next;
>   };
>   struct foo a;
>   struct foo b;
>   struct foo c = {  };
>   struct foo d = {  };
>   struct foo x = {  };
>   struct foo y = {  };
>   struct foo *r1, *r2, *r3;
> 
>   void cpu0(void)
>   {
>   WRITE_ONCE(x.next, );
>   }
> 
>   void cpu1(void)
>   {
>   r1 = lockless_dereference(x.next);
>   WRITE_ONCE(r1->next, );
>   }
> 
>   void cpu2(void)
>   {
>   r2 = lockless_dereference(y.next);
>   r3 = READ_ONCE(r2->next);
>   }
> 
> In this case, it is legal to end the run with:
> 
>   r1 ==  && r2 ==  && r3 == 
> 
> Please see below for a ppcmem litmus test.
> 
> So, did I get it right this time?  ;-)

The code above looks correct to me (in that it matches WRC+addrs),
but your litmus test:

> PPC WRCnf+addrs
> ""
> {
> 0:r2=x; 0:r3=y;
> 1:r2=x; 1:r3=y;
> 2:r2=x; 2:r3=y;
> c=a; d=b; x=c; y=d;
> }
>  P0   | P1| P2;
>  stw r3,0(r2) | lwz r8,0(r2)  | lwz r8,0(r3)  ;
>   | stw r2,0(r3)  | lwz r9,0(r8)  ;
> exists
> (1:r8=y /\ 2:r8=x /\ 2:r9=c)

Seems to be missing the address dependency on P1.

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Will Deacon
On Fri, Jan 15, 2016 at 01:58:53PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 10:27:14PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 15, 2016 at 09:46:12AM -0800, Paul E. McKenney wrote:
> > > On Fri, Jan 15, 2016 at 10:13:48AM +0100, Peter Zijlstra wrote:
> > 
> > > > And the stuff we're confused about is how best to express the difference
> > > > and guarantees of these two forms of transitivity and how exactly they
> > > > interact.
> > > 
> > > Hoping my memory-barrier.txt patch helps here...
> > 
> > Yes, that seems a good start. But yesterday you raised the 'fun' point
> > of two globally ordered sequences connected by a single local link.
> 
> The conclusion that I am slowly coming to is that litmus tests should
> not be thought of as linear chains, but rather as cycles.  If you think
> of it as a cycle, then it doesn't matter where the local link is, just
> how many of them and how they are connected.

Do you have some examples of this? I'm struggling to make it work in my
mind, or are you talking specifically in the context of the kernel
memory model?

> But I will admit that there are some rather strange litmus tests that
> challenge this cycle-centric view, for example, the one shown below.
> It turns out that herd and ppcmem disagree on the outcome.  (The Power
> architects side with ppcmem.)
> 
> > And I think I'm still confused on LWSYNC (in the smp_wmb case) when one
> > of the stores looses a conflict, and if that scenario matters. If it
> > does, we should inspect the same case for other barriers.
> 
> Indeed.  I am still working on how these should be described.  My
> current thought is to be quite conservative on what ordering is
> actually respected, however, the current task is formalizing how
> RCU plays with the rest of the memory model.
> 
>   Thanx, Paul
> 
> 
> 
> PPC Overlapping Group-B sets version 4
> ""
> (* When the Group-B sets from two different barriers involve instructions in
>the same thread, within that thread one set must contain the other.
> 
>   P0  P1  P2
>   Rx=1Wy=1Wz=2
>   dep.lwsync  lwsync
>   Ry=0Wz=1Wx=1
>   Rz=1
> 
>   assert(!(z=2))
> 
>Forbidden by ppcmem, allowed by herd.
> *)
> {
> 0:r1=x; 0:r2=y; 0:r3=z;
> 1:r1=x; 1:r2=y; 1:r3=z; 1:r4=1;
> 2:r1=x; 2:r2=y; 2:r3=z; 2:r4=1; 2:r5=2;
> }
>  P0   | P1| P2;
>  lwz r6,0(r1) | stw r4,0(r2)  | stw r5,0(r3)  ;
>  xor r7,r6,r6 | lwsync| lwsync;
>  lwzx r7,r7,r2| stw r4,0(r3)  | stw r4,0(r1)  ;
>  lwz r8,0(r3) |   |   ;
> 
> exists
> (z=2 /\ 0:r6=1 /\ 0:r7=0 /\ 0:r8=1)

That really hurts. Assuming that the "assert(!(z=2))" is actually there
to constrain the coherence order of z to be {0->1->2}, then I think that
this test is forbidden on arm using dmb instead of lwsync. That said, I
also don't think the Rz=1 in P0 changes anything.

The double negatives don't help here! (it is forbidden to guarantee that
z is not always 2).

Will


Re: [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-25 Thread Will Deacon
Hi Paul,

On Fri, Jan 15, 2016 at 09:39:12AM -0800, Paul E. McKenney wrote:
> On Fri, Jan 15, 2016 at 09:55:54AM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 14, 2016 at 01:29:13PM -0800, Paul E. McKenney wrote:
> > > So smp_mb() provides transitivity, as do pairs of smp_store_release()
> > > and smp_read_acquire(), 
> > 
> > But they provide different grades of transitivity, which is where all
> > the confusion lays.
> > 
> > smp_mb() is strongly/globally transitive, all CPUs will agree on the order.
> > 
> > Whereas the RCpc release+acquire is weakly so, only the two cpus
> > involved in the handover will agree on the order.
> 
> Good point!
> 
> Using grace periods in place of smp_mb() also provides strong/global
> transitivity, but also insanely high latencies.  ;-)
> 
> The patch below updates Documentation/memory-barriers.txt to define
> local vs. global transitivity.  The corresponding ppcmem litmus test
> is included below as well.
> 
> Should we start putting litmus tests for the various examples
> somewhere, perhaps in a litmus-tests directory within each participating
> architecture?  I have a pile of powerpc-related litmus tests on my laptop,
> but they probably aren't doing all that much good there.

I too would like to have the litmus tests in the kernel so that we can
refer to them from memory-barriers.txt. Ideally they wouldn't be targetted
to a particular arch, however.

> PPC local-transitive
> ""
> {
> 0:r1=1; 0:r2=u; 0:r3=v; 0:r4=x; 0:r5=y; 0:r6=z;
> 1:r1=1; 1:r2=u; 1:r3=v; 1:r4=x; 1:r5=y; 1:r6=z;
> 2:r1=1; 2:r2=u; 2:r3=v; 2:r4=x; 2:r5=y; 2:r6=z;
> 3:r1=1; 3:r2=u; 3:r3=v; 3:r4=x; 3:r5=y; 3:r6=z;
> }
>  P0   | P1   | P2   | P3   ;
>  lwz r9,0(r4) | lwz r9,0(r5) | lwz r9,0(r6) | stw r1,0(r3) ;
>  lwsync   | lwsync   | lwsync   | sync ;
>  stw r1,0(r2) | lwz r8,0(r3) | stw r1,0(r7) | lwz r9,0(r2) ;
>  lwsync   | lwz r7,0(r2) |  |  ;
>  stw r1,0(r5) | lwsync   |  |  ;
>   | stw r1,0(r6) |  |  ;
> exists
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r8=0 /\ 3:r9=0) *)
> (* (0:r9=1 /\ 1:r9=1 /\ 2:r9=1) *)
> (* (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0) *)
> (0:r9=0 /\ 1:r9=1 /\ 2:r9=1 /\ 1:r7=0)

i.e. we should rewrite this using READ_ONCE/WRITE_ONCE and smp_mb() etc.

> 
> 
> commit 2cb4e83a1b5c89c8e39b8a64bd89269d05913e41
> Author: Paul E. McKenney 
> Date:   Fri Jan 15 09:30:42 2016 -0800
> 
> documentation: Distinguish between local and global transitivity
> 
> The introduction of smp_load_acquire() and smp_store_release() had
> the side effect of introducing a weaker notion of transitivity:
> The transitivity of full smp_mb() barriers is global, but that
> of smp_store_release()/smp_load_acquire() chains is local.  This
> commit therefore introduces the notion of local transitivity and
> gives an example.
> 
> Reported-by: Peter Zijlstra 
> Reported-by: Will Deacon 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index c66ba46d8079..d8109ed99342 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1318,8 +1318,82 @@ or a level of cache, CPU 2 might have early access to 
> CPU 1's writes.
>  General barriers are therefore required to ensure that all CPUs agree
>  on the combined order of CPU 1's and CPU 2's accesses.
>  
> -To reiterate, if your code requires transitivity, use general barriers
> -throughout.
> +General barriers provide "global transitivity", so that all CPUs will
> +agree on the order of operations.  In contrast, a chain of release-acquire
> +pairs provides only "local transitivity", so that only those CPUs on
> +the chain are guaranteed to agree on the combined order of the accesses.

Thanks for having a go at this. I tried defining something axiomatically,
but got stuck pretty quickly. In my scheme, I used "data-directed
transitivity" instead of "local transitivity", since the latter seems to
be a bit of a misnomer.

> +For example, switching to C code in deference to Herman Hollerith:
> +
> + int u, v, x, y, z;
> +
> + void cpu0(void)
> + {
> + r0 = smp_load_acquire();
> + WRITE_ONCE(u, 1);
> + smp_store_release(, 1);
> + }
> +
> + void cpu1(void)
> + {
> + r1 = smp_load_acquire();
> + r4 = READ_ONCE(v);
> + r5 = READ_ONCE(u);
> + smp_store_release(, 1);
> + }
> +
> + void cpu2(void)
> + {
> + r2 = smp_load_acquire();
> + smp_store_release(, 1);
> + }
> +
> + void cpu3(void)
> + {
> + WRITE_ONCE(v, 1);
> + smp_mb();
> +