subject:"\[PATCH tip\/locking\/core v4 1\/6\] powerpc\: atomic\: Make \*xchg and \*cmpxchg a full barrier"

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-26 Thread Boqun Feng

On Mon, Oct 26, 2015 at 02:20:21PM +1100, Paul Mackerras wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> > 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> 
> Right - interrupts and returns from interrupt are context
> synchronizing operations, which means they wait until all outstanding
> instructions have got to the point where they have reported any
> exceptions they're going to report, which means in turn that loads and
> stores have completed address translation.  But all of that doesn't
> imply anything about the visibility of the loads and stores.
> 
> There is a full barrier in the context switch path, but not in the
> system call entry/exit path.
> 

Thank you, Paul. That's much clear now ;-)

Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-26 Thread Boqun Feng

On Mon, Oct 26, 2015 at 11:20:01AM +0900, Michael Ellerman wrote:
> 
> Sorry guys, these threads are so long I tend not to read them very actively :}
> 
> Looking at the system call path, the straight line path does not include any
> barriers. I can't see any hidden in macros either.
> 
> We also have an explicit sync in the switch_to() path, which suggests that we
> know system call is not a full barrier.
> 
> Also looking at the architecture, section 1.5 which talks about the
> synchronisation that occurs on system calls, defines nothing in terms of
> memory ordering, and includes a programming note which says "Unlike the
> Synchronize instruction, a context synchronizing operation does not affect the
> order in which storage accesses are performed.".
> 

Thank you, Michael. So IIUC, "sc" and "rfid" just imply an execution
barrier like "isync" rather than a memory barrier. So memory barriers
are needed if a system call need a memory ordering guarantee.

Regards,
Boqun

> Whether that's actually how it's implemented I don't know, I'll see if I can
> find out.
> 
> cheers
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-26 Thread Boqun Feng

On Mon, Oct 26, 2015 at 11:20:01AM +0900, Michael Ellerman wrote:
> 
> Sorry guys, these threads are so long I tend not to read them very actively :}
> 
> Looking at the system call path, the straight line path does not include any
> barriers. I can't see any hidden in macros either.
> 
> We also have an explicit sync in the switch_to() path, which suggests that we
> know system call is not a full barrier.
> 
> Also looking at the architecture, section 1.5 which talks about the
> synchronisation that occurs on system calls, defines nothing in terms of
> memory ordering, and includes a programming note which says "Unlike the
> Synchronize instruction, a context synchronizing operation does not affect the
> order in which storage accesses are performed.".
> 

Thank you, Michael. So IIUC, "sc" and "rfid" just imply an execution
barrier like "isync" rather than a memory barrier. So memory barriers
are needed if a system call need a memory ordering guarantee.

Regards,
Boqun

> Whether that's actually how it's implemented I don't know, I'll see if I can
> find out.
> 
> cheers
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-26 Thread Boqun Feng

On Mon, Oct 26, 2015 at 02:20:21PM +1100, Paul Mackerras wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> > 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> 
> Right - interrupts and returns from interrupt are context
> synchronizing operations, which means they wait until all outstanding
> instructions have got to the point where they have reported any
> exceptions they're going to report, which means in turn that loads and
> stores have completed address translation.  But all of that doesn't
> imply anything about the visibility of the loads and stores.
> 
> There is a full barrier in the context switch path, but not in the
> system call entry/exit path.
> 

Thank you, Paul. That's much clear now ;-)

Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Paul Mackerras

On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> 
> So x86 implies a full barrier for everything that changes the CPL; and
> some form of implied ordering seems a must if you change the privilege
> level unless you tag every single load/store with the priv level at that
> time, which seems the more expensive option.
> 
> So I suspect the typical implementation will flush all load/stores,
> change the effective priv level and continue.
> 
> This can of course be implemented at a pure per CPU ordering (RCpc),
> which would be in line with the rest of Power, in which case you do
> indeed need an explicit sync to make it visible to other CPUs.

Right - interrupts and returns from interrupt are context
synchronizing operations, which means they wait until all outstanding
instructions have got to the point where they have reported any
exceptions they're going to report, which means in turn that loads and
stores have completed address translation.  But all of that doesn't
imply anything about the visibility of the loads and stores.

There is a full barrier in the context switch path, but not in the
system call entry/exit path.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Michael Ellerman

On Wed, 2015-10-21 at 12:36 -0700, Paul E. McKenney wrote:

> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:

> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:

> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> 
> And it is entirely possible that there is some similar operation
> somewhere in the powerpc entry/exit code.  I would not trust myself
> to recognize it, though.

> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> > 
> > But yes, if Michael or Ben could clarify this it would be good.
> 
> :-) ;-) ;-)

Sorry guys, these threads are so long I tend not to read them very actively :}

Looking at the system call path, the straight line path does not include any
barriers. I can't see any hidden in macros either.

We also have an explicit sync in the switch_to() path, which suggests that we
know system call is not a full barrier.

Also looking at the architecture, section 1.5 which talks about the
synchronisation that occurs on system calls, defines nothing in terms of
memory ordering, and includes a programming note which says "Unlike the
Synchronize instruction, a context synchronizing operation does not affect the
order in which storage accesses are performed.".

Whether that's actually how it's implemented I don't know, I'll see if I can
find out.

cheers

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Boqun Feng

On Wed, Oct 21, 2015 at 12:36:38PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> 
> And it is entirely possible that there is some similar operation
> somewhere in the powerpc entry/exit code.  I would not trust myself
> to recognize it, though.
> 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> > 
> > But yes, if Michael or Ben could clarify this it would be good.
> > 

Michael and Ben, ping for this, thank you ;-)

Regards,
Boqun

> > Back then I talked to Ralf about what MIPS says on this, and MIPS arch
> > spec is entirely quiet on this, it allows implementations full freedom
> > IIRC.
> 
> :-) ;-) ;-)
> 
> > 
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Boqun Feng

On Sat, Oct 24, 2015 at 07:53:56PM +0800, Boqun Feng wrote:
> On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> > 
> > Right, futexes are a pain; and I think we all agreed we didn't want to
> > go rely on implementation details unless we absolutely _have_ to.
> > 
> 
> Agreed.
> 
> Besides, after I have read why futex_wake_op(the caller of
> futex_atomic_op_inuser()) is introduced, I think your worries are quite
> reasonable. I thought the futex_atomic_op_inuser() only operated on
> futex related variables, but it turns out it can actually operate any
> userspace variable if userspace code likes, therefore we don't have
> control of all memory ordering guarantee of the variable. So if PPC
> doesn't provide a full barrier at user<->kernel boundries, we should
> make futex_atomic_op_inuser() fully ordered.
> 
> 
> Still looking into futex_atomic_cmpxchg_inatomic() ...
> 

I thought that the futex related variables (userspace variables that the
first parameter of futex system call points to) are only accessed by
futex system call in userspace, but it turns out not the fact. So
memordy ordering guarantees of these variables are also out of the
control of kernel. Therefore we should make
futex_atomic_cmpxchg_inatomic() fully ordered, of course, if PPC doesn't
provide a full barrier at user<->kernel boundries..

Regards
Boqun

signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Boqun Feng

On Sat, Oct 24, 2015 at 07:53:56PM +0800, Boqun Feng wrote:
> On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> > 
> > Right, futexes are a pain; and I think we all agreed we didn't want to
> > go rely on implementation details unless we absolutely _have_ to.
> > 
> 
> Agreed.
> 
> Besides, after I have read why futex_wake_op(the caller of
> futex_atomic_op_inuser()) is introduced, I think your worries are quite
> reasonable. I thought the futex_atomic_op_inuser() only operated on
> futex related variables, but it turns out it can actually operate any
> userspace variable if userspace code likes, therefore we don't have
> control of all memory ordering guarantee of the variable. So if PPC
> doesn't provide a full barrier at user<->kernel boundries, we should
> make futex_atomic_op_inuser() fully ordered.
> 
> 
> Still looking into futex_atomic_cmpxchg_inatomic() ...
> 

I thought that the futex related variables (userspace variables that the
first parameter of futex system call points to) are only accessed by
futex system call in userspace, but it turns out not the fact. So
memordy ordering guarantees of these variables are also out of the
control of kernel. Therefore we should make
futex_atomic_cmpxchg_inatomic() fully ordered, of course, if PPC doesn't
provide a full barrier at user<->kernel boundries..

Regards
Boqun

signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Boqun Feng

On Wed, Oct 21, 2015 at 12:36:38PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> 
> And it is entirely possible that there is some similar operation
> somewhere in the powerpc entry/exit code.  I would not trust myself
> to recognize it, though.
> 
> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> > 
> > But yes, if Michael or Ben could clarify this it would be good.
> > 

Michael and Ben, ping for this, thank you ;-)

Regards,
Boqun

> > Back then I talked to Ralf about what MIPS says on this, and MIPS arch
> > spec is entirely quiet on this, it allows implementations full freedom
> > IIRC.
> 
> :-) ;-) ;-)
> 
> > 
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Michael Ellerman

On Wed, 2015-10-21 at 12:36 -0700, Paul E. McKenney wrote:

> On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:

> > On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:

> > > I am not seeing a sync there, but I really have to defer to the
> > > maintainers on this one.  I could easily have missed one.
> > 
> > So x86 implies a full barrier for everything that changes the CPL; and
> > some form of implied ordering seems a must if you change the privilege
> > level unless you tag every single load/store with the priv level at that
> > time, which seems the more expensive option.
> 
> And it is entirely possible that there is some similar operation
> somewhere in the powerpc entry/exit code.  I would not trust myself
> to recognize it, though.

> > So I suspect the typical implementation will flush all load/stores,
> > change the effective priv level and continue.
> > 
> > This can of course be implemented at a pure per CPU ordering (RCpc),
> > which would be in line with the rest of Power, in which case you do
> > indeed need an explicit sync to make it visible to other CPUs.
> > 
> > But yes, if Michael or Ben could clarify this it would be good.
> 
> :-) ;-) ;-)

Sorry guys, these threads are so long I tend not to read them very actively :}

Looking at the system call path, the straight line path does not include any
barriers. I can't see any hidden in macros either.

We also have an explicit sync in the switch_to() path, which suggests that we
know system call is not a full barrier.

Also looking at the architecture, section 1.5 which talks about the
synchronisation that occurs on system calls, defines nothing in terms of
memory ordering, and includes a programming note which says "Unlike the
Synchronize instruction, a context synchronizing operation does not affect the
order in which storage accesses are performed.".

Whether that's actually how it's implemented I don't know, I'll see if I can
find out.

cheers

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-25 Thread Paul Mackerras

On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> 
> So x86 implies a full barrier for everything that changes the CPL; and
> some form of implied ordering seems a must if you change the privilege
> level unless you tag every single load/store with the priv level at that
> time, which seems the more expensive option.
> 
> So I suspect the typical implementation will flush all load/stores,
> change the effective priv level and continue.
> 
> This can of course be implemented at a pure per CPU ordering (RCpc),
> which would be in line with the rest of Power, in which case you do
> indeed need an explicit sync to make it visible to other CPUs.

Right - interrupts and returns from interrupt are context
synchronizing operations, which means they wait until all outstanding
instructions have got to the point where they have reported any
exceptions they're going to report, which means in turn that loads and
stores have completed address translation.  But all of that doesn't
imply anything about the visibility of the loads and stores.

There is a full barrier in the context switch path, but not in the
system call entry/exit path.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-24 Thread Boqun Feng

On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 08:07:16PM +0800, Boqun Feng wrote:
> > On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > > > 
> > > > > > > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > > > So a full barrier on one side of these operations is enough, I think.
> > > > > IOW, there is no need to strengthen these operations.
> > > > 
> > > > Do we need to also worry about other futex use cases?
> > > 
> > > Worry, always!
> > > 
> > > But yes, there is one more specific usecase, which is that of a
> > > condition variable.
> > > 
> > > When we go sleep on a futex, we might want to assume visibility of the
> > > stores done by the thread that woke us by the time we wake up.
> > > 
> > 
> > But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
> > and imply a full barrier, is an RELEASE(sc) semantics really needed
> > here?
> 
> For this, no, the current code should be fine I think.
> 
> > Further more, is this condition variable visibility guaranteed by other
> > part of futex? Because in futex_wake_op:
> > 
> > futex_wake_op()
> >   ...
> >   double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
> >   wake_up_q(_q);
> > 
> > and in futex_wait():
> > 
> > futex_wait()
> >   ...
> >   futex_wait_queue_me(hb, , to); <- schedule() here
> >   ...
> >   unqueue_me()
> > drop_futex_key_refs(>key);
> > iput()/mmdrop(); <- a full barrier
> >   
> > 
> > The RELEASE(pc) barrier pairs with the full barrier, therefore the
> > userspace wakee can observe the condition variable modification.
> 
> Right, futexes are a pain; and I think we all agreed we didn't want to
> go rely on implementation details unless we absolutely _have_ to.
> 

Agreed.

Besides, after I have read why futex_wake_op(the caller of
futex_atomic_op_inuser()) is introduced, I think your worries are quite
reasonable. I thought the futex_atomic_op_inuser() only operated on
futex related variables, but it turns out it can actually operate any
userspace variable if userspace code likes, therefore we don't have
control of all memory ordering guarantee of the variable. So if PPC
doesn't provide a full barrier at user<->kernel boundries, we should
make futex_atomic_op_inuser() fully ordered.


Still looking into futex_atomic_cmpxchg_inatomic() ...

> > > And.. aside from the thoughts I outlined in the email referenced above,
> > > there is always the chance people accidentally rely on the strong
> > > ordering on their x86 CPU and find things come apart when ran on their
> > > ARM/MIPS/etc..
> > > 
> > > There are a fair number of people who use the raw futex call and we have
> > > 0 visibility into many of them. The assumed and accidental ordering
> > > guarantees will forever remain a mystery.
> > > 
> > 
> > Understood. That's truely a potential problem. Considering not all the
> > architectures imply a full barrier at user<->kernel boundries, maybe we
> > can use one bit in the opcode of the futex system call to indicate
> > whether userspace treats futex as fully ordered. Like:
> > 
> > #define FUTEX_ORDER_SEQ_CST  0
> > #define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)
> 
> Not unless there's an actual performance problem with any of this.
> Futexes are painful enough as is.

Make sense, and we still have choices like modifying the userspace code
if there is actually a performance problem ;-)

Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-24 Thread Peter Zijlstra

On Thu, Oct 22, 2015 at 08:07:16PM +0800, Boqun Feng wrote:
> On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > > 
> > > > > > > https://lkml.org/lkml/2015/8/26/596
> > 
> > > > So a full barrier on one side of these operations is enough, I think.
> > > > IOW, there is no need to strengthen these operations.
> > > 
> > > Do we need to also worry about other futex use cases?
> > 
> > Worry, always!
> > 
> > But yes, there is one more specific usecase, which is that of a
> > condition variable.
> > 
> > When we go sleep on a futex, we might want to assume visibility of the
> > stores done by the thread that woke us by the time we wake up.
> > 
> 
> But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
> and imply a full barrier, is an RELEASE(sc) semantics really needed
> here?

For this, no, the current code should be fine I think.

> Further more, is this condition variable visibility guaranteed by other
> part of futex? Because in futex_wake_op:
> 
>   futex_wake_op()
> ...
> double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
> wake_up_q(_q);
> 
> and in futex_wait():
> 
>   futex_wait()
> ...
> futex_wait_queue_me(hb, , to); <- schedule() here
> ...
> unqueue_me()
>   drop_futex_key_refs(>key);
>   iput()/mmdrop(); <- a full barrier
> 
> 
> The RELEASE(pc) barrier pairs with the full barrier, therefore the
> userspace wakee can observe the condition variable modification.

Right, futexes are a pain; and I think we all agreed we didn't want to
go rely on implementation details unless we absolutely _have_ to.

> > And.. aside from the thoughts I outlined in the email referenced above,
> > there is always the chance people accidentally rely on the strong
> > ordering on their x86 CPU and find things come apart when ran on their
> > ARM/MIPS/etc..
> > 
> > There are a fair number of people who use the raw futex call and we have
> > 0 visibility into many of them. The assumed and accidental ordering
> > guarantees will forever remain a mystery.
> > 
> 
> Understood. That's truely a potential problem. Considering not all the
> architectures imply a full barrier at user<->kernel boundries, maybe we
> can use one bit in the opcode of the futex system call to indicate
> whether userspace treats futex as fully ordered. Like:
> 
> #define FUTEX_ORDER_SEQ_CST  0
> #define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)

Not unless there's an actual performance problem with any of this.
Futexes are painful enough as is.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-24 Thread Boqun Feng

On Sat, Oct 24, 2015 at 12:26:27PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 08:07:16PM +0800, Boqun Feng wrote:
> > On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > > > 
> > > > > > > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > > > So a full barrier on one side of these operations is enough, I think.
> > > > > IOW, there is no need to strengthen these operations.
> > > > 
> > > > Do we need to also worry about other futex use cases?
> > > 
> > > Worry, always!
> > > 
> > > But yes, there is one more specific usecase, which is that of a
> > > condition variable.
> > > 
> > > When we go sleep on a futex, we might want to assume visibility of the
> > > stores done by the thread that woke us by the time we wake up.
> > > 
> > 
> > But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
> > and imply a full barrier, is an RELEASE(sc) semantics really needed
> > here?
> 
> For this, no, the current code should be fine I think.
> 
> > Further more, is this condition variable visibility guaranteed by other
> > part of futex? Because in futex_wake_op:
> > 
> > futex_wake_op()
> >   ...
> >   double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
> >   wake_up_q(_q);
> > 
> > and in futex_wait():
> > 
> > futex_wait()
> >   ...
> >   futex_wait_queue_me(hb, , to); <- schedule() here
> >   ...
> >   unqueue_me()
> > drop_futex_key_refs(>key);
> > iput()/mmdrop(); <- a full barrier
> >   
> > 
> > The RELEASE(pc) barrier pairs with the full barrier, therefore the
> > userspace wakee can observe the condition variable modification.
> 
> Right, futexes are a pain; and I think we all agreed we didn't want to
> go rely on implementation details unless we absolutely _have_ to.
> 

Agreed.

Besides, after I have read why futex_wake_op(the caller of
futex_atomic_op_inuser()) is introduced, I think your worries are quite
reasonable. I thought the futex_atomic_op_inuser() only operated on
futex related variables, but it turns out it can actually operate any
userspace variable if userspace code likes, therefore we don't have
control of all memory ordering guarantee of the variable. So if PPC
doesn't provide a full barrier at user<->kernel boundries, we should
make futex_atomic_op_inuser() fully ordered.


Still looking into futex_atomic_cmpxchg_inatomic() ...

> > > And.. aside from the thoughts I outlined in the email referenced above,
> > > there is always the chance people accidentally rely on the strong
> > > ordering on their x86 CPU and find things come apart when ran on their
> > > ARM/MIPS/etc..
> > > 
> > > There are a fair number of people who use the raw futex call and we have
> > > 0 visibility into many of them. The assumed and accidental ordering
> > > guarantees will forever remain a mystery.
> > > 
> > 
> > Understood. That's truely a potential problem. Considering not all the
> > architectures imply a full barrier at user<->kernel boundries, maybe we
> > can use one bit in the opcode of the futex system call to indicate
> > whether userspace treats futex as fully ordered. Like:
> > 
> > #define FUTEX_ORDER_SEQ_CST  0
> > #define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)
> 
> Not unless there's an actual performance problem with any of this.
> Futexes are painful enough as is.

Make sense, and we still have choices like modifying the userspace code
if there is actually a performance problem ;-)

Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-24 Thread Peter Zijlstra

On Thu, Oct 22, 2015 at 08:07:16PM +0800, Boqun Feng wrote:
> On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > > 
> > > > > > > https://lkml.org/lkml/2015/8/26/596
> > 
> > > > So a full barrier on one side of these operations is enough, I think.
> > > > IOW, there is no need to strengthen these operations.
> > > 
> > > Do we need to also worry about other futex use cases?
> > 
> > Worry, always!
> > 
> > But yes, there is one more specific usecase, which is that of a
> > condition variable.
> > 
> > When we go sleep on a futex, we might want to assume visibility of the
> > stores done by the thread that woke us by the time we wake up.
> > 
> 
> But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
> and imply a full barrier, is an RELEASE(sc) semantics really needed
> here?

For this, no, the current code should be fine I think.

> Further more, is this condition variable visibility guaranteed by other
> part of futex? Because in futex_wake_op:
> 
>   futex_wake_op()
> ...
> double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
> wake_up_q(_q);
> 
> and in futex_wait():
> 
>   futex_wait()
> ...
> futex_wait_queue_me(hb, , to); <- schedule() here
> ...
> unqueue_me()
>   drop_futex_key_refs(>key);
>   iput()/mmdrop(); <- a full barrier
> 
> 
> The RELEASE(pc) barrier pairs with the full barrier, therefore the
> userspace wakee can observe the condition variable modification.

Right, futexes are a pain; and I think we all agreed we didn't want to
go rely on implementation details unless we absolutely _have_ to.

> > And.. aside from the thoughts I outlined in the email referenced above,
> > there is always the chance people accidentally rely on the strong
> > ordering on their x86 CPU and find things come apart when ran on their
> > ARM/MIPS/etc..
> > 
> > There are a fair number of people who use the raw futex call and we have
> > 0 visibility into many of them. The assumed and accidental ordering
> > guarantees will forever remain a mystery.
> > 
> 
> Understood. That's truely a potential problem. Considering not all the
> architectures imply a full barrier at user<->kernel boundries, maybe we
> can use one bit in the opcode of the futex system call to indicate
> whether userspace treats futex as fully ordered. Like:
> 
> #define FUTEX_ORDER_SEQ_CST  0
> #define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)

Not unless there's an actual performance problem with any of this.
Futexes are painful enough as is.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-22 Thread Boqun Feng

On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > 
> > > > > > https://lkml.org/lkml/2015/8/26/596
> 
> > > So a full barrier on one side of these operations is enough, I think.
> > > IOW, there is no need to strengthen these operations.
> > 
> > Do we need to also worry about other futex use cases?
> 
> Worry, always!
> 
> But yes, there is one more specific usecase, which is that of a
> condition variable.
> 
> When we go sleep on a futex, we might want to assume visibility of the
> stores done by the thread that woke us by the time we wake up.
> 

But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
and imply a full barrier, is an RELEASE(sc) semantics really needed
here?

Further more, is this condition variable visibility guaranteed by other
part of futex? Because in futex_wake_op:

futex_wake_op()
  ...
  double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
  wake_up_q(_q);

and in futex_wait():

futex_wait()
  ...
  futex_wait_queue_me(hb, , to); <- schedule() here
  ...
  unqueue_me()
drop_futex_key_refs(>key);
iput()/mmdrop(); <- a full barrier

The RELEASE(pc) barrier pairs with the full barrier, therefore the
userspace wakee can observe the condition variable modification.

> 
> 
> And.. aside from the thoughts I outlined in the email referenced above,
> there is always the chance people accidentally rely on the strong
> ordering on their x86 CPU and find things come apart when ran on their
> ARM/MIPS/etc..
> 
> There are a fair number of people who use the raw futex call and we have
> 0 visibility into many of them. The assumed and accidental ordering
> guarantees will forever remain a mystery.
> 

Understood. That's truely a potential problem. Considering not all the
architectures imply a full barrier at user<->kernel boundries, maybe we
can use one bit in the opcode of the futex system call to indicate
whether userspace treats futex as fully ordered. Like:

#define FUTEX_ORDER_SEQ_CST  0
#define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)

Therefore all existing code will run with a strong ordering version of
futex(of course, we need to check and modify kernel code first to
guarantee that), and if userspace code uses FUTEX_ORDER_RELAXED, it must
not rely on futex() for the strong ordering, and should add memory
barriers itself if necessary.

Regards,
Boqun

signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-22 Thread Boqun Feng

On Wed, Oct 21, 2015 at 09:48:25PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > > 
> > > > > > https://lkml.org/lkml/2015/8/26/596
> 
> > > So a full barrier on one side of these operations is enough, I think.
> > > IOW, there is no need to strengthen these operations.
> > 
> > Do we need to also worry about other futex use cases?
> 
> Worry, always!
> 
> But yes, there is one more specific usecase, which is that of a
> condition variable.
> 
> When we go sleep on a futex, we might want to assume visibility of the
> stores done by the thread that woke us by the time we wake up.
> 

But the thing is futex atomics in PPC are already RELEASE(pc)+ACQUIRE
and imply a full barrier, is an RELEASE(sc) semantics really needed
here?

Further more, is this condition variable visibility guaranteed by other
part of futex? Because in futex_wake_op:

futex_wake_op()
  ...
  double_unlock_hb(hb1, hb2);  <- RELEASE(pc) barrier here.
  wake_up_q(_q);

and in futex_wait():

futex_wait()
  ...
  futex_wait_queue_me(hb, , to); <- schedule() here
  ...
  unqueue_me()
drop_futex_key_refs(>key);
iput()/mmdrop(); <- a full barrier

The RELEASE(pc) barrier pairs with the full barrier, therefore the
userspace wakee can observe the condition variable modification.

> 
> 
> And.. aside from the thoughts I outlined in the email referenced above,
> there is always the chance people accidentally rely on the strong
> ordering on their x86 CPU and find things come apart when ran on their
> ARM/MIPS/etc..
> 
> There are a fair number of people who use the raw futex call and we have
> 0 visibility into many of them. The assumed and accidental ordering
> guarantees will forever remain a mystery.
> 

Understood. That's truely a potential problem. Considering not all the
architectures imply a full barrier at user<->kernel boundries, maybe we
can use one bit in the opcode of the futex system call to indicate
whether userspace treats futex as fully ordered. Like:

#define FUTEX_ORDER_SEQ_CST  0
#define FUTEX_ORDER_RELAXED  64 (bit 7 and bit 8 are already used)

Therefore all existing code will run with a strong ordering version of
futex(of course, we need to check and modify kernel code first to
guarantee that), and if userspace code uses FUTEX_ORDER_RELAXED, it must
not rely on futex() for the strong ordering, and should add memory
barriers itself if necessary.

Regards,
Boqun

signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Peter Zijlstra

On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > 
> > > > > https://lkml.org/lkml/2015/8/26/596

> > So a full barrier on one side of these operations is enough, I think.
> > IOW, there is no need to strengthen these operations.
> 
> Do we need to also worry about other futex use cases?

Worry, always!

But yes, there is one more specific usecase, which is that of a
condition variable.

When we go sleep on a futex, we might want to assume visibility of the
stores done by the thread that woke us by the time we wake up.

And.. aside from the thoughts I outlined in the email referenced above,
there is always the chance people accidentally rely on the strong
ordering on their x86 CPU and find things come apart when ran on their
ARM/MIPS/etc..

There are a fair number of people who use the raw futex call and we have
0 visibility into many of them. The assumed and accidental ordering
guarantees will forever remain a mystery.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Paul E. McKenney

On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> 
> So x86 implies a full barrier for everything that changes the CPL; and
> some form of implied ordering seems a must if you change the privilege
> level unless you tag every single load/store with the priv level at that
> time, which seems the more expensive option.

And it is entirely possible that there is some similar operation
somewhere in the powerpc entry/exit code.  I would not trust myself
to recognize it, though.

> So I suspect the typical implementation will flush all load/stores,
> change the effective priv level and continue.
> 
> This can of course be implemented at a pure per CPU ordering (RCpc),
> which would be in line with the rest of Power, in which case you do
> indeed need an explicit sync to make it visible to other CPUs.
> 
> But yes, if Michael or Ben could clarify this it would be good.
> 
> Back then I talked to Ralf about what MIPS says on this, and MIPS arch
> spec is entirely quiet on this, it allows implementations full freedom
> IIRC.

:-) ;-) ;-)

> 

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Paul E. McKenney

On Wed, Oct 21, 2015 at 04:45:03PM +0800, Boqun Feng wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > Am I missing something here?  If not, it seems to me that you need
> > > > > the leading lwsync to instead be a sync.
> > > > > 
> > > > > Of course, if I am not missing something, then this applies also to 
> > > > > the
> > > > > value-returning RMW atomic operations that you pulled this pattern 
> > > > > from.
> > > > > If so, it would seem that I didn't think through all the possibilities
> > > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > > but not as the load/store itself.  :-/
> > > > > 
> > > > 
> > > > Paul, I know this may be difficult, but could you recall why the
> > > > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > > > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > > > 
> > > > I did some search, but couldn't find the discussion of that patch.
> > > > 
> > > > I ask this because I recall Peter once bought up a discussion:
> > > > 
> > > > https://lkml.org/lkml/2015/8/26/596
> > > > 
> > > > Peter's conclusion seems to be that we could(though didn't want to) live
> > > > with futex atomics not being full barriers.
> > 
> > I have heard of user-level applications relying on unlock-lock being a
> > full barrier.  So paranoia would argue for the full barrier.
> 
> Understood.
> 
> So a full barrier on one side of these operations is enough, I think.
> IOW, there is no need to strengthen these operations.

Do we need to also worry about other futex use cases?

Thanx, Paul

> > > > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > > > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > > > the futex atomics, just wonder whether such strengthen is a -fix- or
> > > > not, considering that I want this patch to go to -stable tree.
> > > 
> > > So Linus' argued that since we only need to order against user accesses
> > > (true) and priv changes typically imply strong barriers (open) we might
> > > want to allow archs to rely on those instead of mandating they have
> > > explicit barriers in the futex primitives.
> > > 
> > > And I indeed forgot to follow up on that discussion.
> > > 
> > > So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> > > not critical to the futex atomic implementations what extra barriers are
> > > added.
> > > 
> > > If not; then strengthening the futex ops is indeed (probably) a good
> > > thing :-)
> 
> Peter, that's probably a good thing, but I'm not that familiar with
> futex right now, so I won't touch that part if unnecessary in this
> series.
> 
> Regards,
> Boqun
> 
> > 
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> > 
> > Thanx, Paul
> > 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Boqun Feng

On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> > > > 
> > > > Of course, if I am not missing something, then this applies also to the
> > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > If so, it would seem that I didn't think through all the possibilities
> > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > but not as the load/store itself.  :-/
> > > > 
> > > 
> > > Paul, I know this may be difficult, but could you recall why the
> > > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > > 
> > > I did some search, but couldn't find the discussion of that patch.
> > > 
> > > I ask this because I recall Peter once bought up a discussion:
> > > 
> > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > Peter's conclusion seems to be that we could(though didn't want to) live
> > > with futex atomics not being full barriers.
> 
> I have heard of user-level applications relying on unlock-lock being a
> full barrier.  So paranoia would argue for the full barrier.
> 

Understood.

So a full barrier on one side of these operations is enough, I think.
IOW, there is no need to strengthen these operations.

> > > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > > the futex atomics, just wonder whether such strengthen is a -fix- or
> > > not, considering that I want this patch to go to -stable tree.
> > 
> > So Linus' argued that since we only need to order against user accesses
> > (true) and priv changes typically imply strong barriers (open) we might
> > want to allow archs to rely on those instead of mandating they have
> > explicit barriers in the futex primitives.
> > 
> > And I indeed forgot to follow up on that discussion.
> > 
> > So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> > not critical to the futex atomic implementations what extra barriers are
> > added.
> > 
> > If not; then strengthening the futex ops is indeed (probably) a good
> > thing :-)

Peter, that's probably a good thing, but I'm not that familiar with
futex right now, so I won't touch that part if unnecessary in this
series.

Regards,
Boqun

> 
> I am not seeing a sync there, but I really have to defer to the
> maintainers on this one.  I could easily have missed one.
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Peter Zijlstra

On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> I am not seeing a sync there, but I really have to defer to the
> maintainers on this one.  I could easily have missed one.

So x86 implies a full barrier for everything that changes the CPL; and
some form of implied ordering seems a must if you change the privilege
level unless you tag every single load/store with the priv level at that
time, which seems the more expensive option.

So I suspect the typical implementation will flush all load/stores,
change the effective priv level and continue.

This can of course be implemented at a pure per CPU ordering (RCpc),
which would be in line with the rest of Power, in which case you do
indeed need an explicit sync to make it visible to other CPUs.

But yes, if Michael or Ben could clarify this it would be good.

Back then I talked to Ralf about what MIPS says on this, and MIPS arch
spec is entirely quiet on this, it allows implementations full freedom
IIRC.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Paul E. McKenney

On Wed, Oct 21, 2015 at 04:45:03PM +0800, Boqun Feng wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> > > On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > Am I missing something here?  If not, it seems to me that you need
> > > > > the leading lwsync to instead be a sync.
> > > > > 
> > > > > Of course, if I am not missing something, then this applies also to 
> > > > > the
> > > > > value-returning RMW atomic operations that you pulled this pattern 
> > > > > from.
> > > > > If so, it would seem that I didn't think through all the possibilities
> > > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > > but not as the load/store itself.  :-/
> > > > > 
> > > > 
> > > > Paul, I know this may be difficult, but could you recall why the
> > > > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > > > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > > > 
> > > > I did some search, but couldn't find the discussion of that patch.
> > > > 
> > > > I ask this because I recall Peter once bought up a discussion:
> > > > 
> > > > https://lkml.org/lkml/2015/8/26/596
> > > > 
> > > > Peter's conclusion seems to be that we could(though didn't want to) live
> > > > with futex atomics not being full barriers.
> > 
> > I have heard of user-level applications relying on unlock-lock being a
> > full barrier.  So paranoia would argue for the full barrier.
> 
> Understood.
> 
> So a full barrier on one side of these operations is enough, I think.
> IOW, there is no need to strengthen these operations.

Do we need to also worry about other futex use cases?

Thanx, Paul

> > > > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > > > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > > > the futex atomics, just wonder whether such strengthen is a -fix- or
> > > > not, considering that I want this patch to go to -stable tree.
> > > 
> > > So Linus' argued that since we only need to order against user accesses
> > > (true) and priv changes typically imply strong barriers (open) we might
> > > want to allow archs to rely on those instead of mandating they have
> > > explicit barriers in the futex primitives.
> > > 
> > > And I indeed forgot to follow up on that discussion.
> > > 
> > > So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> > > not critical to the futex atomic implementations what extra barriers are
> > > added.
> > > 
> > > If not; then strengthening the futex ops is indeed (probably) a good
> > > thing :-)
> 
> Peter, that's probably a good thing, but I'm not that familiar with
> futex right now, so I won't touch that part if unnecessary in this
> series.
> 
> Regards,
> Boqun
> 
> > 
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> > 
> > Thanx, Paul
> > 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Peter Zijlstra

On Wed, Oct 21, 2015 at 12:35:23PM -0700, Paul E. McKenney wrote:
> > > > > I ask this because I recall Peter once bought up a discussion:
> > > > > 
> > > > > https://lkml.org/lkml/2015/8/26/596

> > So a full barrier on one side of these operations is enough, I think.
> > IOW, there is no need to strengthen these operations.
> 
> Do we need to also worry about other futex use cases?

Worry, always!

But yes, there is one more specific usecase, which is that of a
condition variable.

When we go sleep on a futex, we might want to assume visibility of the
stores done by the thread that woke us by the time we wake up.

And.. aside from the thoughts I outlined in the email referenced above,
there is always the chance people accidentally rely on the strong
ordering on their x86 CPU and find things come apart when ran on their
ARM/MIPS/etc..

There are a fair number of people who use the raw futex call and we have
0 visibility into many of them. The assumed and accidental ordering
guarantees will forever remain a mystery.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Paul E. McKenney

On Wed, Oct 21, 2015 at 10:18:33AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> > I am not seeing a sync there, but I really have to defer to the
> > maintainers on this one.  I could easily have missed one.
> 
> So x86 implies a full barrier for everything that changes the CPL; and
> some form of implied ordering seems a must if you change the privilege
> level unless you tag every single load/store with the priv level at that
> time, which seems the more expensive option.

And it is entirely possible that there is some similar operation
somewhere in the powerpc entry/exit code.  I would not trust myself
to recognize it, though.

> So I suspect the typical implementation will flush all load/stores,
> change the effective priv level and continue.
> 
> This can of course be implemented at a pure per CPU ordering (RCpc),
> which would be in line with the rest of Power, in which case you do
> indeed need an explicit sync to make it visible to other CPUs.
> 
> But yes, if Michael or Ben could clarify this it would be good.
> 
> Back then I talked to Ralf about what MIPS says on this, and MIPS arch
> spec is entirely quiet on this, it allows implementations full freedom
> IIRC.

:-) ;-) ;-)

> 

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Peter Zijlstra

On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> I am not seeing a sync there, but I really have to defer to the
> maintainers on this one.  I could easily have missed one.

So x86 implies a full barrier for everything that changes the CPL; and
some form of implied ordering seems a must if you change the privilege
level unless you tag every single load/store with the priv level at that
time, which seems the more expensive option.

So I suspect the typical implementation will flush all load/stores,
change the effective priv level and continue.

This can of course be implemented at a pure per CPU ordering (RCpc),
which would be in line with the rest of Power, in which case you do
indeed need an explicit sync to make it visible to other CPUs.

But yes, if Michael or Ben could clarify this it would be good.

Back then I talked to Ralf about what MIPS says on this, and MIPS arch
spec is entirely quiet on this, it allows implementations full freedom
IIRC.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-21 Thread Boqun Feng

On Tue, Oct 20, 2015 at 02:28:35PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> > > > 
> > > > Of course, if I am not missing something, then this applies also to the
> > > > value-returning RMW atomic operations that you pulled this pattern from.
> > > > If so, it would seem that I didn't think through all the possibilities
> > > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > > that I worried about the RMW atomic operation acting as a barrier,
> > > > but not as the load/store itself.  :-/
> > > > 
> > > 
> > > Paul, I know this may be difficult, but could you recall why the
> > > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > > 
> > > I did some search, but couldn't find the discussion of that patch.
> > > 
> > > I ask this because I recall Peter once bought up a discussion:
> > > 
> > > https://lkml.org/lkml/2015/8/26/596
> > > 
> > > Peter's conclusion seems to be that we could(though didn't want to) live
> > > with futex atomics not being full barriers.
> 
> I have heard of user-level applications relying on unlock-lock being a
> full barrier.  So paranoia would argue for the full barrier.
> 

Understood.

So a full barrier on one side of these operations is enough, I think.
IOW, there is no need to strengthen these operations.

> > > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > > the futex atomics, just wonder whether such strengthen is a -fix- or
> > > not, considering that I want this patch to go to -stable tree.
> > 
> > So Linus' argued that since we only need to order against user accesses
> > (true) and priv changes typically imply strong barriers (open) we might
> > want to allow archs to rely on those instead of mandating they have
> > explicit barriers in the futex primitives.
> > 
> > And I indeed forgot to follow up on that discussion.
> > 
> > So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> > not critical to the futex atomic implementations what extra barriers are
> > added.
> > 
> > If not; then strengthening the futex ops is indeed (probably) a good
> > thing :-)

Peter, that's probably a good thing, but I'm not that familiar with
futex right now, so I won't touch that part if unnecessary in this
series.

Regards,
Boqun

> 
> I am not seeing a sync there, but I really have to defer to the
> maintainers on this one.  I could easily have missed one.
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-20 Thread Paul E. McKenney

On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > 
> > > Am I missing something here?  If not, it seems to me that you need
> > > the leading lwsync to instead be a sync.
> > > 
> > > Of course, if I am not missing something, then this applies also to the
> > > value-returning RMW atomic operations that you pulled this pattern from.
> > > If so, it would seem that I didn't think through all the possibilities
> > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > that I worried about the RMW atomic operation acting as a barrier,
> > > but not as the load/store itself.  :-/
> > > 
> > 
> > Paul, I know this may be difficult, but could you recall why the
> > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > 
> > I did some search, but couldn't find the discussion of that patch.
> > 
> > I ask this because I recall Peter once bought up a discussion:
> > 
> > https://lkml.org/lkml/2015/8/26/596
> > 
> > Peter's conclusion seems to be that we could(though didn't want to) live
> > with futex atomics not being full barriers.

I have heard of user-level applications relying on unlock-lock being a
full barrier.  So paranoia would argue for the full barrier.

> > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > the futex atomics, just wonder whether such strengthen is a -fix- or
> > not, considering that I want this patch to go to -stable tree.
> 
> So Linus' argued that since we only need to order against user accesses
> (true) and priv changes typically imply strong barriers (open) we might
> want to allow archs to rely on those instead of mandating they have
> explicit barriers in the futex primitives.
> 
> And I indeed forgot to follow up on that discussion.
> 
> So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> not critical to the futex atomic implementations what extra barriers are
> added.
> 
> If not; then strengthening the futex ops is indeed (probably) a good
> thing :-)

I am not seeing a sync there, but I really have to defer to the
maintainers on this one.  I could easily have missed one.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-20 Thread Peter Zijlstra

On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> > 
> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> > 
> 
> Paul, I know this may be difficult, but could you recall why the
> __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> 
> I did some search, but couldn't find the discussion of that patch.
> 
> I ask this because I recall Peter once bought up a discussion:
> 
> https://lkml.org/lkml/2015/8/26/596
> 
> Peter's conclusion seems to be that we could(though didn't want to) live
> with futex atomics not being full barriers.
> 
> 
> Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> the futex atomics, just wonder whether such strengthen is a -fix- or
> not, considering that I want this patch to go to -stable tree.

So Linus' argued that since we only need to order against user accesses
(true) and priv changes typically imply strong barriers (open) we might
want to allow archs to rely on those instead of mandating they have
explicit barriers in the futex primitives.

And I indeed forgot to follow up on that discussion.

So; does PPC imply full barriers on user<->kernel boundaries? If so, its
not critical to the futex atomic implementations what extra barriers are
added.

If not; then strengthening the futex ops is indeed (probably) a good
thing :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-20 Thread Boqun Feng

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 
> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.
> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 

Paul, I know this may be difficult, but could you recall why the
__futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?

I did some search, but couldn't find the discussion of that patch.

I ask this because I recall Peter once bought up a discussion:

https://lkml.org/lkml/2015/8/26/596

Peter's conclusion seems to be that we could(though didn't want to) live
with futex atomics not being full barriers.

Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
the futex atomics, just wonder whether such strengthen is a -fix- or
not, considering that I want this patch to go to -stable tree.

Of course, in the meanwhile of waiting for your answer, I will try to
figure out this by myself ;-)

Regards,
Boqun

signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-20 Thread Boqun Feng

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 
> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.
> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 

Paul, I know this may be difficult, but could you recall why the
__futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?

I did some search, but couldn't find the discussion of that patch.

I ask this because I recall Peter once bought up a discussion:

https://lkml.org/lkml/2015/8/26/596

Peter's conclusion seems to be that we could(though didn't want to) live
with futex atomics not being full barriers.

Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
the futex atomics, just wonder whether such strengthen is a -fix- or
not, considering that I want this patch to go to -stable tree.

Of course, in the meanwhile of waiting for your answer, I will try to
figure out this by myself ;-)

Regards,
Boqun

signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-20 Thread Peter Zijlstra

On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> > 
> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> > 
> 
> Paul, I know this may be difficult, but could you recall why the
> __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> 
> I did some search, but couldn't find the discussion of that patch.
> 
> I ask this because I recall Peter once bought up a discussion:
> 
> https://lkml.org/lkml/2015/8/26/596
> 
> Peter's conclusion seems to be that we could(though didn't want to) live
> with futex atomics not being full barriers.
> 
> 
> Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> the futex atomics, just wonder whether such strengthen is a -fix- or
> not, considering that I want this patch to go to -stable tree.

So Linus' argued that since we only need to order against user accesses
(true) and priv changes typically imply strong barriers (open) we might
want to allow archs to rely on those instead of mandating they have
explicit barriers in the futex primitives.

And I indeed forgot to follow up on that discussion.

So; does PPC imply full barriers on user<->kernel boundaries? If so, its
not critical to the futex atomic implementations what extra barriers are
added.

If not; then strengthening the futex ops is indeed (probably) a good
thing :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-20 Thread Paul E. McKenney

On Tue, Oct 20, 2015 at 11:21:47AM +0200, Peter Zijlstra wrote:
> On Tue, Oct 20, 2015 at 03:15:32PM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > 
> > > Am I missing something here?  If not, it seems to me that you need
> > > the leading lwsync to instead be a sync.
> > > 
> > > Of course, if I am not missing something, then this applies also to the
> > > value-returning RMW atomic operations that you pulled this pattern from.
> > > If so, it would seem that I didn't think through all the possibilities
> > > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > > that I worried about the RMW atomic operation acting as a barrier,
> > > but not as the load/store itself.  :-/
> > > 
> > 
> > Paul, I know this may be difficult, but could you recall why the
> > __futex_atomic_op() and futex_atomic_cmpxchg_inatomic() also got
> > involved into the movement of PPC_ATOMIC_EXIT_BARRIER to "sync"?
> > 
> > I did some search, but couldn't find the discussion of that patch.
> > 
> > I ask this because I recall Peter once bought up a discussion:
> > 
> > https://lkml.org/lkml/2015/8/26/596
> > 
> > Peter's conclusion seems to be that we could(though didn't want to) live
> > with futex atomics not being full barriers.

I have heard of user-level applications relying on unlock-lock being a
full barrier.  So paranoia would argue for the full barrier.

> > Peter, just be clear, I'm not in favor of relaxing futex atomics. But if
> > I make PPC_ATOMIC_ENTRY_BARRIER being "sync", it will also strengthen
> > the futex atomics, just wonder whether such strengthen is a -fix- or
> > not, considering that I want this patch to go to -stable tree.
> 
> So Linus' argued that since we only need to order against user accesses
> (true) and priv changes typically imply strong barriers (open) we might
> want to allow archs to rely on those instead of mandating they have
> explicit barriers in the futex primitives.
> 
> And I indeed forgot to follow up on that discussion.
> 
> So; does PPC imply full barriers on user<->kernel boundaries? If so, its
> not critical to the futex atomic implementations what extra barriers are
> added.
> 
> If not; then strengthening the futex ops is indeed (probably) a good
> thing :-)

I am not seeing a sync there, but I really have to defer to the
maintainers on this one.  I could easily have missed one.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-18 Thread Boqun Feng

On Thu, Oct 15, 2015 at 09:30:40AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 15, 2015 at 12:48:03PM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
[snip]
> > 
> > > Why not try creating a longer litmus test that requires P0's write to
> > > "a" to propagate to P1 before both processes complete?
> > > 
> > 
> > I will try to write one, but to be clear, you mean we still observe 
> > 
> > 0:r3 == 0 && a == 2 && 1:r3 == 0 
> > 
> > at the end, right? Because I understand that if P1's write to 'a'
> > doesn't override P0's, P0's write to 'a' will propagate.
> 
> Your choice.  My question is whether you can come up with a similar
> litmus test where lwsync is allowing the behavior here, but clearly
> is affecting some other aspect of ordering.
> 

Got it, though my question about the propagation of P0's write to 'a'
was originally aimed at understanding the hardware behavior(or model) in
your sequence of events ;-)

To be clear, by "some other aspect of ordering", you mean something like
a paired RELEASE+ACQUIRE senario(i.e. P1 observes P0's write to 'a' via
a load, which means P0's write to 'a' propagates at some point), right?

If so I haven't yet came up with one, and I think there's probably none,
so my worry about "lwsync" in other places is likely unnecessary.

Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-18 Thread Boqun Feng

On Thu, Oct 15, 2015 at 09:30:40AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 15, 2015 at 12:48:03PM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
[snip]
> > 
> > > Why not try creating a longer litmus test that requires P0's write to
> > > "a" to propagate to P1 before both processes complete?
> > > 
> > 
> > I will try to write one, but to be clear, you mean we still observe 
> > 
> > 0:r3 == 0 && a == 2 && 1:r3 == 0 
> > 
> > at the end, right? Because I understand that if P1's write to 'a'
> > doesn't override P0's, P0's write to 'a' will propagate.
> 
> Your choice.  My question is whether you can come up with a similar
> litmus test where lwsync is allowing the behavior here, but clearly
> is affecting some other aspect of ordering.
> 

Got it, though my question about the propagation of P0's write to 'a'
was originally aimed at understanding the hardware behavior(or model) in
your sequence of events ;-)

To be clear, by "some other aspect of ordering", you mean something like
a paired RELEASE+ACQUIRE senario(i.e. P1 observes P0's write to 'a' via
a load, which means P0's write to 'a' propagates at some point), right?

If so I haven't yet came up with one, and I think there's probably none,
so my worry about "lwsync" in other places is likely unnecessary.

Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 12:48:03PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> [snip]
> > > 
> > > I'm afraid more than that, the above litmus also shows that
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > > 
> > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > >   r3 = xchg_release(, 1);   smp_mb();
> > >   r3 = READ_ONCE(x);
> > > 
> > >   (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > > 
> > > in the implementation of this patchset, which should be disallowed by
> > > the semantics of RELEASE, right?
> > 
> > Not necessarily.  If you had the read first on CPU 1, and you had a
> > similar problem, I would be more worried.
> > 
> 
> Sometimes I think maybe we should say that a single unpaired ACQUIRE or
> RELEASE doesn't have any order guarantee because of the above case.
> 
> But seems that's not a normal or even existing case, my bad ;-(
> 
> > > And even:
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > > 
> > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > >   smp_store_release(, 1);   smp_mb();
> > >   r3 = READ_ONCE(x);
> > > 
> > >   (1:r3 == 0 && a == 2) is not prohibitted
> > > 
> > > shows by:
> > > 
> > >   PPC weird-lwsync
> > >   ""
> > >   {
> > >   0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > >   1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > >   }
> > >P0 | P1 ;
> > >stw r1,0(r2)   | stw r1,0(r12)  ;
> > >lwsync | sync   ;
> > >stw  r1,0(r12) | lwz r3,0(r2)   ;
> > >   exists
> > >   (a=2 /\ 1:r3=0)
> > > 
> > > Please find something I'm (or the tool is) missing, maybe we can't use
> > > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > > 0?
> > 
> > Again, if you were pairing the smp_store_release() with an 
> > smp_load_acquire()
> > or even a READ_ONCE() followed by a barrier, I would be quite concerned.
> > I am not at all worried about the above two litmus tests.
> > 
> 
> Understood, thank you for think through that ;-)
> 
> > > And there is really something I find strange, see below.
> > > 
> > > > > 
> > > > > So the scenario that would fail would be this one, right?
> > > > > 
> > > > > a = x = 0
> > > > > 
> > > > >   CPU0CPU1
> > > > > 
> > > > >   r3 = load_locked ();
> > > > >   a = 2;
> > > > >   sync();
> > > > >   r3 = x;
> > > > >   x = 1;
> > > > >   lwsync();
> > > > >   if (!store_cond(, 1))
> > > > >   goto again
> > > > > 
> > > > > 
> > > > > Where we hoist the load way up because lwsync allows this.
> > > > 
> > > > That scenario would end up with a==1 rather than a==2.
> > > > 
> > > > > I always thought this would fail because CPU1's store to @a would fail
> > > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > > load and now seeing the new value (2).
> > > > 
> > > > The stwcx. failure was one thing that prevented a number of other
> > > > misordering cases.  The problem is that we have to let go of the notion
> > > > of an implicit global clock.
> > > > 
> > > > To that end, the herd tool can make a diagram of what it thought
> > > > happened, and I have attached it.  I used this diagram to try and force
> > > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > > and succeeded.  Here is the sequence of events:
> > > > 
> > > > o   Commit P0's write.  The model offers to propagate this write
> > > > to the coherence point and to P1, but don't do so yet.
> > > > 
> > > > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > > > 
> > > > o   Commit P0's lwsync.
> > > > 
> > > > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > > > 
> > > > o   Commit P0's stwcx. as successful.  This stores a=1.
> > > > 
> > > > o   Commit P0's branch (not taken).
> > > 
> > > So at this point, P0's write to 'a' has propagated to P1, right? But
> > > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > > Doesn't the lwsync prevent this from happening?
> > 
> > No, because lwsync is quite a bit weaker than sync aside from just
> > the store-load ordering.
> > 
> 
> Understood, I've tried the ppcmem, much clear now ;-)
> 
> > > If at this point P0's write to 'a' hasn't propagated then when?
> > 
> > Later.  At the very end of the test, in this case.  ;-)
> > 
> 
> Hmm.. I tried exactly this sequence in ppcmem, seems propagation of P0's
> write to 'a' is never an option...
> 
> > Why not try creating a longer litmus test that requires P0's write to
> > "a" to

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 03:50:44PM +0100, Will Deacon wrote:
> On Thu, Oct 15, 2015 at 11:35:10AM +0100, Will Deacon wrote:
> > Dammit guys, it's never simple is it?
> 
> I re-read this and it's even more confusing than I first thought.
> 
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o Commit P0's write.  The model offers to propagate this write
> > >   to the coherence point and to P1, but don't do so yet.
> > > 
> > > o Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o Commit P0's lwsync.
> > > 
> > > o Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > On arm64, this is a conditional-store-*release* and therefore cannot be
> > observed before the initial write to x...
> > 
> > > o Commit P0's branch (not taken).
> > > 
> > > o Commit P0's final register-to-register move.
> > > 
> > > o Commit P1's sync instruction.
> > > 
> > > o There is now nothing that can happen in either processor.
> > >   P0 is done, and P1 is waiting for its sync.  Therefore,
> > >   propagate P1's a=2 write to the coherence point and to
> > >   the other thread.
> > 
> > ... therefore this is illegal, because you haven't yet propagated that
> > prior write...
> 
> I misread this as a propagation of PO's conditional store. What actually
> happens on arm64, is that the early conditional store can only succeed
> once it is placed into the coherence order of the location which it is
> updating (but note that this is subtly different from multi-copy
> atomicity!).
> 
> So, given that the previous conditional store succeeded, the coherence
> order on A must be either {0, 1, 2} or {0, 2, 1}.
> 
> If it's {0, 1, 2} (as required by your complete example), that means
> P1's a=2 write "observes" the conditional store by P0, and therefore
> (because the conditional store has release semantics), also observes
> P0's x=1 write.
> 
> On the other hand, if P1's a=2 write propagates first and we have a
> coherence order of {0, 2, 1}, then P0 must have r3=2, because an
> exclusive load returning zero would have led to a failed conditional
> store thanks to the intervening write by P1.
> 
> I find it pretty weird if PPC allows the conditional store to succeed in
> this way, as I think that would break simple cases like two threads
> incrementing a shared variable in parallel:
> 
> 
> ""
> {
> 0:r1=1; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> 1:r1=1; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> }
> P0 | P1 ;
> lwarx  r11,r10,r12 | lwarx  r11,r10,r12 ;
> add r11,r1,r11 | add r11,r1,r11 ;
> stwcx. r11,r10,r12 | stwcx. r11,r10,r12 ;
> bne Fail0  | bne Fail1  ;
> mr r3,r1   | mr r3,r1   ;
> Fail0: | Fail1: ;
> exists
> (0:r3=1 /\ a=1 /\ 1:r3=1)
> 
> 
> Is also allowed by herd and forbidden by ppcmem, for example.

I had no idea that either herd or ppcmem knew about the add instruction.
It looks like they at least try to understand.  Needless to say, in this
case I agree with ppcmem.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 10:49:23PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> > > According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> > > versions all need to imply a full barrier, however they are now just
> > > RELEASE+ACQUIRE, which is not a full barrier.
> > > 
> > > So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> > > PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> > > __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> > > semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> > > 
> > > This patch is a complement of commit b97021f85517 ("powerpc: Fix
> > > atomic_xxx_return barrier semantics").
> > > 
> > > Acked-by: Michael Ellerman 
> > > Cc:  # 3.4+
> > > Signed-off-by: Boqun Feng 
> > > ---
> > >  arch/powerpc/include/asm/cmpxchg.h | 16 
> > >  1 file changed, 8 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> > > b/arch/powerpc/include/asm/cmpxchg.h
> > > index ad6263c..d1a8d93 100644
> > > --- a/arch/powerpc/include/asm/cmpxchg.h
> > > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > > @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
> > >   unsigned long prev;
> > > 
> > >   __asm__ __volatile__(
> > > - PPC_RELEASE_BARRIER
> > > + PPC_ATOMIC_ENTRY_BARRIER
> > 
> > This looks to be the lwsync instruction.
> > 
> > >  "1:  lwarx   %0,0,%2 \n"
> > >   PPC405_ERR77(0,%2)
> > >  "stwcx.  %3,0,%2 \n\
> > >   bne-1b"
> > > - PPC_ACQUIRE_BARRIER
> > > + PPC_ATOMIC_EXIT_BARRIER
> > 
> > And this looks to be the sync instruction.
> > 
> > >   : "=" (prev), "+m" (*(volatile unsigned int *)p)
> > >   : "r" (p), "r" (val)
> > >   : "cc", "memory");
> > 
> > Hmmm...
> > 
> > Suppose we have something like the following, where "a" and "x" are both
> > initially zero:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > If xchg() is fully ordered, we should never observe both CPUs'
> > r3 values being zero, correct?
> > 
> > And wouldn't this be represented by the following litmus test?
> > 
> > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> >  stwcx. r1,r10,r12  | ;
> >  bne Fail0  | ;
> >  mr r3,r11  | ;
> >  Fail0: | ;
> > exists
> > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > 
> > I left off P0's trailing sync because there is nothing for it to order
> > against in this particular litmus test.  I tried adding it and verified
> > that it has no effect.
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> > 
> 
> If so, I will define PPC_ATOMIC_ENTRY_BARRIER as "sync" in the next
> version of this patch, any concern?
> 
> Of course, I will wait to do that until we all understand this is
> nececarry and agree to make the change.

I am in favor, but I am not the maintainer.  ;-)

Thanx, Paul

> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> 
> For the value-returning RMW atomics, if the leading barrier is
> necessarily to be "sync", I will just remove my __atomic_op_fence() in
> patch 4, but I will remain patch 3 unchanged for the consistency of
> __atomic_op_*() macros' definitions. Peter and Will, do that works for
> you both?
> 
> Regards,
> Boqun
> 
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> > 
> > Thanx, Paul
> > 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?
> 
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o   Commit P0's stwcx. as successful.  This stores a=1.
> 
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
> 
> > o   Commit P0's branch (not taken).
> > 
> > o   Commit P0's final register-to-register move.
> > 
> > o   Commit P1's sync instruction.
> > 
> > o   There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync.  Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
> 
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

OK.  Power distinguishes between propagating to the coherence point
and to each of the other CPUs.

> > o   There is still nothing that can happen in either processor.
> > So pick the barrier propagate, then the acknowledge sync.
> > 
> > o   P1 can now execute its read from x.  Because P0's write to
> > x is still waiting to propagate to P1, this still reads
> > x=0.  Execute and commit, and we now have both r3 registers
> > equal to zero and the final value a=2.
> 
> ... and P1 would have to read x == 1.

Good!  Do ARMMEM and herd agree with you?

> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?

Yes.  But this is not store->store observability, but rather store->load
visibility.  Furthermore, as I understand it, lwsync controls the
visibility to other CPUs, but not necessarily the coherence order.

Let's look at the example C code again:

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
r3 = xchg(, 1);   smp_mb();
r3 = READ_ONCE(x);

The problem is that we are applying intuitions obtained from a
release-acquire chain, which hands off from stores to loads.  In contrast,
this example is quite weird in that we have a store handing off to another
store, but with reads also involved.  Making that work on Power requires
full memory barriers on both sides.  Intuitively, the coherence order
can be established after the fact as long as all readers see a consistent
set of values based on the subset of the sequence that each reader sees.

Anyway, it looks like Power does need a sync before and after for
value-returning atomics.  That certainly simplifies the analysis.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Will Deacon

On Thu, Oct 15, 2015 at 11:35:10AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?

I re-read this and it's even more confusing than I first thought.

> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o   Commit P0's stwcx. as successful.  This stores a=1.
> 
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
> 
> > o   Commit P0's branch (not taken).
> > 
> > o   Commit P0's final register-to-register move.
> > 
> > o   Commit P1's sync instruction.
> > 
> > o   There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync.  Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
> 
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

I misread this as a propagation of PO's conditional store. What actually
happens on arm64, is that the early conditional store can only succeed
once it is placed into the coherence order of the location which it is
updating (but note that this is subtly different from multi-copy
atomicity!).

So, given that the previous conditional store succeeded, the coherence
order on A must be either {0, 1, 2} or {0, 2, 1}.

If it's {0, 1, 2} (as required by your complete example), that means
P1's a=2 write "observes" the conditional store by P0, and therefore
(because the conditional store has release semantics), also observes
P0's x=1 write.

On the other hand, if P1's a=2 write propagates first and we have a
coherence order of {0, 2, 1}, then P0 must have r3=2, because an
exclusive load returning zero would have led to a failed conditional
store thanks to the intervening write by P1.

I find it pretty weird if PPC allows the conditional store to succeed in
this way, as I think that would break simple cases like two threads
incrementing a shared variable in parallel:

""
{
0:r1=1; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=1; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
}
P0 | P1 ;
lwarx  r11,r10,r12 | lwarx  r11,r10,r12 ;
add r11,r1,r11 | add r11,r1,r11 ;
stwcx. r11,r10,r12 | stwcx. r11,r10,r12 ;
bne Fail0  | bne Fail1  ;
mr r3,r1   | mr r3,r1   ;
Fail0: | Fail1: ;
exists
(0:r3=1 /\ a=1 /\ 1:r3=1)

Is also allowed by herd and forbidden by ppcmem, for example.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Boqun Feng

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> > According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> > versions all need to imply a full barrier, however they are now just
> > RELEASE+ACQUIRE, which is not a full barrier.
> > 
> > So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> > PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> > __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> > semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> > 
> > This patch is a complement of commit b97021f85517 ("powerpc: Fix
> > atomic_xxx_return barrier semantics").
> > 
> > Acked-by: Michael Ellerman 
> > Cc:  # 3.4+
> > Signed-off-by: Boqun Feng 
> > ---
> >  arch/powerpc/include/asm/cmpxchg.h | 16 
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> > b/arch/powerpc/include/asm/cmpxchg.h
> > index ad6263c..d1a8d93 100644
> > --- a/arch/powerpc/include/asm/cmpxchg.h
> > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
> > unsigned long prev;
> > 
> > __asm__ __volatile__(
> > -   PPC_RELEASE_BARRIER
> > +   PPC_ATOMIC_ENTRY_BARRIER
> 
> This looks to be the lwsync instruction.
> 
> >  "1:lwarx   %0,0,%2 \n"
> > PPC405_ERR77(0,%2)
> >  "  stwcx.  %3,0,%2 \n\
> > bne-1b"
> > -   PPC_ACQUIRE_BARRIER
> > +   PPC_ATOMIC_EXIT_BARRIER
> 
> And this looks to be the sync instruction.
> 
> > : "=" (prev), "+m" (*(volatile unsigned int *)p)
> > : "r" (p), "r" (val)
> > : "cc", "memory");
> 
> Hmmm...
> 
> Suppose we have something like the following, where "a" and "x" are both
> initially zero:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
> If xchg() is fully ordered, we should never observe both CPUs'
> r3 values being zero, correct?
> 
> And wouldn't this be represented by the following litmus test?
> 
>   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
>stwcx. r1,r10,r12  | ;
>bne Fail0  | ;
>mr r3,r11  | ;
>Fail0: | ;
>   exists
>   (0:r3=0 /\ a=2 /\ 1:r3=0)
> 
> I left off P0's trailing sync because there is nothing for it to order
> against in this particular litmus test.  I tried adding it and verified
> that it has no effect.
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 

If so, I will define PPC_ATOMIC_ENTRY_BARRIER as "sync" in the next
version of this patch, any concern?

Of course, I will wait to do that until we all understand this is
nececarry and agree to make the change.

> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.

For the value-returning RMW atomics, if the leading barrier is
necessarily to be "sync", I will just remove my __atomic_op_fence() in
patch 4, but I will remain patch 3 unchanged for the consistency of
__atomic_op_*() macros' definitions. Peter and Will, do that works for
you both?

Regards,
Boqun

> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Boqun Feng

On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> 
> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?
> 

I did some litmus and put the result here. My understanding might be
wrong, and I think Paul can explain the lwsync and store->store order
better ;-)


When a store->lwsync->store pairs with load->lwsync->load, according to
herd, YES.

PPC W+lwsync+W-R+lwsync+R
"
  2015-10-15
  herds said (1:r1=0 /\ 1:r2=2) doesn't exist,
  so if P1 observe the write to 'b', it must also observe P0's write
  to 'a'
"
{
0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
1:r1=0; 1:r2=0; 1:r12=a; 1:r13=b;
}

 P0  | P1 ;
 stw r1, 0(r12)  | lwz r2, 0(r13) ;
 lwsync  | lwsync ;
 stw r2, 0(r13)  | lwz r1, 0(r12) ;

exists
(1:r1=0 /\ 1:r2=2)


If observation also includes "a write on one CPU -override- another
write on another CPU", then

when a store->lwsync->store pairs(?) with store->sync->load, according
to herd, NO(?).

PPC W+lwsync+W-W+sync+R
"
  2015-10-15
  herds said (1:r1=0 /\ b=3) exists sometimes,
  so if P1 observe P0's write to 'b'(by 'overriding' this write to
  'b'), it may not observe P0's write to 'a'.
"
{
0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
1:r1=0; 1:r2=3; 1:r12=a; 1:r13=b;
}

 P0  | P1 ;
 stw r1, 0(r12)  | stw r2, 0(r13) ;
 lwsync  | sync ;
 stw r2, 0(r13)  | lwz r1, 0(r12) ;

exists
(1:r1=0 /\ b=3)


Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Will Deacon

Dammit guys, it's never simple is it?

On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> To that end, the herd tool can make a diagram of what it thought
> happened, and I have attached it.  I used this diagram to try and force
> this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> and succeeded.  Here is the sequence of events:
> 
> o Commit P0's write.  The model offers to propagate this write
>   to the coherence point and to P1, but don't do so yet.
> 
> o Commit P1's write.  Similar offers, but don't take them up yet.
> 
> o Commit P0's lwsync.
> 
> o Execute P0's lwarx, which reads a=0.  Then commit it.
> 
> o Commit P0's stwcx. as successful.  This stores a=1.

On arm64, this is a conditional-store-*release* and therefore cannot be
observed before the initial write to x...

> o Commit P0's branch (not taken).
> 
> o Commit P0's final register-to-register move.
> 
> o Commit P1's sync instruction.
> 
> o There is now nothing that can happen in either processor.
>   P0 is done, and P1 is waiting for its sync.  Therefore,
>   propagate P1's a=2 write to the coherence point and to
>   the other thread.

... therefore this is illegal, because you haven't yet propagated that
prior write...

> 
> o There is still nothing that can happen in either processor.
>   So pick the barrier propagate, then the acknowledge sync.
> 
> o P1 can now execute its read from x.  Because P0's write to
>   x is still waiting to propagate to P1, this still reads
>   x=0.  Execute and commit, and we now have both r3 registers
>   equal to zero and the final value a=2.

... and P1 would have to read x == 1.

So arm64 is ok. Doesn't lwsync order store->store observability for PPC?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Will Deacon

Dammit guys, it's never simple is it?

On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> To that end, the herd tool can make a diagram of what it thought
> happened, and I have attached it.  I used this diagram to try and force
> this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> and succeeded.  Here is the sequence of events:
> 
> o Commit P0's write.  The model offers to propagate this write
>   to the coherence point and to P1, but don't do so yet.
> 
> o Commit P1's write.  Similar offers, but don't take them up yet.
> 
> o Commit P0's lwsync.
> 
> o Execute P0's lwarx, which reads a=0.  Then commit it.
> 
> o Commit P0's stwcx. as successful.  This stores a=1.

On arm64, this is a conditional-store-*release* and therefore cannot be
observed before the initial write to x...

> o Commit P0's branch (not taken).
> 
> o Commit P0's final register-to-register move.
> 
> o Commit P1's sync instruction.
> 
> o There is now nothing that can happen in either processor.
>   P0 is done, and P1 is waiting for its sync.  Therefore,
>   propagate P1's a=2 write to the coherence point and to
>   the other thread.

... therefore this is illegal, because you haven't yet propagated that
prior write...

> 
> o There is still nothing that can happen in either processor.
>   So pick the barrier propagate, then the acknowledge sync.
> 
> o P1 can now execute its read from x.  Because P0's write to
>   x is still waiting to propagate to P1, this still reads
>   x=0.  Execute and commit, and we now have both r3 registers
>   equal to zero and the final value a=2.

... and P1 would have to read x == 1.

So arm64 is ok. Doesn't lwsync order store->store observability for PPC?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 10:49:23PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> > > According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> > > versions all need to imply a full barrier, however they are now just
> > > RELEASE+ACQUIRE, which is not a full barrier.
> > > 
> > > So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> > > PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> > > __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> > > semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> > > 
> > > This patch is a complement of commit b97021f85517 ("powerpc: Fix
> > > atomic_xxx_return barrier semantics").
> > > 
> > > Acked-by: Michael Ellerman 
> > > Cc:  # 3.4+
> > > Signed-off-by: Boqun Feng 
> > > ---
> > >  arch/powerpc/include/asm/cmpxchg.h | 16 
> > >  1 file changed, 8 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> > > b/arch/powerpc/include/asm/cmpxchg.h
> > > index ad6263c..d1a8d93 100644
> > > --- a/arch/powerpc/include/asm/cmpxchg.h
> > > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > > @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
> > >   unsigned long prev;
> > > 
> > >   __asm__ __volatile__(
> > > - PPC_RELEASE_BARRIER
> > > + PPC_ATOMIC_ENTRY_BARRIER
> > 
> > This looks to be the lwsync instruction.
> > 
> > >  "1:  lwarx   %0,0,%2 \n"
> > >   PPC405_ERR77(0,%2)
> > >  "stwcx.  %3,0,%2 \n\
> > >   bne-1b"
> > > - PPC_ACQUIRE_BARRIER
> > > + PPC_ATOMIC_EXIT_BARRIER
> > 
> > And this looks to be the sync instruction.
> > 
> > >   : "=" (prev), "+m" (*(volatile unsigned int *)p)
> > >   : "r" (p), "r" (val)
> > >   : "cc", "memory");
> > 
> > Hmmm...
> > 
> > Suppose we have something like the following, where "a" and "x" are both
> > initially zero:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > If xchg() is fully ordered, we should never observe both CPUs'
> > r3 values being zero, correct?
> > 
> > And wouldn't this be represented by the following litmus test?
> > 
> > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> >  stwcx. r1,r10,r12  | ;
> >  bne Fail0  | ;
> >  mr r3,r11  | ;
> >  Fail0: | ;
> > exists
> > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > 
> > I left off P0's trailing sync because there is nothing for it to order
> > against in this particular litmus test.  I tried adding it and verified
> > that it has no effect.
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> > 
> 
> If so, I will define PPC_ATOMIC_ENTRY_BARRIER as "sync" in the next
> version of this patch, any concern?
> 
> Of course, I will wait to do that until we all understand this is
> nececarry and agree to make the change.

I am in favor, but I am not the maintainer.  ;-)

Thanx, Paul

> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> 
> For the value-returning RMW atomics, if the leading barrier is
> necessarily to be "sync", I will just remove my __atomic_op_fence() in
> patch 4, but I will remain patch 3 unchanged for the consistency of
> __atomic_op_*() macros' definitions. Peter and Will, do that works for
> you both?
> 
> Regards,
> Boqun
> 
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> > 
> > Thanx, Paul
> > 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?
> 
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o   Commit P0's stwcx. as successful.  This stores a=1.
> 
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
> 
> > o   Commit P0's branch (not taken).
> > 
> > o   Commit P0's final register-to-register move.
> > 
> > o   Commit P1's sync instruction.
> > 
> > o   There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync.  Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
> 
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

OK.  Power distinguishes between propagating to the coherence point
and to each of the other CPUs.

> > o   There is still nothing that can happen in either processor.
> > So pick the barrier propagate, then the acknowledge sync.
> > 
> > o   P1 can now execute its read from x.  Because P0's write to
> > x is still waiting to propagate to P1, this still reads
> > x=0.  Execute and commit, and we now have both r3 registers
> > equal to zero and the final value a=2.
> 
> ... and P1 would have to read x == 1.

Good!  Do ARMMEM and herd agree with you?

> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?

Yes.  But this is not store->store observability, but rather store->load
visibility.  Furthermore, as I understand it, lwsync controls the
visibility to other CPUs, but not necessarily the coherence order.

Let's look at the example C code again:

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
r3 = xchg(, 1);   smp_mb();
r3 = READ_ONCE(x);

The problem is that we are applying intuitions obtained from a
release-acquire chain, which hands off from stores to loads.  In contrast,
this example is quite weird in that we have a store handing off to another
store, but with reads also involved.  Making that work on Power requires
full memory barriers on both sides.  Intuitively, the coherence order
can be established after the fact as long as all readers see a consistent
set of values based on the subset of the sequence that each reader sees.

Anyway, it looks like Power does need a sync before and after for
value-returning atomics.  That certainly simplifies the analysis.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Will Deacon

On Thu, Oct 15, 2015 at 11:35:10AM +0100, Will Deacon wrote:
> Dammit guys, it's never simple is it?

I re-read this and it's even more confusing than I first thought.

> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o   Commit P0's stwcx. as successful.  This stores a=1.
> 
> On arm64, this is a conditional-store-*release* and therefore cannot be
> observed before the initial write to x...
> 
> > o   Commit P0's branch (not taken).
> > 
> > o   Commit P0's final register-to-register move.
> > 
> > o   Commit P1's sync instruction.
> > 
> > o   There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync.  Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
> 
> ... therefore this is illegal, because you haven't yet propagated that
> prior write...

I misread this as a propagation of PO's conditional store. What actually
happens on arm64, is that the early conditional store can only succeed
once it is placed into the coherence order of the location which it is
updating (but note that this is subtly different from multi-copy
atomicity!).

So, given that the previous conditional store succeeded, the coherence
order on A must be either {0, 1, 2} or {0, 2, 1}.

If it's {0, 1, 2} (as required by your complete example), that means
P1's a=2 write "observes" the conditional store by P0, and therefore
(because the conditional store has release semantics), also observes
P0's x=1 write.

On the other hand, if P1's a=2 write propagates first and we have a
coherence order of {0, 2, 1}, then P0 must have r3=2, because an
exclusive load returning zero would have led to a failed conditional
store thanks to the intervening write by P1.

I find it pretty weird if PPC allows the conditional store to succeed in
this way, as I think that would break simple cases like two threads
incrementing a shared variable in parallel:

""
{
0:r1=1; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=1; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
}
P0 | P1 ;
lwarx  r11,r10,r12 | lwarx  r11,r10,r12 ;
add r11,r1,r11 | add r11,r1,r11 ;
stwcx. r11,r10,r12 | stwcx. r11,r10,r12 ;
bne Fail0  | bne Fail1  ;
mr r3,r1   | mr r3,r1   ;
Fail0: | Fail1: ;
exists
(0:r3=1 /\ a=1 /\ 1:r3=1)

Is also allowed by herd and forbidden by ppcmem, for example.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Boqun Feng

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> > According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> > versions all need to imply a full barrier, however they are now just
> > RELEASE+ACQUIRE, which is not a full barrier.
> > 
> > So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> > PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> > __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> > semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> > 
> > This patch is a complement of commit b97021f85517 ("powerpc: Fix
> > atomic_xxx_return barrier semantics").
> > 
> > Acked-by: Michael Ellerman 
> > Cc:  # 3.4+
> > Signed-off-by: Boqun Feng 
> > ---
> >  arch/powerpc/include/asm/cmpxchg.h | 16 
> >  1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> > b/arch/powerpc/include/asm/cmpxchg.h
> > index ad6263c..d1a8d93 100644
> > --- a/arch/powerpc/include/asm/cmpxchg.h
> > +++ b/arch/powerpc/include/asm/cmpxchg.h
> > @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
> > unsigned long prev;
> > 
> > __asm__ __volatile__(
> > -   PPC_RELEASE_BARRIER
> > +   PPC_ATOMIC_ENTRY_BARRIER
> 
> This looks to be the lwsync instruction.
> 
> >  "1:lwarx   %0,0,%2 \n"
> > PPC405_ERR77(0,%2)
> >  "  stwcx.  %3,0,%2 \n\
> > bne-1b"
> > -   PPC_ACQUIRE_BARRIER
> > +   PPC_ATOMIC_EXIT_BARRIER
> 
> And this looks to be the sync instruction.
> 
> > : "=" (prev), "+m" (*(volatile unsigned int *)p)
> > : "r" (p), "r" (val)
> > : "cc", "memory");
> 
> Hmmm...
> 
> Suppose we have something like the following, where "a" and "x" are both
> initially zero:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
> If xchg() is fully ordered, we should never observe both CPUs'
> r3 values being zero, correct?
> 
> And wouldn't this be represented by the following litmus test?
> 
>   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
>stwcx. r1,r10,r12  | ;
>bne Fail0  | ;
>mr r3,r11  | ;
>Fail0: | ;
>   exists
>   (0:r3=0 /\ a=2 /\ 1:r3=0)
> 
> I left off P0's trailing sync because there is nothing for it to order
> against in this particular litmus test.  I tried adding it and verified
> that it has no effect.
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.
> 

If so, I will define PPC_ATOMIC_ENTRY_BARRIER as "sync" in the next
version of this patch, any concern?

Of course, I will wait to do that until we all understand this is
nececarry and agree to make the change.

> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.

For the value-returning RMW atomics, if the leading barrier is
necessarily to be "sync", I will just remove my __atomic_op_fence() in
patch 4, but I will remain patch 3 unchanged for the consistency of
__atomic_op_*() macros' definitions. Peter and Will, do that works for
you both?

Regards,
Boqun

> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/
> 
>   Thanx, Paul
> 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Boqun Feng

On Thu, Oct 15, 2015 at 11:35:44AM +0100, Will Deacon wrote:
> 
> So arm64 is ok. Doesn't lwsync order store->store observability for PPC?
> 

I did some litmus and put the result here. My understanding might be
wrong, and I think Paul can explain the lwsync and store->store order
better ;-)


When a store->lwsync->store pairs with load->lwsync->load, according to
herd, YES.

PPC W+lwsync+W-R+lwsync+R
"
  2015-10-15
  herds said (1:r1=0 /\ 1:r2=2) doesn't exist,
  so if P1 observe the write to 'b', it must also observe P0's write
  to 'a'
"
{
0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
1:r1=0; 1:r2=0; 1:r12=a; 1:r13=b;
}

 P0  | P1 ;
 stw r1, 0(r12)  | lwz r2, 0(r13) ;
 lwsync  | lwsync ;
 stw r2, 0(r13)  | lwz r1, 0(r12) ;

exists
(1:r1=0 /\ 1:r2=2)


If observation also includes "a write on one CPU -override- another
write on another CPU", then

when a store->lwsync->store pairs(?) with store->sync->load, according
to herd, NO(?).

PPC W+lwsync+W-W+sync+R
"
  2015-10-15
  herds said (1:r1=0 /\ b=3) exists sometimes,
  so if P1 observe P0's write to 'b'(by 'overriding' this write to
  'b'), it may not observe P0's write to 'a'.
"
{
0:r1=1; 0:r2=2; 0:r12=a; 0:r13=b;
1:r1=0; 1:r2=3; 1:r12=a; 1:r13=b;
}

 P0  | P1 ;
 stw r1, 0(r12)  | stw r2, 0(r13) ;
 lwsync  | sync ;
 stw r2, 0(r13)  | lwz r1, 0(r12) ;

exists
(1:r1=0 /\ b=3)


Regards,
Boqun


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 12:48:03PM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> [snip]
> > > 
> > > I'm afraid more than that, the above litmus also shows that
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > > 
> > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > >   r3 = xchg_release(, 1);   smp_mb();
> > >   r3 = READ_ONCE(x);
> > > 
> > >   (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > > 
> > > in the implementation of this patchset, which should be disallowed by
> > > the semantics of RELEASE, right?
> > 
> > Not necessarily.  If you had the read first on CPU 1, and you had a
> > similar problem, I would be more worried.
> > 
> 
> Sometimes I think maybe we should say that a single unpaired ACQUIRE or
> RELEASE doesn't have any order guarantee because of the above case.
> 
> But seems that's not a normal or even existing case, my bad ;-(
> 
> > > And even:
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > > 
> > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > >   smp_store_release(, 1);   smp_mb();
> > >   r3 = READ_ONCE(x);
> > > 
> > >   (1:r3 == 0 && a == 2) is not prohibitted
> > > 
> > > shows by:
> > > 
> > >   PPC weird-lwsync
> > >   ""
> > >   {
> > >   0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > >   1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > >   }
> > >P0 | P1 ;
> > >stw r1,0(r2)   | stw r1,0(r12)  ;
> > >lwsync | sync   ;
> > >stw  r1,0(r12) | lwz r3,0(r2)   ;
> > >   exists
> > >   (a=2 /\ 1:r3=0)
> > > 
> > > Please find something I'm (or the tool is) missing, maybe we can't use
> > > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > > 0?
> > 
> > Again, if you were pairing the smp_store_release() with an 
> > smp_load_acquire()
> > or even a READ_ONCE() followed by a barrier, I would be quite concerned.
> > I am not at all worried about the above two litmus tests.
> > 
> 
> Understood, thank you for think through that ;-)
> 
> > > And there is really something I find strange, see below.
> > > 
> > > > > 
> > > > > So the scenario that would fail would be this one, right?
> > > > > 
> > > > > a = x = 0
> > > > > 
> > > > >   CPU0CPU1
> > > > > 
> > > > >   r3 = load_locked ();
> > > > >   a = 2;
> > > > >   sync();
> > > > >   r3 = x;
> > > > >   x = 1;
> > > > >   lwsync();
> > > > >   if (!store_cond(, 1))
> > > > >   goto again
> > > > > 
> > > > > 
> > > > > Where we hoist the load way up because lwsync allows this.
> > > > 
> > > > That scenario would end up with a==1 rather than a==2.
> > > > 
> > > > > I always thought this would fail because CPU1's store to @a would fail
> > > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > > load and now seeing the new value (2).
> > > > 
> > > > The stwcx. failure was one thing that prevented a number of other
> > > > misordering cases.  The problem is that we have to let go of the notion
> > > > of an implicit global clock.
> > > > 
> > > > To that end, the herd tool can make a diagram of what it thought
> > > > happened, and I have attached it.  I used this diagram to try and force
> > > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > > and succeeded.  Here is the sequence of events:
> > > > 
> > > > o   Commit P0's write.  The model offers to propagate this write
> > > > to the coherence point and to P1, but don't do so yet.
> > > > 
> > > > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > > > 
> > > > o   Commit P0's lwsync.
> > > > 
> > > > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > > > 
> > > > o   Commit P0's stwcx. as successful.  This stores a=1.
> > > > 
> > > > o   Commit P0's branch (not taken).
> > > 
> > > So at this point, P0's write to 'a' has propagated to P1, right? But
> > > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > > Doesn't the lwsync prevent this from happening?
> > 
> > No, because lwsync is quite a bit weaker than sync aside from just
> > the store-load ordering.
> > 
> 
> Understood, I've tried the ppcmem, much clear now ;-)
> 
> > > If at this point P0's write to 'a' hasn't propagated then when?
> > 
> > Later.  At the very end of the test, in this case.  ;-)
> > 
> 
> Hmm.. I tried exactly this sequence in ppcmem, seems propagation of P0's
> write to 'a' is never an option...
> 
> > Why not try creating a longer litmus test that requires P0's write to
> > "a" to

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-15 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 03:50:44PM +0100, Will Deacon wrote:
> On Thu, Oct 15, 2015 at 11:35:10AM +0100, Will Deacon wrote:
> > Dammit guys, it's never simple is it?
> 
> I re-read this and it's even more confusing than I first thought.
> 
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o Commit P0's write.  The model offers to propagate this write
> > >   to the coherence point and to P1, but don't do so yet.
> > > 
> > > o Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o Commit P0's lwsync.
> > > 
> > > o Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > On arm64, this is a conditional-store-*release* and therefore cannot be
> > observed before the initial write to x...
> > 
> > > o Commit P0's branch (not taken).
> > > 
> > > o Commit P0's final register-to-register move.
> > > 
> > > o Commit P1's sync instruction.
> > > 
> > > o There is now nothing that can happen in either processor.
> > >   P0 is done, and P1 is waiting for its sync.  Therefore,
> > >   propagate P1's a=2 write to the coherence point and to
> > >   the other thread.
> > 
> > ... therefore this is illegal, because you haven't yet propagated that
> > prior write...
> 
> I misread this as a propagation of PO's conditional store. What actually
> happens on arm64, is that the early conditional store can only succeed
> once it is placed into the coherence order of the location which it is
> updating (but note that this is subtly different from multi-copy
> atomicity!).
> 
> So, given that the previous conditional store succeeded, the coherence
> order on A must be either {0, 1, 2} or {0, 2, 1}.
> 
> If it's {0, 1, 2} (as required by your complete example), that means
> P1's a=2 write "observes" the conditional store by P0, and therefore
> (because the conditional store has release semantics), also observes
> P0's x=1 write.
> 
> On the other hand, if P1's a=2 write propagates first and we have a
> coherence order of {0, 2, 1}, then P0 must have r3=2, because an
> exclusive load returning zero would have led to a failed conditional
> store thanks to the intervening write by P1.
> 
> I find it pretty weird if PPC allows the conditional store to succeed in
> this way, as I think that would break simple cases like two threads
> incrementing a shared variable in parallel:
> 
> 
> ""
> {
> 0:r1=1; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> 1:r1=1; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> }
> P0 | P1 ;
> lwarx  r11,r10,r12 | lwarx  r11,r10,r12 ;
> add r11,r1,r11 | add r11,r1,r11 ;
> stwcx. r11,r10,r12 | stwcx. r11,r10,r12 ;
> bne Fail0  | bne Fail1  ;
> mr r3,r1   | mr r3,r1   ;
> Fail0: | Fail1: ;
> exists
> (0:r3=1 /\ a=1 /\ 1:r3=1)
> 
> 
> Is also allowed by herd and forbidden by ppcmem, for example.

I had no idea that either herd or ppcmem knew about the add instruction.
It looks like they at least try to understand.  Needless to say, in this
case I agree with ppcmem.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
[snip]
> > 
> > I'm afraid more than that, the above litmus also shows that
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > 
> > in the implementation of this patchset, which should be disallowed by
> > the semantics of RELEASE, right?
> 
> Not necessarily.  If you had the read first on CPU 1, and you had a
> similar problem, I would be more worried.
> 

Sometimes I think maybe we should say that a single unpaired ACQUIRE or
RELEASE doesn't have any order guarantee because of the above case.

But seems that's not a normal or even existing case, my bad ;-(

> > And even:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > smp_store_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (1:r3 == 0 && a == 2) is not prohibitted
> > 
> > shows by:
> > 
> > PPC weird-lwsync
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  stw  r1,0(r12) | lwz r3,0(r2)   ;
> > exists
> > (a=2 /\ 1:r3=0)
> > 
> > Please find something I'm (or the tool is) missing, maybe we can't use
> > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > 0?
> 
> Again, if you were pairing the smp_store_release() with an smp_load_acquire()
> or even a READ_ONCE() followed by a barrier, I would be quite concerned.
> I am not at all worried about the above two litmus tests.
> 

Understood, thank you for think through that ;-)

> > And there is really something I find strange, see below.
> > 
> > > > 
> > > > So the scenario that would fail would be this one, right?
> > > > 
> > > > a = x = 0
> > > > 
> > > > CPU0CPU1
> > > > 
> > > > r3 = load_locked ();
> > > > a = 2;
> > > > sync();
> > > > r3 = x;
> > > > x = 1;
> > > > lwsync();
> > > > if (!store_cond(, 1))
> > > > goto again
> > > > 
> > > > 
> > > > Where we hoist the load way up because lwsync allows this.
> > > 
> > > That scenario would end up with a==1 rather than a==2.
> > > 
> > > > I always thought this would fail because CPU1's store to @a would fail
> > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > load and now seeing the new value (2).
> > > 
> > > The stwcx. failure was one thing that prevented a number of other
> > > misordering cases.  The problem is that we have to let go of the notion
> > > of an implicit global clock.
> > > 
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o Commit P0's write.  The model offers to propagate this write
> > >   to the coherence point and to P1, but don't do so yet.
> > > 
> > > o Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o Commit P0's lwsync.
> > > 
> > > o Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o Commit P0's stwcx. as successful.  This stores a=1.
> > > 
> > > o Commit P0's branch (not taken).
> > 
> > So at this point, P0's write to 'a' has propagated to P1, right? But
> > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > Doesn't the lwsync prevent this from happening?
> 
> No, because lwsync is quite a bit weaker than sync aside from just
> the store-load ordering.
> 

Understood, I've tried the ppcmem, much clear now ;-)

> > If at this point P0's write to 'a' hasn't propagated then when?
> 
> Later.  At the very end of the test, in this case.  ;-)
> 

Hmm.. I tried exactly this sequence in ppcmem, seems propagation of P0's
write to 'a' is never an option...

> Why not try creating a longer litmus test that requires P0's write to
> "a" to propagate to P1 before both processes complete?
> 

I will try to write one, but to be clear, you mean we still observe 

0:r3 == 0 && a == 2 && 1:r3 == 0 

at the end, right? Because I understand that if P1's write to 'a'
doesn't override P0's, P0's write to 'a' will propagate.

Regards,
Boqun


signature.asc

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 11:11:01AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> [snip]
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o Commit P0's write.  The model offers to propagate this write
> > >   to the coherence point and to P1, but don't do so yet.
> > > 
> > > o Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o Commit P0's lwsync.
> > > 
> > > o Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o Commit P0's stwcx. as successful.  This stores a=1.
> > > 
> > > o Commit P0's branch (not taken).
> > > 
> > 
> > So at this point, P0's write to 'a' has propagated to P1, right? But
> > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > Doesn't the lwsync prevent this from happening?
> > 
> > If at this point P0's write to 'a' hasn't propagated then when?
> 
> Hmm.. I played around ppcmem, and figured out what happens to
> propagation of P0's write to 'a':
> 
> At this point, or some point after store 'a' to 1 and before sync on
> P1 finish, writes to 'a' reachs a coherence point which 'a' is 2, so
> P0's write to 'a' "fails" and will not propagate.
> 
> I probably misunderstood the word "propagate", which actually means an
> already coherent write gets seen by another CPU, right?

It is quite possible for a given write to take a position in the coherence
order that guarantees that no one will see it, as is the case here.
But yes, all readers will see an order of values for a given memory
location that is consistent with the coherence order.

> So my question should be:
> 
> As lwsync can order P0's write to 'a' happens after P0's write to 'x',
> why P0's write to 'x' isn't seen by P1 after P1's write to 'a' overrides
> P0's?

There is no global clock for PPC's memory model.

> But ppcmem gave me the answer ;-) lwsync won't wait under P0's write to
> 'x' gets propagated, and if P0's write to 'a' "wins" in write coherence,
> lwsync will guarantee propagation of 'x' happens before that of 'a', but
> if P0's write to 'a' "fails", there will be no propagation of 'a' from
> P0. So that lwsync can't do anything here.

I believe that this is consistent, but the corners can get tricky.

Thanx, Paul

> Regards,
> Boqun
> 
> > 
> > > o Commit P0's final register-to-register move.
> > > 
> > > o Commit P1's sync instruction.
> > > 
> > > o There is now nothing that can happen in either processor.
> > >   P0 is done, and P1 is waiting for its sync.  Therefore,
> > >   propagate P1's a=2 write to the coherence point and to
> > >   the other thread.
> > > 
> > > o There is still nothing that can happen in either processor.
> > >   So pick the barrier propagate, then the acknowledge sync.
> > > 
> > > o P1 can now execute its read from x.  Because P0's write to
> > >   x is still waiting to propagate to P1, this still reads
> > >   x=0.  Execute and commit, and we now have both r3 registers
> > >   equal to zero and the final value a=2.
> > > 
> > > o Clean up by propagating the write to x everywhere, and
> > >   propagating the lwsync.
> > > 
> > > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > > 
> > > I am still not 100% confident of my litmus test.  It is quite possible
> > > that I lost something in translation, but that is looking less likely.
> > > 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

Hi Paul,

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
[snip]
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o   Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > o   Commit P0's branch (not taken).
> > 
> 
> So at this point, P0's write to 'a' has propagated to P1, right? But
> P0's write to 'x' hasn't, even there is a lwsync between them, right?
> Doesn't the lwsync prevent this from happening?
> 
> If at this point P0's write to 'a' hasn't propagated then when?
> 

Hmm.. I played around ppcmem, and figured out what happens to
propagation of P0's write to 'a':

At this point, or some point after store 'a' to 1 and before sync on
P1 finish, writes to 'a' reachs a coherence point which 'a' is 2, so
P0's write to 'a' "fails" and will not propagate.


I probably misunderstood the word "propagate", which actually means an
already coherent write gets seen by another CPU, right?

So my question should be:

As lwsync can order P0's write to 'a' happens after P0's write to 'x',
why P0's write to 'x' isn't seen by P1 after P1's write to 'a' overrides
P0's?

But ppcmem gave me the answer ;-) lwsync won't wait under P0's write to
'x' gets propagated, and if P0's write to 'a' "wins" in write coherence,
lwsync will guarantee propagation of 'x' happens before that of 'a', but
if P0's write to 'a' "fails", there will be no propagation of 'a' from
P0. So that lwsync can't do anything here.

Regards,
Boqun

> 
> > o   Commit P0's final register-to-register move.
> > 
> > o   Commit P1's sync instruction.
> > 
> > o   There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync.  Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
> > 
> > o   There is still nothing that can happen in either processor.
> > So pick the barrier propagate, then the acknowledge sync.
> > 
> > o   P1 can now execute its read from x.  Because P0's write to
> > x is still waiting to propagate to P1, this still reads
> > x=0.  Execute and commit, and we now have both r3 registers
> > equal to zero and the final value a=2.
> > 
> > o   Clean up by propagating the write to x everywhere, and
> > propagating the lwsync.
> > 
> > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > 
> > I am still not 100% confident of my litmus test.  It is quite possible
> > that I lost something in translation, but that is looking less likely.
> > 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 09:22:26AM +0800, Boqun Feng wrote:
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > > Suppose we have something like the following, where "a" and "x" are 
> > > > > both
> > > > > initially zero:
> > > > > 
> > > > >   CPU 0   CPU 1
> > > > >   -   -
> > > > > 
> > > > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > > > >   r3 = xchg(, 1);   smp_mb();
> > > > >   r3 = READ_ONCE(x);
> > > > > 
> > > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > > r3 values being zero, correct?
> > > > > 
> > > > > And wouldn't this be represented by the following litmus test?
> > > > > 
> > > > >   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > >   ""
> > > > >   {
> > > > >   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > >   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > >   }
> > > > >P0 | P1 ;
> > > > >stw r1,0(r2)   | stw r1,0(r12)  ;
> > > > >lwsync | sync   ;
> > > > >lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > > > >stwcx. r1,r10,r12  | ;
> > > > >bne Fail0  | ;
> > > > >mr r3,r11  | ;
> > > > >Fail0: | ;
> > > > >   exists
> > > > >   (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > > 
> > > > > I left off P0's trailing sync because there is nothing for it to order
> > > > > against in this particular litmus test.  I tried adding it and 
> > > > > verified
> > > > > that it has no effect.
> > > > > 
> > > > > Am I missing something here?  If not, it seems to me that you need
> > > > > the leading lwsync to instead be a sync.
> > 
> > I'm afraid more than that, the above litmus also shows that
> 
> I mean there will be more things we need to fix, perhaps even smp_wmb()
> need to be sync then..

That should not be necessary.  For smp_wmb(), lwsync should be just fine.

Thanx, Paul

> Regards,
> Boqun
> 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > 
> > in the implementation of this patchset, which should be disallowed by
> > the semantics of RELEASE, right?
> > 
> > And even:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > smp_store_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (1:r3 == 0 && a == 2) is not prohibitted
> > 
> > shows by:
> > 
> > PPC weird-lwsync
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  stw  r1,0(r12) | lwz r3,0(r2)   ;
> > exists
> > (a=2 /\ 1:r3=0)
> > 
> > 
> > Please find something I'm (or the tool is) missing, maybe we can't use
> > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > 0?
> > 
> > And there is really something I find strange, see below.
> > 
> > > > 
> > > > So the scenario that would fail would be this one, right?
> > > > 
> > > > a = x = 0
> > > > 
> > > > CPU0CPU1
> > > > 
> > > > r3 = load_locked ();
> > > > a = 2;
> > > > sync();
> > > > r3 = x;
> > > > x = 1;
> > > > lwsync();
> > > > if (!store_cond(, 1))
> > > > goto again
> > > > 
> > > > 
> > > > Where we hoist the load way up because lwsync allows this.
> > > 
> > > That scenario would end up with a==1 rather than a==2.
> > > 
> > > > I always thought this would fail because CPU1's store to @a would fail
> > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > load and now seeing the new value (2).
> > > 
> > > The stwcx. failure was one thing that prevented a number of other
> > > misordering cases.  The problem is that we have to let go of the notion
> > > of an implicit global clock.
> > > 
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > Suppose we have something like the following, where "a" and "x" are both
> > > > initially zero:
> > > > 
> > > > CPU 0   CPU 1
> > > > -   -
> > > > 
> > > > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > > > r3 = xchg(, 1);   smp_mb();
> > > > r3 = READ_ONCE(x);
> > > > 
> > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > r3 values being zero, correct?
> > > > 
> > > > And wouldn't this be represented by the following litmus test?
> > > > 
> > > > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > ""
> > > > {
> > > > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > }
> > > >  P0 | P1 ;
> > > >  stw r1,0(r2)   | stw r1,0(r12)  ;
> > > >  lwsync | sync   ;
> > > >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > > >  stwcx. r1,r10,r12  | ;
> > > >  bne Fail0  | ;
> > > >  mr r3,r11  | ;
> > > >  Fail0: | ;
> > > > exists
> > > > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > 
> > > > I left off P0's trailing sync because there is nothing for it to order
> > > > against in this particular litmus test.  I tried adding it and verified
> > > > that it has no effect.
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> 
> I'm afraid more than that, the above litmus also shows that
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> 
> in the implementation of this patchset, which should be disallowed by
> the semantics of RELEASE, right?

Not necessarily.  If you had the read first on CPU 1, and you had a
similar problem, I would be more worried.

> And even:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   smp_store_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (1:r3 == 0 && a == 2) is not prohibitted
> 
> shows by:
> 
>   PPC weird-lwsync
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>stw  r1,0(r12) | lwz r3,0(r2)   ;
>   exists
>   (a=2 /\ 1:r3=0)
> 
> Please find something I'm (or the tool is) missing, maybe we can't use
> (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> 0?

Again, if you were pairing the smp_store_release() with an smp_load_acquire()
or even a READ_ONCE() followed by a barrier, I would be quite concerned.
I am not at all worried about the above two litmus tests.

> And there is really something I find strange, see below.
> 
> > > 
> > > So the scenario that would fail would be this one, right?
> > > 
> > > a = x = 0
> > > 
> > >   CPU0CPU1
> > > 
> > >   r3 = load_locked ();
> > >   a = 2;
> > >   sync();
> > >   r3 = x;
> > >   x = 1;
> > >   lwsync();
> > >   if (!store_cond(, 1))
> > >   goto again
> > > 
> > > 
> > > Where we hoist the load way up because lwsync allows this.
> > 
> > That scenario would end up with a==1 rather than a==2.
> > 
> > > I always thought this would fail because CPU1's store to @a would fail
> > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > load and now seeing the new value (2).
> > 
> > The stwcx. failure was one thing that prevented a number of other
> > misordering cases.  The problem is that we have to let go of the notion
> > of an implicit global clock.
> > 
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > Suppose we have something like the following, where "a" and "x" are both
> > > > initially zero:
> > > > 
> > > > CPU 0   CPU 1
> > > > -   -
> > > > 
> > > > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > > > r3 = xchg(, 1);   smp_mb();
> > > > r3 = READ_ONCE(x);
> > > > 
> > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > r3 values being zero, correct?
> > > > 
> > > > And wouldn't this be represented by the following litmus test?
> > > > 
> > > > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > ""
> > > > {
> > > > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > }
> > > >  P0 | P1 ;
> > > >  stw r1,0(r2)   | stw r1,0(r12)  ;
> > > >  lwsync | sync   ;
> > > >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > > >  stwcx. r1,r10,r12  | ;
> > > >  bne Fail0  | ;
> > > >  mr r3,r11  | ;
> > > >  Fail0: | ;
> > > > exists
> > > > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > 
> > > > I left off P0's trailing sync because there is nothing for it to order
> > > > against in this particular litmus test.  I tried adding it and verified
> > > > that it has no effect.
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> 
> I'm afraid more than that, the above litmus also shows that
> 

I mean there will be more things we need to fix, perhaps even smp_wmb()
need to be sync then..

Regards,
Boqun

>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> 
> in the implementation of this patchset, which should be disallowed by
> the semantics of RELEASE, right?
> 
> And even:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   smp_store_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (1:r3 == 0 && a == 2) is not prohibitted
> 
> shows by:
> 
>   PPC weird-lwsync
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>stw  r1,0(r12) | lwz r3,0(r2)   ;
>   exists
>   (a=2 /\ 1:r3=0)
> 
> 
> Please find something I'm (or the tool is) missing, maybe we can't use
> (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> 0?
> 
> And there is really something I find strange, see below.
> 
> > > 
> > > So the scenario that would fail would be this one, right?
> > > 
> > > a = x = 0
> > > 
> > >   CPU0CPU1
> > > 
> > >   r3 = load_locked ();
> > >   a = 2;
> > >   sync();
> > >   r3 = x;
> > >   x = 1;
> > >   lwsync();
> > >   if (!store_cond(, 1))
> > >   goto again
> > > 
> > > 
> > > Where we hoist the load way up because lwsync allows this.
> > 
> > That scenario would end up with a==1 rather than a==2.
> > 
> > > I always thought this would fail because CPU1's store to @a would fail
> > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > load and now seeing the new value (2).
> > 
> > The stwcx. failure was one thing that prevented a number of other
> > misordering cases.  The problem is that we have to let go of the notion
> > of an implicit global clock.
> > 
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > Suppose we have something like the following, where "a" and "x" are both
> > > initially zero:
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > > 
> > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > >   r3 = xchg(, 1);   smp_mb();
> > >   r3 = READ_ONCE(x);
> > > 
> > > If xchg() is fully ordered, we should never observe both CPUs'
> > > r3 values being zero, correct?
> > > 
> > > And wouldn't this be represented by the following litmus test?
> > > 
> > >   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > >   ""
> > >   {
> > >   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > >   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > >   }
> > >P0 | P1 ;
> > >stw r1,0(r2)   | stw r1,0(r12)  ;
> > >lwsync | sync   ;
> > >lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > >stwcx. r1,r10,r12  | ;
> > >bne Fail0  | ;
> > >mr r3,r11  | ;
> > >Fail0: | ;
> > >   exists
> > >   (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > 
> > > I left off P0's trailing sync because there is nothing for it to order
> > > against in this particular litmus test.  I tried adding it and verified
> > > that it has no effect.
> > > 
> > > Am I missing something here?  If not, it seems to me that you need
> > > the leading lwsync to instead be a sync.

I'm afraid more than that, the above litmus also shows that

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
r3 = xchg_release(, 1);   smp_mb();
r3 = READ_ONCE(x);

(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted

in the implementation of this patchset, which should be disallowed by
the semantics of RELEASE, right?

And even:

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
smp_store_release(, 1);   smp_mb();
r3 = READ_ONCE(x);

(1:r3 == 0 && a == 2) is not prohibitted

shows by:

PPC weird-lwsync
""
{
0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
}
 P0 | P1 ;
 stw r1,0(r2)   | stw r1,0(r12)  ;
 lwsync | sync   ;
 stw  r1,0(r12) | lwz r3,0(r2)   ;
exists
(a=2 /\ 1:r3=0)


Please find something I'm (or the tool is) missing, maybe we can't use
(a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
0?

And there is really something I find strange, see below.

> > 
> > So the scenario that would fail would be this one, right?
> > 
> > a = x = 0
> > 
> > CPU0CPU1
> > 
> > r3 = load_locked ();
> > a = 2;
> > sync();
> > r3 = x;
> > x = 1;
> > lwsync();
> > if (!store_cond(, 1))
> > goto again
> > 
> > 
> > Where we hoist the load way up because lwsync allows this.
> 
> That scenario would end up with a==1 rather than a==2.
> 
> > I always thought this would fail because CPU1's store to @a would fail
> > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > load and now seeing the new value (2).
> 
> The stwcx. failure was one thing that prevented a number of other
> misordering cases.  The problem is that we have to let go of the notion
> of an implicit global clock.
> 
> To that end, the herd tool can make a diagram of what it thought
> happened, and I have attached it.  I used this diagram to try and force
> this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> and succeeded.  Here is the sequence of events:
> 
> o Commit P0's write.  The model offers to propagate this write
>   to the coherence point and to P1, but don't do so yet.
> 
> o Commit P1's write.  Similar offers, but don't take them up yet.
> 
> o Commit P0's lwsync.
> 
> o Execute P0's lwarx, which reads a=0.  Then commit it.
> 
> o Commit P0's stwcx. as successful.  This stores a=1.
> 
> o Commit P0's branch (not taken).
> 

So at this point, P0's write to 'a' has propagated to P1, right? But
P0's write to 'x' hasn't, even there is a lwsync between them, right?
Doesn't the lwsync prevent this from happening?

If at this point P0's write to 'a' hasn't propagated then when?

Regards,
Boqun

> o Commit P0's final register-to-register move.
> 
> o

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > Suppose we have something like the following, where "a" and "x" are both
> > initially zero:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > If xchg() is fully ordered, we should never observe both CPUs'
> > r3 values being zero, correct?
> > 
> > And wouldn't this be represented by the following litmus test?
> > 
> > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> >  stwcx. r1,r10,r12  | ;
> >  bne Fail0  | ;
> >  mr r3,r11  | ;
> >  Fail0: | ;
> > exists
> > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > 
> > I left off P0's trailing sync because there is nothing for it to order
> > against in this particular litmus test.  I tried adding it and verified
> > that it has no effect.
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> 
> So the scenario that would fail would be this one, right?
> 
> a = x = 0
> 
>   CPU0CPU1
> 
>   r3 = load_locked ();
>   a = 2;
>   sync();
>   r3 = x;
>   x = 1;
>   lwsync();
>   if (!store_cond(, 1))
>   goto again
> 
> 
> Where we hoist the load way up because lwsync allows this.

That scenario would end up with a==1 rather than a==2.

> I always thought this would fail because CPU1's store to @a would fail
> the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> load and now seeing the new value (2).

The stwcx. failure was one thing that prevented a number of other
misordering cases.  The problem is that we have to let go of the notion
of an implicit global clock.

To that end, the herd tool can make a diagram of what it thought
happened, and I have attached it.  I used this diagram to try and force
this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
and succeeded.  Here is the sequence of events:

o   Commit P0's write.  The model offers to propagate this write
to the coherence point and to P1, but don't do so yet.

o   Commit P1's write.  Similar offers, but don't take them up yet.

o   Commit P0's lwsync.

o   Execute P0's lwarx, which reads a=0.  Then commit it.

o   Commit P0's stwcx. as successful.  This stores a=1.

o   Commit P0's branch (not taken).

o   Commit P0's final register-to-register move.

o   Commit P1's sync instruction.

o   There is now nothing that can happen in either processor.
P0 is done, and P1 is waiting for its sync.  Therefore,
propagate P1's a=2 write to the coherence point and to
the other thread.

o   There is still nothing that can happen in either processor.
So pick the barrier propagate, then the acknowledge sync.

o   P1 can now execute its read from x.  Because P0's write to
x is still waiting to propagate to P1, this still reads
x=0.  Execute and commit, and we now have both r3 registers
equal to zero and the final value a=2.

o   Clean up by propagating the write to x everywhere, and
propagating the lwsync.

And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;

I am still not 100% confident of my litmus test.  It is quite possible
that I lost something in translation, but that is looking less likely.

> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> 
> AARGH64 does something very similar; it does something like:
> 
>   ll
>   ...
>   sc-release
> 
>   mb
> 
> Which I assumed worked for the same reason, any change to the variable
> would fail the sc, and we go for round 2, now observing the new value.

I have to defer to Will on this one.  You are right that ARM and PowerPC
do have similar memory models, but there are some differences.

Thanx, Paul

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Peter Zijlstra

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> Suppose we have something like the following, where "a" and "x" are both
> initially zero:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
> If xchg() is fully ordered, we should never observe both CPUs'
> r3 values being zero, correct?
> 
> And wouldn't this be represented by the following litmus test?
> 
>   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
>stwcx. r1,r10,r12  | ;
>bne Fail0  | ;
>mr r3,r11  | ;
>Fail0: | ;
>   exists
>   (0:r3=0 /\ a=2 /\ 1:r3=0)
> 
> I left off P0's trailing sync because there is nothing for it to order
> against in this particular litmus test.  I tried adding it and verified
> that it has no effect.
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.

So the scenario that would fail would be this one, right?

a = x = 0

CPU0CPU1

r3 = load_locked ();
a = 2;
sync();
r3 = x;
x = 1;
lwsync();
if (!store_cond(, 1))
goto again


Where we hoist the load way up because lwsync allows this.

I always thought this would fail because CPU1's store to @a would fail
the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
load and now seeing the new value (2).

> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.
> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/

AARGH64 does something very similar; it does something like:

ll
...
sc-release

mb

Which I assumed worked for the same reason, any change to the variable
would fail the sc, and we go for round 2, now observing the new value.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> versions all need to imply a full barrier, however they are now just
> RELEASE+ACQUIRE, which is not a full barrier.
> 
> So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> 
> This patch is a complement of commit b97021f85517 ("powerpc: Fix
> atomic_xxx_return barrier semantics").
> 
> Acked-by: Michael Ellerman 
> Cc:  # 3.4+
> Signed-off-by: Boqun Feng 
> ---
>  arch/powerpc/include/asm/cmpxchg.h | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index ad6263c..d1a8d93 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
>   unsigned long prev;
> 
>   __asm__ __volatile__(
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER

This looks to be the lwsync instruction.

>  "1:  lwarx   %0,0,%2 \n"
>   PPC405_ERR77(0,%2)
>  "stwcx.  %3,0,%2 \n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER

And this looks to be the sync instruction.

>   : "=" (prev), "+m" (*(volatile unsigned int *)p)
>   : "r" (p), "r" (val)
>   : "cc", "memory");

Hmmm...

Suppose we have something like the following, where "a" and "x" are both
initially zero:

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
r3 = xchg(, 1);   smp_mb();
r3 = READ_ONCE(x);

If xchg() is fully ordered, we should never observe both CPUs'
r3 values being zero, correct?

And wouldn't this be represented by the following litmus test?

PPC SB+lwsync-RMW2-lwsync+st-sync-leading
""
{
0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
}
 P0 | P1 ;
 stw r1,0(r2)   | stw r1,0(r12)  ;
 lwsync | sync   ;
 lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
 stwcx. r1,r10,r12  | ;
 bne Fail0  | ;
 mr r3,r11  | ;
 Fail0: | ;
exists
(0:r3=0 /\ a=2 /\ 1:r3=0)

I left off P0's trailing sync because there is nothing for it to order
against in this particular litmus test.  I tried adding it and verified
that it has no effect.

Am I missing something here?  If not, it seems to me that you need
the leading lwsync to instead be a sync.

Of course, if I am not missing something, then this applies also to the
value-returning RMW atomic operations that you pulled this pattern from.
If so, it would seem that I didn't think through all the possibilities
back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
that I worried about the RMW atomic operation acting as a barrier,
but not as the load/store itself.  :-/

Thanx, Paul

> @@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
>   unsigned long prev;
> 
>   __asm__ __volatile__(
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  ldarx   %0,0,%2 \n"
>   PPC405_ERR77(0,%2)
>  "stdcx.  %3,0,%2 \n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   : "=" (prev), "+m" (*(volatile unsigned long *)p)
>   : "r" (p), "r" (val)
>   : "cc", "memory");
> @@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
> old, unsigned long new)
>   unsigned int prev;
> 
>   __asm__ __volatile__ (
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  lwarx   %0,0,%2 # __cmpxchg_u32\n\
>   cmpw0,%0,%3\n\
>   bne-2f\n"
>   PPC405_ERR77(0,%2)
>  "stwcx.  %4,0,%2\n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   "\n\
>  2:"
>   : "=" (prev), "+m" (*p)
> @@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
> old, unsigned long new)
>   unsigned long prev;
> 
>   __asm__ __volatile__ (
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  ldarx   %0,0,%2 # __cmpxchg_u64\n\
>   cmpd0,%0,%3\n\
>   bne-2f\n\
>   stdcx.  %4,0,%2\n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   "\n\
>  2:"
>   : "=" (prev), "+m" (*p)
> -- 
> 2.5.3
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message

[PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
versions all need to imply a full barrier, however they are now just
RELEASE+ACQUIRE, which is not a full barrier.

So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().

This patch is a complement of commit b97021f85517 ("powerpc: Fix
atomic_xxx_return barrier semantics").

Acked-by: Michael Ellerman 
Cc:  # 3.4+
Signed-off-by: Boqun Feng 
---
 arch/powerpc/include/asm/cmpxchg.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..d1a8d93 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
old, unsigned long new)
unsigned int prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 # __cmpxchg_u32\n\
cmpw0,%0,%3\n\
bne-2f\n"
PPC405_ERR77(0,%2)
 "  stwcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
@@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
old, unsigned long new)
unsigned long prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 # __cmpxchg_u64\n\
cmpd0,%0,%3\n\
bne-2f\n\
stdcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
versions all need to imply a full barrier, however they are now just
RELEASE+ACQUIRE, which is not a full barrier.

So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
__{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().

This patch is a complement of commit b97021f85517 ("powerpc: Fix
atomic_xxx_return barrier semantics").

Acked-by: Michael Ellerman 
Cc:  # 3.4+
Signed-off-by: Boqun Feng 
---
 arch/powerpc/include/asm/cmpxchg.h | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/cmpxchg.h 
b/arch/powerpc/include/asm/cmpxchg.h
index ad6263c..d1a8d93 100644
--- a/arch/powerpc/include/asm/cmpxchg.h
+++ b/arch/powerpc/include/asm/cmpxchg.h
@@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stwcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned int *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
unsigned long prev;
 
__asm__ __volatile__(
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 \n"
PPC405_ERR77(0,%2)
 "  stdcx.  %3,0,%2 \n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
: "=" (prev), "+m" (*(volatile unsigned long *)p)
: "r" (p), "r" (val)
: "cc", "memory");
@@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
old, unsigned long new)
unsigned int prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:lwarx   %0,0,%2 # __cmpxchg_u32\n\
cmpw0,%0,%3\n\
bne-2f\n"
PPC405_ERR77(0,%2)
 "  stwcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
@@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
old, unsigned long new)
unsigned long prev;
 
__asm__ __volatile__ (
-   PPC_RELEASE_BARRIER
+   PPC_ATOMIC_ENTRY_BARRIER
 "1:ldarx   %0,0,%2 # __cmpxchg_u64\n\
cmpd0,%0,%3\n\
bne-2f\n\
stdcx.  %4,0,%2\n\
bne-1b"
-   PPC_ACQUIRE_BARRIER
+   PPC_ATOMIC_EXIT_BARRIER
"\n\
 2:"
: "=" (prev), "+m" (*p)
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Wed, Oct 14, 2015 at 11:55:56PM +0800, Boqun Feng wrote:
> According to memory-barriers.txt, xchg, cmpxchg and their atomic{,64}_
> versions all need to imply a full barrier, however they are now just
> RELEASE+ACQUIRE, which is not a full barrier.
> 
> So replace PPC_RELEASE_BARRIER and PPC_ACQUIRE_BARRIER with
> PPC_ATOMIC_ENTRY_BARRIER and PPC_ATOMIC_EXIT_BARRIER in
> __{cmp,}xchg_{u32,u64} respectively to guarantee a full barrier
> semantics of atomic{,64}_{cmp,}xchg() and {cmp,}xchg().
> 
> This patch is a complement of commit b97021f85517 ("powerpc: Fix
> atomic_xxx_return barrier semantics").
> 
> Acked-by: Michael Ellerman 
> Cc:  # 3.4+
> Signed-off-by: Boqun Feng 
> ---
>  arch/powerpc/include/asm/cmpxchg.h | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/cmpxchg.h 
> b/arch/powerpc/include/asm/cmpxchg.h
> index ad6263c..d1a8d93 100644
> --- a/arch/powerpc/include/asm/cmpxchg.h
> +++ b/arch/powerpc/include/asm/cmpxchg.h
> @@ -18,12 +18,12 @@ __xchg_u32(volatile void *p, unsigned long val)
>   unsigned long prev;
> 
>   __asm__ __volatile__(
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER

This looks to be the lwsync instruction.

>  "1:  lwarx   %0,0,%2 \n"
>   PPC405_ERR77(0,%2)
>  "stwcx.  %3,0,%2 \n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER

And this looks to be the sync instruction.

>   : "=" (prev), "+m" (*(volatile unsigned int *)p)
>   : "r" (p), "r" (val)
>   : "cc", "memory");

Hmmm...

Suppose we have something like the following, where "a" and "x" are both
initially zero:

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
r3 = xchg(, 1);   smp_mb();
r3 = READ_ONCE(x);

If xchg() is fully ordered, we should never observe both CPUs'
r3 values being zero, correct?

And wouldn't this be represented by the following litmus test?

PPC SB+lwsync-RMW2-lwsync+st-sync-leading
""
{
0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
}
 P0 | P1 ;
 stw r1,0(r2)   | stw r1,0(r12)  ;
 lwsync | sync   ;
 lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
 stwcx. r1,r10,r12  | ;
 bne Fail0  | ;
 mr r3,r11  | ;
 Fail0: | ;
exists
(0:r3=0 /\ a=2 /\ 1:r3=0)

I left off P0's trailing sync because there is nothing for it to order
against in this particular litmus test.  I tried adding it and verified
that it has no effect.

Am I missing something here?  If not, it seems to me that you need
the leading lwsync to instead be a sync.

Of course, if I am not missing something, then this applies also to the
value-returning RMW atomic operations that you pulled this pattern from.
If so, it would seem that I didn't think through all the possibilities
back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
that I worried about the RMW atomic operation acting as a barrier,
but not as the load/store itself.  :-/

Thanx, Paul

> @@ -61,12 +61,12 @@ __xchg_u64(volatile void *p, unsigned long val)
>   unsigned long prev;
> 
>   __asm__ __volatile__(
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  ldarx   %0,0,%2 \n"
>   PPC405_ERR77(0,%2)
>  "stdcx.  %3,0,%2 \n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   : "=" (prev), "+m" (*(volatile unsigned long *)p)
>   : "r" (p), "r" (val)
>   : "cc", "memory");
> @@ -151,14 +151,14 @@ __cmpxchg_u32(volatile unsigned int *p, unsigned long 
> old, unsigned long new)
>   unsigned int prev;
> 
>   __asm__ __volatile__ (
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  lwarx   %0,0,%2 # __cmpxchg_u32\n\
>   cmpw0,%0,%3\n\
>   bne-2f\n"
>   PPC405_ERR77(0,%2)
>  "stwcx.  %4,0,%2\n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   "\n\
>  2:"
>   : "=" (prev), "+m" (*p)
> @@ -197,13 +197,13 @@ __cmpxchg_u64(volatile unsigned long *p, unsigned long 
> old, unsigned long new)
>   unsigned long prev;
> 
>   __asm__ __volatile__ (
> - PPC_RELEASE_BARRIER
> + PPC_ATOMIC_ENTRY_BARRIER
>  "1:  ldarx   %0,0,%2 # __cmpxchg_u64\n\
>   cmpd0,%0,%3\n\
>   bne-2f\n\
>   stdcx.  %4,0,%2\n\
>   bne-1b"
> - PPC_ACQUIRE_BARRIER
> + PPC_ATOMIC_EXIT_BARRIER
>   "\n\
>  2:"
>   : "=" (prev), "+m" (*p)
> -- 
> 2.5.3
> 

--
To unsubscribe from this list:

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Peter Zijlstra

On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> Suppose we have something like the following, where "a" and "x" are both
> initially zero:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
> If xchg() is fully ordered, we should never observe both CPUs'
> r3 values being zero, correct?
> 
> And wouldn't this be represented by the following litmus test?
> 
>   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
>stwcx. r1,r10,r12  | ;
>bne Fail0  | ;
>mr r3,r11  | ;
>Fail0: | ;
>   exists
>   (0:r3=0 /\ a=2 /\ 1:r3=0)
> 
> I left off P0's trailing sync because there is nothing for it to order
> against in this particular litmus test.  I tried adding it and verified
> that it has no effect.
> 
> Am I missing something here?  If not, it seems to me that you need
> the leading lwsync to instead be a sync.

So the scenario that would fail would be this one, right?

a = x = 0

CPU0CPU1

r3 = load_locked ();
a = 2;
sync();
r3 = x;
x = 1;
lwsync();
if (!store_cond(, 1))
goto again


Where we hoist the load way up because lwsync allows this.

I always thought this would fail because CPU1's store to @a would fail
the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
load and now seeing the new value (2).

> Of course, if I am not missing something, then this applies also to the
> value-returning RMW atomic operations that you pulled this pattern from.
> If so, it would seem that I didn't think through all the possibilities
> back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> that I worried about the RMW atomic operation acting as a barrier,
> but not as the load/store itself.  :-/

AARGH64 does something very similar; it does something like:

ll
...
sc-release

mb

Which I assumed worked for the same reason, any change to the variable
would fail the sc, and we go for round 2, now observing the new value.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > Suppose we have something like the following, where "a" and "x" are both
> > initially zero:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > If xchg() is fully ordered, we should never observe both CPUs'
> > r3 values being zero, correct?
> > 
> > And wouldn't this be represented by the following litmus test?
> > 
> > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> >  stwcx. r1,r10,r12  | ;
> >  bne Fail0  | ;
> >  mr r3,r11  | ;
> >  Fail0: | ;
> > exists
> > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > 
> > I left off P0's trailing sync because there is nothing for it to order
> > against in this particular litmus test.  I tried adding it and verified
> > that it has no effect.
> > 
> > Am I missing something here?  If not, it seems to me that you need
> > the leading lwsync to instead be a sync.
> 
> So the scenario that would fail would be this one, right?
> 
> a = x = 0
> 
>   CPU0CPU1
> 
>   r3 = load_locked ();
>   a = 2;
>   sync();
>   r3 = x;
>   x = 1;
>   lwsync();
>   if (!store_cond(, 1))
>   goto again
> 
> 
> Where we hoist the load way up because lwsync allows this.

That scenario would end up with a==1 rather than a==2.

> I always thought this would fail because CPU1's store to @a would fail
> the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> load and now seeing the new value (2).

The stwcx. failure was one thing that prevented a number of other
misordering cases.  The problem is that we have to let go of the notion
of an implicit global clock.

To that end, the herd tool can make a diagram of what it thought
happened, and I have attached it.  I used this diagram to try and force
this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
and succeeded.  Here is the sequence of events:

o   Commit P0's write.  The model offers to propagate this write
to the coherence point and to P1, but don't do so yet.

o   Commit P1's write.  Similar offers, but don't take them up yet.

o   Commit P0's lwsync.

o   Execute P0's lwarx, which reads a=0.  Then commit it.

o   Commit P0's stwcx. as successful.  This stores a=1.

o   Commit P0's branch (not taken).

o   Commit P0's final register-to-register move.

o   Commit P1's sync instruction.

o   There is now nothing that can happen in either processor.
P0 is done, and P1 is waiting for its sync.  Therefore,
propagate P1's a=2 write to the coherence point and to
the other thread.

o   There is still nothing that can happen in either processor.
So pick the barrier propagate, then the acknowledge sync.

o   P1 can now execute its read from x.  Because P0's write to
x is still waiting to propagate to P1, this still reads
x=0.  Execute and commit, and we now have both r3 registers
equal to zero and the final value a=2.

o   Clean up by propagating the write to x everywhere, and
propagating the lwsync.

And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;

I am still not 100% confident of my litmus test.  It is quite possible
that I lost something in translation, but that is looking less likely.

> > Of course, if I am not missing something, then this applies also to the
> > value-returning RMW atomic operations that you pulled this pattern from.
> > If so, it would seem that I didn't think through all the possibilities
> > back when PPC_ATOMIC_EXIT_BARRIER moved to sync...  In fact, I believe
> > that I worried about the RMW atomic operation acting as a barrier,
> > but not as the load/store itself.  :-/
> 
> AARGH64 does something very similar; it does something like:
> 
>   ll
>   ...
>   sc-release
> 
>   mb
> 
> Which I assumed worked for the same reason, any change to the variable
> would fail the sc, and we go for round 2, now observing the new value.

I have to defer to Will on this one.  You are right that ARM and PowerPC
do have similar memory models, but there are some differences.

Thanx, Paul

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > Suppose we have something like the following, where "a" and "x" are both
> > > > initially zero:
> > > > 
> > > > CPU 0   CPU 1
> > > > -   -
> > > > 
> > > > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > > > r3 = xchg(, 1);   smp_mb();
> > > > r3 = READ_ONCE(x);
> > > > 
> > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > r3 values being zero, correct?
> > > > 
> > > > And wouldn't this be represented by the following litmus test?
> > > > 
> > > > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > ""
> > > > {
> > > > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > }
> > > >  P0 | P1 ;
> > > >  stw r1,0(r2)   | stw r1,0(r12)  ;
> > > >  lwsync | sync   ;
> > > >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > > >  stwcx. r1,r10,r12  | ;
> > > >  bne Fail0  | ;
> > > >  mr r3,r11  | ;
> > > >  Fail0: | ;
> > > > exists
> > > > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > 
> > > > I left off P0's trailing sync because there is nothing for it to order
> > > > against in this particular litmus test.  I tried adding it and verified
> > > > that it has no effect.
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> 
> I'm afraid more than that, the above litmus also shows that
> 

I mean there will be more things we need to fix, perhaps even smp_wmb()
need to be sync then..

Regards,
Boqun

>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> 
> in the implementation of this patchset, which should be disallowed by
> the semantics of RELEASE, right?
> 
> And even:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   smp_store_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (1:r3 == 0 && a == 2) is not prohibitted
> 
> shows by:
> 
>   PPC weird-lwsync
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>stw  r1,0(r12) | lwz r3,0(r2)   ;
>   exists
>   (a=2 /\ 1:r3=0)
> 
> 
> Please find something I'm (or the tool is) missing, maybe we can't use
> (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> 0?
> 
> And there is really something I find strange, see below.
> 
> > > 
> > > So the scenario that would fail would be this one, right?
> > > 
> > > a = x = 0
> > > 
> > >   CPU0CPU1
> > > 
> > >   r3 = load_locked ();
> > >   a = 2;
> > >   sync();
> > >   r3 = x;
> > >   x = 1;
> > >   lwsync();
> > >   if (!store_cond(, 1))
> > >   goto again
> > > 
> > > 
> > > Where we hoist the load way up because lwsync allows this.
> > 
> > That scenario would end up with a==1 rather than a==2.
> > 
> > > I always thought this would fail because CPU1's store to @a would fail
> > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > load and now seeing the new value (2).
> > 
> > The stwcx. failure was one thing that prevented a number of other
> > misordering cases.  The problem is that we have to let go of the notion
> > of an implicit global clock.
> > 
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > Suppose we have something like the following, where "a" and "x" are both
> > > initially zero:
> > > 
> > >   CPU 0   CPU 1
> > >   -   -
> > > 
> > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > >   r3 = xchg(, 1);   smp_mb();
> > >   r3 = READ_ONCE(x);
> > > 
> > > If xchg() is fully ordered, we should never observe both CPUs'
> > > r3 values being zero, correct?
> > > 
> > > And wouldn't this be represented by the following litmus test?
> > > 
> > >   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > >   ""
> > >   {
> > >   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > >   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > >   }
> > >P0 | P1 ;
> > >stw r1,0(r2)   | stw r1,0(r12)  ;
> > >lwsync | sync   ;
> > >lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > >stwcx. r1,r10,r12  | ;
> > >bne Fail0  | ;
> > >mr r3,r11  | ;
> > >Fail0: | ;
> > >   exists
> > >   (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > 
> > > I left off P0's trailing sync because there is nothing for it to order
> > > against in this particular litmus test.  I tried adding it and verified
> > > that it has no effect.
> > > 
> > > Am I missing something here?  If not, it seems to me that you need
> > > the leading lwsync to instead be a sync.

I'm afraid more than that, the above litmus also shows that

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
r3 = xchg_release(, 1);   smp_mb();
r3 = READ_ONCE(x);

(0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted

in the implementation of this patchset, which should be disallowed by
the semantics of RELEASE, right?

And even:

CPU 0   CPU 1
-   -

WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
smp_store_release(, 1);   smp_mb();
r3 = READ_ONCE(x);

(1:r3 == 0 && a == 2) is not prohibitted

shows by:

PPC weird-lwsync
""
{
0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
}
 P0 | P1 ;
 stw r1,0(r2)   | stw r1,0(r12)  ;
 lwsync | sync   ;
 stw  r1,0(r12) | lwz r3,0(r2)   ;
exists
(a=2 /\ 1:r3=0)


Please find something I'm (or the tool is) missing, maybe we can't use
(a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
0?

And there is really something I find strange, see below.

> > 
> > So the scenario that would fail would be this one, right?
> > 
> > a = x = 0
> > 
> > CPU0CPU1
> > 
> > r3 = load_locked ();
> > a = 2;
> > sync();
> > r3 = x;
> > x = 1;
> > lwsync();
> > if (!store_cond(, 1))
> > goto again
> > 
> > 
> > Where we hoist the load way up because lwsync allows this.
> 
> That scenario would end up with a==1 rather than a==2.
> 
> > I always thought this would fail because CPU1's store to @a would fail
> > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > load and now seeing the new value (2).
> 
> The stwcx. failure was one thing that prevented a number of other
> misordering cases.  The problem is that we have to let go of the notion
> of an implicit global clock.
> 
> To that end, the herd tool can make a diagram of what it thought
> happened, and I have attached it.  I used this diagram to try and force
> this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> and succeeded.  Here is the sequence of events:
> 
> o Commit P0's write.  The model offers to propagate this write
>   to the coherence point and to P1, but don't do so yet.
> 
> o Commit P1's write.  Similar offers, but don't take them up yet.
> 
> o Commit P0's lwsync.
> 
> o Execute P0's lwarx, which reads a=0.  Then commit it.
> 
> o Commit P0's stwcx. as successful.  This stores a=1.
> 
> o Commit P0's branch (not taken).
> 

So at this point, P0's write to 'a' has propagated to P1, right? But
P0's write to 'x' hasn't, even there is a lwsync between them, right?
Doesn't the lwsync prevent this from happening?

If at this point P0's write to 'a' hasn't propagated then when?

Regards,
Boqun

> o Commit P0's final register-to-register move.
> 
> o

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 11:11:01AM +0800, Boqun Feng wrote:
> Hi Paul,
> 
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> [snip]
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o Commit P0's write.  The model offers to propagate this write
> > >   to the coherence point and to P1, but don't do so yet.
> > > 
> > > o Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o Commit P0's lwsync.
> > > 
> > > o Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o Commit P0's stwcx. as successful.  This stores a=1.
> > > 
> > > o Commit P0's branch (not taken).
> > > 
> > 
> > So at this point, P0's write to 'a' has propagated to P1, right? But
> > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > Doesn't the lwsync prevent this from happening?
> > 
> > If at this point P0's write to 'a' hasn't propagated then when?
> 
> Hmm.. I played around ppcmem, and figured out what happens to
> propagation of P0's write to 'a':
> 
> At this point, or some point after store 'a' to 1 and before sync on
> P1 finish, writes to 'a' reachs a coherence point which 'a' is 2, so
> P0's write to 'a' "fails" and will not propagate.
> 
> I probably misunderstood the word "propagate", which actually means an
> already coherent write gets seen by another CPU, right?

It is quite possible for a given write to take a position in the coherence
order that guarantees that no one will see it, as is the case here.
But yes, all readers will see an order of values for a given memory
location that is consistent with the coherence order.

> So my question should be:
> 
> As lwsync can order P0's write to 'a' happens after P0's write to 'x',
> why P0's write to 'x' isn't seen by P1 after P1's write to 'a' overrides
> P0's?

There is no global clock for PPC's memory model.

> But ppcmem gave me the answer ;-) lwsync won't wait under P0's write to
> 'x' gets propagated, and if P0's write to 'a' "wins" in write coherence,
> lwsync will guarantee propagation of 'x' happens before that of 'a', but
> if P0's write to 'a' "fails", there will be no propagation of 'a' from
> P0. So that lwsync can't do anything here.

I believe that this is consistent, but the corners can get tricky.

Thanx, Paul

> Regards,
> Boqun
> 
> > 
> > > o Commit P0's final register-to-register move.
> > > 
> > > o Commit P1's sync instruction.
> > > 
> > > o There is now nothing that can happen in either processor.
> > >   P0 is done, and P1 is waiting for its sync.  Therefore,
> > >   propagate P1's a=2 write to the coherence point and to
> > >   the other thread.
> > > 
> > > o There is still nothing that can happen in either processor.
> > >   So pick the barrier propagate, then the acknowledge sync.
> > > 
> > > o P1 can now execute its read from x.  Because P0's write to
> > >   x is still waiting to propagate to P1, this still reads
> > >   x=0.  Execute and commit, and we now have both r3 registers
> > >   equal to zero and the final value a=2.
> > > 
> > > o Clean up by propagating the write to x everywhere, and
> > >   propagating the lwsync.
> > > 
> > > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > > 
> > > I am still not 100% confident of my litmus test.  It is quite possible
> > > that I lost something in translation, but that is looking less likely.
> > > 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 09:22:26AM +0800, Boqun Feng wrote:
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> > On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > > Suppose we have something like the following, where "a" and "x" are 
> > > > > both
> > > > > initially zero:
> > > > > 
> > > > >   CPU 0   CPU 1
> > > > >   -   -
> > > > > 
> > > > >   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > > > >   r3 = xchg(, 1);   smp_mb();
> > > > >   r3 = READ_ONCE(x);
> > > > > 
> > > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > > r3 values being zero, correct?
> > > > > 
> > > > > And wouldn't this be represented by the following litmus test?
> > > > > 
> > > > >   PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > >   ""
> > > > >   {
> > > > >   0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > >   1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > >   }
> > > > >P0 | P1 ;
> > > > >stw r1,0(r2)   | stw r1,0(r12)  ;
> > > > >lwsync | sync   ;
> > > > >lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > > > >stwcx. r1,r10,r12  | ;
> > > > >bne Fail0  | ;
> > > > >mr r3,r11  | ;
> > > > >Fail0: | ;
> > > > >   exists
> > > > >   (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > > 
> > > > > I left off P0's trailing sync because there is nothing for it to order
> > > > > against in this particular litmus test.  I tried adding it and 
> > > > > verified
> > > > > that it has no effect.
> > > > > 
> > > > > Am I missing something here?  If not, it seems to me that you need
> > > > > the leading lwsync to instead be a sync.
> > 
> > I'm afraid more than that, the above litmus also shows that
> 
> I mean there will be more things we need to fix, perhaps even smp_wmb()
> need to be sync then..

That should not be necessary.  For smp_wmb(), lwsync should be just fine.

Thanx, Paul

> Regards,
> Boqun
> 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > 
> > in the implementation of this patchset, which should be disallowed by
> > the semantics of RELEASE, right?
> > 
> > And even:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > smp_store_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (1:r3 == 0 && a == 2) is not prohibitted
> > 
> > shows by:
> > 
> > PPC weird-lwsync
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  stw  r1,0(r12) | lwz r3,0(r2)   ;
> > exists
> > (a=2 /\ 1:r3=0)
> > 
> > 
> > Please find something I'm (or the tool is) missing, maybe we can't use
> > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > 0?
> > 
> > And there is really something I find strange, see below.
> > 
> > > > 
> > > > So the scenario that would fail would be this one, right?
> > > > 
> > > > a = x = 0
> > > > 
> > > > CPU0CPU1
> > > > 
> > > > r3 = load_locked ();
> > > > a = 2;
> > > > sync();
> > > > r3 = x;
> > > > x = 1;
> > > > lwsync();
> > > > if (!store_cond(, 1))
> > > > goto again
> > > > 
> > > > 
> > > > Where we hoist the load way up because lwsync allows this.
> > > 
> > > That scenario would end up with a==1 rather than a==2.
> > > 
> > > > I always thought this would fail because CPU1's store to @a would fail
> > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > load and now seeing the new value (2).
> > > 
> > > The stwcx. failure was one thing that prevented a number of other
> > > misordering cases.  The problem is that we have to let go of the notion
> > > of an implicit global clock.
> > > 
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Paul E. McKenney

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
> > On Wed, Oct 14, 2015 at 11:04:19PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 14, 2015 at 01:19:17PM -0700, Paul E. McKenney wrote:
> > > > Suppose we have something like the following, where "a" and "x" are both
> > > > initially zero:
> > > > 
> > > > CPU 0   CPU 1
> > > > -   -
> > > > 
> > > > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > > > r3 = xchg(, 1);   smp_mb();
> > > > r3 = READ_ONCE(x);
> > > > 
> > > > If xchg() is fully ordered, we should never observe both CPUs'
> > > > r3 values being zero, correct?
> > > > 
> > > > And wouldn't this be represented by the following litmus test?
> > > > 
> > > > PPC SB+lwsync-RMW2-lwsync+st-sync-leading
> > > > ""
> > > > {
> > > > 0:r1=1; 0:r2=x; 0:r3=3; 0:r10=0 ; 0:r11=0; 0:r12=a;
> > > > 1:r1=2; 1:r2=x; 1:r3=3; 1:r10=0 ; 1:r11=0; 1:r12=a;
> > > > }
> > > >  P0 | P1 ;
> > > >  stw r1,0(r2)   | stw r1,0(r12)  ;
> > > >  lwsync | sync   ;
> > > >  lwarx  r11,r10,r12 | lwz r3,0(r2)   ;
> > > >  stwcx. r1,r10,r12  | ;
> > > >  bne Fail0  | ;
> > > >  mr r3,r11  | ;
> > > >  Fail0: | ;
> > > > exists
> > > > (0:r3=0 /\ a=2 /\ 1:r3=0)
> > > > 
> > > > I left off P0's trailing sync because there is nothing for it to order
> > > > against in this particular litmus test.  I tried adding it and verified
> > > > that it has no effect.
> > > > 
> > > > Am I missing something here?  If not, it seems to me that you need
> > > > the leading lwsync to instead be a sync.
> 
> I'm afraid more than that, the above litmus also shows that
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   r3 = xchg_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> 
> in the implementation of this patchset, which should be disallowed by
> the semantics of RELEASE, right?

Not necessarily.  If you had the read first on CPU 1, and you had a
similar problem, I would be more worried.

> And even:
> 
>   CPU 0   CPU 1
>   -   -
> 
>   WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
>   smp_store_release(, 1);   smp_mb();
>   r3 = READ_ONCE(x);
> 
>   (1:r3 == 0 && a == 2) is not prohibitted
> 
> shows by:
> 
>   PPC weird-lwsync
>   ""
>   {
>   0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
>   1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
>   }
>P0 | P1 ;
>stw r1,0(r2)   | stw r1,0(r12)  ;
>lwsync | sync   ;
>stw  r1,0(r12) | lwz r3,0(r2)   ;
>   exists
>   (a=2 /\ 1:r3=0)
> 
> Please find something I'm (or the tool is) missing, maybe we can't use
> (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> 0?

Again, if you were pairing the smp_store_release() with an smp_load_acquire()
or even a READ_ONCE() followed by a barrier, I would be quite concerned.
I am not at all worried about the above two litmus tests.

> And there is really something I find strange, see below.
> 
> > > 
> > > So the scenario that would fail would be this one, right?
> > > 
> > > a = x = 0
> > > 
> > >   CPU0CPU1
> > > 
> > >   r3 = load_locked ();
> > >   a = 2;
> > >   sync();
> > >   r3 = x;
> > >   x = 1;
> > >   lwsync();
> > >   if (!store_cond(, 1))
> > >   goto again
> > > 
> > > 
> > > Where we hoist the load way up because lwsync allows this.
> > 
> > That scenario would end up with a==1 rather than a==2.
> > 
> > > I always thought this would fail because CPU1's store to @a would fail
> > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > load and now seeing the new value (2).
> > 
> > The stwcx. failure was one thing that prevented a number of other
> > misordering cases.  The problem is that we have to let go of the notion
> > of an implicit global clock.
> > 
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

Hi Paul,

On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
> On Wed, Oct 14, 2015 at 02:44:53PM -0700, Paul E. McKenney wrote:
[snip]
> > To that end, the herd tool can make a diagram of what it thought
> > happened, and I have attached it.  I used this diagram to try and force
> > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > and succeeded.  Here is the sequence of events:
> > 
> > o   Commit P0's write.  The model offers to propagate this write
> > to the coherence point and to P1, but don't do so yet.
> > 
> > o   Commit P1's write.  Similar offers, but don't take them up yet.
> > 
> > o   Commit P0's lwsync.
> > 
> > o   Execute P0's lwarx, which reads a=0.  Then commit it.
> > 
> > o   Commit P0's stwcx. as successful.  This stores a=1.
> > 
> > o   Commit P0's branch (not taken).
> > 
> 
> So at this point, P0's write to 'a' has propagated to P1, right? But
> P0's write to 'x' hasn't, even there is a lwsync between them, right?
> Doesn't the lwsync prevent this from happening?
> 
> If at this point P0's write to 'a' hasn't propagated then when?
> 

Hmm.. I played around ppcmem, and figured out what happens to
propagation of P0's write to 'a':

At this point, or some point after store 'a' to 1 and before sync on
P1 finish, writes to 'a' reachs a coherence point which 'a' is 2, so
P0's write to 'a' "fails" and will not propagate.


I probably misunderstood the word "propagate", which actually means an
already coherent write gets seen by another CPU, right?

So my question should be:

As lwsync can order P0's write to 'a' happens after P0's write to 'x',
why P0's write to 'x' isn't seen by P1 after P1's write to 'a' overrides
P0's?

But ppcmem gave me the answer ;-) lwsync won't wait under P0's write to
'x' gets propagated, and if P0's write to 'a' "wins" in write coherence,
lwsync will guarantee propagation of 'x' happens before that of 'a', but
if P0's write to 'a' "fails", there will be no propagation of 'a' from
P0. So that lwsync can't do anything here.

Regards,
Boqun

> 
> > o   Commit P0's final register-to-register move.
> > 
> > o   Commit P1's sync instruction.
> > 
> > o   There is now nothing that can happen in either processor.
> > P0 is done, and P1 is waiting for its sync.  Therefore,
> > propagate P1's a=2 write to the coherence point and to
> > the other thread.
> > 
> > o   There is still nothing that can happen in either processor.
> > So pick the barrier propagate, then the acknowledge sync.
> > 
> > o   P1 can now execute its read from x.  Because P0's write to
> > x is still waiting to propagate to P1, this still reads
> > x=0.  Execute and commit, and we now have both r3 registers
> > equal to zero and the final value a=2.
> > 
> > o   Clean up by propagating the write to x everywhere, and
> > propagating the lwsync.
> > 
> > And the "exists" clause really does trigger: 0:r3=0; 1:r3=0; [a]=2;
> > 
> > I am still not 100% confident of my litmus test.  It is quite possible
> > that I lost something in translation, but that is looking less likely.
> > 


signature.asc
Description: PGP signature

Re: [PATCH tip/locking/core v4 1/6] powerpc: atomic: Make xchg and cmpxchg a full barrier

2015-10-14 Thread Boqun Feng

On Wed, Oct 14, 2015 at 08:07:05PM -0700, Paul E. McKenney wrote:
> On Thu, Oct 15, 2015 at 08:53:21AM +0800, Boqun Feng wrote:
[snip]
> > 
> > I'm afraid more than that, the above litmus also shows that
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > r3 = xchg_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (0:r3 == 0 && 1:r3 == 0 && a == 2) is not prohibitted
> > 
> > in the implementation of this patchset, which should be disallowed by
> > the semantics of RELEASE, right?
> 
> Not necessarily.  If you had the read first on CPU 1, and you had a
> similar problem, I would be more worried.
> 

Sometimes I think maybe we should say that a single unpaired ACQUIRE or
RELEASE doesn't have any order guarantee because of the above case.

But seems that's not a normal or even existing case, my bad ;-(

> > And even:
> > 
> > CPU 0   CPU 1
> > -   -
> > 
> > WRITE_ONCE(x, 1);   WRITE_ONCE(a, 2);
> > smp_store_release(, 1);   smp_mb();
> > r3 = READ_ONCE(x);
> > 
> > (1:r3 == 0 && a == 2) is not prohibitted
> > 
> > shows by:
> > 
> > PPC weird-lwsync
> > ""
> > {
> > 0:r1=1; 0:r2=x; 0:r3=3; 0:r12=a;
> > 1:r1=2; 1:r2=x; 1:r3=3; 1:r12=a;
> > }
> >  P0 | P1 ;
> >  stw r1,0(r2)   | stw r1,0(r12)  ;
> >  lwsync | sync   ;
> >  stw  r1,0(r12) | lwz r3,0(r2)   ;
> > exists
> > (a=2 /\ 1:r3=0)
> > 
> > Please find something I'm (or the tool is) missing, maybe we can't use
> > (a == 2) as a indication that STORE on CPU 1 happens after STORE on CPU
> > 0?
> 
> Again, if you were pairing the smp_store_release() with an smp_load_acquire()
> or even a READ_ONCE() followed by a barrier, I would be quite concerned.
> I am not at all worried about the above two litmus tests.
> 

Understood, thank you for think through that ;-)

> > And there is really something I find strange, see below.
> > 
> > > > 
> > > > So the scenario that would fail would be this one, right?
> > > > 
> > > > a = x = 0
> > > > 
> > > > CPU0CPU1
> > > > 
> > > > r3 = load_locked ();
> > > > a = 2;
> > > > sync();
> > > > r3 = x;
> > > > x = 1;
> > > > lwsync();
> > > > if (!store_cond(, 1))
> > > > goto again
> > > > 
> > > > 
> > > > Where we hoist the load way up because lwsync allows this.
> > > 
> > > That scenario would end up with a==1 rather than a==2.
> > > 
> > > > I always thought this would fail because CPU1's store to @a would fail
> > > > the store_cond() on CPU0 and we'd do the 'again' thing, re-issuing the
> > > > load and now seeing the new value (2).
> > > 
> > > The stwcx. failure was one thing that prevented a number of other
> > > misordering cases.  The problem is that we have to let go of the notion
> > > of an implicit global clock.
> > > 
> > > To that end, the herd tool can make a diagram of what it thought
> > > happened, and I have attached it.  I used this diagram to try and force
> > > this scenario at https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html#PPC,
> > > and succeeded.  Here is the sequence of events:
> > > 
> > > o Commit P0's write.  The model offers to propagate this write
> > >   to the coherence point and to P1, but don't do so yet.
> > > 
> > > o Commit P1's write.  Similar offers, but don't take them up yet.
> > > 
> > > o Commit P0's lwsync.
> > > 
> > > o Execute P0's lwarx, which reads a=0.  Then commit it.
> > > 
> > > o Commit P0's stwcx. as successful.  This stores a=1.
> > > 
> > > o Commit P0's branch (not taken).
> > 
> > So at this point, P0's write to 'a' has propagated to P1, right? But
> > P0's write to 'x' hasn't, even there is a lwsync between them, right?
> > Doesn't the lwsync prevent this from happening?
> 
> No, because lwsync is quite a bit weaker than sync aside from just
> the store-load ordering.
> 

Understood, I've tried the ppcmem, much clear now ;-)

> > If at this point P0's write to 'a' hasn't propagated then when?
> 
> Later.  At the very end of the test, in this case.  ;-)
> 

Hmm.. I tried exactly this sequence in ppcmem, seems propagation of P0's
write to 'a' is never an option...

> Why not try creating a longer litmus test that requires P0's write to
> "a" to propagate to P1 before both processes complete?
> 

I will try to write one, but to be clear, you mean we still observe 

0:r3 == 0 && a == 2 && 1:r3 == 0 

at the end, right? Because I understand that if P1's write to 'a'
doesn't override P0's, P0's write to 'a' will propagate.

Regards,
Boqun


signature.asc

74 matches

Mail list logo