Re: [PATCH RFC LKMM 1/7] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-09-07 Thread Daniel Lustig
On 9/7/2018 10:38 AM, Alan Stern wrote:
> On Fri, 7 Sep 2018, Daniel Lustig wrote:
> 
>> On 9/7/2018 9:09 AM, Will Deacon wrote:
>>> On Fri, Sep 07, 2018 at 12:00:19PM -0400, Alan Stern wrote:
>>>> On Thu, 6 Sep 2018, Andrea Parri wrote:
>>>>
>>>>>> Have you noticed any part of the generic code that relies on ordinary 
>>>>>> acquire-release (rather than atomic RMW acquire-release) in order to 
>>>>>> implement locking constructs?
>>>>>
>>>>> There are several places in code where the "lock-acquire" seems to be
>>>>> provided by an atomic_cond_read_acquire/smp_cond_load_acquire: I have
>>>>> mentioned one in qspinlock in this thread; qrwlock and mcs_spinlock
>>>>> provide other examples (grep for the primitives...).
>>>>>
>>>>> As long as we don't consider these primitive as RMW (which would seem
>>>>> odd...) or as acquire for which "most people expect strong ordering"
>>>>> (see above), these provides other examples for the _gap_ I mentioned.
>>>>
>>>> Okay, now I understand your objection.  It does appear that on RISC-V,
>>>> if nowhere else, the current implementations of qspinlock, qrwlock,
>>>> etc. will not provide "RCtso" ordering.
>>>>
>>>> The discussions surrounding this topic have been so lengthy and 
>>>> confusing that I have lost track of any comments Palmer or Daniel may 
>>>> have made concerning this potential problem.
>>>>
>>>> One possible resolution would be to define smp_cond_load_acquire() 
>>>> specially on RISC-V so that it provided the same ordering guarantees as 
>>>> RMW-acquire.  (Plus adding a comment in the asm-generic/barrier.h 
>>>> pointing out the necessity for the stronger guarantee on all 
>>>> architectures.)
>>>>
>>>> Another would be to replace the usages of atomic/smp_cond_load_acquire 
>>>> in the locking constructs with a new function that would otherwise be 
>>>> the same but would provide the ordering guarantee we want.
>>>>
>>>> Do you think either of these would be an adequate fix?
>>>
>>> I didn't think RISC-V used qspinlock or qrwlock, so I'm not sure there's
>>> actually anything to fix, is there?
>>>
>>> Will
>>
>> I've also lost track of whether the current preference is or is not for
>> RCtso, or in which subset of cases RCtso is currently preferred.  For
>> whichever cases do in fact need to be RCtso, the RISC-V approach would
>> still be the same as what I've written in the past, as far as I can
>> tell [1].
> 
> The patch which Paul plans to send in for the next merge window makes 
> the LKMM require RCtso ordering for spinlocks, and by extension, for 
> all locking operations.  As I understand it, the current RISC-V 
> implementation of spinlocks does provide this ordering.
> 
> We have discussed creating another patch for the LKMM which would
> require RMW-acquire/ordinary-release also to have RCtso ordering.  
> Nobody has written the patch yet, but it would be straightfoward.  The
> rationale is that many types of locks are implemented in terms of
> RMW-acquire, so if the locks are required to be RCtso then so should
> the lower-level operations they are built from.
> 
> Will feels strongly (and Linus agrees) that the LKMM should not require
> ordinary acquire and release to be any stronger than RCpc.
> 
> The issue that Andrea raised has to do with qspinlock, qrwlock, and
> mcs_spinlock, which are implemented using smp_cond_load_acquire()  
> instead of RMW-acquire.  This provides only the ordering properties of
> smp_load_acquire(), namely RCpc, which means that qspinlocks etc. might
> not be RCtso.
> 
> Since we do want locks to be RCtso, the question is how to resolve this 
> discrepancy.

Thanks for the summary Alan!

I think RISC-V might actually get RCtso with smp_cond_load_acquire()
implemented using fence r,rw, believe it or not :)

The read->read and read->write requirements are covered by the fence r,rw, so
what we need to add on is the write->write ordering requirement.  On RISC-V,
we can get release semantics in three ways: fence rw,w, AMO.rl, and SC.rl.

If we use fence rw,w for release, then the "w,w" part covers it.

If we use AMO.rl for release, then the prior stores are ordered before the
AMO, and the fence r,rw orders the AMO before subsequent stores.

If we use SC.rl, then the prior stores are ordered before the SC, and the
branch to check whether the SC suc

Re: [PATCH RFC LKMM 1/7] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-09-07 Thread Daniel Lustig
On 9/7/2018 10:38 AM, Alan Stern wrote:
> On Fri, 7 Sep 2018, Daniel Lustig wrote:
> 
>> On 9/7/2018 9:09 AM, Will Deacon wrote:
>>> On Fri, Sep 07, 2018 at 12:00:19PM -0400, Alan Stern wrote:
>>>> On Thu, 6 Sep 2018, Andrea Parri wrote:
>>>>
>>>>>> Have you noticed any part of the generic code that relies on ordinary 
>>>>>> acquire-release (rather than atomic RMW acquire-release) in order to 
>>>>>> implement locking constructs?
>>>>>
>>>>> There are several places in code where the "lock-acquire" seems to be
>>>>> provided by an atomic_cond_read_acquire/smp_cond_load_acquire: I have
>>>>> mentioned one in qspinlock in this thread; qrwlock and mcs_spinlock
>>>>> provide other examples (grep for the primitives...).
>>>>>
>>>>> As long as we don't consider these primitive as RMW (which would seem
>>>>> odd...) or as acquire for which "most people expect strong ordering"
>>>>> (see above), these provides other examples for the _gap_ I mentioned.
>>>>
>>>> Okay, now I understand your objection.  It does appear that on RISC-V,
>>>> if nowhere else, the current implementations of qspinlock, qrwlock,
>>>> etc. will not provide "RCtso" ordering.
>>>>
>>>> The discussions surrounding this topic have been so lengthy and 
>>>> confusing that I have lost track of any comments Palmer or Daniel may 
>>>> have made concerning this potential problem.
>>>>
>>>> One possible resolution would be to define smp_cond_load_acquire() 
>>>> specially on RISC-V so that it provided the same ordering guarantees as 
>>>> RMW-acquire.  (Plus adding a comment in the asm-generic/barrier.h 
>>>> pointing out the necessity for the stronger guarantee on all 
>>>> architectures.)
>>>>
>>>> Another would be to replace the usages of atomic/smp_cond_load_acquire 
>>>> in the locking constructs with a new function that would otherwise be 
>>>> the same but would provide the ordering guarantee we want.
>>>>
>>>> Do you think either of these would be an adequate fix?
>>>
>>> I didn't think RISC-V used qspinlock or qrwlock, so I'm not sure there's
>>> actually anything to fix, is there?
>>>
>>> Will
>>
>> I've also lost track of whether the current preference is or is not for
>> RCtso, or in which subset of cases RCtso is currently preferred.  For
>> whichever cases do in fact need to be RCtso, the RISC-V approach would
>> still be the same as what I've written in the past, as far as I can
>> tell [1].
> 
> The patch which Paul plans to send in for the next merge window makes 
> the LKMM require RCtso ordering for spinlocks, and by extension, for 
> all locking operations.  As I understand it, the current RISC-V 
> implementation of spinlocks does provide this ordering.
> 
> We have discussed creating another patch for the LKMM which would
> require RMW-acquire/ordinary-release also to have RCtso ordering.  
> Nobody has written the patch yet, but it would be straightfoward.  The
> rationale is that many types of locks are implemented in terms of
> RMW-acquire, so if the locks are required to be RCtso then so should
> the lower-level operations they are built from.
> 
> Will feels strongly (and Linus agrees) that the LKMM should not require
> ordinary acquire and release to be any stronger than RCpc.
> 
> The issue that Andrea raised has to do with qspinlock, qrwlock, and
> mcs_spinlock, which are implemented using smp_cond_load_acquire()  
> instead of RMW-acquire.  This provides only the ordering properties of
> smp_load_acquire(), namely RCpc, which means that qspinlocks etc. might
> not be RCtso.
> 
> Since we do want locks to be RCtso, the question is how to resolve this 
> discrepancy.

Thanks for the summary Alan!

I think RISC-V might actually get RCtso with smp_cond_load_acquire()
implemented using fence r,rw, believe it or not :)

The read->read and read->write requirements are covered by the fence r,rw, so
what we need to add on is the write->write ordering requirement.  On RISC-V,
we can get release semantics in three ways: fence rw,w, AMO.rl, and SC.rl.

If we use fence rw,w for release, then the "w,w" part covers it.

If we use AMO.rl for release, then the prior stores are ordered before the
AMO, and the fence r,rw orders the AMO before subsequent stores.

If we use SC.rl, then the prior stores are ordered before the SC, and the
branch to check whether the SC suc

Re: [PATCH RFC LKMM 1/7] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-09-07 Thread Daniel Lustig
On 9/7/2018 9:09 AM, Will Deacon wrote:
> On Fri, Sep 07, 2018 at 12:00:19PM -0400, Alan Stern wrote:
>> On Thu, 6 Sep 2018, Andrea Parri wrote:
>>
 Have you noticed any part of the generic code that relies on ordinary 
 acquire-release (rather than atomic RMW acquire-release) in order to 
 implement locking constructs?
>>>
>>> There are several places in code where the "lock-acquire" seems to be
>>> provided by an atomic_cond_read_acquire/smp_cond_load_acquire: I have
>>> mentioned one in qspinlock in this thread; qrwlock and mcs_spinlock
>>> provide other examples (grep for the primitives...).
>>>
>>> As long as we don't consider these primitive as RMW (which would seem
>>> odd...) or as acquire for which "most people expect strong ordering"
>>> (see above), these provides other examples for the _gap_ I mentioned.
>>
>> Okay, now I understand your objection.  It does appear that on RISC-V,
>> if nowhere else, the current implementations of qspinlock, qrwlock,
>> etc. will not provide "RCtso" ordering.
>>
>> The discussions surrounding this topic have been so lengthy and 
>> confusing that I have lost track of any comments Palmer or Daniel may 
>> have made concerning this potential problem.
>>
>> One possible resolution would be to define smp_cond_load_acquire() 
>> specially on RISC-V so that it provided the same ordering guarantees as 
>> RMW-acquire.  (Plus adding a comment in the asm-generic/barrier.h 
>> pointing out the necessity for the stronger guarantee on all 
>> architectures.)
>>
>> Another would be to replace the usages of atomic/smp_cond_load_acquire 
>> in the locking constructs with a new function that would otherwise be 
>> the same but would provide the ordering guarantee we want.
>>
>> Do you think either of these would be an adequate fix?
> 
> I didn't think RISC-V used qspinlock or qrwlock, so I'm not sure there's
> actually anything to fix, is there?
> 
> Will

I've also lost track of whether the current preference is or is not for
RCtso, or in which subset of cases RCtso is currently preferred.  For
whichever cases do in fact need to be RCtso, the RISC-V approach would
still be the same as what I've written in the past, as far as I can
tell [1].

In a nutshell, if a data structure uses only atomics with .aq/.rl,
RISC-V provides RCtso already anyway.  If a data structure uses fences,
or mixes fences and atomics, we can replace a "fence r,rw" or a
"fence rw,w" with a "fence.tso" (== fence r,rw + fence rw,w) as
necessary, at the cost of some amount of performance.

I suppose the answer to the question of whether smp_cond_load_acquire()
needs to change depends on where exactly RCtso is needed, and which
data structures actually use that vs. some other macro. 

Does that answer your question Alan?  Does it make sense?

[1] 
https://lore.kernel.org/lkml/11b27d32-4a8a-3f84-0f25-723095ef1...@nvidia.com/

Dan


Re: [PATCH RFC LKMM 1/7] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-09-07 Thread Daniel Lustig
On 9/7/2018 9:09 AM, Will Deacon wrote:
> On Fri, Sep 07, 2018 at 12:00:19PM -0400, Alan Stern wrote:
>> On Thu, 6 Sep 2018, Andrea Parri wrote:
>>
 Have you noticed any part of the generic code that relies on ordinary 
 acquire-release (rather than atomic RMW acquire-release) in order to 
 implement locking constructs?
>>>
>>> There are several places in code where the "lock-acquire" seems to be
>>> provided by an atomic_cond_read_acquire/smp_cond_load_acquire: I have
>>> mentioned one in qspinlock in this thread; qrwlock and mcs_spinlock
>>> provide other examples (grep for the primitives...).
>>>
>>> As long as we don't consider these primitive as RMW (which would seem
>>> odd...) or as acquire for which "most people expect strong ordering"
>>> (see above), these provides other examples for the _gap_ I mentioned.
>>
>> Okay, now I understand your objection.  It does appear that on RISC-V,
>> if nowhere else, the current implementations of qspinlock, qrwlock,
>> etc. will not provide "RCtso" ordering.
>>
>> The discussions surrounding this topic have been so lengthy and 
>> confusing that I have lost track of any comments Palmer or Daniel may 
>> have made concerning this potential problem.
>>
>> One possible resolution would be to define smp_cond_load_acquire() 
>> specially on RISC-V so that it provided the same ordering guarantees as 
>> RMW-acquire.  (Plus adding a comment in the asm-generic/barrier.h 
>> pointing out the necessity for the stronger guarantee on all 
>> architectures.)
>>
>> Another would be to replace the usages of atomic/smp_cond_load_acquire 
>> in the locking constructs with a new function that would otherwise be 
>> the same but would provide the ordering guarantee we want.
>>
>> Do you think either of these would be an adequate fix?
> 
> I didn't think RISC-V used qspinlock or qrwlock, so I'm not sure there's
> actually anything to fix, is there?
> 
> Will

I've also lost track of whether the current preference is or is not for
RCtso, or in which subset of cases RCtso is currently preferred.  For
whichever cases do in fact need to be RCtso, the RISC-V approach would
still be the same as what I've written in the past, as far as I can
tell [1].

In a nutshell, if a data structure uses only atomics with .aq/.rl,
RISC-V provides RCtso already anyway.  If a data structure uses fences,
or mixes fences and atomics, we can replace a "fence r,rw" or a
"fence rw,w" with a "fence.tso" (== fence r,rw + fence rw,w) as
necessary, at the cost of some amount of performance.

I suppose the answer to the question of whether smp_cond_load_acquire()
needs to change depends on where exactly RCtso is needed, and which
data structures actually use that vs. some other macro. 

Does that answer your question Alan?  Does it make sense?

[1] 
https://lore.kernel.org/lkml/11b27d32-4a8a-3f84-0f25-723095ef1...@nvidia.com/

Dan


Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-12 Thread Daniel Lustig
On 7/12/2018 2:45 AM, Will Deacon wrote:
> On Thu, Jul 12, 2018 at 11:34:32AM +0200, Peter Zijlstra wrote:
>> On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
>>> And I think if we raise atomic*_acquire() to require TSO (but ideally
>>> raise it to RCsc) we're there.
>>
>> To clarify, just the RmW-acquire. Things like atomic_read_acquire() can
>> stay smp_load_acquire() and be RCpc.
> 
> I don't have strong opinions about strengthening RmW atomics to TSO, so
> if it helps to unblock Alan's patch (which doesn't go near this!) then I'll
> go with it. The important part is that we continue to allow roach motel
> into the RmW for other accesses in the non-fully-ordered cases.
> 
> Daniel -- your AMO instructions are cool with this, right? It's just the
> fence-based implementations that will need help?
> 
> Will
Right, let me pull this part out of the overly-long response I just gave
on the thread with Linus :)

if we pair AMOs with AMOs, we get RCsc, and everything is fine.  If we
start mixing in fences (mostly because we don't currently have native
load-acquire or store-release opcodes), then that's when all the rest of the
complexity comes in.

Dan


Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-12 Thread Daniel Lustig
On 7/12/2018 2:45 AM, Will Deacon wrote:
> On Thu, Jul 12, 2018 at 11:34:32AM +0200, Peter Zijlstra wrote:
>> On Thu, Jul 12, 2018 at 09:40:40AM +0200, Peter Zijlstra wrote:
>>> And I think if we raise atomic*_acquire() to require TSO (but ideally
>>> raise it to RCsc) we're there.
>>
>> To clarify, just the RmW-acquire. Things like atomic_read_acquire() can
>> stay smp_load_acquire() and be RCpc.
> 
> I don't have strong opinions about strengthening RmW atomics to TSO, so
> if it helps to unblock Alan's patch (which doesn't go near this!) then I'll
> go with it. The important part is that we continue to allow roach motel
> into the RmW for other accesses in the non-fully-ordered cases.
> 
> Daniel -- your AMO instructions are cool with this, right? It's just the
> fence-based implementations that will need help?
> 
> Will
Right, let me pull this part out of the overly-long response I just gave
on the thread with Linus :)

if we pair AMOs with AMOs, we get RCsc, and everything is fine.  If we
start mixing in fences (mostly because we don't currently have native
load-acquire or store-release opcodes), then that's when all the rest of the
complexity comes in.

Dan


Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-12 Thread Daniel Lustig
On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra  wrote:
>>
>> The locking pattern is fairly simple and shows where RCpc comes apart
>> from expectation real nice.
> 
> So who does RCpc right now for the unlock-lock sequence? Somebody
> mentioned powerpc. Anybody else?
> 
> How nasty would be be to make powerpc conform? I will always advocate
> tighter locking and ordering rules over looser ones..
> 
> Linus

RISC-V probably would have been RCpc if we weren't having this discussion.
Depending on how we map atomics/acquire/release/unlock/lock, we can end up
producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
behaviors, and we're trying to figure out which we actually need.

I think the debate is this:

Obviously programmers would prefer just to have RCsc and not have to figure out
all the complexity of the other options.  On x86 or architectures with native
RCsc operations (like ARMv8), that's generally easy enough to get.

For weakly-ordered architectures that use fences for ordering (including
PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
from RCpc to either "RCtso" or RCsc.  People using these architectures are
concerned about whether there's a negative performance impact from those extra
fences.

However, some scheduler code, some RCU code, and probably some other examples
already implicitly or explicitly assume unlock()/lock() provides stronger
ordering than RCpc.  So, we have to decide whether to:
1) define unlock()/lock() to enforce "RCtso" or RCsc, insert more fences on
PowerPC and RISC-V accordingly, and probably negatively regress PowerPC
2) leave unlock()/lock() as enforcing only RCpc, fix any code that currently
assumes something stronger than RCpc is being provided, and hope people don't
get it wrong in the future
3) some mixture like having unlock()/lock() be "RCtso" but smp_store_release()/
smp_cond_load_acquire() be only RCpc

Also, FWIW, if other weakly-ordered architectures come along in the future and
also use any kind of lightweight fence rather than native RCsc operations,
they'll likely be in the same boat as RISC-V and Power here, in the sense of
not providing RCsc by default either.

Is that a fair assessment everyone?



I can also not-so-briefly summarize RISC-V's status here, since I think there's
been a bunch of confusion about where we're coming from:

First of all, I promise we're not trying to start a fight about all this :)
We're trying to understand the LKMM requirements so we know what instructions
to use.

With that, the easy case: RISC-V is RCsc if we use AMOs or load-reserved/
store-conditional, all of which have RCsc .aq and .rl bits:

  (a) ...
  amoswap.w.rl x0, x0, [lock]  // unlock()
  ...
loop:
  amoswap.w.aq a0, t1, [lock]  // lock()
  bnez a0, loop// lock()
  (b) ...

(a) is ordered before (b) here, regardless of (a) and (b).  Likewise for our
load-reserved/store-conditional instructions, which also have .aq and rl.
That's similiar to how ARM behaves, and is no problem.  We're happy with that
too.

Unfortunately, we don't (currently?) have plain load-acquire or store-release
opcodes in the ISA.  (That's a different discussion...)  For those, we need
fences instead.  And that's where it gets messier.

RISC-V *would* end up providing only RCpc if we use what I'd argue is the most
"natural" fence-based mapping for store-release operations, and then pair that
with LR/SC:

  (a) ...
  fence rw,w // unlock()
  sw x0, [lock]  // unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock] // lock()
  bnez loop   // lock()
  (b) ...

However, if (a) and (b) are loads to different addresses, then (a) is not
ordered before (b) here.  One unpaired RCsc operation is not a full fence.
Clearly "fence rw,w" is not sufficient if the scheduler, RCU, and elsewhere
depend on "RCtso" or RCsc.

RISC-V can get back to "RCtso", matching PowerPC, by using a stronger fence:

  (a) ...
  fence.tso  // unlock(), fence.tso == fence rw,w + fence r,r
  sw x0, [lock]  // unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock] // lock()
  bnez loop   // lock()
  (b) ...

(a) is ordered before (b), unless (a) is a store and (b) is a load to a
different address.

(Modeling note: this example is why I asked for Alan's v3 patch over the v2
patch, which I believe would only have worked if the fence.tso were at the end)

To get full RCsc here, we'd need a fence rw,rw in between the unlock store and
the lock load, much like PowerPC would I believe need a heavyweight sync:

  (a) ...
  fence rw,w // unlock()
  sw x0, [lock]  // unlock()
  ...
  fence rw,rw// can attach either to lock() or to unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock] // lock()
  bnez loop   // lock()
  (b) ...

In general, RISC-V's fence.tso will suffice wherever PowerPC's lwsync does, and
RISC-V's fence 

Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-12 Thread Daniel Lustig
On 7/12/2018 11:10 AM, Linus Torvalds wrote:
> On Thu, Jul 12, 2018 at 11:05 AM Peter Zijlstra  wrote:
>>
>> The locking pattern is fairly simple and shows where RCpc comes apart
>> from expectation real nice.
> 
> So who does RCpc right now for the unlock-lock sequence? Somebody
> mentioned powerpc. Anybody else?
> 
> How nasty would be be to make powerpc conform? I will always advocate
> tighter locking and ordering rules over looser ones..
> 
> Linus

RISC-V probably would have been RCpc if we weren't having this discussion.
Depending on how we map atomics/acquire/release/unlock/lock, we can end up
producing RCpc, "RCtso" (feel free to find a better name here...), or RCsc
behaviors, and we're trying to figure out which we actually need.

I think the debate is this:

Obviously programmers would prefer just to have RCsc and not have to figure out
all the complexity of the other options.  On x86 or architectures with native
RCsc operations (like ARMv8), that's generally easy enough to get.

For weakly-ordered architectures that use fences for ordering (including
PowerPC and sometimes RISC-V, see below), though, it takes extra fences to go
from RCpc to either "RCtso" or RCsc.  People using these architectures are
concerned about whether there's a negative performance impact from those extra
fences.

However, some scheduler code, some RCU code, and probably some other examples
already implicitly or explicitly assume unlock()/lock() provides stronger
ordering than RCpc.  So, we have to decide whether to:
1) define unlock()/lock() to enforce "RCtso" or RCsc, insert more fences on
PowerPC and RISC-V accordingly, and probably negatively regress PowerPC
2) leave unlock()/lock() as enforcing only RCpc, fix any code that currently
assumes something stronger than RCpc is being provided, and hope people don't
get it wrong in the future
3) some mixture like having unlock()/lock() be "RCtso" but smp_store_release()/
smp_cond_load_acquire() be only RCpc

Also, FWIW, if other weakly-ordered architectures come along in the future and
also use any kind of lightweight fence rather than native RCsc operations,
they'll likely be in the same boat as RISC-V and Power here, in the sense of
not providing RCsc by default either.

Is that a fair assessment everyone?



I can also not-so-briefly summarize RISC-V's status here, since I think there's
been a bunch of confusion about where we're coming from:

First of all, I promise we're not trying to start a fight about all this :)
We're trying to understand the LKMM requirements so we know what instructions
to use.

With that, the easy case: RISC-V is RCsc if we use AMOs or load-reserved/
store-conditional, all of which have RCsc .aq and .rl bits:

  (a) ...
  amoswap.w.rl x0, x0, [lock]  // unlock()
  ...
loop:
  amoswap.w.aq a0, t1, [lock]  // lock()
  bnez a0, loop// lock()
  (b) ...

(a) is ordered before (b) here, regardless of (a) and (b).  Likewise for our
load-reserved/store-conditional instructions, which also have .aq and rl.
That's similiar to how ARM behaves, and is no problem.  We're happy with that
too.

Unfortunately, we don't (currently?) have plain load-acquire or store-release
opcodes in the ISA.  (That's a different discussion...)  For those, we need
fences instead.  And that's where it gets messier.

RISC-V *would* end up providing only RCpc if we use what I'd argue is the most
"natural" fence-based mapping for store-release operations, and then pair that
with LR/SC:

  (a) ...
  fence rw,w // unlock()
  sw x0, [lock]  // unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock] // lock()
  bnez loop   // lock()
  (b) ...

However, if (a) and (b) are loads to different addresses, then (a) is not
ordered before (b) here.  One unpaired RCsc operation is not a full fence.
Clearly "fence rw,w" is not sufficient if the scheduler, RCU, and elsewhere
depend on "RCtso" or RCsc.

RISC-V can get back to "RCtso", matching PowerPC, by using a stronger fence:

  (a) ...
  fence.tso  // unlock(), fence.tso == fence rw,w + fence r,r
  sw x0, [lock]  // unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock] // lock()
  bnez loop   // lock()
  (b) ...

(a) is ordered before (b), unless (a) is a store and (b) is a load to a
different address.

(Modeling note: this example is why I asked for Alan's v3 patch over the v2
patch, which I believe would only have worked if the fence.tso were at the end)

To get full RCsc here, we'd need a fence rw,rw in between the unlock store and
the lock load, much like PowerPC would I believe need a heavyweight sync:

  (a) ...
  fence rw,w // unlock()
  sw x0, [lock]  // unlock()
  ...
  fence rw,rw// can attach either to lock() or to unlock()
  ...
loop:
  lr.w.aq a0, [lock]  // lock()
  sc.w t1, [lock] // lock()
  bnez loop   // lock()
  (b) ...

In general, RISC-V's fence.tso will suffice wherever PowerPC's lwsync does, and
RISC-V's fence 

Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-11 Thread Daniel Lustig
On 7/11/2018 10:00 AM, Peter Zijlstra wrote:
> On Wed, Jul 11, 2018 at 04:57:51PM +0100, Will Deacon wrote:
> 
>> It might be simple to model, but I worry this weakens our locking
>> implementations to a point where they will not be understood by the average
>> kernel developer. As I've said before, I would prefer "full" RCsc locking,
> 
> Another vote for RCsc locks. The (in)famous hold-out is of course
> PowerPC, but it now looks like RISC-V is following where they I really
> rather wish they didn't.

That's not entirely fair.  We came in wanting to do the "natural" or
"expected" thing: use RCsc atomics where we have them available in the
ISA, and use "fence r,rw" and "fence rw,w" where we don't.

I would argue that the idea of having data races on variables that are
also protected by locks is equally if not more counterintuitive and
unexpected than the "natural" mapping we had originally proposed to use.

All the discussion here[1] for example is about having ordering and
doing an smp_cond_load_acquire() on a variable which is sometimes
protected by a CPU's rq->lock and sometimes not?  Isn't that one of the
key use cases for this whole discussion?

FWIW, this code also mixes locking, smp_rmb(), and
smp_cond_load_acquire() all in very close proximity with each other...
I suppose there's no mixing and matching on the same variable, but
we are here trying to reason about how all those pieces interact, no?

[1] https://lkml.org/lkml/2015/10/6/805

Dan


Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-11 Thread Daniel Lustig
On 7/11/2018 10:00 AM, Peter Zijlstra wrote:
> On Wed, Jul 11, 2018 at 04:57:51PM +0100, Will Deacon wrote:
> 
>> It might be simple to model, but I worry this weakens our locking
>> implementations to a point where they will not be understood by the average
>> kernel developer. As I've said before, I would prefer "full" RCsc locking,
> 
> Another vote for RCsc locks. The (in)famous hold-out is of course
> PowerPC, but it now looks like RISC-V is following where they I really
> rather wish they didn't.

That's not entirely fair.  We came in wanting to do the "natural" or
"expected" thing: use RCsc atomics where we have them available in the
ISA, and use "fence r,rw" and "fence rw,w" where we don't.

I would argue that the idea of having data races on variables that are
also protected by locks is equally if not more counterintuitive and
unexpected than the "natural" mapping we had originally proposed to use.

All the discussion here[1] for example is about having ordering and
doing an smp_cond_load_acquire() on a variable which is sometimes
protected by a CPU's rq->lock and sometimes not?  Isn't that one of the
key use cases for this whole discussion?

FWIW, this code also mixes locking, smp_rmb(), and
smp_cond_load_acquire() all in very close proximity with each other...
I suppose there's no mixing and matching on the same variable, but
we are here trying to reason about how all those pieces interact, no?

[1] https://lkml.org/lkml/2015/10/6/805

Dan


Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-10 Thread Daniel Lustig
On 7/9/2018 1:01 PM, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking.  In other words, given
> the following code:
> 
>   WRITE_ONCE(x, 1);
>   spin_unlock():
>   spin_lock();
>   WRITE_ONCE(y, 1);
> 
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s.  In terms of
> the memory model, this means expanding the cumul-fence relation.
> 
> Locks should also provide read-read (and read-write) ordering in a
> similar way.  Given:
> 
>   READ_ONCE(x);
>   spin_unlock();
>   spin_lock();
>   READ_ONCE(y);   // or WRITE_ONCE(y, 1);
> 
> the load of x should be executed before the load of (or store to) y.
> The LKMM already provides this ordering, but it provides it even in
> the case where the two accesses are separated by a release/acquire
> pair of fences rather than unlock/lock.  This would prevent
> architectures from using weakly ordered implementations of release and
> acquire, which seems like an unnecessary restriction.  The patch
> therefore removes the ordering requirement from the LKMM for that
> case.
> 
> All the architectures supported by the Linux kernel (including RISC-V)
> do provide this ordering for locks, albeit for varying reasons.
> Therefore this patch changes the model in accordance with the
> developers' wishes.
> 
> Signed-off-by: Alan Stern 
> 
> ---
> 
> v.2: Restrict the ordering to lock operations, not general release
> and acquire fences.
> 
> [as1871b]
> 
> 
>  tools/memory-model/Documentation/explanation.txt   | 
>  186 +++---
>  tools/memory-model/linux-kernel.cat| 
>8 
>  tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 
>5 
>  3 files changed, 149 insertions(+), 50 deletions(-)
> 
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,7 @@ let strong-fence = mb | gp
>  (* Release Acquire *)
>  let acq-po = [Acquire] ; po ; [M]
>  let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> +let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po

It feels slightly weird that unlock-rf-lock-po is asymmetrical.  And in
fact, I think the current RISC-V solution we've been discussing (namely,
putting a fence.tso instead of a fence rw,w in front of the release)
may not even technically respect that particular sequence.  The
fence.tso solution really enforces "po; [UL]; rf; [LKR]", right?

Does something like "po; [UL]; rf; [LKR]; po" fit in with the rest
of the model?  If so, maybe that solves the asymmetry and also
legalizes the approach of putting fence.tso in front?

Or, other suggestions?

Dan

>  (**)
>  (* Fundamental coherence ordering *)
> @@ -60,13 +60,13 @@ let dep = addr | data
>  let rwdep = (dep | ctrl) ; [W]
>  let overwrite = co | fr
>  let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
>  let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence
> +let ppo = to-r | to-w | fence | (unlock-rf-lock-po & int)
>  
>  (* Propagation: Ordering from release operations and strong fences. *)
>  let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
> +let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | unlock-rf-lock-po
>  let prop = (overwrite & ext)? ; cumul-fence* ; rfe?


Re: [PATCH v2] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-07-10 Thread Daniel Lustig
On 7/9/2018 1:01 PM, Alan Stern wrote:
> More than one kernel developer has expressed the opinion that the LKMM
> should enforce ordering of writes by locking.  In other words, given
> the following code:
> 
>   WRITE_ONCE(x, 1);
>   spin_unlock():
>   spin_lock();
>   WRITE_ONCE(y, 1);
> 
> the stores to x and y should be propagated in order to all other CPUs,
> even though those other CPUs might not access the lock s.  In terms of
> the memory model, this means expanding the cumul-fence relation.
> 
> Locks should also provide read-read (and read-write) ordering in a
> similar way.  Given:
> 
>   READ_ONCE(x);
>   spin_unlock();
>   spin_lock();
>   READ_ONCE(y);   // or WRITE_ONCE(y, 1);
> 
> the load of x should be executed before the load of (or store to) y.
> The LKMM already provides this ordering, but it provides it even in
> the case where the two accesses are separated by a release/acquire
> pair of fences rather than unlock/lock.  This would prevent
> architectures from using weakly ordered implementations of release and
> acquire, which seems like an unnecessary restriction.  The patch
> therefore removes the ordering requirement from the LKMM for that
> case.
> 
> All the architectures supported by the Linux kernel (including RISC-V)
> do provide this ordering for locks, albeit for varying reasons.
> Therefore this patch changes the model in accordance with the
> developers' wishes.
> 
> Signed-off-by: Alan Stern 
> 
> ---
> 
> v.2: Restrict the ordering to lock operations, not general release
> and acquire fences.
> 
> [as1871b]
> 
> 
>  tools/memory-model/Documentation/explanation.txt   | 
>  186 +++---
>  tools/memory-model/linux-kernel.cat| 
>8 
>  tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | 
>5 
>  3 files changed, 149 insertions(+), 50 deletions(-)
> 
> Index: usb-4.x/tools/memory-model/linux-kernel.cat
> ===
> --- usb-4.x.orig/tools/memory-model/linux-kernel.cat
> +++ usb-4.x/tools/memory-model/linux-kernel.cat
> @@ -38,7 +38,7 @@ let strong-fence = mb | gp
>  (* Release Acquire *)
>  let acq-po = [Acquire] ; po ; [M]
>  let po-rel = [M] ; po ; [Release]
> -let rfi-rel-acq = [Release] ; rfi ; [Acquire]
> +let unlock-rf-lock-po = [UL] ; rf ; [LKR] ; po

It feels slightly weird that unlock-rf-lock-po is asymmetrical.  And in
fact, I think the current RISC-V solution we've been discussing (namely,
putting a fence.tso instead of a fence rw,w in front of the release)
may not even technically respect that particular sequence.  The
fence.tso solution really enforces "po; [UL]; rf; [LKR]", right?

Does something like "po; [UL]; rf; [LKR]; po" fit in with the rest
of the model?  If so, maybe that solves the asymmetry and also
legalizes the approach of putting fence.tso in front?

Or, other suggestions?

Dan

>  (**)
>  (* Fundamental coherence ordering *)
> @@ -60,13 +60,13 @@ let dep = addr | data
>  let rwdep = (dep | ctrl) ; [W]
>  let overwrite = co | fr
>  let to-w = rwdep | (overwrite & int)
> -let to-r = addr | (dep ; rfi) | rfi-rel-acq
> +let to-r = addr | (dep ; rfi)
>  let fence = strong-fence | wmb | po-rel | rmb | acq-po
> -let ppo = to-r | to-w | fence
> +let ppo = to-r | to-w | fence | (unlock-rf-lock-po & int)
>  
>  (* Propagation: Ordering from release operations and strong fences. *)
>  let A-cumul(r) = rfe? ; r
> -let cumul-fence = A-cumul(strong-fence | po-rel) | wmb
> +let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | unlock-rf-lock-po
>  let prop = (overwrite & ext)? ; cumul-fence* ; rfe?


Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-09 Thread Daniel Lustig
On 7/9/2018 9:52 AM, Will Deacon wrote:
> On Fri, Jul 06, 2018 at 02:10:55PM -0700, Paul E. McKenney wrote:
>> On Fri, Jul 06, 2018 at 04:37:21PM -0400, Alan Stern wrote:
>>> On Thu, 5 Jul 2018, Andrea Parri wrote:
>>>
> At any rate, it looks like instead of strengthening the relation, I
> should write a patch that removes it entirely.  I also will add new,
> stronger relations for use with locking, essentially making spin_lock
> and spin_unlock be RCsc.

 Thank you.

 Ah let me put this forward: please keep an eye on the (generic)

   queued_spin_lock()
   queued_spin_unlock()

 (just to point out an example). Their implementation (in part.,
 the fast-path) suggests that if we will stick to RCsc lock then
 we should also stick to RCsc acq. load from RMW and rel. store.

Just to be clear, this is "RCsc with W->R exception" again, right?

>>> A very good point.  The implementation of those routines uses
>>> atomic_cmpxchg_acquire() to acquire the lock.  Unless this is
>>> implemented with an operation or fence that provides write-write
>>> ordering (in conjunction with a suitable release), qspinlocks won't
>>> have the ordering properties that we want.
>>>
>>> I'm going to assume that the release operations used for unlocking
>>> don't need to have any extra properties; only the lock-acquire
>>> operations need to be special (i.e., stronger than a normal
>>> smp_load_acquire). This suggests that atomic RMW functions with acquire
>>> semantics should also use this stronger form of acquire.

It's not clear to me that the burden of enforcing "RCsc with W->R
ordering" should always be placed only on the acquire half.
RISC-V currently places some of the burden on the release half, as
we discussed last week.  Specifically, there are a few cases where
fence.tso is used instead of fence rw,w on the release side.

If we always use fence.tso here, following the current recommendation,
we'll still be fine.  If LKMM introduces an RCpc vs. RCsc distinction
of some kind, though, I think we would want to distinguish the two
types of release accordingly as well.

>>> Does anybody have a different suggestion?
>>
>> The approach you suggest makes sense to me.  Will, Peter, Daniel, any
>> reasons why this approach would be a problem for you guys?
> 
> qspinlock is very much opt-in per arch, so we can simply require that
> an architecture must have RCsc RmW atomics if they want to use qspinlock.
> Should an architecture arise where that isn't the case, then we could
> consider an arch hook in the qspinlock code, but I don't think we have
> to solve that yet.
> 
> Will

This sounds reasonable to me.

Dan



Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-09 Thread Daniel Lustig
On 7/9/2018 9:52 AM, Will Deacon wrote:
> On Fri, Jul 06, 2018 at 02:10:55PM -0700, Paul E. McKenney wrote:
>> On Fri, Jul 06, 2018 at 04:37:21PM -0400, Alan Stern wrote:
>>> On Thu, 5 Jul 2018, Andrea Parri wrote:
>>>
> At any rate, it looks like instead of strengthening the relation, I
> should write a patch that removes it entirely.  I also will add new,
> stronger relations for use with locking, essentially making spin_lock
> and spin_unlock be RCsc.

 Thank you.

 Ah let me put this forward: please keep an eye on the (generic)

   queued_spin_lock()
   queued_spin_unlock()

 (just to point out an example). Their implementation (in part.,
 the fast-path) suggests that if we will stick to RCsc lock then
 we should also stick to RCsc acq. load from RMW and rel. store.

Just to be clear, this is "RCsc with W->R exception" again, right?

>>> A very good point.  The implementation of those routines uses
>>> atomic_cmpxchg_acquire() to acquire the lock.  Unless this is
>>> implemented with an operation or fence that provides write-write
>>> ordering (in conjunction with a suitable release), qspinlocks won't
>>> have the ordering properties that we want.
>>>
>>> I'm going to assume that the release operations used for unlocking
>>> don't need to have any extra properties; only the lock-acquire
>>> operations need to be special (i.e., stronger than a normal
>>> smp_load_acquire). This suggests that atomic RMW functions with acquire
>>> semantics should also use this stronger form of acquire.

It's not clear to me that the burden of enforcing "RCsc with W->R
ordering" should always be placed only on the acquire half.
RISC-V currently places some of the burden on the release half, as
we discussed last week.  Specifically, there are a few cases where
fence.tso is used instead of fence rw,w on the release side.

If we always use fence.tso here, following the current recommendation,
we'll still be fine.  If LKMM introduces an RCpc vs. RCsc distinction
of some kind, though, I think we would want to distinguish the two
types of release accordingly as well.

>>> Does anybody have a different suggestion?
>>
>> The approach you suggest makes sense to me.  Will, Peter, Daniel, any
>> reasons why this approach would be a problem for you guys?
> 
> qspinlock is very much opt-in per arch, so we can simply require that
> an architecture must have RCsc RmW atomics if they want to use qspinlock.
> Should an architecture arise where that isn't the case, then we could
> consider an arch hook in the qspinlock code, but I don't think we have
> to solve that yet.
> 
> Will

This sounds reasonable to me.

Dan



Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 9:56 AM, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 05:22:26PM +0100, Will Deacon wrote:
>> On Thu, Jul 05, 2018 at 08:44:39AM -0700, Daniel Lustig wrote:
>>> On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
>>>> On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
>>>>> At any rate, it looks like instead of strengthening the relation, I
>>>>> should write a patch that removes it entirely.  I also will add new,
>>>>> stronger relations for use with locking, essentially making spin_lock
>>>>> and spin_unlock be RCsc.
>>>>
>>>> Only in the presence of smp_mb__after_unlock_lock() or
>>>> smp_mb__after_spinlock(), correct?  Or am I confused about RCsc?
>>>>
>>>>Thanx, Paul
>>>>
>>>
>>> In terms of naming...is what you're asking for really RCsc?  To me,
>>> that would imply that even stores in the first critical section would
>>> need to be ordered before loads in the second critical section.
>>> Meaning that even x86 would need an mfence in either lock() or unlock()?
>>
>> I think a LOCK operation always implies an atomic RmW, which will give
>> full ordering guarantees on x86. I know there have been interesting issues
>> involving I/O accesses in the past, but I think that's still out of scope
>> for the memory model.

Yes, you're right about atomic RMWs on x86, and I'm not worried about I/O
here either.  But see below.

>>
>> Peter will know.
> 
> Agreed, x86 locked operations imply full fences, so x86 will order the
> accesses in consecutive critical sections with respect to an observer
> not holding the lock, even stores in earlier critical sections against
> loads in later critical sections.  We have been discussing tightening
> LKMM to make an unlock-lock pair order everything except earlier stores
> vs. later loads.  (Of course, if everyone holds the lock, they will see
> full ordering against both earlier and later critical sections.)
> 
> Or are you pushing for something stronger?
> 
>   Thanx, Paul
> 

No, I'm definitely not pushing for anything stronger.  I'm still just
wondering if the name "RCsc" is right for what you described.  For
example, Andrea just said this in a parallel email:

> "RCsc" as ordering everything except for W -> R, without the [extra]
> barriers

If it's "RCsc with exceptions", doesn't it make sense to find a
different name, rather than simply overloading the term "RCsc" with
a subtly different meaning, and hoping nobody gets confused?

I suppose on x86 and ARM you'd happen to get "true RCsc" anyway, just
due to the way things are currently mapped: LOCKed RMWs and "true RCsc"
instructions, respectively.  But on Power and RISC-V, it would really
be more "RCsc with a W->R exception", right?

In fact, the more I think about it, this doesn't seem to be RCsc at all.
It seems closer to "RCpc plus extra PC ordering between critical
sections".  No?

The synchronization accesses themselves aren't sequentially consistent
with respect to each other under the Power or RISC-V mappings, unless
there's a hwsync in there somewhere that I missed?  Or a rule
preventing stw from forwarding to lwarx?  Or some other higher-order
effect preventing it from being observed anyway?

So that's all I'm suggesting here.  If you all buy that, maybe "RCpccs"
for "RCpc with processor consistent critical section ordering"?
I don't have a strong opinion on the name itself; I just want to find
a name that's less ambiguous or overloaded.

Dan


Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 9:56 AM, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 05:22:26PM +0100, Will Deacon wrote:
>> On Thu, Jul 05, 2018 at 08:44:39AM -0700, Daniel Lustig wrote:
>>> On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
>>>> On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
>>>>> At any rate, it looks like instead of strengthening the relation, I
>>>>> should write a patch that removes it entirely.  I also will add new,
>>>>> stronger relations for use with locking, essentially making spin_lock
>>>>> and spin_unlock be RCsc.
>>>>
>>>> Only in the presence of smp_mb__after_unlock_lock() or
>>>> smp_mb__after_spinlock(), correct?  Or am I confused about RCsc?
>>>>
>>>>Thanx, Paul
>>>>
>>>
>>> In terms of naming...is what you're asking for really RCsc?  To me,
>>> that would imply that even stores in the first critical section would
>>> need to be ordered before loads in the second critical section.
>>> Meaning that even x86 would need an mfence in either lock() or unlock()?
>>
>> I think a LOCK operation always implies an atomic RmW, which will give
>> full ordering guarantees on x86. I know there have been interesting issues
>> involving I/O accesses in the past, but I think that's still out of scope
>> for the memory model.

Yes, you're right about atomic RMWs on x86, and I'm not worried about I/O
here either.  But see below.

>>
>> Peter will know.
> 
> Agreed, x86 locked operations imply full fences, so x86 will order the
> accesses in consecutive critical sections with respect to an observer
> not holding the lock, even stores in earlier critical sections against
> loads in later critical sections.  We have been discussing tightening
> LKMM to make an unlock-lock pair order everything except earlier stores
> vs. later loads.  (Of course, if everyone holds the lock, they will see
> full ordering against both earlier and later critical sections.)
> 
> Or are you pushing for something stronger?
> 
>   Thanx, Paul
> 

No, I'm definitely not pushing for anything stronger.  I'm still just
wondering if the name "RCsc" is right for what you described.  For
example, Andrea just said this in a parallel email:

> "RCsc" as ordering everything except for W -> R, without the [extra]
> barriers

If it's "RCsc with exceptions", doesn't it make sense to find a
different name, rather than simply overloading the term "RCsc" with
a subtly different meaning, and hoping nobody gets confused?

I suppose on x86 and ARM you'd happen to get "true RCsc" anyway, just
due to the way things are currently mapped: LOCKed RMWs and "true RCsc"
instructions, respectively.  But on Power and RISC-V, it would really
be more "RCsc with a W->R exception", right?

In fact, the more I think about it, this doesn't seem to be RCsc at all.
It seems closer to "RCpc plus extra PC ordering between critical
sections".  No?

The synchronization accesses themselves aren't sequentially consistent
with respect to each other under the Power or RISC-V mappings, unless
there's a hwsync in there somewhere that I missed?  Or a rule
preventing stw from forwarding to lwarx?  Or some other higher-order
effect preventing it from being observed anyway?

So that's all I'm suggesting here.  If you all buy that, maybe "RCpccs"
for "RCpc with processor consistent critical section ordering"?
I don't have a strong opinion on the name itself; I just want to find
a name that's less ambiguous or overloaded.

Dan


Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
>> At any rate, it looks like instead of strengthening the relation, I
>> should write a patch that removes it entirely.  I also will add new,
>> stronger relations for use with locking, essentially making spin_lock
>> and spin_unlock be RCsc.
> 
> Only in the presence of smp_mb__after_unlock_lock() or
> smp_mb__after_spinlock(), correct?  Or am I confused about RCsc?
> 
>   Thanx, Paul
> 

In terms of naming...is what you're asking for really RCsc?  To me,
that would imply that even stores in the first critical section would
need to be ordered before loads in the second critical section.
Meaning that even x86 would need an mfence in either lock() or unlock()?

If you're only looking for R->R, R->W, and W->W ordering between
critical sections, is it better to find a new unique name for this?
"RCtso" possibly, or something to that effect?

Dan


Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 8:31 AM, Paul E. McKenney wrote:
> On Thu, Jul 05, 2018 at 10:21:36AM -0400, Alan Stern wrote:
>> At any rate, it looks like instead of strengthening the relation, I
>> should write a patch that removes it entirely.  I also will add new,
>> stronger relations for use with locking, essentially making spin_lock
>> and spin_unlock be RCsc.
> 
> Only in the presence of smp_mb__after_unlock_lock() or
> smp_mb__after_spinlock(), correct?  Or am I confused about RCsc?
> 
>   Thanx, Paul
> 

In terms of naming...is what you're asking for really RCsc?  To me,
that would imply that even stores in the first critical section would
need to be ordered before loads in the second critical section.
Meaning that even x86 would need an mfence in either lock() or unlock()?

If you're only looking for R->R, R->W, and W->W ordering between
critical sections, is it better to find a new unique name for this?
"RCtso" possibly, or something to that effect?

Dan


Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 8:16 AM, Daniel Lustig wrote:
> On 7/5/2018 7:44 AM, Will Deacon wrote:
>> Andrea,
>>
>> On Thu, Jul 05, 2018 at 04:00:29PM +0200, Andrea Parri wrote:
>>> On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
>>>> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
>>>>> There's also read-write ordering, in the form of the LB pattern:
>>>>>
>>>>> P0(int *x, int *y, int *z)
>>>>> {
>>>>>   r0 = READ_ONCE(*x);
>>>>>   smp_store_release(z, 1);
>>>>>   r1 = smp_load_acquire(z);
>>>>>   WRITE_ONCE(*y, 1);
>>>>> }
>>>>>
>>>>> P1(int *x, int *y)
>>>>> {
>>>>>   r2 = READ_ONCE(*y);
>>>>>   smp_mp();
>>>>>   WRITE_ONCE(*x, 1);
>>>>> }
>>>>>
>>>>> exists (0:r0=1 /\ 1:r2=1)
>>>>
>>>> The access types are irrelevant to the acquire/release primitives, so yes
>>>> that's also allowed.
>>>>
>>>>> Would this be allowed if smp_load_acquire() was implemented with LDAPR?
>>>>> If the answer is yes then we will have to remove the rfi-rel-acq and
>>>>> rel-rf-acq-po relations from the memory model entirely.
>>>>
>>>> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
>>>> rel-rfi-acq-po for the other? Sounds like I'm confused here.
>>>
>>> [Your reply about 1/2 suggests that you've figured this out now, IAC ...]
>>
>> Yeah, the naming threw me because it's not the same order as the composition
>> in the actual definition (why not?). I typoed an extra 'po' suffix. Well
>> done for spotting it.
>>
>>> "rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
>>> is defined and currently used in linux-kernel.cat (and, FWIW, it has been
>>> so since when we upstreamed LKMM).
>>>
>>> My point is that, as exemplified by the two tests I reported above, this
>>> relation already prevents you from implementing your acquire with LDAPR;
>>> so my/our question was not "can you run herd7 for me?" but rather "do you
>>> think that changes are needed to the .cat file?".
>>
>> I mean, you can run herd on the armv8 memory model and see the answer to the
>> litmus test. So we have two options for the LKMM .cat file (the Arm one is
>> baked in silicon):
>>
>>   1. Say that architectures with RCpc acquire/release instructions must
>>  instead use RCsc instructions (if they have them) or full fences
>>
>>   2. Change the LKMM .cat file to allow RCpc acquire/release (note that I'd
>>  still like RCsc lock/unlock semantics, and I think that's actually the
>>  real case that matters here, but the currently difficulty in
>>  distinguishing the two in the .cat file has led to this broader
>>  ordering being enforced as a side-effect)
>>
>> I prefer (2), because I think it's a safe and sensible relaxation to make
>> and accomodates what I consider to be a likely extension to weakly ordered
>> architectures that we might want to support efficiently. So yes, I think
>> changes are needed to the LKMM .cat file, but please don't view that as a
>> criticism. We change stuff all the time as long as it doesn't break 
>> userspace.
>>
>>> This question goes back _at least_ to:
>>>
>>>   
>>> http://lkml.kernel.org/r/1519301990-11766-1-git-send-email-parri.and...@gmail.com
>>>
>>> (see, in part., the "IMPORTANT" note at the bottom of the commit message)
>>> and that discussion later resulted in:
>>>
>>>   0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
>>>   5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
>>>
>>> (the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
>>>
>>>   https://github.com/riscv/riscv-isa-manual, Appendix A.5
>>>
>>>  includes our "Linux mapping", although the currently-recommended mapping
>>>  differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
>>
>> [I think the discussion at hand is broader than RISC-V, and I looped in
>>  Daniel already]
> 
> Sorry for the delay, it was Independence Day here in the US.
> 
> I'm aligned with Will on all this so far.  As mentioned above, this issue
> definitely comes up on RISC-V,

Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 8:16 AM, Daniel Lustig wrote:
> On 7/5/2018 7:44 AM, Will Deacon wrote:
>> Andrea,
>>
>> On Thu, Jul 05, 2018 at 04:00:29PM +0200, Andrea Parri wrote:
>>> On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
>>>> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
>>>>> There's also read-write ordering, in the form of the LB pattern:
>>>>>
>>>>> P0(int *x, int *y, int *z)
>>>>> {
>>>>>   r0 = READ_ONCE(*x);
>>>>>   smp_store_release(z, 1);
>>>>>   r1 = smp_load_acquire(z);
>>>>>   WRITE_ONCE(*y, 1);
>>>>> }
>>>>>
>>>>> P1(int *x, int *y)
>>>>> {
>>>>>   r2 = READ_ONCE(*y);
>>>>>   smp_mp();
>>>>>   WRITE_ONCE(*x, 1);
>>>>> }
>>>>>
>>>>> exists (0:r0=1 /\ 1:r2=1)
>>>>
>>>> The access types are irrelevant to the acquire/release primitives, so yes
>>>> that's also allowed.
>>>>
>>>>> Would this be allowed if smp_load_acquire() was implemented with LDAPR?
>>>>> If the answer is yes then we will have to remove the rfi-rel-acq and
>>>>> rel-rf-acq-po relations from the memory model entirely.
>>>>
>>>> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
>>>> rel-rfi-acq-po for the other? Sounds like I'm confused here.
>>>
>>> [Your reply about 1/2 suggests that you've figured this out now, IAC ...]
>>
>> Yeah, the naming threw me because it's not the same order as the composition
>> in the actual definition (why not?). I typoed an extra 'po' suffix. Well
>> done for spotting it.
>>
>>> "rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
>>> is defined and currently used in linux-kernel.cat (and, FWIW, it has been
>>> so since when we upstreamed LKMM).
>>>
>>> My point is that, as exemplified by the two tests I reported above, this
>>> relation already prevents you from implementing your acquire with LDAPR;
>>> so my/our question was not "can you run herd7 for me?" but rather "do you
>>> think that changes are needed to the .cat file?".
>>
>> I mean, you can run herd on the armv8 memory model and see the answer to the
>> litmus test. So we have two options for the LKMM .cat file (the Arm one is
>> baked in silicon):
>>
>>   1. Say that architectures with RCpc acquire/release instructions must
>>  instead use RCsc instructions (if they have them) or full fences
>>
>>   2. Change the LKMM .cat file to allow RCpc acquire/release (note that I'd
>>  still like RCsc lock/unlock semantics, and I think that's actually the
>>  real case that matters here, but the currently difficulty in
>>  distinguishing the two in the .cat file has led to this broader
>>  ordering being enforced as a side-effect)
>>
>> I prefer (2), because I think it's a safe and sensible relaxation to make
>> and accomodates what I consider to be a likely extension to weakly ordered
>> architectures that we might want to support efficiently. So yes, I think
>> changes are needed to the LKMM .cat file, but please don't view that as a
>> criticism. We change stuff all the time as long as it doesn't break 
>> userspace.
>>
>>> This question goes back _at least_ to:
>>>
>>>   
>>> http://lkml.kernel.org/r/1519301990-11766-1-git-send-email-parri.and...@gmail.com
>>>
>>> (see, in part., the "IMPORTANT" note at the bottom of the commit message)
>>> and that discussion later resulted in:
>>>
>>>   0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
>>>   5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
>>>
>>> (the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
>>>
>>>   https://github.com/riscv/riscv-isa-manual, Appendix A.5
>>>
>>>  includes our "Linux mapping", although the currently-recommended mapping
>>>  differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
>>
>> [I think the discussion at hand is broader than RISC-V, and I looped in
>>  Daniel already]
> 
> Sorry for the delay, it was Independence Day here in the US.
> 
> I'm aligned with Will on all this so far.  As mentioned above, this issue
> definitely comes up on RISC-V,

Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 7:44 AM, Will Deacon wrote:
> Andrea,
> 
> On Thu, Jul 05, 2018 at 04:00:29PM +0200, Andrea Parri wrote:
>> On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
>>> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
 There's also read-write ordering, in the form of the LB pattern:

 P0(int *x, int *y, int *z)
 {
r0 = READ_ONCE(*x);
smp_store_release(z, 1);
r1 = smp_load_acquire(z);
WRITE_ONCE(*y, 1);
 }

 P1(int *x, int *y)
 {
r2 = READ_ONCE(*y);
smp_mp();
WRITE_ONCE(*x, 1);
 }

 exists (0:r0=1 /\ 1:r2=1)
>>>
>>> The access types are irrelevant to the acquire/release primitives, so yes
>>> that's also allowed.
>>>
 Would this be allowed if smp_load_acquire() was implemented with LDAPR?
 If the answer is yes then we will have to remove the rfi-rel-acq and
 rel-rf-acq-po relations from the memory model entirely.
>>>
>>> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
>>> rel-rfi-acq-po for the other? Sounds like I'm confused here.
>>
>> [Your reply about 1/2 suggests that you've figured this out now, IAC ...]
> 
> Yeah, the naming threw me because it's not the same order as the composition
> in the actual definition (why not?). I typoed an extra 'po' suffix. Well
> done for spotting it.
> 
>> "rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
>> is defined and currently used in linux-kernel.cat (and, FWIW, it has been
>> so since when we upstreamed LKMM).
>>
>> My point is that, as exemplified by the two tests I reported above, this
>> relation already prevents you from implementing your acquire with LDAPR;
>> so my/our question was not "can you run herd7 for me?" but rather "do you
>> think that changes are needed to the .cat file?".
> 
> I mean, you can run herd on the armv8 memory model and see the answer to the
> litmus test. So we have two options for the LKMM .cat file (the Arm one is
> baked in silicon):
> 
>   1. Say that architectures with RCpc acquire/release instructions must
>  instead use RCsc instructions (if they have them) or full fences
> 
>   2. Change the LKMM .cat file to allow RCpc acquire/release (note that I'd
>  still like RCsc lock/unlock semantics, and I think that's actually the
>  real case that matters here, but the currently difficulty in
>  distinguishing the two in the .cat file has led to this broader
>  ordering being enforced as a side-effect)
> 
> I prefer (2), because I think it's a safe and sensible relaxation to make
> and accomodates what I consider to be a likely extension to weakly ordered
> architectures that we might want to support efficiently. So yes, I think
> changes are needed to the LKMM .cat file, but please don't view that as a
> criticism. We change stuff all the time as long as it doesn't break userspace.
> 
>> This question goes back _at least_ to:
>>
>>   
>> http://lkml.kernel.org/r/1519301990-11766-1-git-send-email-parri.and...@gmail.com
>>
>> (see, in part., the "IMPORTANT" note at the bottom of the commit message)
>> and that discussion later resulted in:
>>
>>   0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
>>   5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
>>
>> (the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
>>
>>   https://github.com/riscv/riscv-isa-manual, Appendix A.5
>>
>>  includes our "Linux mapping", although the currently-recommended mapping
>>  differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
> 
> [I think the discussion at hand is broader than RISC-V, and I looped in
>  Daniel already]

Sorry for the delay, it was Independence Day here in the US.

I'm aligned with Will on all this so far.  As mentioned above, this issue
definitely comes up on RISC-V, but for reasons even beyond RCpc vs. RCsc.
Why?  On RISC-V, we have RCsc atomics, but no native load-acquire or
store-release opcodes.  You might expect you can use "ld; fence r,rw"
and "fence rw,w; sd", respectively, but those don't interact with the
RCsc .aq/.rl mappings.  For example, suppose we did this:

(a)
amoswap.w.rl x0, x0, 0(a1)   // st-rel using amoswap with dummy dest
...
lw a0, 0(a1) // ld-acq using fence-based mapping
fence r,rw
(b)

Nothing here orders (a) before (b) in general.  A mapping always using
fences would cover what's being asked for (I believe), and using only
RCsc operations (like ARM does) would also work if we had native
load-acquire and store-release operations, but this mixed mapping
doesn't.  That's why the spec currently recommends the stronger
fence.tso (~= multi-copy atomic Power lwsync) instead.

...but it's very non-obvious.  Is there really a hurry to rush all
this in?

>> My understanding is that your answer to this question is "Yes", but I am
>> not sure about the "How/Which changes?"; of course, an answer his question
>> 

Re: [PATCH 2/2] tools/memory-model: Add write ordering by release-acquire and by locks

2018-07-05 Thread Daniel Lustig
On 7/5/2018 7:44 AM, Will Deacon wrote:
> Andrea,
> 
> On Thu, Jul 05, 2018 at 04:00:29PM +0200, Andrea Parri wrote:
>> On Wed, Jul 04, 2018 at 01:11:04PM +0100, Will Deacon wrote:
>>> On Tue, Jul 03, 2018 at 01:28:17PM -0400, Alan Stern wrote:
 There's also read-write ordering, in the form of the LB pattern:

 P0(int *x, int *y, int *z)
 {
r0 = READ_ONCE(*x);
smp_store_release(z, 1);
r1 = smp_load_acquire(z);
WRITE_ONCE(*y, 1);
 }

 P1(int *x, int *y)
 {
r2 = READ_ONCE(*y);
smp_mp();
WRITE_ONCE(*x, 1);
 }

 exists (0:r0=1 /\ 1:r2=1)
>>>
>>> The access types are irrelevant to the acquire/release primitives, so yes
>>> that's also allowed.
>>>
 Would this be allowed if smp_load_acquire() was implemented with LDAPR?
 If the answer is yes then we will have to remove the rfi-rel-acq and
 rel-rf-acq-po relations from the memory model entirely.
>>>
>>> I don't understand what you mean by "rfi-rel-acq-po", and I assume you mean
>>> rel-rfi-acq-po for the other? Sounds like I'm confused here.
>>
>> [Your reply about 1/2 suggests that you've figured this out now, IAC ...]
> 
> Yeah, the naming threw me because it's not the same order as the composition
> in the actual definition (why not?). I typoed an extra 'po' suffix. Well
> done for spotting it.
> 
>> "rfi-rel-acq" (as Alan wrote, and as I wrote before my question above...)
>> is defined and currently used in linux-kernel.cat (and, FWIW, it has been
>> so since when we upstreamed LKMM).
>>
>> My point is that, as exemplified by the two tests I reported above, this
>> relation already prevents you from implementing your acquire with LDAPR;
>> so my/our question was not "can you run herd7 for me?" but rather "do you
>> think that changes are needed to the .cat file?".
> 
> I mean, you can run herd on the armv8 memory model and see the answer to the
> litmus test. So we have two options for the LKMM .cat file (the Arm one is
> baked in silicon):
> 
>   1. Say that architectures with RCpc acquire/release instructions must
>  instead use RCsc instructions (if they have them) or full fences
> 
>   2. Change the LKMM .cat file to allow RCpc acquire/release (note that I'd
>  still like RCsc lock/unlock semantics, and I think that's actually the
>  real case that matters here, but the currently difficulty in
>  distinguishing the two in the .cat file has led to this broader
>  ordering being enforced as a side-effect)
> 
> I prefer (2), because I think it's a safe and sensible relaxation to make
> and accomodates what I consider to be a likely extension to weakly ordered
> architectures that we might want to support efficiently. So yes, I think
> changes are needed to the LKMM .cat file, but please don't view that as a
> criticism. We change stuff all the time as long as it doesn't break userspace.
> 
>> This question goes back _at least_ to:
>>
>>   
>> http://lkml.kernel.org/r/1519301990-11766-1-git-send-email-parri.and...@gmail.com
>>
>> (see, in part., the "IMPORTANT" note at the bottom of the commit message)
>> and that discussion later resulted in:
>>
>>   0123f4d76ca63b ("riscv/spinlock: Strengthen implementations with fences")
>>   5ce6c1f3535fa8 ("riscv/atomic: Strengthen implementations with fences")
>>
>> (the latest _draft_ of the RISC-V specification, as pointed out by Daniel,
>>
>>   https://github.com/riscv/riscv-isa-manual, Appendix A.5
>>
>>  includes our "Linux mapping", although the currently-recommended mapping
>>  differs and involves a "fence.tso [+ any acquire, including RCpc .aq]").
> 
> [I think the discussion at hand is broader than RISC-V, and I looped in
>  Daniel already]

Sorry for the delay, it was Independence Day here in the US.

I'm aligned with Will on all this so far.  As mentioned above, this issue
definitely comes up on RISC-V, but for reasons even beyond RCpc vs. RCsc.
Why?  On RISC-V, we have RCsc atomics, but no native load-acquire or
store-release opcodes.  You might expect you can use "ld; fence r,rw"
and "fence rw,w; sd", respectively, but those don't interact with the
RCsc .aq/.rl mappings.  For example, suppose we did this:

(a)
amoswap.w.rl x0, x0, 0(a1)   // st-rel using amoswap with dummy dest
...
lw a0, 0(a1) // ld-acq using fence-based mapping
fence r,rw
(b)

Nothing here orders (a) before (b) in general.  A mapping always using
fences would cover what's being asked for (I believe), and using only
RCsc operations (like ARM does) would also work if we had native
load-acquire and store-release operations, but this mixed mapping
doesn't.  That's why the spec currently recommends the stronger
fence.tso (~= multi-copy atomic Power lwsync) instead.

...but it's very non-obvious.  Is there really a hurry to rush all
this in?

>> My understanding is that your answer to this question is "Yes", but I am
>> not sure about the "How/Which changes?"; of course, an answer his question
>> 

Re: [PATCH] MAINTAINERS: Add Daniel Lustig as a LKMM reviewer

2018-06-22 Thread Daniel Lustig
On 6/22/2018 2:17 PM, Palmer Dabbelt wrote:
> Dan runs the RISC-V memory model working group.  I've been forwarding
> him LKMM emails that end up in my inbox, but I'm far from an expert in
> this stuff.  He requested to be added as a reviewer, which seem sane to
> me as it'll take a human out of the loop.
> 
> CC: Daniel Lustig 
> Signed-off-by: Palmer Dabbelt 

Looks like there's an accidental backquote before my name?

Once that gets fixed:

Acked-by: Daniel Lustig 

> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9d5eeff51b5f..b980635220e5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8316,6 +8316,7 @@ M:  Jade Alglave 
>  M:   Luc Maranget 
>  M:   "Paul E. McKenney" 
>  R:   Akira Yokosawa 
> +R:`  Daniel Lustig 
>  L:   linux-kernel@vger.kernel.org
>  S:   Supported
>  T:   git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
> 


Re: [PATCH] MAINTAINERS: Add Daniel Lustig as a LKMM reviewer

2018-06-22 Thread Daniel Lustig
On 6/22/2018 2:17 PM, Palmer Dabbelt wrote:
> Dan runs the RISC-V memory model working group.  I've been forwarding
> him LKMM emails that end up in my inbox, but I'm far from an expert in
> this stuff.  He requested to be added as a reviewer, which seem sane to
> me as it'll take a human out of the loop.
> 
> CC: Daniel Lustig 
> Signed-off-by: Palmer Dabbelt 

Looks like there's an accidental backquote before my name?

Once that gets fixed:

Acked-by: Daniel Lustig 

> ---
>  MAINTAINERS | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9d5eeff51b5f..b980635220e5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8316,6 +8316,7 @@ M:  Jade Alglave 
>  M:   Luc Maranget 
>  M:   "Paul E. McKenney" 
>  R:   Akira Yokosawa 
> +R:`  Daniel Lustig 
>  L:   linux-kernel@vger.kernel.org
>  S:   Supported
>  T:   git git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
> 


Re: [PATCH v2 2/2] riscv/atomic: Strengthen implementations with fences

2018-03-09 Thread Daniel Lustig
On 3/9/2018 2:57 PM, Palmer Dabbelt wrote:
> On Fri, 09 Mar 2018 13:30:08 PST (-0800), parri.and...@gmail.com wrote:
>> On Fri, Mar 09, 2018 at 10:54:27AM -0800, Palmer Dabbelt wrote:
>>> On Fri, 09 Mar 2018 10:36:44 PST (-0800), parri.and...@gmail.com wrote:
>>
>> [...]
>>
>>> >This proposal relies on the generic definition,
>>> >
>>> >   include/linux/atomic.h ,
>>> >
>>> >and on the
>>> >
>>> >   __atomic_op_acquire()
>>> >   __atomic_op_release()
>>> >
>>> >above to build the acquire/release atomics (except for the xchg,cmpxchg,
>>> >where the ACQUIRE_BARRIER is inserted conditionally/on success).
>>>
>>> I thought we wanted to use the AQ and RL bits for AMOs, just not for LR/SC
>>> sequences.  IIRC the AMOs are safe with the current memory model, but I 
>>> might
>>> just have some version mismatches in my head.
>>
>> AMO.aqrl are "safe" w.r.t. the LKMM (as they provide "full-ordering"); OTOH,
>> AMO.aq and AMO.rl present weaknesses that LKMM (and some kernel developers)
>> do not "expect".  I was probing this issue in:
>>
>>   https://marc.info/?l=linux-kernel=151930201102853=2
>>
>> (c.f., e.g., test "RISCV-unlock-lock-read-ordering" from that post).
>>
>> Quoting from the commit message of my patch 1/2:
>>
>>   "Referring to the "unlock-lock-read-ordering" test reported below,
>>    Daniel wrote:
>>
>>  I think an RCpc interpretation of .aq and .rl would in fact
>>  allow the two normal loads in P1 to be reordered [...]
>>
>>  [...]
>>
>>  Likewise even if the unlock()/lock() is between two stores.
>>  A control dependency might originate from the load part of
>>  the amoswap.w.aq, but there still would have to be something
>>  to ensure that this load part in fact performs after the store
>>  part of the amoswap.w.rl performs globally, and that's not
>>  automatic under RCpc.
>>
>>    Simulation of the RISC-V memory consistency model confirmed this
>>    expectation."
>>
>> I have just (re)checked these observations against the latest specification,
>> and my results _confirmed_ these verdicts.
> 
> Thanks, I must have just gotten confused about a draft spec or something.  I'm
> pulling these on top or your other memory model related patch.  I've renamed
> the branch "next-mm" to be a bit more descriptiove.

(Sorry for being out of the loop this week, I was out to deal with
a family matter.)

I assume you're using the herd model?  Luc's doing a great job with
that, but even so, nothing is officially confirmed until we ratify
the model.  In other words, the herd model may end up changing too.
If something is broken on our end, there's still time to fix it.

Regarding AMOs, let me copy from something I wrote in a previous
offline conversation:

> it seems to us that pairing a store-release of "amoswap.rl" with
> a "ld; fence r,rw" doesn't actually give us the RC semantics we've
> been discussing for LKMM.  For example:
> 
> (a) sd t0,0(s0)
> (b) amoswap.d.rl x0,t1,0(s1)
> ...
> (c) ld a0,0(s1)
> (d) fence r,rw
> (e) sd t2,0(s2)
> 
> There, we won't get (a) ordered before (e) regardless of whether
> (b) is RCpc or RCsc.  Do you agree?

At the moment, only the load part of (b) is in the predecessor
set of (d), but the store part of (b) is not.  Likewise, the
.rl annotation applies only to the store part of (b), not the
load part.

This gets back to a question Linus asked last week about
whether the AMO is a single unit or whether it can be
considered to split into a load and a store part (which still
perform atomically).  For RISC-V, for right now at least, the
answer is the latter.  Is it still the latter for Linux too?

https://lkml.org/lkml/2018/2/26/606

> So I think we'll need to make sure we pair .rl with .aq, or that
> we pair fence-based mappings with fence-based mappings, in order
> to make the acquire/release operations work.

This assumes we'll say that .aq and .rl are RCsc, not RCpc.
But in this case, I think .aq and .rl could still be safe to use,
as long as you don't ever try to mix in a fence-based mapping
on the same data structure like in the example above.  That
might be important if we want to find the most compact legal
implementation, and hence do want to use .aq and .rl after all.

> And since we don't have native "ld.aq" today in RISC-V, that
> would mean smp_store_release would have to remain implemented
> as "fence rw,w; s{w|d}", rather than "amoswap.{w|d}.rl", for
> example.

Thoughts?

Thanks,
Dan

> 
> Thanks!


Re: [PATCH v2 2/2] riscv/atomic: Strengthen implementations with fences

2018-03-09 Thread Daniel Lustig
On 3/9/2018 2:57 PM, Palmer Dabbelt wrote:
> On Fri, 09 Mar 2018 13:30:08 PST (-0800), parri.and...@gmail.com wrote:
>> On Fri, Mar 09, 2018 at 10:54:27AM -0800, Palmer Dabbelt wrote:
>>> On Fri, 09 Mar 2018 10:36:44 PST (-0800), parri.and...@gmail.com wrote:
>>
>> [...]
>>
>>> >This proposal relies on the generic definition,
>>> >
>>> >   include/linux/atomic.h ,
>>> >
>>> >and on the
>>> >
>>> >   __atomic_op_acquire()
>>> >   __atomic_op_release()
>>> >
>>> >above to build the acquire/release atomics (except for the xchg,cmpxchg,
>>> >where the ACQUIRE_BARRIER is inserted conditionally/on success).
>>>
>>> I thought we wanted to use the AQ and RL bits for AMOs, just not for LR/SC
>>> sequences.  IIRC the AMOs are safe with the current memory model, but I 
>>> might
>>> just have some version mismatches in my head.
>>
>> AMO.aqrl are "safe" w.r.t. the LKMM (as they provide "full-ordering"); OTOH,
>> AMO.aq and AMO.rl present weaknesses that LKMM (and some kernel developers)
>> do not "expect".  I was probing this issue in:
>>
>>   https://marc.info/?l=linux-kernel=151930201102853=2
>>
>> (c.f., e.g., test "RISCV-unlock-lock-read-ordering" from that post).
>>
>> Quoting from the commit message of my patch 1/2:
>>
>>   "Referring to the "unlock-lock-read-ordering" test reported below,
>>    Daniel wrote:
>>
>>  I think an RCpc interpretation of .aq and .rl would in fact
>>  allow the two normal loads in P1 to be reordered [...]
>>
>>  [...]
>>
>>  Likewise even if the unlock()/lock() is between two stores.
>>  A control dependency might originate from the load part of
>>  the amoswap.w.aq, but there still would have to be something
>>  to ensure that this load part in fact performs after the store
>>  part of the amoswap.w.rl performs globally, and that's not
>>  automatic under RCpc.
>>
>>    Simulation of the RISC-V memory consistency model confirmed this
>>    expectation."
>>
>> I have just (re)checked these observations against the latest specification,
>> and my results _confirmed_ these verdicts.
> 
> Thanks, I must have just gotten confused about a draft spec or something.  I'm
> pulling these on top or your other memory model related patch.  I've renamed
> the branch "next-mm" to be a bit more descriptiove.

(Sorry for being out of the loop this week, I was out to deal with
a family matter.)

I assume you're using the herd model?  Luc's doing a great job with
that, but even so, nothing is officially confirmed until we ratify
the model.  In other words, the herd model may end up changing too.
If something is broken on our end, there's still time to fix it.

Regarding AMOs, let me copy from something I wrote in a previous
offline conversation:

> it seems to us that pairing a store-release of "amoswap.rl" with
> a "ld; fence r,rw" doesn't actually give us the RC semantics we've
> been discussing for LKMM.  For example:
> 
> (a) sd t0,0(s0)
> (b) amoswap.d.rl x0,t1,0(s1)
> ...
> (c) ld a0,0(s1)
> (d) fence r,rw
> (e) sd t2,0(s2)
> 
> There, we won't get (a) ordered before (e) regardless of whether
> (b) is RCpc or RCsc.  Do you agree?

At the moment, only the load part of (b) is in the predecessor
set of (d), but the store part of (b) is not.  Likewise, the
.rl annotation applies only to the store part of (b), not the
load part.

This gets back to a question Linus asked last week about
whether the AMO is a single unit or whether it can be
considered to split into a load and a store part (which still
perform atomically).  For RISC-V, for right now at least, the
answer is the latter.  Is it still the latter for Linux too?

https://lkml.org/lkml/2018/2/26/606

> So I think we'll need to make sure we pair .rl with .aq, or that
> we pair fence-based mappings with fence-based mappings, in order
> to make the acquire/release operations work.

This assumes we'll say that .aq and .rl are RCsc, not RCpc.
But in this case, I think .aq and .rl could still be safe to use,
as long as you don't ever try to mix in a fence-based mapping
on the same data structure like in the example above.  That
might be important if we want to find the most compact legal
implementation, and hence do want to use .aq and .rl after all.

> And since we don't have native "ld.aq" today in RISC-V, that
> would mean smp_store_release would have to remain implemented
> as "fence rw,w; s{w|d}", rather than "amoswap.{w|d}.rl", for
> example.

Thoughts?

Thanks,
Dan

> 
> Thanks!


Re: [RFC PATCH] riscv/locking: Strengthen spin_lock() and spin_unlock()

2018-03-01 Thread Daniel Lustig
On 3/1/2018 1:54 PM, Palmer Dabbelt wrote:
> On Thu, 01 Mar 2018 07:11:41 PST (-0800), parri.and...@gmail.com wrote:
>> Hi Daniel,
>>
>> On Thu, Feb 22, 2018 at 11:47:57AM -0800, Daniel Lustig wrote:
>>> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
>>> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
>>> >> So we have something that is not all that rare in the Linux kernel
>>> >> community, namely two conflicting more-or-less concurrent changes.
>>> >> This clearly needs to be resolved, either by us not strengthening the
>>> >> Linux-kernel memory model in the way we were planning to or by you
>>> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
>>> >> externally viewed release-acquire situations.
>>> >>
>>> >> Other thoughts?
>>> >
>>> > Like said in the other email, I would _much_ prefer to not go weaker
>>> > than PPC, I find that PPC is already painfully weak at times.
>>>
>>> Sure, and RISC-V could make this work too by using RCsc instructions
>>> and/or by using lightweight fences instead.  It just wasn't clear at
>>> first whether smp_load_acquire() and smp_store_release() were RCpc,
>>> RCsc, or something else, and hence whether RISC-V would actually need
>>> to use something stronger than pure RCpc there.  Likewise for
>>> spin_unlock()/spin_lock() and everywhere else this comes up.
>>
>> while digging into riscv's locks and atomics to fix the issues discussed
>> earlier in this thread, I became aware of another issue:
>>
>> Considering here the CMPXCHG primitives, for example, I noticed that the
>> implementation currently provides the four variants
>>
>> ATOMIC_OPS(    , .aqrl)
>> ATOMIC_OPS(_acquire,   .aq)
>> ATOMIC_OPS(_release,   .rl)
>> ATOMIC_OPS(_relaxed,  )
>>
>> (corresponding, resp., to
>>
>> atomic_cmpxchg()
>> atomic_cmpxchg_acquire()
>> atomic_cmpxchg_release()
>> atomic_cmpxchg_relaxed()  )
>>
>> so that the first variant above ends up doing
>>
>> 0:    lr.w.aqrl  %0, %addr
>>     bne    %0, %old, 1f
>>     sc.w.aqrl  %1, %new, %addr
>>     bnez   %1, 0b
>>     1:
>>
>> From Sect. 2.3.5. ("Acquire/Release Ordering") of the Spec.,
>>
>>  "AMOs with both .aq and .rl set are fully-ordered operations.  Treating
>>   the load part and the store part as independent RCsc operations is not
>>   in and of itself sufficient to enforce full fencing behavior, but this
>>   subtle weak behavior is counterintuitive and not much of an advantage
>>   architecturally, especially with lr and sc also available [...]."
>>
>> I understand that
>>
>>     { x = y = u = v = 0 }
>>
>> P0()
>>
>> WRITE_ONCE(x, 1);    ("relaxed" store, sw)
>> atomic_cmpxchg(, 0, 1);
>> r0 = READ_ONCE(y);    ("relaxed" load, lw)
>>
>> P1()
>>
>> WRITE_ONCE(y, 1);
>> atomic_cmpxchg(, 0, 1);
>> r1 = READ_ONCE(x);
>>
>> could result in (u = v = 1 and r0 = r1 = 0) at the end; can you confirm?
> 
> cmpxchg isn't an AMO, it's an LR SC sequence, so that blurb doesn't
> apply.  I think "lr.w.aqrl" and "sc.w.aqrl" is not sufficient to
> perform a fully ordered operation (ie, it's an incorrect
> implementation of atomic_cmpxchg()), but I was hoping to get some
> time to actually internalize this part of the RISC-V memory model at
> some point to be sure.

Agreed, the current intention is that we'll add a fence rw,rw in there
when we use lr/sc as the implementation, likely similar to what ARM does:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e86f0b409a44193f1587e87b69c5dcf8f65be67

It's also probably not the only piece of synchronization-related code
we'll have to go back and audit as we finalize the RISC-V memory
consistency model over the next couple months or so.

Dan


Re: [RFC PATCH] riscv/locking: Strengthen spin_lock() and spin_unlock()

2018-03-01 Thread Daniel Lustig
On 3/1/2018 1:54 PM, Palmer Dabbelt wrote:
> On Thu, 01 Mar 2018 07:11:41 PST (-0800), parri.and...@gmail.com wrote:
>> Hi Daniel,
>>
>> On Thu, Feb 22, 2018 at 11:47:57AM -0800, Daniel Lustig wrote:
>>> On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
>>> > On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
>>> >> So we have something that is not all that rare in the Linux kernel
>>> >> community, namely two conflicting more-or-less concurrent changes.
>>> >> This clearly needs to be resolved, either by us not strengthening the
>>> >> Linux-kernel memory model in the way we were planning to or by you
>>> >> strengthening RISC-V to be no weaker than PowerPC for these sorts of
>>> >> externally viewed release-acquire situations.
>>> >>
>>> >> Other thoughts?
>>> >
>>> > Like said in the other email, I would _much_ prefer to not go weaker
>>> > than PPC, I find that PPC is already painfully weak at times.
>>>
>>> Sure, and RISC-V could make this work too by using RCsc instructions
>>> and/or by using lightweight fences instead.  It just wasn't clear at
>>> first whether smp_load_acquire() and smp_store_release() were RCpc,
>>> RCsc, or something else, and hence whether RISC-V would actually need
>>> to use something stronger than pure RCpc there.  Likewise for
>>> spin_unlock()/spin_lock() and everywhere else this comes up.
>>
>> while digging into riscv's locks and atomics to fix the issues discussed
>> earlier in this thread, I became aware of another issue:
>>
>> Considering here the CMPXCHG primitives, for example, I noticed that the
>> implementation currently provides the four variants
>>
>> ATOMIC_OPS(    , .aqrl)
>> ATOMIC_OPS(_acquire,   .aq)
>> ATOMIC_OPS(_release,   .rl)
>> ATOMIC_OPS(_relaxed,  )
>>
>> (corresponding, resp., to
>>
>> atomic_cmpxchg()
>> atomic_cmpxchg_acquire()
>> atomic_cmpxchg_release()
>> atomic_cmpxchg_relaxed()  )
>>
>> so that the first variant above ends up doing
>>
>> 0:    lr.w.aqrl  %0, %addr
>>     bne    %0, %old, 1f
>>     sc.w.aqrl  %1, %new, %addr
>>     bnez   %1, 0b
>>     1:
>>
>> From Sect. 2.3.5. ("Acquire/Release Ordering") of the Spec.,
>>
>>  "AMOs with both .aq and .rl set are fully-ordered operations.  Treating
>>   the load part and the store part as independent RCsc operations is not
>>   in and of itself sufficient to enforce full fencing behavior, but this
>>   subtle weak behavior is counterintuitive and not much of an advantage
>>   architecturally, especially with lr and sc also available [...]."
>>
>> I understand that
>>
>>     { x = y = u = v = 0 }
>>
>> P0()
>>
>> WRITE_ONCE(x, 1);    ("relaxed" store, sw)
>> atomic_cmpxchg(, 0, 1);
>> r0 = READ_ONCE(y);    ("relaxed" load, lw)
>>
>> P1()
>>
>> WRITE_ONCE(y, 1);
>> atomic_cmpxchg(, 0, 1);
>> r1 = READ_ONCE(x);
>>
>> could result in (u = v = 1 and r0 = r1 = 0) at the end; can you confirm?
> 
> cmpxchg isn't an AMO, it's an LR SC sequence, so that blurb doesn't
> apply.  I think "lr.w.aqrl" and "sc.w.aqrl" is not sufficient to
> perform a fully ordered operation (ie, it's an incorrect
> implementation of atomic_cmpxchg()), but I was hoping to get some
> time to actually internalize this part of the RISC-V memory model at
> some point to be sure.

Agreed, the current intention is that we'll add a fence rw,rw in there
when we use lr/sc as the implementation, likely similar to what ARM does:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e86f0b409a44193f1587e87b69c5dcf8f65be67

It's also probably not the only piece of synchronization-related code
we'll have to go back and audit as we finalize the RISC-V memory
consistency model over the next couple months or so.

Dan


Re: [PATCH] riscv/barrier: Define __smp_{store_release,load_acquire}

2018-02-27 Thread Daniel Lustig
On 2/27/2018 10:21 AM, Palmer Dabbelt wrote:
> On Mon, 26 Feb 2018 18:24:11 PST (-0800), parri.and...@gmail.com wrote:
>> Introduce __smp_{store_release,load_acquire}, and rely on the generic
>> definitions for smp_{store_release,load_acquire}. This avoids the use
>> of full ("rw,rw") fences on SMP.
>>
>> Signed-off-by: Andrea Parri 
>> ---
>>  arch/riscv/include/asm/barrier.h | 15 +++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/riscv/include/asm/barrier.h 
>> b/arch/riscv/include/asm/barrier.h
>> index 5510366d169ae..d4628e4b3a5ea 100644
>> --- a/arch/riscv/include/asm/barrier.h
>> +++ b/arch/riscv/include/asm/barrier.h
>> @@ -38,6 +38,21 @@
>>  #define __smp_rmb()    RISCV_FENCE(r,r)
>>  #define __smp_wmb()    RISCV_FENCE(w,w)
>>
>> +#define __smp_store_release(p, v)    \
>> +do {    \
>> +    compiletime_assert_atomic_type(*p);    \
>> +    RISCV_FENCE(rw,w);    \
>> +    WRITE_ONCE(*p, v);    \
>> +} while (0)
>> +
>> +#define __smp_load_acquire(p)    \
>> +({    \
>> +    typeof(*p) ___p1 = READ_ONCE(*p);    \
>> +    compiletime_assert_atomic_type(*p);    \
>> +    RISCV_FENCE(r,rw);    \
>> +    ___p1;    \
>> +})
>> +
>>  /*
>>   * This is a very specific barrier: it's currently only used in two places 
>> in
>>   * the kernel, both in the scheduler.  See include/linux/spinlock.h for the 
>> two
> 
> I'm adding Daniel just in case I misunderstood what's going on here,
> but these look good to me.  As this is a non-trivial memory model
> change I'm going to let it bake in linux-next for a bit just so it
> gets some visibility.

Looks good to me too.  In particular, it also covers the
Write->release(p)->acquire(p)->Write ordering that we were debating
in the broader LKMM thread, which is good.

Dan

> 
> Thanks


Re: [PATCH] riscv/barrier: Define __smp_{store_release,load_acquire}

2018-02-27 Thread Daniel Lustig
On 2/27/2018 10:21 AM, Palmer Dabbelt wrote:
> On Mon, 26 Feb 2018 18:24:11 PST (-0800), parri.and...@gmail.com wrote:
>> Introduce __smp_{store_release,load_acquire}, and rely on the generic
>> definitions for smp_{store_release,load_acquire}. This avoids the use
>> of full ("rw,rw") fences on SMP.
>>
>> Signed-off-by: Andrea Parri 
>> ---
>>  arch/riscv/include/asm/barrier.h | 15 +++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/arch/riscv/include/asm/barrier.h 
>> b/arch/riscv/include/asm/barrier.h
>> index 5510366d169ae..d4628e4b3a5ea 100644
>> --- a/arch/riscv/include/asm/barrier.h
>> +++ b/arch/riscv/include/asm/barrier.h
>> @@ -38,6 +38,21 @@
>>  #define __smp_rmb()    RISCV_FENCE(r,r)
>>  #define __smp_wmb()    RISCV_FENCE(w,w)
>>
>> +#define __smp_store_release(p, v)    \
>> +do {    \
>> +    compiletime_assert_atomic_type(*p);    \
>> +    RISCV_FENCE(rw,w);    \
>> +    WRITE_ONCE(*p, v);    \
>> +} while (0)
>> +
>> +#define __smp_load_acquire(p)    \
>> +({    \
>> +    typeof(*p) ___p1 = READ_ONCE(*p);    \
>> +    compiletime_assert_atomic_type(*p);    \
>> +    RISCV_FENCE(r,rw);    \
>> +    ___p1;    \
>> +})
>> +
>>  /*
>>   * This is a very specific barrier: it's currently only used in two places 
>> in
>>   * the kernel, both in the scheduler.  See include/linux/spinlock.h for the 
>> two
> 
> I'm adding Daniel just in case I misunderstood what's going on here,
> but these look good to me.  As this is a non-trivial memory model
> change I'm going to let it bake in linux-next for a bit just so it
> gets some visibility.

Looks good to me too.  In particular, it also covers the
Write->release(p)->acquire(p)->Write ordering that we were debating
in the broader LKMM thread, which is good.

Dan

> 
> Thanks


Re: [RFC PATCH] riscv/locking: Strengthen spin_lock() and spin_unlock()

2018-02-22 Thread Daniel Lustig
On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
>> So we have something that is not all that rare in the Linux kernel
>> community, namely two conflicting more-or-less concurrent changes.
>> This clearly needs to be resolved, either by us not strengthening the
>> Linux-kernel memory model in the way we were planning to or by you
>> strengthening RISC-V to be no weaker than PowerPC for these sorts of
>> externally viewed release-acquire situations.
>>
>> Other thoughts?
> 
> Like said in the other email, I would _much_ prefer to not go weaker
> than PPC, I find that PPC is already painfully weak at times.

Sure, and RISC-V could make this work too by using RCsc instructions
and/or by using lightweight fences instead.  It just wasn't clear at
first whether smp_load_acquire() and smp_store_release() were RCpc,
RCsc, or something else, and hence whether RISC-V would actually need
to use something stronger than pure RCpc there.  Likewise for
spin_unlock()/spin_lock() and everywhere else this comes up.

As Paul's email in the other thread observed, RCpc seems to be
OK for smp_load_acquire()/smp_store_release() at least according
to the current LKMM herd spec.  Unlock/lock are stronger already
I guess.  But if there's an active proposal to strengthen them all
to something stricter than pure RCpc, then that's good to know.

My understanding from earlier discussions is that ARM has no plans
to use their own RCpc instruction for smp_load_acquire() instead
of their RCsc instructions.  Is that still true?  If they were to
use the RCpc load there, that would cause them to have the same
problem we're discussing here, right?  Just checking.

Dan


Re: [RFC PATCH] riscv/locking: Strengthen spin_lock() and spin_unlock()

2018-02-22 Thread Daniel Lustig
On 2/22/2018 10:27 AM, Peter Zijlstra wrote:
> On Thu, Feb 22, 2018 at 10:13:17AM -0800, Paul E. McKenney wrote:
>> So we have something that is not all that rare in the Linux kernel
>> community, namely two conflicting more-or-less concurrent changes.
>> This clearly needs to be resolved, either by us not strengthening the
>> Linux-kernel memory model in the way we were planning to or by you
>> strengthening RISC-V to be no weaker than PowerPC for these sorts of
>> externally viewed release-acquire situations.
>>
>> Other thoughts?
> 
> Like said in the other email, I would _much_ prefer to not go weaker
> than PPC, I find that PPC is already painfully weak at times.

Sure, and RISC-V could make this work too by using RCsc instructions
and/or by using lightweight fences instead.  It just wasn't clear at
first whether smp_load_acquire() and smp_store_release() were RCpc,
RCsc, or something else, and hence whether RISC-V would actually need
to use something stronger than pure RCpc there.  Likewise for
spin_unlock()/spin_lock() and everywhere else this comes up.

As Paul's email in the other thread observed, RCpc seems to be
OK for smp_load_acquire()/smp_store_release() at least according
to the current LKMM herd spec.  Unlock/lock are stronger already
I guess.  But if there's an active proposal to strengthen them all
to something stricter than pure RCpc, then that's good to know.

My understanding from earlier discussions is that ARM has no plans
to use their own RCpc instruction for smp_load_acquire() instead
of their RCsc instructions.  Is that still true?  If they were to
use the RCpc load there, that would cause them to have the same
problem we're discussing here, right?  Just checking.

Dan


Re: [RFC PATCH] riscv/locking: Strengthen spin_lock() and spin_unlock()

2018-02-22 Thread Daniel Lustig
On 2/22/2018 6:12 AM, Andrea Parri wrote:
> On Thu, Feb 22, 2018 at 02:40:04PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
>>
>>> C unlock-lock-read-ordering
>>>
>>> {}
>>> /* s initially owned by P1 */
>>>
>>> P0(int *x, int *y)
>>> {
>>> WRITE_ONCE(*x, 1);
>>> smp_wmb();
>>> WRITE_ONCE(*y, 1);
>>> }
>>>
>>> P1(int *x, int *y, spinlock_t *s)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> r0 = READ_ONCE(*y);
>>> spin_unlock(s);
>>> spin_lock(s);
>>> r1 = READ_ONCE(*y);
>>> }
>>>
>>> exists (1:r0=1 /\ 1:r1=0)
>>>
>>> RISCV RISCV-unlock-lock-read-ordering
>>> {
>>> 0:x2=x; 0:x4=y;
>>> 1:x2=y; 1:x4=x; 1:x6=s;
>>> s=1;
>>> }
>>>  P0   |  P1  ;
>>>  ori x1,x0,1  | lw x1,0(x2)  ;
>>>  sw x1,0(x2)  | amoswap.w.rl x0,x0,(x6)  ;
>>>  fence w,w| ori x5,x0,1  ;
>>>  ori x3,x0,1  | amoswap.w.aq x0,x5,(x6)  ;
>>>  sw x3,0(x4)  | lw x3,0(x4)  ;
>>> exists
>>> (1:x1=1 /\ 1:x3=0)
>>
>> So I would indeed expect this to be forbidden. Could someone please
>> explain how this could be allowed?
> 
> As mentioned in IRC, my understanding here is only based on the spec.
> referred below and on its (available) formalizations.  I expect that
> RISC-V people will be able to provide more information.

I think an RCpc interpretation of .aq and .rl would in fact allow
the two normal loads in P1 to be reordered.  Wouldn't it?  Does that
go against what the LKMM requires?

The intuition would be that the amoswap.w.aq can forward from the
amoswap.w.rl while that's still in the store buffer, and then the
lw x3,0(x4) can also perform while the amoswap.w.rl is still in
the store buffer, all before the l1 x1,0(x2) executes.  That's
not forbidden unless the amoswaps are RCsc, unless I'm missing
something.

Likewise even if the unlock()/lock() is between two stores.  A
control dependency might originate from the load part of the
amoswap.w.aq, but there still would have to be something to
ensure that this load part in fact performs after the store
part of the amoswap.w.rl performs globally, and that's not
automatic under RCpc.

In any case, our concern that amoswap.w.rl and amoswap.w.aq might
not actually be sufficient for spin_unlock() and spin_lock() was
what prompted us to start our own internal discussion on this.

FWIW we have made a few clarifications to our spec that haven't
propagated to the public upstream yet, but that should be happening
soon.  In any case as it relates to the questions here, it's
more a matter of us deciding what we should put into the spec in
the end than it is a matter of checking the current text.  In
other words, we're still finalizing things, and there's still
some time to tweak as necessary.

>>
>>> C unlock-lock-write-ordering
>>>
>>> {}
>>> /* s initially owned by P0 */
>>>
>>> P0(int *x, int *y, spinlock_t *s)
>>> {
>>> WRITE_ONCE(*x, 1);
>>> spin_unlock(s);
>>> spin_lock(s);
>>> WRITE_ONCE(*y, 1);
>>> }
>>>
>>> P1(int *x, int *y)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> r0 = READ_ONCE(*y);
>>> smp_rmb();
>>> r1 = READ_ONCE(*y);
>>> }
>>>
>>> exists (1:r0=1 /\ 1:r1=0)
>>>
>>> RISCV RISCV-unlock-lock-write-ordering
>>> {
>>> 0:x2=x; 0:x4=y; 0:x6=s;
>>> 1:x2=y; 1:x4=x;
>>> s=1;
>>> }
>>>  P0   | P1   ;
>>>  ori x1,x0,1  | lw x1,0(x2)  ;
>>>  sw x1,0(x2)  | fence r,r;
>>>  amoswap.w.rl x0,x0,(x6)  | lw x3,0(x4)  ;
>>>  ori x5,x0,1  |  ;
>>>  amoswap.w.aq x0,x5,(x6)  |  ;
>>>  ori x3,x0,1  |  ;
>>>  sw x3,0(x4)  |  ;
>>> exists
>>> (1:x1=1 /\ 1:x3=0)
>>
>> And here I think the RISCV conversion is flawed, there should be a ctrl
>> dependency. The second store-word in P0 should depend on the result of
>> amoswap.w.aq being 0.
> 
> You're right: AFAICT, this can be remedied by inserting "beq x0,x5,FAIL00"
> right after amoswap.w.aq (and this label at the end of P0); this does not
> change the verdict of the available formalizations reported above however.
> 
> (So, AFAICT, the above question remains valid/open.)

This makes sense, but one other question: Paul mentioned earlier in the
thread that

> The PowerPC lock implementation's unlock-lock pair does not order writes
> from the previous critical section against reads from the later critical
> section, but it does order other combinations of reads and writes.

My understanding is that the same logic about control dependencies applies
here too, right?  In other words, in spite of Peter's claim, the control
dependency doesn't automatically fix this for RISC-V or for PowerPC?

> 
>   Andrea
> 
> 
>>
>> (strictly speaking there should be a ctrl-dep in the read example too,
>> except it'd be pointless for ordering reads, so I accept it being left
>> out)
>>
>> Again, I cannot see how this could be allowed.
>>

Peter, in this test you mentioned earlier:


Re: [RFC PATCH] riscv/locking: Strengthen spin_lock() and spin_unlock()

2018-02-22 Thread Daniel Lustig
On 2/22/2018 6:12 AM, Andrea Parri wrote:
> On Thu, Feb 22, 2018 at 02:40:04PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 22, 2018 at 01:19:50PM +0100, Andrea Parri wrote:
>>
>>> C unlock-lock-read-ordering
>>>
>>> {}
>>> /* s initially owned by P1 */
>>>
>>> P0(int *x, int *y)
>>> {
>>> WRITE_ONCE(*x, 1);
>>> smp_wmb();
>>> WRITE_ONCE(*y, 1);
>>> }
>>>
>>> P1(int *x, int *y, spinlock_t *s)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> r0 = READ_ONCE(*y);
>>> spin_unlock(s);
>>> spin_lock(s);
>>> r1 = READ_ONCE(*y);
>>> }
>>>
>>> exists (1:r0=1 /\ 1:r1=0)
>>>
>>> RISCV RISCV-unlock-lock-read-ordering
>>> {
>>> 0:x2=x; 0:x4=y;
>>> 1:x2=y; 1:x4=x; 1:x6=s;
>>> s=1;
>>> }
>>>  P0   |  P1  ;
>>>  ori x1,x0,1  | lw x1,0(x2)  ;
>>>  sw x1,0(x2)  | amoswap.w.rl x0,x0,(x6)  ;
>>>  fence w,w| ori x5,x0,1  ;
>>>  ori x3,x0,1  | amoswap.w.aq x0,x5,(x6)  ;
>>>  sw x3,0(x4)  | lw x3,0(x4)  ;
>>> exists
>>> (1:x1=1 /\ 1:x3=0)
>>
>> So I would indeed expect this to be forbidden. Could someone please
>> explain how this could be allowed?
> 
> As mentioned in IRC, my understanding here is only based on the spec.
> referred below and on its (available) formalizations.  I expect that
> RISC-V people will be able to provide more information.

I think an RCpc interpretation of .aq and .rl would in fact allow
the two normal loads in P1 to be reordered.  Wouldn't it?  Does that
go against what the LKMM requires?

The intuition would be that the amoswap.w.aq can forward from the
amoswap.w.rl while that's still in the store buffer, and then the
lw x3,0(x4) can also perform while the amoswap.w.rl is still in
the store buffer, all before the l1 x1,0(x2) executes.  That's
not forbidden unless the amoswaps are RCsc, unless I'm missing
something.

Likewise even if the unlock()/lock() is between two stores.  A
control dependency might originate from the load part of the
amoswap.w.aq, but there still would have to be something to
ensure that this load part in fact performs after the store
part of the amoswap.w.rl performs globally, and that's not
automatic under RCpc.

In any case, our concern that amoswap.w.rl and amoswap.w.aq might
not actually be sufficient for spin_unlock() and spin_lock() was
what prompted us to start our own internal discussion on this.

FWIW we have made a few clarifications to our spec that haven't
propagated to the public upstream yet, but that should be happening
soon.  In any case as it relates to the questions here, it's
more a matter of us deciding what we should put into the spec in
the end than it is a matter of checking the current text.  In
other words, we're still finalizing things, and there's still
some time to tweak as necessary.

>>
>>> C unlock-lock-write-ordering
>>>
>>> {}
>>> /* s initially owned by P0 */
>>>
>>> P0(int *x, int *y, spinlock_t *s)
>>> {
>>> WRITE_ONCE(*x, 1);
>>> spin_unlock(s);
>>> spin_lock(s);
>>> WRITE_ONCE(*y, 1);
>>> }
>>>
>>> P1(int *x, int *y)
>>> {
>>> int r0;
>>> int r1;
>>>
>>> r0 = READ_ONCE(*y);
>>> smp_rmb();
>>> r1 = READ_ONCE(*y);
>>> }
>>>
>>> exists (1:r0=1 /\ 1:r1=0)
>>>
>>> RISCV RISCV-unlock-lock-write-ordering
>>> {
>>> 0:x2=x; 0:x4=y; 0:x6=s;
>>> 1:x2=y; 1:x4=x;
>>> s=1;
>>> }
>>>  P0   | P1   ;
>>>  ori x1,x0,1  | lw x1,0(x2)  ;
>>>  sw x1,0(x2)  | fence r,r;
>>>  amoswap.w.rl x0,x0,(x6)  | lw x3,0(x4)  ;
>>>  ori x5,x0,1  |  ;
>>>  amoswap.w.aq x0,x5,(x6)  |  ;
>>>  ori x3,x0,1  |  ;
>>>  sw x3,0(x4)  |  ;
>>> exists
>>> (1:x1=1 /\ 1:x3=0)
>>
>> And here I think the RISCV conversion is flawed, there should be a ctrl
>> dependency. The second store-word in P0 should depend on the result of
>> amoswap.w.aq being 0.
> 
> You're right: AFAICT, this can be remedied by inserting "beq x0,x5,FAIL00"
> right after amoswap.w.aq (and this label at the end of P0); this does not
> change the verdict of the available formalizations reported above however.
> 
> (So, AFAICT, the above question remains valid/open.)

This makes sense, but one other question: Paul mentioned earlier in the
thread that

> The PowerPC lock implementation's unlock-lock pair does not order writes
> from the previous critical section against reads from the later critical
> section, but it does order other combinations of reads and writes.

My understanding is that the same logic about control dependencies applies
here too, right?  In other words, in spite of Peter's claim, the control
dependency doesn't automatically fix this for RISC-V or for PowerPC?

> 
>   Andrea
> 
> 
>>
>> (strictly speaking there should be a ctrl-dep in the read example too,
>> except it'd be pointless for ordering reads, so I accept it being left
>> out)
>>
>> Again, I cannot see how this could be allowed.
>>

Peter, in this test you mentioned earlier:


Re: [PATCH RFC tools/lkmm 10/12] tools/memory-model: Add a S lock-based external-view litmus test

2018-02-21 Thread Daniel Lustig
On 2/21/2018 9:27 PM, Boqun Feng wrote:
> On Wed, Feb 21, 2018 at 08:13:57PM -0800, Paul E. McKenney wrote:
>> On Thu, Feb 22, 2018 at 11:23:49AM +0800, Boqun Feng wrote:
>>> On Tue, Feb 20, 2018 at 03:25:10PM -0800, Paul E. McKenney wrote:
 From: Alan Stern 

 This commit adds a litmus test in which P0() and P1() form a lock-based S
 litmus test, with the addition of P2(), which observes P0()'s and P1()'s
 accesses with a full memory barrier but without the lock.  This litmus
 test asks whether writes carried out by two different processes under the
 same lock will be seen in order by a third process not holding that lock.
 The answer to this question is "yes" for all architectures supporting
>>>
>>> Hmm.. it this true? Our spin_lock() is RCpc because of PowerPC, so
>>> spin_lock()+spin_unlock() pairs don't provide transitivity, and that's
>>> why we have smp_mb__after_unlock_lock(). Is there something I'm missing?
>>> Or there is an upcomming commit to switch PowerPC's lock implementation?
>>
>> The PowerPC lock implementation's unlock-lock pair does not order writes
>> from the previous critical section against reads from the later critical
>> section, but it does order other combinations of reads and writes.
> 
> Ah.. right! Thanks for the explanation ;-)
> 
>> Some have apparently said that RISC-V 's unlock-lock pair also does not
>> order writes from the previous critical section against writes from the
>> later critical section.  And no, I don't claim to have yet gotten my
>> head around RISC-V memory ordering.  ;-)
>>
> 
> Me neither. Now I remember this: we have a off-list(accidentally)
> discussion about this, and IIRC at that moment riscv people confirmed
> that riscv's unlock-lock pair doesn't order write->write, but that was
> before their memory model draft posted for discussions, so things may
> change now... 
> 
> Besides, I think the smp_mb() on P2 can be relaxed to smp_rmb(), no?
> 
> Regards,
> Boqun
> 
>>  Thanx, Paul
>>

As a matter of fact, the RISC-V memory model committee is debating this
exact question at the moment.  Now that I see this thread I'll have to
go back and catch up on what I've missed, but at least I can be on cc
from this point on to answer any RISC-V questions that come up from
here on.

(Is there some place I should add my name as a RISC-V memory model
contact, so I don't miss threads like this in the future?)

And yes, if we go with a purely RCpc interpretation of acquire and
release, then I don't believe the writes in the previous critical
section would be ordered with the writes in the subsequent critical
section.  That's really all the argument boils down to.  We'd like
to support RCpc if we can since that's all some software needs, but
we also obviously want to make sure we can support RCsc and these
kinds of LKMM subtleties efficiently too when needed.  So we have a
few encoding details that we're still finalizing, because questions
like this one are still popping up :)

Let me know if you had other RISC-V-specific questions I can help
answer.

Dan


Re: [PATCH RFC tools/lkmm 10/12] tools/memory-model: Add a S lock-based external-view litmus test

2018-02-21 Thread Daniel Lustig
On 2/21/2018 9:27 PM, Boqun Feng wrote:
> On Wed, Feb 21, 2018 at 08:13:57PM -0800, Paul E. McKenney wrote:
>> On Thu, Feb 22, 2018 at 11:23:49AM +0800, Boqun Feng wrote:
>>> On Tue, Feb 20, 2018 at 03:25:10PM -0800, Paul E. McKenney wrote:
 From: Alan Stern 

 This commit adds a litmus test in which P0() and P1() form a lock-based S
 litmus test, with the addition of P2(), which observes P0()'s and P1()'s
 accesses with a full memory barrier but without the lock.  This litmus
 test asks whether writes carried out by two different processes under the
 same lock will be seen in order by a third process not holding that lock.
 The answer to this question is "yes" for all architectures supporting
>>>
>>> Hmm.. it this true? Our spin_lock() is RCpc because of PowerPC, so
>>> spin_lock()+spin_unlock() pairs don't provide transitivity, and that's
>>> why we have smp_mb__after_unlock_lock(). Is there something I'm missing?
>>> Or there is an upcomming commit to switch PowerPC's lock implementation?
>>
>> The PowerPC lock implementation's unlock-lock pair does not order writes
>> from the previous critical section against reads from the later critical
>> section, but it does order other combinations of reads and writes.
> 
> Ah.. right! Thanks for the explanation ;-)
> 
>> Some have apparently said that RISC-V 's unlock-lock pair also does not
>> order writes from the previous critical section against writes from the
>> later critical section.  And no, I don't claim to have yet gotten my
>> head around RISC-V memory ordering.  ;-)
>>
> 
> Me neither. Now I remember this: we have a off-list(accidentally)
> discussion about this, and IIRC at that moment riscv people confirmed
> that riscv's unlock-lock pair doesn't order write->write, but that was
> before their memory model draft posted for discussions, so things may
> change now... 
> 
> Besides, I think the smp_mb() on P2 can be relaxed to smp_rmb(), no?
> 
> Regards,
> Boqun
> 
>>  Thanx, Paul
>>

As a matter of fact, the RISC-V memory model committee is debating this
exact question at the moment.  Now that I see this thread I'll have to
go back and catch up on what I've missed, but at least I can be on cc
from this point on to answer any RISC-V questions that come up from
here on.

(Is there some place I should add my name as a RISC-V memory model
contact, so I don't miss threads like this in the future?)

And yes, if we go with a purely RCpc interpretation of acquire and
release, then I don't believe the writes in the previous critical
section would be ordered with the writes in the subsequent critical
section.  That's really all the argument boils down to.  We'd like
to support RCpc if we can since that's all some software needs, but
we also obviously want to make sure we can support RCsc and these
kinds of LKMM subtleties efficiently too when needed.  So we have a
few encoding details that we're still finalizing, because questions
like this one are still popping up :)

Let me know if you had other RISC-V-specific questions I can help
answer.

Dan


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-12-01 Thread Daniel Lustig
On 12/1/2017 7:32 AM, Alan Stern wrote:
> On Fri, 1 Dec 2017, Boqun Feng wrote:
>>> But even on a non-other-multicopy-atomic system, there has to be some 
>>> synchronization between the memory controller and P1's CPU.  Otherwise, 
>>> how could the system guarantee that P1's smp_load_acquire would see the 
>>> post-increment value of y?  It seems reasonable to assume that this 
>>> synchronization would also cause P1 to see x=1.
>>>
>>
>> I agree with you the "reasonable" part ;-) So basically, memory
>> controller could only do the write of AMO until P0's second write
>> propagated to the memory controller(and because of the wmb(), P0's first
>> write must be already propagated to the memory controller, too), so it
>> makes sense when the write of AMO propagated from memory controller to
>> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
>> memory controller acts at least like a release.
>>
>> However, some part of myself is still a little paranoid, because to my
>> understanding, the point of AMO is to get atomic operations executing
>> as fast as possible, so maybe, AMO has some fast path for the memory
>> controller to forward a write to the CPU that issues the AMO, in that
>> way, it will become unreasonable ;-)
> 
> It's true that a hardware design in the future might behave differently 
> from current hardware.  If that ever happens, we will need to rethink 
> the situation.  Maybe the designers will change their hardware to make 
> it match the memory model.  Or maybe the memory model will change.

Do you mean all of the above in the context of increment etc, as opposed
to swap?  ARM hardware in the wild is already documented as forwarding
SWP values to subsequent loads early, even past control dependencies.
Paul sent this link earlier in the thread.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html

The reason swap is special is because its store value is available to be
forwarded even before the AMO goes out to the memory controller or
wherever else it gets its load value from.

Also, the case I described is an acquire rather than a control
dependency, but it's similar enough that it doesn't seem completely
unrealistic to think hardware might try to do this.

Dan


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-12-01 Thread Daniel Lustig
On 12/1/2017 7:32 AM, Alan Stern wrote:
> On Fri, 1 Dec 2017, Boqun Feng wrote:
>>> But even on a non-other-multicopy-atomic system, there has to be some 
>>> synchronization between the memory controller and P1's CPU.  Otherwise, 
>>> how could the system guarantee that P1's smp_load_acquire would see the 
>>> post-increment value of y?  It seems reasonable to assume that this 
>>> synchronization would also cause P1 to see x=1.
>>>
>>
>> I agree with you the "reasonable" part ;-) So basically, memory
>> controller could only do the write of AMO until P0's second write
>> propagated to the memory controller(and because of the wmb(), P0's first
>> write must be already propagated to the memory controller, too), so it
>> makes sense when the write of AMO propagated from memory controller to
>> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
>> memory controller acts at least like a release.
>>
>> However, some part of myself is still a little paranoid, because to my
>> understanding, the point of AMO is to get atomic operations executing
>> as fast as possible, so maybe, AMO has some fast path for the memory
>> controller to forward a write to the CPU that issues the AMO, in that
>> way, it will become unreasonable ;-)
> 
> It's true that a hardware design in the future might behave differently 
> from current hardware.  If that ever happens, we will need to rethink 
> the situation.  Maybe the designers will change their hardware to make 
> it match the memory model.  Or maybe the memory model will change.

Do you mean all of the above in the context of increment etc, as opposed
to swap?  ARM hardware in the wild is already documented as forwarding
SWP values to subsequent loads early, even past control dependencies.
Paul sent this link earlier in the thread.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html

The reason swap is special is because its store value is available to be
forwarded even before the AMO goes out to the memory controller or
wherever else it gets its load value from.

Also, the case I described is an acquire rather than a control
dependency, but it's similar enough that it doesn't seem completely
unrealistic to think hardware might try to do this.

Dan


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-11-29 Thread Daniel Lustig
On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
>> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
>>
>>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
>>>
>>>> While we're here, let me ask about another test which isn't directly
>>>> about unlock/lock but which is still somewhat related to this
>>>> discussion:
>>>>
>>>> "MP+wmb+xchg-acq" (or some such)
>>>>
>>>> {}
>>>>
>>>> P0(int *x, int *y)
>>>> {
>>>> WRITE_ONCE(*x, 1);
>>>> smp_wmb();
>>>> WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *x, int *y)
>>>> {
>>>> r1 = atomic_xchg_relaxed(y, 2);
>>>> r2 = smp_load_acquire(y);
>>>> r3 = READ_ONCE(*x);
>>>> }
>>>>
>>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
>>>>
>>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
>>>> and hence would forbid this outcome.
>>>
>>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
>>> mean _release is just daft.
>>
>> The C11 memory model specifically allows atomic operations to be 
>> interspersed within a release sequence.  But it doesn't say why.
> 
> The use case put forward within the committee is for atomic quantities
> with mode bits.  The most frequent has the atomic quantity having
> lock-like properties, in which case you don't want to lose the ordering
> effects of the lock handoff just because a mode bit got set or cleared.
> Some claim to actually use something like this, but details have not
> been forthcoming.
> 
> I confess to being a bit skeptical.  If the mode changes are infrequent,
> the update could just as well be ordered.

Aren't reference counting implementations which use memory_order_relaxed
for incrementing the count another important use case?  Specifically,
the synchronization between a memory_order_release decrement and the
eventual memory_order_acquire/consume free shouldn't be interrupted by
other (relaxed) increments and (release-only) decrements that happen in
between.  At least that's my understanding of this use case.  I wasn't
there when the C/C++ committee decided this.

> That said, Daniel, the C++ memory model really does require that the
> above litmus test be forbidden, my denigration of it notwithstanding.

Yes I agree, that's why I'm curious what the Linux memory model has
in mind here :)

Dan

>   Thanx, Paul
> 


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-11-29 Thread Daniel Lustig
On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
>> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
>>
>>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
>>>
>>>> While we're here, let me ask about another test which isn't directly
>>>> about unlock/lock but which is still somewhat related to this
>>>> discussion:
>>>>
>>>> "MP+wmb+xchg-acq" (or some such)
>>>>
>>>> {}
>>>>
>>>> P0(int *x, int *y)
>>>> {
>>>> WRITE_ONCE(*x, 1);
>>>> smp_wmb();
>>>> WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *x, int *y)
>>>> {
>>>> r1 = atomic_xchg_relaxed(y, 2);
>>>> r2 = smp_load_acquire(y);
>>>> r3 = READ_ONCE(*x);
>>>> }
>>>>
>>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
>>>>
>>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
>>>> and hence would forbid this outcome.
>>>
>>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
>>> mean _release is just daft.
>>
>> The C11 memory model specifically allows atomic operations to be 
>> interspersed within a release sequence.  But it doesn't say why.
> 
> The use case put forward within the committee is for atomic quantities
> with mode bits.  The most frequent has the atomic quantity having
> lock-like properties, in which case you don't want to lose the ordering
> effects of the lock handoff just because a mode bit got set or cleared.
> Some claim to actually use something like this, but details have not
> been forthcoming.
> 
> I confess to being a bit skeptical.  If the mode changes are infrequent,
> the update could just as well be ordered.

Aren't reference counting implementations which use memory_order_relaxed
for incrementing the count another important use case?  Specifically,
the synchronization between a memory_order_release decrement and the
eventual memory_order_acquire/consume free shouldn't be interrupted by
other (relaxed) increments and (release-only) decrements that happen in
between.  At least that's my understanding of this use case.  I wasn't
there when the C/C++ committee decided this.

> That said, Daniel, the C++ memory model really does require that the
> above litmus test be forbidden, my denigration of it notwithstanding.

Yes I agree, that's why I'm curious what the Linux memory model has
in mind here :)

Dan

>   Thanx, Paul
> 


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-11-29 Thread Daniel Lustig
On 11/27/2017 1:16 PM, Alan Stern wrote:
> This is essentially a repeat of an email I sent out before the
> Thanksgiving holiday, the assumption being that lack of any responses
> was caused by the holiday break.  (And this time the message is CC'ed
> to LKML, so there will be a public record of it.)
> 
> A few people have said they believe the Linux Kernel Memory Model
> should make unlock followed by lock (of the same variable) act as a
> write memory barrier.  In other words, they want the memory model to
> forbid the following litmus test:
>

> 
> I (and others!) would like to know people's opinions on these matters.
> 
> Alan Stern

While we're here, let me ask about another test which isn't directly
about unlock/lock but which is still somewhat related to this
discussion:

"MP+wmb+xchg-acq" (or some such)

{}

P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
smp_wmb();
WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
r1 = atomic_xchg_relaxed(y, 2);
r2 = smp_load_acquire(y);
r3 = READ_ONCE(*x);
}

exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)

C/C++ would call the atomic_xchg_relaxed part of a release sequence
and hence would forbid this outcome.

x86 and Power would forbid this.  ARM forbids this via a special-case
rule in the memory model, ordering atomics with later load-acquires.

RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
because xchg doesn't have an inherent internal data dependency.

If the Linux memory model is going to forbid this outcome, then
RISC-V would either need to use fences instead, or maybe we'd need to
add a special rule to our memory model similarly.  This is one detail
where RISC-V is still actively deciding what to do.

Have you all thought about this test before?  Any idea which way you
are leaning regarding the outcome above?

Thanks,
Dan


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-11-29 Thread Daniel Lustig
On 11/27/2017 1:16 PM, Alan Stern wrote:
> This is essentially a repeat of an email I sent out before the
> Thanksgiving holiday, the assumption being that lack of any responses
> was caused by the holiday break.  (And this time the message is CC'ed
> to LKML, so there will be a public record of it.)
> 
> A few people have said they believe the Linux Kernel Memory Model
> should make unlock followed by lock (of the same variable) act as a
> write memory barrier.  In other words, they want the memory model to
> forbid the following litmus test:
>

> 
> I (and others!) would like to know people's opinions on these matters.
> 
> Alan Stern

While we're here, let me ask about another test which isn't directly
about unlock/lock but which is still somewhat related to this
discussion:

"MP+wmb+xchg-acq" (or some such)

{}

P0(int *x, int *y)
{
WRITE_ONCE(*x, 1);
smp_wmb();
WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
r1 = atomic_xchg_relaxed(y, 2);
r2 = smp_load_acquire(y);
r3 = READ_ONCE(*x);
}

exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)

C/C++ would call the atomic_xchg_relaxed part of a release sequence
and hence would forbid this outcome.

x86 and Power would forbid this.  ARM forbids this via a special-case
rule in the memory model, ordering atomics with later load-acquires.

RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
because xchg doesn't have an inherent internal data dependency.

If the Linux memory model is going to forbid this outcome, then
RISC-V would either need to use fences instead, or maybe we'd need to
add a special rule to our memory model similarly.  This is one detail
where RISC-V is still actively deciding what to do.

Have you all thought about this test before?  Any idea which way you
are leaning regarding the outcome above?

Thanks,
Dan


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-11-27 Thread Daniel Lustig
On 11/27/2017 1:16 PM, Alan Stern wrote:> C rel-acq-write-ordering-3
> 
> {}
> 
> P0(int *x, int *s, int *y)
> {
>   WRITE_ONCE(*x, 1);
>   smp_store_release(s, 1);
>   r1 = smp_load_acquire(s);
>   WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>   r2 = READ_ONCE(*y);
>   smp_rmb();
>   r3 = READ_ONCE(*x);
> }
> 
> exists (1:r2=1 /\ 1:r3=0)
> 

> 
> And going to extremes...

Sorry if I'm missing something obvious, but before going to extremes...
what about this one?

"SB+rel-acq" (or please rename if you have a different scheme)

{}

P0(int *x, int *s, int *y)
{
WRITE_ONCE(*x, 1);
smp_store_release(s, 1);
r1 = smp_load_acquire(s);
r2 = READ_ONCE(*y);
}

P1(int *x, int *y)
{
WRITE_ONCE(*y, 1);
smp_store_release(s, 2);
r3 = smp_load_acquire(s);
r4 = READ_ONCE(*x);
}

exists (1:r2=0 /\ 1:r4=0)

If smp_store_release() and smp_load_acquire() map to normal TSO loads
and stores on x86, then this test can't be forbidden, can it?

Similar question for the other tests, but this is probably the
easiest one to analyze.

Dan


Re: Unlock-lock questions and the Linux Kernel Memory Model

2017-11-27 Thread Daniel Lustig
On 11/27/2017 1:16 PM, Alan Stern wrote:> C rel-acq-write-ordering-3
> 
> {}
> 
> P0(int *x, int *s, int *y)
> {
>   WRITE_ONCE(*x, 1);
>   smp_store_release(s, 1);
>   r1 = smp_load_acquire(s);
>   WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>   r2 = READ_ONCE(*y);
>   smp_rmb();
>   r3 = READ_ONCE(*x);
> }
> 
> exists (1:r2=1 /\ 1:r3=0)
> 

> 
> And going to extremes...

Sorry if I'm missing something obvious, but before going to extremes...
what about this one?

"SB+rel-acq" (or please rename if you have a different scheme)

{}

P0(int *x, int *s, int *y)
{
WRITE_ONCE(*x, 1);
smp_store_release(s, 1);
r1 = smp_load_acquire(s);
r2 = READ_ONCE(*y);
}

P1(int *x, int *y)
{
WRITE_ONCE(*y, 1);
smp_store_release(s, 2);
r3 = smp_load_acquire(s);
r4 = READ_ONCE(*x);
}

exists (1:r2=0 /\ 1:r4=0)

If smp_store_release() and smp_load_acquire() map to normal TSO loads
and stores on x86, then this test can't be forbidden, can it?

Similar question for the other tests, but this is probably the
easiest one to analyze.

Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-16 Thread Daniel Lustig
> From: Will Deacon [mailto:will.dea...@arm.com]
> Hi Daniel,
> 
> On Thu, Nov 16, 2017 at 06:40:46AM +0000, Daniel Lustig wrote:
> > > > In that case, maybe we should just start out having a fence on
> > > > both sides for
> > >
> > > Actually, given your architecture is RCsc rather than RCpc, so I
> > > think maybe you could follow the way that ARM uses(i.e. relaxed load
> > > + release store + a full barrier). You can see the commit log of
> > > 8e86f0b409a4
> > > ("arm64: atomics: fix use of acquire + release for full barrier
> > > semantics") for the reasoning:
> > >
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > o
> > > mmit/?id=8e86f0b409a44193f1587e87b69c5dcf8f65be67
> >
> > OK I'm catching up now...and I have to say, that is clever!  Thanks
> > for the corrections.  It would definitely be good to avoid the double
> > fence.  Let me think this over a bit more.
> >
> > One subtle point about RCpc vs. RCsc, though: see below.
> >
> > >
> > > Sounds great! Any estimation when we can see that(maybe a draft)?
> >
> > The latest should be November 28-30, coinciding with the next RISC-V
> workshop.
> > Possibly even sooner.  We just recently resolved a big debate that's
> > been holding us up for a while, and so now it's just a matter of me
> > writing it all up so it's understandable.
> >
> > In the meantime, though, let me quickly and informally summarize some
> > of the highlights, in case it helps unblock any further review here:
> >
> > - The model is now (other-)multi-copy atomic
> > - .aq and .rl alone mean RCpc
> > - .aqrl means RCsc
> 
> That presents you with an interesting choice when implementing locks: do
> you use .aqrl for lock and unlokc and elide smp_mb__after_spinlock, or do
> you use .aq/.rl for lock/unlock respectively and emit a fence for
> smp_mb__after_spinlock?

Good question, and I should probably do my homework before weighing in strongly
either way.

> 
> > - .aq applies only to the read part of an RMW
> > - .rl applies only to the write part of an RMW
> > - Syntactic dependencies are now respected
> 
> Thanks for the update, even this brief summary is better than the current
> architecture document ;)
> 
> > I recognize this isn't enough detail to do it proper justice, but
> > we'll get the full model out as soon as we can.
> 
> Ok, I'll bite! Do you forbid LB?

This was definitely one of the topics we debated.  In the end, we chose to
allow LB but forbid LB+deps or stronger.  CPUs may not produce LB all that
often, but we didn't see a super strong need to rule it out either.
Out-of-thin-air is covered now that we enforce dependencies.  Also, GPUs do
produce LB, and that's relevant to our (NVIDIA's) use case at least.

I'm happy to keep answering other questions too.

Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-16 Thread Daniel Lustig
> From: Will Deacon [mailto:will.dea...@arm.com]
> Hi Daniel,
> 
> On Thu, Nov 16, 2017 at 06:40:46AM +0000, Daniel Lustig wrote:
> > > > In that case, maybe we should just start out having a fence on
> > > > both sides for
> > >
> > > Actually, given your architecture is RCsc rather than RCpc, so I
> > > think maybe you could follow the way that ARM uses(i.e. relaxed load
> > > + release store + a full barrier). You can see the commit log of
> > > 8e86f0b409a4
> > > ("arm64: atomics: fix use of acquire + release for full barrier
> > > semantics") for the reasoning:
> > >
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/c
> > > o
> > > mmit/?id=8e86f0b409a44193f1587e87b69c5dcf8f65be67
> >
> > OK I'm catching up now...and I have to say, that is clever!  Thanks
> > for the corrections.  It would definitely be good to avoid the double
> > fence.  Let me think this over a bit more.
> >
> > One subtle point about RCpc vs. RCsc, though: see below.
> >
> > >
> > > Sounds great! Any estimation when we can see that(maybe a draft)?
> >
> > The latest should be November 28-30, coinciding with the next RISC-V
> workshop.
> > Possibly even sooner.  We just recently resolved a big debate that's
> > been holding us up for a while, and so now it's just a matter of me
> > writing it all up so it's understandable.
> >
> > In the meantime, though, let me quickly and informally summarize some
> > of the highlights, in case it helps unblock any further review here:
> >
> > - The model is now (other-)multi-copy atomic
> > - .aq and .rl alone mean RCpc
> > - .aqrl means RCsc
> 
> That presents you with an interesting choice when implementing locks: do
> you use .aqrl for lock and unlokc and elide smp_mb__after_spinlock, or do
> you use .aq/.rl for lock/unlock respectively and emit a fence for
> smp_mb__after_spinlock?

Good question, and I should probably do my homework before weighing in strongly
either way.

> 
> > - .aq applies only to the read part of an RMW
> > - .rl applies only to the write part of an RMW
> > - Syntactic dependencies are now respected
> 
> Thanks for the update, even this brief summary is better than the current
> architecture document ;)
> 
> > I recognize this isn't enough detail to do it proper justice, but
> > we'll get the full model out as soon as we can.
> 
> Ok, I'll bite! Do you forbid LB?

This was definitely one of the topics we debated.  In the end, we chose to
allow LB but forbid LB+deps or stronger.  CPUs may not produce LB all that
often, but we didn't see a super strong need to rule it out either.
Out-of-thin-air is covered now that we enforce dependencies.  Also, GPUs do
produce LB, and that's relevant to our (NVIDIA's) use case at least.

I'm happy to keep answering other questions too.

Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-15 Thread Daniel Lustig
> > In that case, maybe we should just start out having a fence on both
> > sides for
> 
> Actually, given your architecture is RCsc rather than RCpc, so I think maybe
> you could follow the way that ARM uses(i.e. relaxed load + release store + a
> full barrier). You can see the commit log of 8e86f0b409a4
> ("arm64: atomics: fix use of acquire + release for full barrier
> semantics") for the reasoning:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/co
> mmit/?id=8e86f0b409a44193f1587e87b69c5dcf8f65be67

OK I'm catching up now...and I have to say, that is clever!  Thanks for the
corrections.  It would definitely be good to avoid the double fence.  Let me
think this over a bit more.

One subtle point about RCpc vs. RCsc, though: see below.

> 
> Sounds great! Any estimation when we can see that(maybe a draft)?

The latest should be November 28-30, coinciding with the next RISC-V workshop.
Possibly even sooner.  We just recently resolved a big debate that's been
holding us up for a while, and so now it's just a matter of me writing it all
up so it's understandable.

In the meantime, though, let me quickly and informally summarize some of the
highlights, in case it helps unblock any further review here:

- The model is now (other-)multi-copy atomic
- .aq and .rl alone mean RCpc
- .aqrl means RCsc
- .aq applies only to the read part of an RMW
- .rl applies only to the write part of an RMW
- Syntactic dependencies are now respected

I recognize this isn't enough detail to do it proper justice, but we'll get the
full model out as soon as we can.

> 
> Regards,
> Boqun
> 
> > Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-15 Thread Daniel Lustig
> > In that case, maybe we should just start out having a fence on both
> > sides for
> 
> Actually, given your architecture is RCsc rather than RCpc, so I think maybe
> you could follow the way that ARM uses(i.e. relaxed load + release store + a
> full barrier). You can see the commit log of 8e86f0b409a4
> ("arm64: atomics: fix use of acquire + release for full barrier
> semantics") for the reasoning:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/co
> mmit/?id=8e86f0b409a44193f1587e87b69c5dcf8f65be67

OK I'm catching up now...and I have to say, that is clever!  Thanks for the
corrections.  It would definitely be good to avoid the double fence.  Let me
think this over a bit more.

One subtle point about RCpc vs. RCsc, though: see below.

> 
> Sounds great! Any estimation when we can see that(maybe a draft)?

The latest should be November 28-30, coinciding with the next RISC-V workshop.
Possibly even sooner.  We just recently resolved a big debate that's been
holding us up for a while, and so now it's just a matter of me writing it all
up so it's understandable.

In the meantime, though, let me quickly and informally summarize some of the
highlights, in case it helps unblock any further review here:

- The model is now (other-)multi-copy atomic
- .aq and .rl alone mean RCpc
- .aqrl means RCsc
- .aq applies only to the read part of an RMW
- .rl applies only to the write part of an RMW
- Syntactic dependencies are now respected

I recognize this isn't enough detail to do it proper justice, but we'll get the
full model out as soon as we can.

> 
> Regards,
> Boqun
> 
> > Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-15 Thread Daniel Lustig
> -Original Message-
> From: Boqun Feng [mailto:boqun.f...@gmail.com]
> Sent: Wednesday, November 15, 2017 5:19 PM
> To: Daniel Lustig <dlus...@nvidia.com>
> Cc: Palmer Dabbelt <pal...@dabbelt.com>; will.dea...@arm.com; Arnd
> Bergmann <a...@arndb.de>; Olof Johansson <o...@lixom.net>; linux-
> ker...@vger.kernel.org; patc...@groups.riscv.org; pet...@infradead.org
> Subject: Re: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code
> 
> On Wed, Nov 15, 2017 at 11:59:44PM +, Daniel Lustig wrote:
> > > On Wed, 15 Nov 2017 10:06:01 PST (-0800), will.dea...@arm.com wrote:
> > >> On Tue, Nov 14, 2017 at 12:30:59PM -0800, Palmer Dabbelt wrote:
> > >> > On Tue, 24 Oct 2017 07:10:33 PDT (-0700), will.dea...@arm.com
> wrote:
> > >> >>On Tue, Sep 26, 2017 at 06:56:31PM -0700, Palmer Dabbelt wrote:
> > > >
> > > > Hi Palmer,
> > > >
> > > >> >>+ATOMIC_OPS(add, add, +,  i,  , _relaxed)
> > > >> >>+ATOMIC_OPS(add, add, +,  i, .aq  , _acquire) ATOMIC_OPS(add,
> > > >> >>+add,
> > > >> >>++,  i, .rl  , _release)
> > > >> >>+ATOMIC_OPS(add, add, +,  i, .aqrl, )
> > > >> >
> > > >> >Have you checked that .aqrl is equivalent to "ordered", since
> > > >> >there are interpretations where that isn't the case. Specifically:
> > > >> >
> > > >> >// all variables zero at start of time
> > > >> >P0:
> > > >> >WRITE_ONCE(x) = 1;
> > > >> >atomic_add_return(y, 1);
> > > >> >WRITE_ONCE(z) = 1;
> > > >> >
> > > >> >P1:
> > > >> >READ_ONCE(z) // reads 1
> > > >> >smp_rmb();
> > > >> >READ_ONCE(x) // must not read 0
> > > >>
> > > >> I haven't.  We don't quite have a formal memory model specification
> yet.
> > > >> I've added Daniel Lustig, who is creating that model.  He should
> > > >> have a better idea
> > > >
> > > > Thanks. You really do need to ensure that, as it's heavily relied upon.
> > >
> > > I know it's the case for our current processors, and I'm pretty sure
> > > it's the case for what's formally specified, but we'll have to wait
> > > for the spec in order to prove it.
> >
> > I think Will is right.  In the current spec, using .aqrl converts an
> > RCpc load or store into an RCsc load or store, but the acquire(-RCsc)
> > annotation still only applies to the load part of the atomic, and the
> > release(-RCsc) annotation applies only to the store part of the atomic.
> >
> > Why is that?  Picture an machine which implements AMOs using something
> > that looks more like an LR/SC under the covers, or one that uses cache
> > line locking, or anything else along those same lines.  In some such
> > machines, there could be a window between lock/reserve and
> > unlock/store-conditional where other later stores could squeeze into, and
> that would break Will's example among others.
> >
> > It's likely the same reasoning that causes ARM to use a trailing dmb
> > here, rather than just using ldaxr/stlxr.  Is that right Will?  I know
> > that's LL/SC and this particular cases uses AMOADD, but it's the same
> > principle.  Well, at least according to how we have it in the current memory
> model draft.
> >
> > Also, RISC-V currently prefers leading fence mappings, so I think the
> > result here, for atomic_add_return() for example, should be this:
> >
> > fence rw,rw
> > amoadd.aq ...
> >
> 
> Hmm.. if atomic_add_return() is implemented like that, how about the
> following case:
> 
>   {x=0, y=0}
> 
>   P1:
> 
>   r1 = atomic_add_return(, 1); // r1 == 0, x will 1 afterwards
>   WRITE_ONCE(y, 1);
> 
>   P2:
> 
>   r2 = READ_ONCE(y); // r2 = 1
>   smp_rmb();
>   r3 = atomic_read(); // r3 = 0?
> 
> , could this result in r1 == 1 && r2 == 1 && r3 == 0? Given you said .aq only
> effects the load part of AMO, and I don't see anything here preventing the
> reordering between store of y and the store part of the AMO on P1.
> 
> Note: we don't allow (r1 == 1 && r2 == 1 && r3 == 0) in above case for linux
> kernel. Please see Documentation/atomic_t.txt:
> 
> "Fully ordered primitives are ordered against everything prior and everything
> subsequent. Therefore a fully ordered primitive is like having an smp_mb()
> before and an smp_mb() after the primitive."

Yes, you're right Boqun.  Good catch, and sorry for over-optimizing too quickly.

In that case, maybe we should just start out having a fence on both sides for
now, and then we'll discuss offline whether we want to change the model's
behavior here.

Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-15 Thread Daniel Lustig
> -Original Message-
> From: Boqun Feng [mailto:boqun.f...@gmail.com]
> Sent: Wednesday, November 15, 2017 5:19 PM
> To: Daniel Lustig 
> Cc: Palmer Dabbelt ; will.dea...@arm.com; Arnd
> Bergmann ; Olof Johansson ; linux-
> ker...@vger.kernel.org; patc...@groups.riscv.org; pet...@infradead.org
> Subject: Re: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code
> 
> On Wed, Nov 15, 2017 at 11:59:44PM +, Daniel Lustig wrote:
> > > On Wed, 15 Nov 2017 10:06:01 PST (-0800), will.dea...@arm.com wrote:
> > >> On Tue, Nov 14, 2017 at 12:30:59PM -0800, Palmer Dabbelt wrote:
> > >> > On Tue, 24 Oct 2017 07:10:33 PDT (-0700), will.dea...@arm.com
> wrote:
> > >> >>On Tue, Sep 26, 2017 at 06:56:31PM -0700, Palmer Dabbelt wrote:
> > > >
> > > > Hi Palmer,
> > > >
> > > >> >>+ATOMIC_OPS(add, add, +,  i,  , _relaxed)
> > > >> >>+ATOMIC_OPS(add, add, +,  i, .aq  , _acquire) ATOMIC_OPS(add,
> > > >> >>+add,
> > > >> >>++,  i, .rl  , _release)
> > > >> >>+ATOMIC_OPS(add, add, +,  i, .aqrl, )
> > > >> >
> > > >> >Have you checked that .aqrl is equivalent to "ordered", since
> > > >> >there are interpretations where that isn't the case. Specifically:
> > > >> >
> > > >> >// all variables zero at start of time
> > > >> >P0:
> > > >> >WRITE_ONCE(x) = 1;
> > > >> >atomic_add_return(y, 1);
> > > >> >WRITE_ONCE(z) = 1;
> > > >> >
> > > >> >P1:
> > > >> >READ_ONCE(z) // reads 1
> > > >> >smp_rmb();
> > > >> >READ_ONCE(x) // must not read 0
> > > >>
> > > >> I haven't.  We don't quite have a formal memory model specification
> yet.
> > > >> I've added Daniel Lustig, who is creating that model.  He should
> > > >> have a better idea
> > > >
> > > > Thanks. You really do need to ensure that, as it's heavily relied upon.
> > >
> > > I know it's the case for our current processors, and I'm pretty sure
> > > it's the case for what's formally specified, but we'll have to wait
> > > for the spec in order to prove it.
> >
> > I think Will is right.  In the current spec, using .aqrl converts an
> > RCpc load or store into an RCsc load or store, but the acquire(-RCsc)
> > annotation still only applies to the load part of the atomic, and the
> > release(-RCsc) annotation applies only to the store part of the atomic.
> >
> > Why is that?  Picture an machine which implements AMOs using something
> > that looks more like an LR/SC under the covers, or one that uses cache
> > line locking, or anything else along those same lines.  In some such
> > machines, there could be a window between lock/reserve and
> > unlock/store-conditional where other later stores could squeeze into, and
> that would break Will's example among others.
> >
> > It's likely the same reasoning that causes ARM to use a trailing dmb
> > here, rather than just using ldaxr/stlxr.  Is that right Will?  I know
> > that's LL/SC and this particular cases uses AMOADD, but it's the same
> > principle.  Well, at least according to how we have it in the current memory
> model draft.
> >
> > Also, RISC-V currently prefers leading fence mappings, so I think the
> > result here, for atomic_add_return() for example, should be this:
> >
> > fence rw,rw
> > amoadd.aq ...
> >
> 
> Hmm.. if atomic_add_return() is implemented like that, how about the
> following case:
> 
>   {x=0, y=0}
> 
>   P1:
> 
>   r1 = atomic_add_return(, 1); // r1 == 0, x will 1 afterwards
>   WRITE_ONCE(y, 1);
> 
>   P2:
> 
>   r2 = READ_ONCE(y); // r2 = 1
>   smp_rmb();
>   r3 = atomic_read(); // r3 = 0?
> 
> , could this result in r1 == 1 && r2 == 1 && r3 == 0? Given you said .aq only
> effects the load part of AMO, and I don't see anything here preventing the
> reordering between store of y and the store part of the AMO on P1.
> 
> Note: we don't allow (r1 == 1 && r2 == 1 && r3 == 0) in above case for linux
> kernel. Please see Documentation/atomic_t.txt:
> 
> "Fully ordered primitives are ordered against everything prior and everything
> subsequent. Therefore a fully ordered primitive is like having an smp_mb()
> before and an smp_mb() after the primitive."

Yes, you're right Boqun.  Good catch, and sorry for over-optimizing too quickly.

In that case, maybe we should just start out having a fence on both sides for
now, and then we'll discuss offline whether we want to change the model's
behavior here.

Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-15 Thread Daniel Lustig
> On Wed, 15 Nov 2017 10:06:01 PST (-0800), will.dea...@arm.com wrote:
>> On Tue, Nov 14, 2017 at 12:30:59PM -0800, Palmer Dabbelt wrote:
>> > On Tue, 24 Oct 2017 07:10:33 PDT (-0700), will.dea...@arm.com wrote:
>> >>On Tue, Sep 26, 2017 at 06:56:31PM -0700, Palmer Dabbelt wrote:
> >
> > Hi Palmer,
> >
> >> >>+ATOMIC_OPS(add, add, +,  i,  , _relaxed)
> >> >>+ATOMIC_OPS(add, add, +,  i, .aq  , _acquire) ATOMIC_OPS(add, add,
> >> >>++,  i, .rl  , _release)
> >> >>+ATOMIC_OPS(add, add, +,  i, .aqrl, )
> >> >
> >> >Have you checked that .aqrl is equivalent to "ordered", since there
> >> >are interpretations where that isn't the case. Specifically:
> >> >
> >> >// all variables zero at start of time
> >> >P0:
> >> >WRITE_ONCE(x) = 1;
> >> >atomic_add_return(y, 1);
> >> >WRITE_ONCE(z) = 1;
> >> >
> >> >P1:
> >> >READ_ONCE(z) // reads 1
> >> >smp_rmb();
> >> >READ_ONCE(x) // must not read 0
> >>
> >> I haven't.  We don't quite have a formal memory model specification yet.
> >> I've added Daniel Lustig, who is creating that model.  He should have
> >> a better idea
> >
> > Thanks. You really do need to ensure that, as it's heavily relied upon.
> 
> I know it's the case for our current processors, and I'm pretty sure it's the
> case for what's formally specified, but we'll have to wait for the spec in 
> order
> to prove it.

I think Will is right.  In the current spec, using .aqrl converts an RCpc load
or store into an RCsc load or store, but the acquire(-RCsc) annotation still
only applies to the load part of the atomic, and the release(-RCsc) annotation
applies only to the store part of the atomic.

Why is that?  Picture an machine which implements AMOs using something that
looks more like an LR/SC under the covers, or one that uses cache line locking,
or anything else along those same lines.  In some such machines, there could be
a window between lock/reserve and unlock/store-conditional where other later
stores could squeeze into, and that would break Will's example among others.

It's likely the same reasoning that causes ARM to use a trailing dmb here,
rather than just using ldaxr/stlxr.  Is that right Will?  I know that's LL/SC
and this particular cases uses AMOADD, but it's the same principle.  Well, at
least according to how we have it in the current memory model draft.

Also, RISC-V currently prefers leading fence mappings, so I think the result
here, for atomic_add_return() for example, should be this:

fence rw,rw
amoadd.aq ...

Note that at this point, I think you could even elide the .rl.  If I'm reading
it right it looks like the ARM mapping does this too (well, the reverse: ARM
elides the "a" in ldaxr due to the trailing dmb making it redundant).

Does that seem reasonable to you all?

Dan


RE: [patches] Re: [PATCH v9 05/12] RISC-V: Atomic and Locking Code

2017-11-15 Thread Daniel Lustig
> On Wed, 15 Nov 2017 10:06:01 PST (-0800), will.dea...@arm.com wrote:
>> On Tue, Nov 14, 2017 at 12:30:59PM -0800, Palmer Dabbelt wrote:
>> > On Tue, 24 Oct 2017 07:10:33 PDT (-0700), will.dea...@arm.com wrote:
>> >>On Tue, Sep 26, 2017 at 06:56:31PM -0700, Palmer Dabbelt wrote:
> >
> > Hi Palmer,
> >
> >> >>+ATOMIC_OPS(add, add, +,  i,  , _relaxed)
> >> >>+ATOMIC_OPS(add, add, +,  i, .aq  , _acquire) ATOMIC_OPS(add, add,
> >> >>++,  i, .rl  , _release)
> >> >>+ATOMIC_OPS(add, add, +,  i, .aqrl, )
> >> >
> >> >Have you checked that .aqrl is equivalent to "ordered", since there
> >> >are interpretations where that isn't the case. Specifically:
> >> >
> >> >// all variables zero at start of time
> >> >P0:
> >> >WRITE_ONCE(x) = 1;
> >> >atomic_add_return(y, 1);
> >> >WRITE_ONCE(z) = 1;
> >> >
> >> >P1:
> >> >READ_ONCE(z) // reads 1
> >> >smp_rmb();
> >> >READ_ONCE(x) // must not read 0
> >>
> >> I haven't.  We don't quite have a formal memory model specification yet.
> >> I've added Daniel Lustig, who is creating that model.  He should have
> >> a better idea
> >
> > Thanks. You really do need to ensure that, as it's heavily relied upon.
> 
> I know it's the case for our current processors, and I'm pretty sure it's the
> case for what's formally specified, but we'll have to wait for the spec in 
> order
> to prove it.

I think Will is right.  In the current spec, using .aqrl converts an RCpc load
or store into an RCsc load or store, but the acquire(-RCsc) annotation still
only applies to the load part of the atomic, and the release(-RCsc) annotation
applies only to the store part of the atomic.

Why is that?  Picture an machine which implements AMOs using something that
looks more like an LR/SC under the covers, or one that uses cache line locking,
or anything else along those same lines.  In some such machines, there could be
a window between lock/reserve and unlock/store-conditional where other later
stores could squeeze into, and that would break Will's example among others.

It's likely the same reasoning that causes ARM to use a trailing dmb here,
rather than just using ldaxr/stlxr.  Is that right Will?  I know that's LL/SC
and this particular cases uses AMOADD, but it's the same principle.  Well, at
least according to how we have it in the current memory model draft.

Also, RISC-V currently prefers leading fence mappings, so I think the result
here, for atomic_add_return() for example, should be this:

fence rw,rw
amoadd.aq ...

Note that at this point, I think you could even elide the .rl.  If I'm reading
it right it looks like the ARM mapping does this too (well, the reverse: ARM
elides the "a" in ldaxr due to the trailing dmb making it redundant).

Does that seem reasonable to you all?

Dan


RE: [PATCH 13/17] RISC-V: Add include subdirectory

2017-06-26 Thread Daniel Lustig
> > https://github.com/riscv/riscv-isa-manual/releases/download/riscv-user
> > -2.2/riscv-spec-v2.2.pdf
> 
> That's the most up to date spec.

Yes, that's the most up to date public spec.  Internally, the RISC-V memory
model task group has been working on fixing the memory model spec for the
past couple of months now.  We're aiming to release it for public review
well before the end of the year.  Hopefully in the coming weeks even.

> > which, on the one hand is reassuring (because ignoring dependency
> > ordering is plain broken), but on the other it doesn't go quite far
> > enough in defining exactly what constitutes a "syntactic data
> > dependency". The cumulativity of your fences also needs defining,
> > because I think this was up in the air at some point and the document
> > above doesn't seem to tackle it (it doesn't seem to describe what
> > constitutes being a memory of the predecessor or successor sets)

That will all covered in the new spec.

> > P.S. You should also totally get your architects to write a formal
> > model ;)

Also in progress :)

Were there any more specific questions I can answer in the meantime?  Or
any specific concern you'd like to point me to?

Dan


RE: [PATCH 13/17] RISC-V: Add include subdirectory

2017-06-26 Thread Daniel Lustig
> > https://github.com/riscv/riscv-isa-manual/releases/download/riscv-user
> > -2.2/riscv-spec-v2.2.pdf
> 
> That's the most up to date spec.

Yes, that's the most up to date public spec.  Internally, the RISC-V memory
model task group has been working on fixing the memory model spec for the
past couple of months now.  We're aiming to release it for public review
well before the end of the year.  Hopefully in the coming weeks even.

> > which, on the one hand is reassuring (because ignoring dependency
> > ordering is plain broken), but on the other it doesn't go quite far
> > enough in defining exactly what constitutes a "syntactic data
> > dependency". The cumulativity of your fences also needs defining,
> > because I think this was up in the air at some point and the document
> > above doesn't seem to tackle it (it doesn't seem to describe what
> > constitutes being a memory of the predecessor or successor sets)

That will all covered in the new spec.

> > P.S. You should also totally get your architects to write a formal
> > model ;)

Also in progress :)

Were there any more specific questions I can answer in the meantime?  Or
any specific concern you'd like to point me to?

Dan