Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Paul E. McKenney
On Wed, Oct 18, 2017 at 05:18:37PM -0400, Alan Stern wrote:
> On Wed, 18 Oct 2017, Paul E. McKenney wrote:
> 
> > > Well, you could explicitly mention that in the multi-thread case, this
> > > means all accesses to the shared variable had better use READ_ONCE() or
> > > WRITE_ONCE().
> > 
> > Like this?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > d.  If there are multiple CPUs, accesses to shared variables
> > should use READ_ONCE() and WRITE_ONCE() or stronger
> > to prevent load/store tearing, load/store fusing, and
> > invented loads and stores.  There are exceptions to
> > this rule, for example:
> > 
> > i.  When there is no possibility of a given
> > shared variable being updated, for example,
> > while holding the update-side lock, reads
> > from that variable need not use READ_ONCE().
> > 
> > ii. When there is no possibility of a given shared
> > variable being either read or updated, for
> > example, when running during early boot, reads
> > from that variable need not use READ_ONCE() and
> > writes to that variable need not use WRITE_ONCE().
> 
> Yeah, except that you mean being read or updated by another thread.

Good point, added that qualifier.

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Paul E. McKenney
On Wed, Oct 18, 2017 at 05:18:37PM -0400, Alan Stern wrote:
> On Wed, 18 Oct 2017, Paul E. McKenney wrote:
> 
> > > Well, you could explicitly mention that in the multi-thread case, this
> > > means all accesses to the shared variable had better use READ_ONCE() or
> > > WRITE_ONCE().
> > 
> > Like this?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > d.  If there are multiple CPUs, accesses to shared variables
> > should use READ_ONCE() and WRITE_ONCE() or stronger
> > to prevent load/store tearing, load/store fusing, and
> > invented loads and stores.  There are exceptions to
> > this rule, for example:
> > 
> > i.  When there is no possibility of a given
> > shared variable being updated, for example,
> > while holding the update-side lock, reads
> > from that variable need not use READ_ONCE().
> > 
> > ii. When there is no possibility of a given shared
> > variable being either read or updated, for
> > example, when running during early boot, reads
> > from that variable need not use READ_ONCE() and
> > writes to that variable need not use WRITE_ONCE().
> 
> Yeah, except that you mean being read or updated by another thread.

Good point, added that qualifier.

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Alan Stern
On Wed, 18 Oct 2017, Paul E. McKenney wrote:

> > Well, you could explicitly mention that in the multi-thread case, this
> > means all accesses to the shared variable had better use READ_ONCE() or
> > WRITE_ONCE().
> 
> Like this?
> 
>   Thanx, Paul
> 
> 
> 
>   d.  If there are multiple CPUs, accesses to shared variables
>   should use READ_ONCE() and WRITE_ONCE() or stronger
>   to prevent load/store tearing, load/store fusing, and
>   invented loads and stores.  There are exceptions to
>   this rule, for example:
> 
>   i.  When there is no possibility of a given
>   shared variable being updated, for example,
>   while holding the update-side lock, reads
>   from that variable need not use READ_ONCE().
> 
>   ii. When there is no possibility of a given shared
>   variable being either read or updated, for
>   example, when running during early boot, reads
>   from that variable need not use READ_ONCE() and
>   writes to that variable need not use WRITE_ONCE().

Yeah, except that you mean being read or updated by another thread.

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Alan Stern
On Wed, 18 Oct 2017, Paul E. McKenney wrote:

> > Well, you could explicitly mention that in the multi-thread case, this
> > means all accesses to the shared variable had better use READ_ONCE() or
> > WRITE_ONCE().
> 
> Like this?
> 
>   Thanx, Paul
> 
> 
> 
>   d.  If there are multiple CPUs, accesses to shared variables
>   should use READ_ONCE() and WRITE_ONCE() or stronger
>   to prevent load/store tearing, load/store fusing, and
>   invented loads and stores.  There are exceptions to
>   this rule, for example:
> 
>   i.  When there is no possibility of a given
>   shared variable being updated, for example,
>   while holding the update-side lock, reads
>   from that variable need not use READ_ONCE().
> 
>   ii. When there is no possibility of a given shared
>   variable being either read or updated, for
>   example, when running during early boot, reads
>   from that variable need not use READ_ONCE() and
>   writes to that variable need not use WRITE_ONCE().

Yeah, except that you mean being read or updated by another thread.

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Paul E. McKenney
On Wed, Oct 18, 2017 at 10:43:42AM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > > > > > b.  Compilers are permitted to use the "as-if" rule.
> > > > > > That is, a compiler can emit whatever code it likes,
> > > > > > as long as the results appear just as if the compiler
> > > > > > had followed all the relevant rules.  To see this,
> > > > > > compiler with a high level of optimization and run
> > > > > > the debugger on the resulting binary.
> > > > > 
> > > > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result 
> > > > > as
> > > > > if it had executed in order (even for a single variable!), and if you
> > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > > > it likes.
> > > > 
> > > > Ah, I omitted an important qualifier:
> > > > 
> > > > b.  Compilers are permitted to use the "as-if" rule.  That 
> > > > is,
> > > > a compiler can emit whatever code it likes, as long as
> > > > the results of a single-threaded execution appear just
> > > > as if the compiler had followed all the relevant rules.
> > > > To see this, compile with a high level of optimization
> > > > and run the debugger on the resulting binary.
> > > 
> > > That's okay for the single-CPU case.  I don't think it covers the
> > > multiple-CPU single-variable case correctly, though.  If you don't use
> > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> > > and stores?  And won't that potentially cause the end result to be
> > > different from what you would get if the code had appeared to execute
> > > in order?
> > 
> > Ah, good point, I need yet another qualifier.  How about the following?
> > 
> > b.  Compilers are permitted to use the "as-if" rule.  That is,
> > a compiler can emit whatever code it likes for normal
> > accesses, as long as the results of a single-threaded
> > execution appear just as if the compiler had followed
> > all the relevant rules.  To see this, compile with a
> > high level of optimization and run the debugger on the
> > resulting binary.
> > 
> > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
> > and atomics.  This, in conjunction with the previously added
> > "single-threaded execution" means that yes, the compiler is permitted
> > to tear normal loads and stores.  The reason is that a single-threaded
> > run could not tell the difference.  Interrupt handlers or multiple
> > threads are required to detect load/store tearing.
> > 
> > So, what am I still missing?  ;-)
> 
> Well, you could explicitly mention that in the multi-thread case, this
> means all accesses to the shared variable had better use READ_ONCE() or
> WRITE_ONCE().

Like this?

Thanx, Paul



d.  If there are multiple CPUs, accesses to shared variables
should use READ_ONCE() and WRITE_ONCE() or stronger
to prevent load/store tearing, load/store fusing, and
invented loads and stores.  There are exceptions to
this rule, for example:

i.  When there is no possibility of a given
shared variable being updated, for example,
while holding the update-side lock, reads
from that variable need not use READ_ONCE().

ii. When there is no possibility of a given shared
variable being either read or updated, for
example, when running during early boot, reads
from that variable need not use READ_ONCE() and
writes to that variable need not use WRITE_ONCE().



Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Paul E. McKenney
On Wed, Oct 18, 2017 at 10:43:42AM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > > > > > b.  Compilers are permitted to use the "as-if" rule.
> > > > > > That is, a compiler can emit whatever code it likes,
> > > > > > as long as the results appear just as if the compiler
> > > > > > had followed all the relevant rules.  To see this,
> > > > > > compiler with a high level of optimization and run
> > > > > > the debugger on the resulting binary.
> > > > > 
> > > > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result 
> > > > > as
> > > > > if it had executed in order (even for a single variable!), and if you
> > > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > > > it likes.
> > > > 
> > > > Ah, I omitted an important qualifier:
> > > > 
> > > > b.  Compilers are permitted to use the "as-if" rule.  That 
> > > > is,
> > > > a compiler can emit whatever code it likes, as long as
> > > > the results of a single-threaded execution appear just
> > > > as if the compiler had followed all the relevant rules.
> > > > To see this, compile with a high level of optimization
> > > > and run the debugger on the resulting binary.
> > > 
> > > That's okay for the single-CPU case.  I don't think it covers the
> > > multiple-CPU single-variable case correctly, though.  If you don't use
> > > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> > > and stores?  And won't that potentially cause the end result to be
> > > different from what you would get if the code had appeared to execute
> > > in order?
> > 
> > Ah, good point, I need yet another qualifier.  How about the following?
> > 
> > b.  Compilers are permitted to use the "as-if" rule.  That is,
> > a compiler can emit whatever code it likes for normal
> > accesses, as long as the results of a single-threaded
> > execution appear just as if the compiler had followed
> > all the relevant rules.  To see this, compile with a
> > high level of optimization and run the debugger on the
> > resulting binary.
> > 
> > I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
> > and atomics.  This, in conjunction with the previously added
> > "single-threaded execution" means that yes, the compiler is permitted
> > to tear normal loads and stores.  The reason is that a single-threaded
> > run could not tell the difference.  Interrupt handlers or multiple
> > threads are required to detect load/store tearing.
> > 
> > So, what am I still missing?  ;-)
> 
> Well, you could explicitly mention that in the multi-thread case, this
> means all accesses to the shared variable had better use READ_ONCE() or
> WRITE_ONCE().

Like this?

Thanx, Paul



d.  If there are multiple CPUs, accesses to shared variables
should use READ_ONCE() and WRITE_ONCE() or stronger
to prevent load/store tearing, load/store fusing, and
invented loads and stores.  There are exceptions to
this rule, for example:

i.  When there is no possibility of a given
shared variable being updated, for example,
while holding the update-side lock, reads
from that variable need not use READ_ONCE().

ii. When there is no possibility of a given shared
variable being either read or updated, for
example, when running during early boot, reads
from that variable need not use READ_ONCE() and
writes to that variable need not use WRITE_ONCE().



Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Alan Stern
On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> > > > >   b.  Compilers are permitted to use the "as-if" rule.
> > > > >   That is, a compiler can emit whatever code it likes,
> > > > >   as long as the results appear just as if the compiler
> > > > >   had followed all the relevant rules.  To see this,
> > > > >   compiler with a high level of optimization and run
> > > > >   the debugger on the resulting binary.
> > > > 
> > > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > > if it had executed in order (even for a single variable!), and if you
> > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > > it likes.
> > > 
> > > Ah, I omitted an important qualifier:
> > > 
> > >   b.  Compilers are permitted to use the "as-if" rule.  That is,
> > >   a compiler can emit whatever code it likes, as long as
> > >   the results of a single-threaded execution appear just
> > >   as if the compiler had followed all the relevant rules.
> > >   To see this, compile with a high level of optimization
> > >   and run the debugger on the resulting binary.
> > 
> > That's okay for the single-CPU case.  I don't think it covers the
> > multiple-CPU single-variable case correctly, though.  If you don't use
> > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> > and stores?  And won't that potentially cause the end result to be
> > different from what you would get if the code had appeared to execute
> > in order?
> 
> Ah, good point, I need yet another qualifier.  How about the following?
> 
>   b.  Compilers are permitted to use the "as-if" rule.  That is,
>   a compiler can emit whatever code it likes for normal
>   accesses, as long as the results of a single-threaded
>   execution appear just as if the compiler had followed
>   all the relevant rules.  To see this, compile with a
>   high level of optimization and run the debugger on the
>   resulting binary.
> 
> I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
> and atomics.  This, in conjunction with the previously added
> "single-threaded execution" means that yes, the compiler is permitted
> to tear normal loads and stores.  The reason is that a single-threaded
> run could not tell the difference.  Interrupt handlers or multiple
> threads are required to detect load/store tearing.
> 
> So, what am I still missing?  ;-)

Well, you could explicitly mention that in the multi-thread case, this
means all accesses to the shared variable had better use READ_ONCE() or
WRITE_ONCE().

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-18 Thread Alan Stern
On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> > > > >   b.  Compilers are permitted to use the "as-if" rule.
> > > > >   That is, a compiler can emit whatever code it likes,
> > > > >   as long as the results appear just as if the compiler
> > > > >   had followed all the relevant rules.  To see this,
> > > > >   compiler with a high level of optimization and run
> > > > >   the debugger on the resulting binary.
> > > > 
> > > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > > if it had executed in order (even for a single variable!), and if you
> > > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > > it likes.
> > > 
> > > Ah, I omitted an important qualifier:
> > > 
> > >   b.  Compilers are permitted to use the "as-if" rule.  That is,
> > >   a compiler can emit whatever code it likes, as long as
> > >   the results of a single-threaded execution appear just
> > >   as if the compiler had followed all the relevant rules.
> > >   To see this, compile with a high level of optimization
> > >   and run the debugger on the resulting binary.
> > 
> > That's okay for the single-CPU case.  I don't think it covers the
> > multiple-CPU single-variable case correctly, though.  If you don't use
> > READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> > and stores?  And won't that potentially cause the end result to be
> > different from what you would get if the code had appeared to execute
> > in order?
> 
> Ah, good point, I need yet another qualifier.  How about the following?
> 
>   b.  Compilers are permitted to use the "as-if" rule.  That is,
>   a compiler can emit whatever code it likes for normal
>   accesses, as long as the results of a single-threaded
>   execution appear just as if the compiler had followed
>   all the relevant rules.  To see this, compile with a
>   high level of optimization and run the debugger on the
>   resulting binary.
> 
> I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
> and atomics.  This, in conjunction with the previously added
> "single-threaded execution" means that yes, the compiler is permitted
> to tear normal loads and stores.  The reason is that a single-threaded
> run could not tell the difference.  Interrupt handlers or multiple
> threads are required to detect load/store tearing.
> 
> So, what am I still missing?  ;-)

Well, you could explicitly mention that in the multi-thread case, this
means all accesses to the shared variable had better use READ_ONCE() or
WRITE_ONCE().

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Tue, Oct 17, 2017 at 05:03:01PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> > > On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> > > 
> > > > How about this?
> > > > 
> > > > 0.  Simple special cases
> > > > 
> > > > If there is only one CPU on the one hand or only one variable
> > > > on the other, the code will execute in order.  There are (as
> > > > usual) some things to be careful of:
> > > > 
> > > > a.  There are some aspects of the C language that are
> > > > unordered.  For example, the compiler can output code
> > > > computing arguments of a multi-parameter function in
> > > > any order it likes, or even interleaved if it so 
> > > > chooses.
> > > 
> > > That parses a little oddly.  I wouldn't agree that the compiler outputs
> > > the code in any order it likes!
> > 
> > When was the last time you talked to a compiler writer?  ;-)
> > 
> > > In fact, I wouldn't even mention the compiler at all.  Just say that
> > > (with a few exceptions) the language doesn't specify the order in which
> > > the arguments of a function or operation should be evaluated.  For
> > > example, in the expression "f(x) + g(y)", the order in which f and g
> > > are called is not defined; the object code is allowed to use either
> > > order or even to interleave the computations.
> > 
> > Nevertheless, I took your suggestion:
> > 
> > a.  There are some aspects of the C language that are
> > unordered.  For example, in the expression "f(x) + g(y)",
> > the order in which f and g are called is not defined;
> > the object code is allowed to use either order or even
> > to interleave the computations.
> 
> Good.
> 
> > > > b.  Compilers are permitted to use the "as-if" rule.
> > > > That is, a compiler can emit whatever code it likes,
> > > > as long as the results appear just as if the compiler
> > > > had followed all the relevant rules.  To see this,
> > > > compiler with a high level of optimization and run
> > > > the debugger on the resulting binary.
> > > 
> > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > if it had executed in order (even for a single variable!), and if you
> > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > it likes.
> > 
> > Ah, I omitted an important qualifier:
> > 
> > b.  Compilers are permitted to use the "as-if" rule.  That is,
> > a compiler can emit whatever code it likes, as long as
> > the results of a single-threaded execution appear just
> > as if the compiler had followed all the relevant rules.
> > To see this, compile with a high level of optimization
> > and run the debugger on the resulting binary.
> 
> That's okay for the single-CPU case.  I don't think it covers the
> multiple-CPU single-variable case correctly, though.  If you don't use
> READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> and stores?  And won't that potentially cause the end result to be
> different from what you would get if the code had appeared to execute
> in order?

Ah, good point, I need yet another qualifier.  How about the following?

b.  Compilers are permitted to use the "as-if" rule.  That is,
a compiler can emit whatever code it likes for normal
accesses, as long as the results of a single-threaded
execution appear just as if the compiler had followed
all the relevant rules.  To see this, compile with a
high level of optimization and run the debugger on the
resulting binary.

I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
and atomics.  This, in conjunction with the previously added
"single-threaded execution" means that yes, the compiler is permitted
to tear normal loads and stores.  The reason is that a single-threaded
run could not tell the difference.  Interrupt handlers or multiple
threads are required to detect load/store tearing.

So, what am I still missing?  ;-)

> > I have seen people (including kernel hackers) surprised by what optimizers
> > do, so I would prefer that the last sentence remain.
> > 
> > > > c.  If there is only one variable but multiple CPUs, all
> > > > accesses to that variable must be aligned and full 
> > > > sized.
> > > 
> > > I would say that the variable is what needs to be aligned, not the
> > > accesses.  (Although, if the variable is aligned and all the accesses
> > > are full sized, then they must necessarily be aligned as well.)
> > 
> > I was 

Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Tue, Oct 17, 2017 at 05:03:01PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> > > On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> > > 
> > > > How about this?
> > > > 
> > > > 0.  Simple special cases
> > > > 
> > > > If there is only one CPU on the one hand or only one variable
> > > > on the other, the code will execute in order.  There are (as
> > > > usual) some things to be careful of:
> > > > 
> > > > a.  There are some aspects of the C language that are
> > > > unordered.  For example, the compiler can output code
> > > > computing arguments of a multi-parameter function in
> > > > any order it likes, or even interleaved if it so 
> > > > chooses.
> > > 
> > > That parses a little oddly.  I wouldn't agree that the compiler outputs
> > > the code in any order it likes!
> > 
> > When was the last time you talked to a compiler writer?  ;-)
> > 
> > > In fact, I wouldn't even mention the compiler at all.  Just say that
> > > (with a few exceptions) the language doesn't specify the order in which
> > > the arguments of a function or operation should be evaluated.  For
> > > example, in the expression "f(x) + g(y)", the order in which f and g
> > > are called is not defined; the object code is allowed to use either
> > > order or even to interleave the computations.
> > 
> > Nevertheless, I took your suggestion:
> > 
> > a.  There are some aspects of the C language that are
> > unordered.  For example, in the expression "f(x) + g(y)",
> > the order in which f and g are called is not defined;
> > the object code is allowed to use either order or even
> > to interleave the computations.
> 
> Good.
> 
> > > > b.  Compilers are permitted to use the "as-if" rule.
> > > > That is, a compiler can emit whatever code it likes,
> > > > as long as the results appear just as if the compiler
> > > > had followed all the relevant rules.  To see this,
> > > > compiler with a high level of optimization and run
> > > > the debugger on the resulting binary.
> > > 
> > > You might omit the last sentence.  Furthermore, if the accesses don't
> > > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > > if it had executed in order (even for a single variable!), and if you
> > > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > > it likes.
> > 
> > Ah, I omitted an important qualifier:
> > 
> > b.  Compilers are permitted to use the "as-if" rule.  That is,
> > a compiler can emit whatever code it likes, as long as
> > the results of a single-threaded execution appear just
> > as if the compiler had followed all the relevant rules.
> > To see this, compile with a high level of optimization
> > and run the debugger on the resulting binary.
> 
> That's okay for the single-CPU case.  I don't think it covers the
> multiple-CPU single-variable case correctly, though.  If you don't use
> READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
> and stores?  And won't that potentially cause the end result to be
> different from what you would get if the code had appeared to execute
> in order?

Ah, good point, I need yet another qualifier.  How about the following?

b.  Compilers are permitted to use the "as-if" rule.  That is,
a compiler can emit whatever code it likes for normal
accesses, as long as the results of a single-threaded
execution appear just as if the compiler had followed
all the relevant rules.  To see this, compile with a
high level of optimization and run the debugger on the
resulting binary.

I added "for normal accesses", which excludes READ_ONCE(), WRITE_ONCE(),
and atomics.  This, in conjunction with the previously added
"single-threaded execution" means that yes, the compiler is permitted
to tear normal loads and stores.  The reason is that a single-threaded
run could not tell the difference.  Interrupt handlers or multiple
threads are required to detect load/store tearing.

So, what am I still missing?  ;-)

> > I have seen people (including kernel hackers) surprised by what optimizers
> > do, so I would prefer that the last sentence remain.
> > 
> > > > c.  If there is only one variable but multiple CPUs, all
> > > > accesses to that variable must be aligned and full 
> > > > sized.
> > > 
> > > I would say that the variable is what needs to be aligned, not the
> > > accesses.  (Although, if the variable is aligned and all the accesses
> > > are full sized, then they must necessarily be aligned as well.)
> > 
> > I was 

Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Alan Stern
On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> > On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> > 
> > > How about this?
> > > 
> > > 0.Simple special cases
> > > 
> > >   If there is only one CPU on the one hand or only one variable
> > >   on the other, the code will execute in order.  There are (as
> > >   usual) some things to be careful of:
> > > 
> > >   a.  There are some aspects of the C language that are
> > >   unordered.  For example, the compiler can output code
> > >   computing arguments of a multi-parameter function in
> > >   any order it likes, or even interleaved if it so chooses.
> > 
> > That parses a little oddly.  I wouldn't agree that the compiler outputs
> > the code in any order it likes!
> 
> When was the last time you talked to a compiler writer?  ;-)
> 
> > In fact, I wouldn't even mention the compiler at all.  Just say that
> > (with a few exceptions) the language doesn't specify the order in which
> > the arguments of a function or operation should be evaluated.  For
> > example, in the expression "f(x) + g(y)", the order in which f and g
> > are called is not defined; the object code is allowed to use either
> > order or even to interleave the computations.
> 
> Nevertheless, I took your suggestion:
> 
>   a.  There are some aspects of the C language that are
>   unordered.  For example, in the expression "f(x) + g(y)",
>   the order in which f and g are called is not defined;
>   the object code is allowed to use either order or even
>   to interleave the computations.

Good.

> > >   b.  Compilers are permitted to use the "as-if" rule.
> > >   That is, a compiler can emit whatever code it likes,
> > >   as long as the results appear just as if the compiler
> > >   had followed all the relevant rules.  To see this,
> > >   compiler with a high level of optimization and run
> > >   the debugger on the resulting binary.
> > 
> > You might omit the last sentence.  Furthermore, if the accesses don't
> > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > if it had executed in order (even for a single variable!), and if you
> > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > it likes.
> 
> Ah, I omitted an important qualifier:
> 
>   b.  Compilers are permitted to use the "as-if" rule.  That is,
>   a compiler can emit whatever code it likes, as long as
>   the results of a single-threaded execution appear just
>   as if the compiler had followed all the relevant rules.
>   To see this, compile with a high level of optimization
>   and run the debugger on the resulting binary.

That's okay for the single-CPU case.  I don't think it covers the
multiple-CPU single-variable case correctly, though.  If you don't use
READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
and stores?  And won't that potentially cause the end result to be
different from what you would get if the code had appeared to execute
in order?

> I have seen people (including kernel hackers) surprised by what optimizers
> do, so I would prefer that the last sentence remain.
> 
> > >   c.  If there is only one variable but multiple CPUs, all
> > >   accesses to that variable must be aligned and full sized.
> > 
> > I would say that the variable is what needs to be aligned, not the
> > accesses.  (Although, if the variable is aligned and all the accesses
> > are full sized, then they must necessarily be aligned as well.)
> 
> I was thinking in terms of an unaligned 16-bit access to a 32-bit
> variable.

That wouldn't be full sized.

>  But how about this?
> 
>   c.  If there is only one variable but multiple CPUs, all

Extra "all".  Otherwise okay.

>   that variable must be properly aligned and all accesses
>   to that variable must be full sized.
> 
> > >   Variables that straddle cachelines or pages void your
> > >   full-ordering warranty, as do undersized accesses that
> > >   load from or store to only part of the variable.
> > 
> > How can a variable straddle pages without also straddling cache lines?
> 
> Well, a variable -can- straddle cachelines without straddling pages,
> which justifies the "or".  Furthermore, given that cacheline sizes have
> been growing, but pages are still 4KB, it is probably only a matter
> of time.  ;-)

By that time, we'll probably be using 64-KB pages.  Or even bigger!

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Alan Stern
On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> > On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> > 
> > > How about this?
> > > 
> > > 0.Simple special cases
> > > 
> > >   If there is only one CPU on the one hand or only one variable
> > >   on the other, the code will execute in order.  There are (as
> > >   usual) some things to be careful of:
> > > 
> > >   a.  There are some aspects of the C language that are
> > >   unordered.  For example, the compiler can output code
> > >   computing arguments of a multi-parameter function in
> > >   any order it likes, or even interleaved if it so chooses.
> > 
> > That parses a little oddly.  I wouldn't agree that the compiler outputs
> > the code in any order it likes!
> 
> When was the last time you talked to a compiler writer?  ;-)
> 
> > In fact, I wouldn't even mention the compiler at all.  Just say that
> > (with a few exceptions) the language doesn't specify the order in which
> > the arguments of a function or operation should be evaluated.  For
> > example, in the expression "f(x) + g(y)", the order in which f and g
> > are called is not defined; the object code is allowed to use either
> > order or even to interleave the computations.
> 
> Nevertheless, I took your suggestion:
> 
>   a.  There are some aspects of the C language that are
>   unordered.  For example, in the expression "f(x) + g(y)",
>   the order in which f and g are called is not defined;
>   the object code is allowed to use either order or even
>   to interleave the computations.

Good.

> > >   b.  Compilers are permitted to use the "as-if" rule.
> > >   That is, a compiler can emit whatever code it likes,
> > >   as long as the results appear just as if the compiler
> > >   had followed all the relevant rules.  To see this,
> > >   compiler with a high level of optimization and run
> > >   the debugger on the resulting binary.
> > 
> > You might omit the last sentence.  Furthermore, if the accesses don't
> > use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> > if it had executed in order (even for a single variable!), and if you
> > do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> > it likes.
> 
> Ah, I omitted an important qualifier:
> 
>   b.  Compilers are permitted to use the "as-if" rule.  That is,
>   a compiler can emit whatever code it likes, as long as
>   the results of a single-threaded execution appear just
>   as if the compiler had followed all the relevant rules.
>   To see this, compile with a high level of optimization
>   and run the debugger on the resulting binary.

That's okay for the single-CPU case.  I don't think it covers the
multiple-CPU single-variable case correctly, though.  If you don't use
READ_ONCE or WRITE_ONCE, isn't the compiler allowed to tear the loads
and stores?  And won't that potentially cause the end result to be
different from what you would get if the code had appeared to execute
in order?

> I have seen people (including kernel hackers) surprised by what optimizers
> do, so I would prefer that the last sentence remain.
> 
> > >   c.  If there is only one variable but multiple CPUs, all
> > >   accesses to that variable must be aligned and full sized.
> > 
> > I would say that the variable is what needs to be aligned, not the
> > accesses.  (Although, if the variable is aligned and all the accesses
> > are full sized, then they must necessarily be aligned as well.)
> 
> I was thinking in terms of an unaligned 16-bit access to a 32-bit
> variable.

That wouldn't be full sized.

>  But how about this?
> 
>   c.  If there is only one variable but multiple CPUs, all

Extra "all".  Otherwise okay.

>   that variable must be properly aligned and all accesses
>   to that variable must be full sized.
> 
> > >   Variables that straddle cachelines or pages void your
> > >   full-ordering warranty, as do undersized accesses that
> > >   load from or store to only part of the variable.
> > 
> > How can a variable straddle pages without also straddling cache lines?
> 
> Well, a variable -can- straddle cachelines without straddling pages,
> which justifies the "or".  Furthermore, given that cacheline sizes have
> been growing, but pages are still 4KB, it is probably only a matter
> of time.  ;-)

By that time, we'll probably be using 64-KB pages.  Or even bigger!

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> > Hello!
> > 
> > At Linux Plumbers Conference, we got requests for a recipes document,
> > and a further request to point to actual code in the Linux kernel.
> > I have pulled together some examples for various litmus-test families,
> > as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> > MP, ...) is here:
> > 
> > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> > 
> > This document is also checked into the memory-models git archive:
> > 
> > https://github.com/aparri/memory-model.git
> > 
> > I would be especially interested in simpler examples in general, and
> > of course any example at all for the cases where I was unable to find
> > any.  Thoughts?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.  Single-variable SC.
> > 
> > a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
> > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > (see kernel/rcu/tree.c).  The counter is accessed by
> > interrupts and NMIs as well as by process-level code.
> > This counter can be accessed by other CPUs, but only
> > for debug output.
> > 
> > b.  Between CPUs, I would put forward the ->dflags
> > updates, but this is anything but simple.  But maybe
> > OK for an illustration?
> > 
> > 1.  MP (see test6.pdf for nickname translation)
> > 
> > a.  smp_store_release() / smp_load_acquire()
> > 
> > init_stack_slab() in lib/stackdepot.c uses release-acquire
> > to handle initialization of a slab of the stack.  Working
> > out the mutual-exclusion design is left as an exercise for
> > the reader.
> > 
> > b.  rcu_assign_pointer() / rcu_dereference()
> > 
> > expand_to_next_prime() does the rcu_assign_pointer(),
> > and next_prime_number() does the rcu_dereference().
> > This mediates access to a bit vector that is expanded
> > as additional primes are needed.  These two functions
> > are in lib/prime_numbers.c.
> > 
> > c.  smp_wmb() / smp_rmb()
> > 
> > xlog_state_switch_iclogs() contains the following:
> > 
> > log->l_curr_block -= log->l_logBBsize;
> > ASSERT(log->l_curr_block >= 0);
> > smp_wmb();
> > log->l_curr_cycle++;
> > 
> > And xlog_valid_lsn() contains the following:
> > 
> > cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> > smp_rmb();
> > cur_block = ACCESS_ONCE(log->l_curr_block);
> > 
> > d.  Replacing either of the above with smp_mb()
> > 
> > Holding off on this one for the moment...
> > 
> > 2.  Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> > 
> > Lots of variety here, can in some cases substitute:
> > 
> > a.  READ_ONCE() for smp_load_acquire()
> > b.  WRITE_ONCE() for smp_store_release()
> > c.  Dependencies for both smp_load_acquire() and
> > smp_store_release().
> > d.  smp_wmb() for smp_store_release() in first thread
> > of ISA2 and Z6.2.
> > e.  smp_rmb() for smp_load_acquire() in last thread of ISA2.
> > 
> > The canonical illustration of LB involves the various memory
> > allocators, where you don't want a load from about-to-be-freed
> > memory to see a store initializing a later incarnation of that
> > same memory area.  But the per-CPU caches make this a very
> > long and complicated example.
> > 
> > I am not aware of any three-CPU release-acquire chains in the
> > Linux kernel.  There are three-CPU lock-based chains in RCU,
> > but these are not at all simple, either.
> > 
> 
> The "Program-Order guarantees" case in scheduler? See the comments
> written by Peter above try_to_wake_up():
> 
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any 
> subsequent
>  * execution on its new CPU [c1].
> ...
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)

Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> > Hello!
> > 
> > At Linux Plumbers Conference, we got requests for a recipes document,
> > and a further request to point to actual code in the Linux kernel.
> > I have pulled together some examples for various litmus-test families,
> > as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> > MP, ...) is here:
> > 
> > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> > 
> > This document is also checked into the memory-models git archive:
> > 
> > https://github.com/aparri/memory-model.git
> > 
> > I would be especially interested in simpler examples in general, and
> > of course any example at all for the cases where I was unable to find
> > any.  Thoughts?
> > 
> > Thanx, Paul
> > 
> > 
> > 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.  Single-variable SC.
> > 
> > a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
> > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > (see kernel/rcu/tree.c).  The counter is accessed by
> > interrupts and NMIs as well as by process-level code.
> > This counter can be accessed by other CPUs, but only
> > for debug output.
> > 
> > b.  Between CPUs, I would put forward the ->dflags
> > updates, but this is anything but simple.  But maybe
> > OK for an illustration?
> > 
> > 1.  MP (see test6.pdf for nickname translation)
> > 
> > a.  smp_store_release() / smp_load_acquire()
> > 
> > init_stack_slab() in lib/stackdepot.c uses release-acquire
> > to handle initialization of a slab of the stack.  Working
> > out the mutual-exclusion design is left as an exercise for
> > the reader.
> > 
> > b.  rcu_assign_pointer() / rcu_dereference()
> > 
> > expand_to_next_prime() does the rcu_assign_pointer(),
> > and next_prime_number() does the rcu_dereference().
> > This mediates access to a bit vector that is expanded
> > as additional primes are needed.  These two functions
> > are in lib/prime_numbers.c.
> > 
> > c.  smp_wmb() / smp_rmb()
> > 
> > xlog_state_switch_iclogs() contains the following:
> > 
> > log->l_curr_block -= log->l_logBBsize;
> > ASSERT(log->l_curr_block >= 0);
> > smp_wmb();
> > log->l_curr_cycle++;
> > 
> > And xlog_valid_lsn() contains the following:
> > 
> > cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> > smp_rmb();
> > cur_block = ACCESS_ONCE(log->l_curr_block);
> > 
> > d.  Replacing either of the above with smp_mb()
> > 
> > Holding off on this one for the moment...
> > 
> > 2.  Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> > 
> > Lots of variety here, can in some cases substitute:
> > 
> > a.  READ_ONCE() for smp_load_acquire()
> > b.  WRITE_ONCE() for smp_store_release()
> > c.  Dependencies for both smp_load_acquire() and
> > smp_store_release().
> > d.  smp_wmb() for smp_store_release() in first thread
> > of ISA2 and Z6.2.
> > e.  smp_rmb() for smp_load_acquire() in last thread of ISA2.
> > 
> > The canonical illustration of LB involves the various memory
> > allocators, where you don't want a load from about-to-be-freed
> > memory to see a store initializing a later incarnation of that
> > same memory area.  But the per-CPU caches make this a very
> > long and complicated example.
> > 
> > I am not aware of any three-CPU release-acquire chains in the
> > Linux kernel.  There are three-CPU lock-based chains in RCU,
> > but these are not at all simple, either.
> > 
> 
> The "Program-Order guarantees" case in scheduler? See the comments
> written by Peter above try_to_wake_up():
> 
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any 
> subsequent
>  * execution on its new CPU [c1].
> ...
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)

Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Thu, Oct 12, 2017 at 12:27:19PM +0100, Will Deacon wrote:
> On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> > >   I am not aware of any three-CPU release-acquire chains in the
> > >   Linux kernel.  There are three-CPU lock-based chains in RCU,
> > >   but these are not at all simple, either.
> > > 
> > 
> > The "Program-Order guarantees" case in scheduler? See the comments
> > written by Peter above try_to_wake_up():
> > 
> >  * The basic program-order guarantee on SMP systems is that when a task [t]
> >  * migrates, all its activity on its old CPU [c0] happens-before any 
> > subsequent
> >  * execution on its new CPU [c1].
> > ...
> >  * For blocking we (obviously) need to provide the same guarantee as for
> >  * migration. However the means are completely different as there is no lock
> >  * chain to provide order. Instead we do:
> >  *
> >  *   1) smp_store_release(X->on_cpu, 0)
> >  *   2) smp_cond_load_acquire(!X->on_cpu)
> >  *
> >  * Example:
> >  *
> >  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> >  *
> >  *   LOCK rq(0)->lock LOCK X->pi_lock
> >  *   dequeue X
> >  *   sched-out X
> >  *   smp_store_release(X->on_cpu, 0);
> >  *
> >  *smp_cond_load_acquire(>on_cpu, !VAL);
> >  *X->state = WAKING
> >  *set_task_cpu(X,2)
> >  *
> >  *LOCK rq(2)->lock
> >  *enqueue X
> >  *X->state = RUNNING
> >  *UNLOCK rq(2)->lock
> >  *
> >  *  LOCK rq(2)->lock // orders 
> > against CPU1
> >  *  sched-out Z
> >  *  sched-in X
> >  *  UNLOCK rq(2)->lock
> >  *
> >  *UNLOCK X->pi_lock
> >  *   UNLOCK rq(0)->lock
> > 
> > This is a chain mixed with lock and acquire-release(maybe even better?).
> > 
> > 
> > And another example would be osq_{lock,unlock}() on multiple(more than
> > three) CPUs. 
> 
> I think the qrwlock also has something similar with the writer fairness
> issue fixed:
> 
> CPU0: (writer doing an unlock)
> smp_store_release(>wlocked, 0); // Bottom byte of lock->cnts
> 
> 
> CPU1: (waiting writer on slowpath)
> atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> ...
> arch_spin_unlock(>wait_lock);
> 
> 
> CPU2: (reader on slowpath)
> arch_spin_lock(>wait_lock);
> 
> and there's mixed-size accesses here too. Fun stuff!

You had me going there until you mentioned the mixed-size accesses.  ;-)

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Thu, Oct 12, 2017 at 12:27:19PM +0100, Will Deacon wrote:
> On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> > On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> > >   I am not aware of any three-CPU release-acquire chains in the
> > >   Linux kernel.  There are three-CPU lock-based chains in RCU,
> > >   but these are not at all simple, either.
> > > 
> > 
> > The "Program-Order guarantees" case in scheduler? See the comments
> > written by Peter above try_to_wake_up():
> > 
> >  * The basic program-order guarantee on SMP systems is that when a task [t]
> >  * migrates, all its activity on its old CPU [c0] happens-before any 
> > subsequent
> >  * execution on its new CPU [c1].
> > ...
> >  * For blocking we (obviously) need to provide the same guarantee as for
> >  * migration. However the means are completely different as there is no lock
> >  * chain to provide order. Instead we do:
> >  *
> >  *   1) smp_store_release(X->on_cpu, 0)
> >  *   2) smp_cond_load_acquire(!X->on_cpu)
> >  *
> >  * Example:
> >  *
> >  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> >  *
> >  *   LOCK rq(0)->lock LOCK X->pi_lock
> >  *   dequeue X
> >  *   sched-out X
> >  *   smp_store_release(X->on_cpu, 0);
> >  *
> >  *smp_cond_load_acquire(>on_cpu, !VAL);
> >  *X->state = WAKING
> >  *set_task_cpu(X,2)
> >  *
> >  *LOCK rq(2)->lock
> >  *enqueue X
> >  *X->state = RUNNING
> >  *UNLOCK rq(2)->lock
> >  *
> >  *  LOCK rq(2)->lock // orders 
> > against CPU1
> >  *  sched-out Z
> >  *  sched-in X
> >  *  UNLOCK rq(2)->lock
> >  *
> >  *UNLOCK X->pi_lock
> >  *   UNLOCK rq(0)->lock
> > 
> > This is a chain mixed with lock and acquire-release(maybe even better?).
> > 
> > 
> > And another example would be osq_{lock,unlock}() on multiple(more than
> > three) CPUs. 
> 
> I think the qrwlock also has something similar with the writer fairness
> issue fixed:
> 
> CPU0: (writer doing an unlock)
> smp_store_release(>wlocked, 0); // Bottom byte of lock->cnts
> 
> 
> CPU1: (waiting writer on slowpath)
> atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> ...
> arch_spin_unlock(>wait_lock);
> 
> 
> CPU2: (reader on slowpath)
> arch_spin_lock(>wait_lock);
> 
> and there's mixed-size accesses here too. Fun stuff!

You had me going there until you mentioned the mixed-size accesses.  ;-)

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > How about this?
> > 
> > 0.  Simple special cases
> > 
> > If there is only one CPU on the one hand or only one variable
> > on the other, the code will execute in order.  There are (as
> > usual) some things to be careful of:
> > 
> > a.  There are some aspects of the C language that are
> > unordered.  For example, the compiler can output code
> > computing arguments of a multi-parameter function in
> > any order it likes, or even interleaved if it so chooses.
> 
> That parses a little oddly.  I wouldn't agree that the compiler outputs
> the code in any order it likes!

When was the last time you talked to a compiler writer?  ;-)

> In fact, I wouldn't even mention the compiler at all.  Just say that
> (with a few exceptions) the language doesn't specify the order in which
> the arguments of a function or operation should be evaluated.  For
> example, in the expression "f(x) + g(y)", the order in which f and g
> are called is not defined; the object code is allowed to use either
> order or even to interleave the computations.

Nevertheless, I took your suggestion:

a.  There are some aspects of the C language that are
unordered.  For example, in the expression "f(x) + g(y)",
the order in which f and g are called is not defined;
the object code is allowed to use either order or even
to interleave the computations.

> > b.  Compilers are permitted to use the "as-if" rule.
> > That is, a compiler can emit whatever code it likes,
> > as long as the results appear just as if the compiler
> > had followed all the relevant rules.  To see this,
> > compiler with a high level of optimization and run
> > the debugger on the resulting binary.
> 
> You might omit the last sentence.  Furthermore, if the accesses don't
> use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> if it had executed in order (even for a single variable!), and if you
> do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> it likes.

Ah, I omitted an important qualifier:

b.  Compilers are permitted to use the "as-if" rule.  That is,
a compiler can emit whatever code it likes, as long as
the results of a single-threaded execution appear just
as if the compiler had followed all the relevant rules.
To see this, compile with a high level of optimization
and run the debugger on the resulting binary.

I have seen people (including kernel hackers) surprised by what optimizers
do, so I would prefer that the last sentence remain.

> > c.  If there is only one variable but multiple CPUs, all
> > accesses to that variable must be aligned and full sized.
> 
> I would say that the variable is what needs to be aligned, not the
> accesses.  (Although, if the variable is aligned and all the accesses
> are full sized, then they must necessarily be aligned as well.)

I was thinking in terms of an unaligned 16-bit access to a 32-bit
variable.  But how about this?

c.  If there is only one variable but multiple CPUs, all
that variable must be properly aligned and all accesses
to that variable must be full sized.

> > Variables that straddle cachelines or pages void your
> > full-ordering warranty, as do undersized accesses that
> > load from or store to only part of the variable.
> 
> How can a variable straddle pages without also straddling cache lines?

Well, a variable -can- straddle cachelines without straddling pages,
which justifies the "or".  Furthermore, given that cacheline sizes have
been growing, but pages are still 4KB, it is probably only a matter
of time.  ;-)

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Tue, Oct 17, 2017 at 03:38:23PM -0400, Alan Stern wrote:
> On Tue, 17 Oct 2017, Paul E. McKenney wrote:
> 
> > How about this?
> > 
> > 0.  Simple special cases
> > 
> > If there is only one CPU on the one hand or only one variable
> > on the other, the code will execute in order.  There are (as
> > usual) some things to be careful of:
> > 
> > a.  There are some aspects of the C language that are
> > unordered.  For example, the compiler can output code
> > computing arguments of a multi-parameter function in
> > any order it likes, or even interleaved if it so chooses.
> 
> That parses a little oddly.  I wouldn't agree that the compiler outputs
> the code in any order it likes!

When was the last time you talked to a compiler writer?  ;-)

> In fact, I wouldn't even mention the compiler at all.  Just say that
> (with a few exceptions) the language doesn't specify the order in which
> the arguments of a function or operation should be evaluated.  For
> example, in the expression "f(x) + g(y)", the order in which f and g
> are called is not defined; the object code is allowed to use either
> order or even to interleave the computations.

Nevertheless, I took your suggestion:

a.  There are some aspects of the C language that are
unordered.  For example, in the expression "f(x) + g(y)",
the order in which f and g are called is not defined;
the object code is allowed to use either order or even
to interleave the computations.

> > b.  Compilers are permitted to use the "as-if" rule.
> > That is, a compiler can emit whatever code it likes,
> > as long as the results appear just as if the compiler
> > had followed all the relevant rules.  To see this,
> > compiler with a high level of optimization and run
> > the debugger on the resulting binary.
> 
> You might omit the last sentence.  Furthermore, if the accesses don't
> use READ_ONCE/WRITE_ONCE then the code might not get the same result as
> if it had executed in order (even for a single variable!), and if you
> do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
> it likes.

Ah, I omitted an important qualifier:

b.  Compilers are permitted to use the "as-if" rule.  That is,
a compiler can emit whatever code it likes, as long as
the results of a single-threaded execution appear just
as if the compiler had followed all the relevant rules.
To see this, compile with a high level of optimization
and run the debugger on the resulting binary.

I have seen people (including kernel hackers) surprised by what optimizers
do, so I would prefer that the last sentence remain.

> > c.  If there is only one variable but multiple CPUs, all
> > accesses to that variable must be aligned and full sized.
> 
> I would say that the variable is what needs to be aligned, not the
> accesses.  (Although, if the variable is aligned and all the accesses
> are full sized, then they must necessarily be aligned as well.)

I was thinking in terms of an unaligned 16-bit access to a 32-bit
variable.  But how about this?

c.  If there is only one variable but multiple CPUs, all
that variable must be properly aligned and all accesses
to that variable must be full sized.

> > Variables that straddle cachelines or pages void your
> > full-ordering warranty, as do undersized accesses that
> > load from or store to only part of the variable.
> 
> How can a variable straddle pages without also straddling cache lines?

Well, a variable -can- straddle cachelines without straddling pages,
which justifies the "or".  Furthermore, given that cacheline sizes have
been growing, but pages are still 4KB, it is probably only a matter
of time.  ;-)

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Thu, Oct 12, 2017 at 03:27:44PM +0200, Andrea Parri wrote:
> Hi Paul,
> 
> On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > At Linux Plumbers Conference, we got requests for a recipes document,
> > and a further request to point to actual code in the Linux kernel.
> > I have pulled together some examples for various litmus-test families,
> > as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> > MP, ...) is here:
> > 
> > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> > 
> > This document is also checked into the memory-models git archive:
> > 
> > https://github.com/aparri/memory-model.git
> > 
> > I would be especially interested in simpler examples in general, and
> > of course any example at all for the cases where I was unable to find
> > any.  Thoughts?
> 
> Below are some examples we did discuss (at some point):
> 
> The comment in kernel/events/ring_buffer.c:perf_output_put_handle()
> describes instances of MP+wmb+rmb and LB+ctrl+mb.

I added this as an alternative for MP and as the example for LB.

> The comments in kernel/sched/core.c:try_to_wake_up() describes more
> instances of MP ("plus locking") and LB (see finish_lock_switch()).

This one looks a bit more messy, so I will set it aside, for the moment,
anyway.

> The comment in kernel/sched/core.c:task_rq_lock() describes an ins-
> tance of MP+wmb+addr-acqpo.

This one also looks a bit messy, so I am setting it aside as well.

> The comment in include/linux/wait.h:waitqueue_active() describes an
> instance of SB+mb+mb.

Very good, I took this as the generic pattern for the current pair
of SB examples.

> 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug")
> describes an instance of W+RWC+porel+mb+mb.

Well, this one certainly is of historical interest.  After all, it might
well be the first Linux-kernel commit containing a litmus test.  ;-)

I put it in recipes-LKcode-63cae12bce986.txt and reference it from
recipes-LKcode.txt.

> [...]
> 
> I wish we could say "any barrier (explicit or implicit) in sources
> is accompanied by a comment mentioning the interested pattern...",
> but life is not always this simple. ;-)

Well, at least scripts/checkpatch.pl now complains if you try to add
a new comment-free barrier.  Not that these complaints are always
paid attention to...

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Thu, Oct 12, 2017 at 03:27:44PM +0200, Andrea Parri wrote:
> Hi Paul,
> 
> On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > At Linux Plumbers Conference, we got requests for a recipes document,
> > and a further request to point to actual code in the Linux kernel.
> > I have pulled together some examples for various litmus-test families,
> > as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> > MP, ...) is here:
> > 
> > https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> > 
> > This document is also checked into the memory-models git archive:
> > 
> > https://github.com/aparri/memory-model.git
> > 
> > I would be especially interested in simpler examples in general, and
> > of course any example at all for the cases where I was unable to find
> > any.  Thoughts?
> 
> Below are some examples we did discuss (at some point):
> 
> The comment in kernel/events/ring_buffer.c:perf_output_put_handle()
> describes instances of MP+wmb+rmb and LB+ctrl+mb.

I added this as an alternative for MP and as the example for LB.

> The comments in kernel/sched/core.c:try_to_wake_up() describes more
> instances of MP ("plus locking") and LB (see finish_lock_switch()).

This one looks a bit more messy, so I will set it aside, for the moment,
anyway.

> The comment in kernel/sched/core.c:task_rq_lock() describes an ins-
> tance of MP+wmb+addr-acqpo.

This one also looks a bit messy, so I am setting it aside as well.

> The comment in include/linux/wait.h:waitqueue_active() describes an
> instance of SB+mb+mb.

Very good, I took this as the generic pattern for the current pair
of SB examples.

> 63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug")
> describes an instance of W+RWC+porel+mb+mb.

Well, this one certainly is of historical interest.  After all, it might
well be the first Linux-kernel commit containing a litmus test.  ;-)

I put it in recipes-LKcode-63cae12bce986.txt and reference it from
recipes-LKcode.txt.

> [...]
> 
> I wish we could say "any barrier (explicit or implicit) in sources
> is accompanied by a comment mentioning the interested pattern...",
> but life is not always this simple. ;-)

Well, at least scripts/checkpatch.pl now complains if you try to add
a new comment-free barrier.  Not that these complaints are always
paid attention to...

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Alan Stern
On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> How about this?
> 
> 0.Simple special cases
> 
>   If there is only one CPU on the one hand or only one variable
>   on the other, the code will execute in order.  There are (as
>   usual) some things to be careful of:
> 
>   a.  There are some aspects of the C language that are
>   unordered.  For example, the compiler can output code
>   computing arguments of a multi-parameter function in
>   any order it likes, or even interleaved if it so chooses.

That parses a little oddly.  I wouldn't agree that the compiler outputs
the code in any order it likes!

In fact, I wouldn't even mention the compiler at all.  Just say that
(with a few exceptions) the language doesn't specify the order in which
the arguments of a function or operation should be evaluated.  For
example, in the expression "f(x) + g(y)", the order in which f and g
are called is not defined; the object code is allowed to use either
order or even to interleave the computations.

>   b.  Compilers are permitted to use the "as-if" rule.
>   That is, a compiler can emit whatever code it likes,
>   as long as the results appear just as if the compiler
>   had followed all the relevant rules.  To see this,
>   compiler with a high level of optimization and run
>   the debugger on the resulting binary.

You might omit the last sentence.  Furthermore, if the accesses don't
use READ_ONCE/WRITE_ONCE then the code might not get the same result as
if it had executed in order (even for a single variable!), and if you
do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
it likes.

>   c.  If there is only one variable but multiple CPUs, all
>   accesses to that variable must be aligned and full sized.

I would say that the variable is what needs to be aligned, not the
accesses.  (Although, if the variable is aligned and all the accesses
are full sized, then they must necessarily be aligned as well.)

>   Variables that straddle cachelines or pages void your
>   full-ordering warranty, as do undersized accesses that
>   load from or store to only part of the variable.

How can a variable straddle pages without also straddling cache lines?

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Alan Stern
On Tue, 17 Oct 2017, Paul E. McKenney wrote:

> How about this?
> 
> 0.Simple special cases
> 
>   If there is only one CPU on the one hand or only one variable
>   on the other, the code will execute in order.  There are (as
>   usual) some things to be careful of:
> 
>   a.  There are some aspects of the C language that are
>   unordered.  For example, the compiler can output code
>   computing arguments of a multi-parameter function in
>   any order it likes, or even interleaved if it so chooses.

That parses a little oddly.  I wouldn't agree that the compiler outputs
the code in any order it likes!

In fact, I wouldn't even mention the compiler at all.  Just say that
(with a few exceptions) the language doesn't specify the order in which
the arguments of a function or operation should be evaluated.  For
example, in the expression "f(x) + g(y)", the order in which f and g
are called is not defined; the object code is allowed to use either
order or even to interleave the computations.

>   b.  Compilers are permitted to use the "as-if" rule.
>   That is, a compiler can emit whatever code it likes,
>   as long as the results appear just as if the compiler
>   had followed all the relevant rules.  To see this,
>   compiler with a high level of optimization and run
>   the debugger on the resulting binary.

You might omit the last sentence.  Furthermore, if the accesses don't
use READ_ONCE/WRITE_ONCE then the code might not get the same result as
if it had executed in order (even for a single variable!), and if you
do use READ_ONCE/WRITE_ONCE then the compiler can't emit whatever code
it likes.

>   c.  If there is only one variable but multiple CPUs, all
>   accesses to that variable must be aligned and full sized.

I would say that the variable is what needs to be aligned, not the
accesses.  (Although, if the variable is aligned and all the accesses
are full sized, then they must necessarily be aligned as well.)

>   Variables that straddle cachelines or pages void your
>   full-ordering warranty, as do undersized accesses that
>   load from or store to only part of the variable.

How can a variable straddle pages without also straddling cache lines?

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Fri, Oct 13, 2017 at 04:09:26PM -0400, Alan Stern wrote:
> On Fri, 13 Oct 2017, Paul E. McKenney wrote:
> > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> > > On Wed, 11 Oct 2017, Paul E. McKenney wrote:

[ . . . ]

> > Perhaps the recipes document should just baldly state that any execution
> > having only one thread and/or having only one variable will be fully
> > ordered?
> 
> That wouldn't be a bad idea.  (Although the part about only one thread 
> should be pretty obvious.)
> 
> Also, you have to be a little careful because the ordering of the
> execution may not always agree with the ordering of the source code.  
> The compiler is allowed to evaluate the arguments to a function call,
> for example, in any order it likes.

How about this?

0.  Simple special cases

If there is only one CPU on the one hand or only one variable
on the other, the code will execute in order.  There are (as
usual) some things to be careful of:

a.  There are some aspects of the C language that are
unordered.  For example, the compiler can output code
computing arguments of a multi-parameter function in
any order it likes, or even interleaved if it so chooses.

b.  Compilers are permitted to use the "as-if" rule.
That is, a compiler can emit whatever code it likes,
as long as the results appear just as if the compiler
had followed all the relevant rules.  To see this,
compiler with a high level of optimization and run
the debugger on the resulting binary.

c.  If there is only one variable but multiple CPUs, all
accesses to that variable must be aligned and full sized.
Variables that straddle cachelines or pages void your
full-ordering warranty, as do undersized accesses that
load from or store to only part of the variable.

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-17 Thread Paul E. McKenney
On Fri, Oct 13, 2017 at 04:09:26PM -0400, Alan Stern wrote:
> On Fri, 13 Oct 2017, Paul E. McKenney wrote:
> > On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> > > On Wed, 11 Oct 2017, Paul E. McKenney wrote:

[ . . . ]

> > Perhaps the recipes document should just baldly state that any execution
> > having only one thread and/or having only one variable will be fully
> > ordered?
> 
> That wouldn't be a bad idea.  (Although the part about only one thread 
> should be pretty obvious.)
> 
> Also, you have to be a little careful because the ordering of the
> execution may not always agree with the ordering of the source code.  
> The compiler is allowed to evaluate the arguments to a function call,
> for example, in any order it likes.

How about this?

0.  Simple special cases

If there is only one CPU on the one hand or only one variable
on the other, the code will execute in order.  There are (as
usual) some things to be careful of:

a.  There are some aspects of the C language that are
unordered.  For example, the compiler can output code
computing arguments of a multi-parameter function in
any order it likes, or even interleaved if it so chooses.

b.  Compilers are permitted to use the "as-if" rule.
That is, a compiler can emit whatever code it likes,
as long as the results appear just as if the compiler
had followed all the relevant rules.  To see this,
compiler with a high level of optimization and run
the debugger on the resulting binary.

c.  If there is only one variable but multiple CPUs, all
accesses to that variable must be aligned and full sized.
Variables that straddle cachelines or pages void your
full-ordering warranty, as do undersized accesses that
load from or store to only part of the variable.

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-13 Thread Alan Stern
On Fri, 13 Oct 2017, Paul E. McKenney wrote:

> On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> > On Wed, 11 Oct 2017, Paul E. McKenney wrote:
> > 
> > > This document lists the litmus-test patterns that we have been discussing,
> > > along with examples from the Linux kernel.  This is intended to feed into
> > > the recipes document.  All examples are from v4.13.
> > > 
> > > 0.Single-variable SC.
> > > 
> > >   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
> > >   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > >   (see kernel/rcu/tree.c).  The counter is accessed by
> > >   interrupts and NMIs as well as by process-level code.
> > >   This counter can be accessed by other CPUs, but only
> > >   for debug output.
> > 
> > I'm not sure that single-variable SC can really be represented by an 
> > example.  It gets used literally all over the kernel -- it's such a 
> > large part of the way we think about computer programs that we rely on 
> > it unconsciously.
> > 
> > For example, the very first function in the very first C source file 
> > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
> > includes this code:
> > 
> > if (acct->active) {
> > u64 suspend = sbuf.f_blocks * SUSPEND;
> > do_div(suspend, 100);
> > 
> > How do we know that the value which gets divided by 100 is
> > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
> > suspend's memory location before it was initialized?  Answer:
> > per-variable SC.
> > 
> > Okay, maybe that's not really applicable, since it doesn't involve
> > accesses to shared memory.  Here's an example that does.  
> > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
> > How do we know that the value retrieved by the second call was not
> > stored _earlier_ than the value retrieved by the first call?  
> > Per-variable SC.
> > 
> > >   b.  Between CPUs, I would put forward the ->dflags
> > >   updates, but this is anything but simple.  But maybe
> > >   OK for an illustration?
> > 
> > Pretty much any code that accesses the same shared variable twice on
> > the same CPU could be an example of per-variable SC.  But I don't think 
> > people would learn much by studying such examples.
> 
> Perhaps the recipes document should just baldly state that any execution
> having only one thread and/or having only one variable will be fully
> ordered?

That wouldn't be a bad idea.  (Although the part about only one thread 
should be pretty obvious.)

Also, you have to be a little careful because the ordering of the
execution may not always agree with the ordering of the source code.  
The compiler is allowed to evaluate the arguments to a function call,
for example, in any order it likes.

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-13 Thread Alan Stern
On Fri, 13 Oct 2017, Paul E. McKenney wrote:

> On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> > On Wed, 11 Oct 2017, Paul E. McKenney wrote:
> > 
> > > This document lists the litmus-test patterns that we have been discussing,
> > > along with examples from the Linux kernel.  This is intended to feed into
> > > the recipes document.  All examples are from v4.13.
> > > 
> > > 0.Single-variable SC.
> > > 
> > >   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
> > >   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > >   (see kernel/rcu/tree.c).  The counter is accessed by
> > >   interrupts and NMIs as well as by process-level code.
> > >   This counter can be accessed by other CPUs, but only
> > >   for debug output.
> > 
> > I'm not sure that single-variable SC can really be represented by an 
> > example.  It gets used literally all over the kernel -- it's such a 
> > large part of the way we think about computer programs that we rely on 
> > it unconsciously.
> > 
> > For example, the very first function in the very first C source file 
> > in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
> > includes this code:
> > 
> > if (acct->active) {
> > u64 suspend = sbuf.f_blocks * SUSPEND;
> > do_div(suspend, 100);
> > 
> > How do we know that the value which gets divided by 100 is
> > sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
> > suspend's memory location before it was initialized?  Answer:
> > per-variable SC.
> > 
> > Okay, maybe that's not really applicable, since it doesn't involve
> > accesses to shared memory.  Here's an example that does.  
> > get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
> > How do we know that the value retrieved by the second call was not
> > stored _earlier_ than the value retrieved by the first call?  
> > Per-variable SC.
> > 
> > >   b.  Between CPUs, I would put forward the ->dflags
> > >   updates, but this is anything but simple.  But maybe
> > >   OK for an illustration?
> > 
> > Pretty much any code that accesses the same shared variable twice on
> > the same CPU could be an example of per-variable SC.  But I don't think 
> > people would learn much by studying such examples.
> 
> Perhaps the recipes document should just baldly state that any execution
> having only one thread and/or having only one variable will be fully
> ordered?

That wouldn't be a bad idea.  (Although the part about only one thread 
should be pretty obvious.)

Also, you have to be a little careful because the ordering of the
execution may not always agree with the ordering of the source code.  
The compiler is allowed to evaluate the arguments to a function call,
for example, in any order it likes.

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-13 Thread Paul E. McKenney
On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> On Wed, 11 Oct 2017, Paul E. McKenney wrote:
> 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.  Single-variable SC.
> > 
> > a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
> > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > (see kernel/rcu/tree.c).  The counter is accessed by
> > interrupts and NMIs as well as by process-level code.
> > This counter can be accessed by other CPUs, but only
> > for debug output.
> 
> I'm not sure that single-variable SC can really be represented by an 
> example.  It gets used literally all over the kernel -- it's such a 
> large part of the way we think about computer programs that we rely on 
> it unconsciously.
> 
> For example, the very first function in the very first C source file 
> in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
> includes this code:
> 
> if (acct->active) {
> u64 suspend = sbuf.f_blocks * SUSPEND;
> do_div(suspend, 100);
> 
> How do we know that the value which gets divided by 100 is
> sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
> suspend's memory location before it was initialized?  Answer:
> per-variable SC.
> 
> Okay, maybe that's not really applicable, since it doesn't involve
> accesses to shared memory.  Here's an example that does.  
> get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
> How do we know that the value retrieved by the second call was not
> stored _earlier_ than the value retrieved by the first call?  
> Per-variable SC.
> 
> > b.  Between CPUs, I would put forward the ->dflags
> > updates, but this is anything but simple.  But maybe
> > OK for an illustration?
> 
> Pretty much any code that accesses the same shared variable twice on
> the same CPU could be an example of per-variable SC.  But I don't think 
> people would learn much by studying such examples.

Perhaps the recipes document should just baldly state that any execution
having only one thread and/or having only one variable will be fully
ordered?

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-13 Thread Paul E. McKenney
On Fri, Oct 13, 2017 at 03:44:07PM -0400, Alan Stern wrote:
> On Wed, 11 Oct 2017, Paul E. McKenney wrote:
> 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.  Single-variable SC.
> > 
> > a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
> > counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
> > (see kernel/rcu/tree.c).  The counter is accessed by
> > interrupts and NMIs as well as by process-level code.
> > This counter can be accessed by other CPUs, but only
> > for debug output.
> 
> I'm not sure that single-variable SC can really be represented by an 
> example.  It gets used literally all over the kernel -- it's such a 
> large part of the way we think about computer programs that we rely on 
> it unconsciously.
> 
> For example, the very first function in the very first C source file 
> in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
> includes this code:
> 
> if (acct->active) {
> u64 suspend = sbuf.f_blocks * SUSPEND;
> do_div(suspend, 100);
> 
> How do we know that the value which gets divided by 100 is
> sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
> suspend's memory location before it was initialized?  Answer:
> per-variable SC.
> 
> Okay, maybe that's not really applicable, since it doesn't involve
> accesses to shared memory.  Here's an example that does.  
> get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
> How do we know that the value retrieved by the second call was not
> stored _earlier_ than the value retrieved by the first call?  
> Per-variable SC.
> 
> > b.  Between CPUs, I would put forward the ->dflags
> > updates, but this is anything but simple.  But maybe
> > OK for an illustration?
> 
> Pretty much any code that accesses the same shared variable twice on
> the same CPU could be an example of per-variable SC.  But I don't think 
> people would learn much by studying such examples.

Perhaps the recipes document should just baldly state that any execution
having only one thread and/or having only one variable will be fully
ordered?

Thanx, Paul



Re: Linux-kernel examples for LKMM recipes

2017-10-13 Thread Alan Stern
On Wed, 11 Oct 2017, Paul E. McKenney wrote:

> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.Single-variable SC.
> 
>   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
>   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
>   (see kernel/rcu/tree.c).  The counter is accessed by
>   interrupts and NMIs as well as by process-level code.
>   This counter can be accessed by other CPUs, but only
>   for debug output.

I'm not sure that single-variable SC can really be represented by an 
example.  It gets used literally all over the kernel -- it's such a 
large part of the way we think about computer programs that we rely on 
it unconsciously.

For example, the very first function in the very first C source file 
in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
includes this code:

if (acct->active) {
u64 suspend = sbuf.f_blocks * SUSPEND;
do_div(suspend, 100);

How do we know that the value which gets divided by 100 is
sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
suspend's memory location before it was initialized?  Answer:
per-variable SC.

Okay, maybe that's not really applicable, since it doesn't involve
accesses to shared memory.  Here's an example that does.  
get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
How do we know that the value retrieved by the second call was not
stored _earlier_ than the value retrieved by the first call?  
Per-variable SC.

>   b.  Between CPUs, I would put forward the ->dflags
>   updates, but this is anything but simple.  But maybe
>   OK for an illustration?

Pretty much any code that accesses the same shared variable twice on
the same CPU could be an example of per-variable SC.  But I don't think 
people would learn much by studying such examples.

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-13 Thread Alan Stern
On Wed, 11 Oct 2017, Paul E. McKenney wrote:

> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.Single-variable SC.
> 
>   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
>   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
>   (see kernel/rcu/tree.c).  The counter is accessed by
>   interrupts and NMIs as well as by process-level code.
>   This counter can be accessed by other CPUs, but only
>   for debug output.

I'm not sure that single-variable SC can really be represented by an 
example.  It gets used literally all over the kernel -- it's such a 
large part of the way we think about computer programs that we rely on 
it unconsciously.

For example, the very first function in the very first C source file 
in the kernel/ directory (namely, check_free_space() in kernel/acct.c) 
includes this code:

if (acct->active) {
u64 suspend = sbuf.f_blocks * SUSPEND;
do_div(suspend, 100);

How do we know that the value which gets divided by 100 is
sbuf.f_blocks * SUSPEND and not the random garbage which was stored in
suspend's memory location before it was initialized?  Answer:
per-variable SC.

Okay, maybe that's not really applicable, since it doesn't involve
accesses to shared memory.  Here's an example that does.  
get_futex_key() in kernel/futex.c calls READ_ONCE(page->mapping) twice.  
How do we know that the value retrieved by the second call was not
stored _earlier_ than the value retrieved by the first call?  
Per-variable SC.

>   b.  Between CPUs, I would put forward the ->dflags
>   updates, but this is anything but simple.  But maybe
>   OK for an illustration?

Pretty much any code that accesses the same shared variable twice on
the same CPU could be an example of per-variable SC.  But I don't think 
people would learn much by studying such examples.

Alan



Re: Linux-kernel examples for LKMM recipes

2017-10-12 Thread Andrea Parri
Hi Paul,

On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote:
> Hello!
> 
> At Linux Plumbers Conference, we got requests for a recipes document,
> and a further request to point to actual code in the Linux kernel.
> I have pulled together some examples for various litmus-test families,
> as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> MP, ...) is here:
> 
>   https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> This document is also checked into the memory-models git archive:
> 
>   https://github.com/aparri/memory-model.git
> 
> I would be especially interested in simpler examples in general, and
> of course any example at all for the cases where I was unable to find
> any.  Thoughts?

Below are some examples we did discuss (at some point):

The comment in kernel/events/ring_buffer.c:perf_output_put_handle()
describes instances of MP+wmb+rmb and LB+ctrl+mb.

The comments in kernel/sched/core.c:try_to_wake_up() describes more
instances of MP ("plus locking") and LB (see finish_lock_switch()).

The comment in kernel/sched/core.c:task_rq_lock() describes an ins-
tance of MP+wmb+addr-acqpo.

The comment in include/linux/wait.h:waitqueue_active() describes an
instance of SB+mb+mb.

63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug")
describes an instance of W+RWC+porel+mb+mb.

[...]

I wish we could say "any barrier (explicit or implicit) in sources
is accompanied by a comment mentioning the interested pattern...",
but life is not always this simple. ;-)

  Andrea


> 
>   Thanx, Paul
> 
> 
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.Single-variable SC.
> 
>   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
>   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
>   (see kernel/rcu/tree.c).  The counter is accessed by
>   interrupts and NMIs as well as by process-level code.
>   This counter can be accessed by other CPUs, but only
>   for debug output.
> 
>   b.  Between CPUs, I would put forward the ->dflags
>   updates, but this is anything but simple.  But maybe
>   OK for an illustration?
> 
> 1.MP (see test6.pdf for nickname translation)
> 
>   a.  smp_store_release() / smp_load_acquire()
> 
>   init_stack_slab() in lib/stackdepot.c uses release-acquire
>   to handle initialization of a slab of the stack.  Working
>   out the mutual-exclusion design is left as an exercise for
>   the reader.
> 
>   b.  rcu_assign_pointer() / rcu_dereference()
> 
>   expand_to_next_prime() does the rcu_assign_pointer(),
>   and next_prime_number() does the rcu_dereference().
>   This mediates access to a bit vector that is expanded
>   as additional primes are needed.  These two functions
>   are in lib/prime_numbers.c.
> 
>   c.  smp_wmb() / smp_rmb()
> 
>   xlog_state_switch_iclogs() contains the following:
> 
>   log->l_curr_block -= log->l_logBBsize;
>   ASSERT(log->l_curr_block >= 0);
>   smp_wmb();
>   log->l_curr_cycle++;
> 
>   And xlog_valid_lsn() contains the following:
> 
>   cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
>   smp_rmb();
>   cur_block = ACCESS_ONCE(log->l_curr_block);
> 
>   d.  Replacing either of the above with smp_mb()
> 
>   Holding off on this one for the moment...
> 
> 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
>   Lots of variety here, can in some cases substitute:
>   
>   a.  READ_ONCE() for smp_load_acquire()
>   b.  WRITE_ONCE() for smp_store_release()
>   c.  Dependencies for both smp_load_acquire() and
>   smp_store_release().
>   d.  smp_wmb() for smp_store_release() in first thread
>   of ISA2 and Z6.2.
>   e.  smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
>   The canonical illustration of LB involves the various memory
>   allocators, where you don't want a load from about-to-be-freed
>   memory to see a store initializing a later incarnation of that
>   same memory area.  But the per-CPU caches make this a very
>   long and complicated example.
> 
>   I am not aware of any three-CPU release-acquire chains in the
>   Linux kernel.  There are three-CPU lock-based chains in RCU,
>   but these are not at all simple, either.
> 
>   Thoughts?
> 
> 3.  

Re: Linux-kernel examples for LKMM recipes

2017-10-12 Thread Andrea Parri
Hi Paul,

On Wed, Oct 11, 2017 at 03:32:30PM -0700, Paul E. McKenney wrote:
> Hello!
> 
> At Linux Plumbers Conference, we got requests for a recipes document,
> and a further request to point to actual code in the Linux kernel.
> I have pulled together some examples for various litmus-test families,
> as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> MP, ...) is here:
> 
>   https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> This document is also checked into the memory-models git archive:
> 
>   https://github.com/aparri/memory-model.git
> 
> I would be especially interested in simpler examples in general, and
> of course any example at all for the cases where I was unable to find
> any.  Thoughts?

Below are some examples we did discuss (at some point):

The comment in kernel/events/ring_buffer.c:perf_output_put_handle()
describes instances of MP+wmb+rmb and LB+ctrl+mb.

The comments in kernel/sched/core.c:try_to_wake_up() describes more
instances of MP ("plus locking") and LB (see finish_lock_switch()).

The comment in kernel/sched/core.c:task_rq_lock() describes an ins-
tance of MP+wmb+addr-acqpo.

The comment in include/linux/wait.h:waitqueue_active() describes an
instance of SB+mb+mb.

63cae12bce986 ("perf/core: Fix sys_perf_event_open() vs. hotplug")
describes an instance of W+RWC+porel+mb+mb.

[...]

I wish we could say "any barrier (explicit or implicit) in sources
is accompanied by a comment mentioning the interested pattern...",
but life is not always this simple. ;-)

  Andrea


> 
>   Thanx, Paul
> 
> 
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.Single-variable SC.
> 
>   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
>   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
>   (see kernel/rcu/tree.c).  The counter is accessed by
>   interrupts and NMIs as well as by process-level code.
>   This counter can be accessed by other CPUs, but only
>   for debug output.
> 
>   b.  Between CPUs, I would put forward the ->dflags
>   updates, but this is anything but simple.  But maybe
>   OK for an illustration?
> 
> 1.MP (see test6.pdf for nickname translation)
> 
>   a.  smp_store_release() / smp_load_acquire()
> 
>   init_stack_slab() in lib/stackdepot.c uses release-acquire
>   to handle initialization of a slab of the stack.  Working
>   out the mutual-exclusion design is left as an exercise for
>   the reader.
> 
>   b.  rcu_assign_pointer() / rcu_dereference()
> 
>   expand_to_next_prime() does the rcu_assign_pointer(),
>   and next_prime_number() does the rcu_dereference().
>   This mediates access to a bit vector that is expanded
>   as additional primes are needed.  These two functions
>   are in lib/prime_numbers.c.
> 
>   c.  smp_wmb() / smp_rmb()
> 
>   xlog_state_switch_iclogs() contains the following:
> 
>   log->l_curr_block -= log->l_logBBsize;
>   ASSERT(log->l_curr_block >= 0);
>   smp_wmb();
>   log->l_curr_cycle++;
> 
>   And xlog_valid_lsn() contains the following:
> 
>   cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
>   smp_rmb();
>   cur_block = ACCESS_ONCE(log->l_curr_block);
> 
>   d.  Replacing either of the above with smp_mb()
> 
>   Holding off on this one for the moment...
> 
> 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
>   Lots of variety here, can in some cases substitute:
>   
>   a.  READ_ONCE() for smp_load_acquire()
>   b.  WRITE_ONCE() for smp_store_release()
>   c.  Dependencies for both smp_load_acquire() and
>   smp_store_release().
>   d.  smp_wmb() for smp_store_release() in first thread
>   of ISA2 and Z6.2.
>   e.  smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
>   The canonical illustration of LB involves the various memory
>   allocators, where you don't want a load from about-to-be-freed
>   memory to see a store initializing a later incarnation of that
>   same memory area.  But the per-CPU caches make this a very
>   long and complicated example.
> 
>   I am not aware of any three-CPU release-acquire chains in the
>   Linux kernel.  There are three-CPU lock-based chains in RCU,
>   but these are not at all simple, either.
> 
>   Thoughts?
> 
> 3.  

Re: Linux-kernel examples for LKMM recipes

2017-10-12 Thread Will Deacon
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> > I am not aware of any three-CPU release-acquire chains in the
> > Linux kernel.  There are three-CPU lock-based chains in RCU,
> > but these are not at all simple, either.
> > 
> 
> The "Program-Order guarantees" case in scheduler? See the comments
> written by Peter above try_to_wake_up():
> 
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any 
> subsequent
>  * execution on its new CPU [c1].
> ...
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
>  *
>  *   LOCK rq(0)->lock LOCK X->pi_lock
>  *   dequeue X
>  *   sched-out X
>  *   smp_store_release(X->on_cpu, 0);
>  *
>  *smp_cond_load_acquire(>on_cpu, !VAL);
>  *X->state = WAKING
>  *set_task_cpu(X,2)
>  *
>  *LOCK rq(2)->lock
>  *enqueue X
>  *X->state = RUNNING
>  *UNLOCK rq(2)->lock
>  *
>  *  LOCK rq(2)->lock // orders 
> against CPU1
>  *  sched-out Z
>  *  sched-in X
>  *  UNLOCK rq(2)->lock
>  *
>  *UNLOCK X->pi_lock
>  *   UNLOCK rq(0)->lock
> 
> This is a chain mixed with lock and acquire-release(maybe even better?).
> 
> 
> And another example would be osq_{lock,unlock}() on multiple(more than
> three) CPUs. 

I think the qrwlock also has something similar with the writer fairness
issue fixed:

CPU0: (writer doing an unlock)
smp_store_release(>wlocked, 0);   // Bottom byte of lock->cnts


CPU1: (waiting writer on slowpath)
atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
...
arch_spin_unlock(>wait_lock);


CPU2: (reader on slowpath)
arch_spin_lock(>wait_lock);

and there's mixed-size accesses here too. Fun stuff!

Will


Re: Linux-kernel examples for LKMM recipes

2017-10-12 Thread Will Deacon
On Thu, Oct 12, 2017 at 09:23:59AM +0800, Boqun Feng wrote:
> On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> > I am not aware of any three-CPU release-acquire chains in the
> > Linux kernel.  There are three-CPU lock-based chains in RCU,
> > but these are not at all simple, either.
> > 
> 
> The "Program-Order guarantees" case in scheduler? See the comments
> written by Peter above try_to_wake_up():
> 
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any 
> subsequent
>  * execution on its new CPU [c1].
> ...
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
>  *
>  *   LOCK rq(0)->lock LOCK X->pi_lock
>  *   dequeue X
>  *   sched-out X
>  *   smp_store_release(X->on_cpu, 0);
>  *
>  *smp_cond_load_acquire(>on_cpu, !VAL);
>  *X->state = WAKING
>  *set_task_cpu(X,2)
>  *
>  *LOCK rq(2)->lock
>  *enqueue X
>  *X->state = RUNNING
>  *UNLOCK rq(2)->lock
>  *
>  *  LOCK rq(2)->lock // orders 
> against CPU1
>  *  sched-out Z
>  *  sched-in X
>  *  UNLOCK rq(2)->lock
>  *
>  *UNLOCK X->pi_lock
>  *   UNLOCK rq(0)->lock
> 
> This is a chain mixed with lock and acquire-release(maybe even better?).
> 
> 
> And another example would be osq_{lock,unlock}() on multiple(more than
> three) CPUs. 

I think the qrwlock also has something similar with the writer fairness
issue fixed:

CPU0: (writer doing an unlock)
smp_store_release(>wlocked, 0);   // Bottom byte of lock->cnts


CPU1: (waiting writer on slowpath)
atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
...
arch_spin_unlock(>wait_lock);


CPU2: (reader on slowpath)
arch_spin_lock(>wait_lock);

and there's mixed-size accesses here too. Fun stuff!

Will


Re: Linux-kernel examples for LKMM recipes

2017-10-11 Thread Boqun Feng
On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> Hello!
> 
> At Linux Plumbers Conference, we got requests for a recipes document,
> and a further request to point to actual code in the Linux kernel.
> I have pulled together some examples for various litmus-test families,
> as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> MP, ...) is here:
> 
>   https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> This document is also checked into the memory-models git archive:
> 
>   https://github.com/aparri/memory-model.git
> 
> I would be especially interested in simpler examples in general, and
> of course any example at all for the cases where I was unable to find
> any.  Thoughts?
> 
>   Thanx, Paul
> 
> 
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.Single-variable SC.
> 
>   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
>   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
>   (see kernel/rcu/tree.c).  The counter is accessed by
>   interrupts and NMIs as well as by process-level code.
>   This counter can be accessed by other CPUs, but only
>   for debug output.
> 
>   b.  Between CPUs, I would put forward the ->dflags
>   updates, but this is anything but simple.  But maybe
>   OK for an illustration?
> 
> 1.MP (see test6.pdf for nickname translation)
> 
>   a.  smp_store_release() / smp_load_acquire()
> 
>   init_stack_slab() in lib/stackdepot.c uses release-acquire
>   to handle initialization of a slab of the stack.  Working
>   out the mutual-exclusion design is left as an exercise for
>   the reader.
> 
>   b.  rcu_assign_pointer() / rcu_dereference()
> 
>   expand_to_next_prime() does the rcu_assign_pointer(),
>   and next_prime_number() does the rcu_dereference().
>   This mediates access to a bit vector that is expanded
>   as additional primes are needed.  These two functions
>   are in lib/prime_numbers.c.
> 
>   c.  smp_wmb() / smp_rmb()
> 
>   xlog_state_switch_iclogs() contains the following:
> 
>   log->l_curr_block -= log->l_logBBsize;
>   ASSERT(log->l_curr_block >= 0);
>   smp_wmb();
>   log->l_curr_cycle++;
> 
>   And xlog_valid_lsn() contains the following:
> 
>   cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
>   smp_rmb();
>   cur_block = ACCESS_ONCE(log->l_curr_block);
> 
>   d.  Replacing either of the above with smp_mb()
> 
>   Holding off on this one for the moment...
> 
> 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
>   Lots of variety here, can in some cases substitute:
>   
>   a.  READ_ONCE() for smp_load_acquire()
>   b.  WRITE_ONCE() for smp_store_release()
>   c.  Dependencies for both smp_load_acquire() and
>   smp_store_release().
>   d.  smp_wmb() for smp_store_release() in first thread
>   of ISA2 and Z6.2.
>   e.  smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
>   The canonical illustration of LB involves the various memory
>   allocators, where you don't want a load from about-to-be-freed
>   memory to see a store initializing a later incarnation of that
>   same memory area.  But the per-CPU caches make this a very
>   long and complicated example.
> 
>   I am not aware of any three-CPU release-acquire chains in the
>   Linux kernel.  There are three-CPU lock-based chains in RCU,
>   but these are not at all simple, either.
> 

The "Program-Order guarantees" case in scheduler? See the comments
written by Peter above try_to_wake_up():

 * The basic program-order guarantee on SMP systems is that when a task [t]
 * migrates, all its activity on its old CPU [c0] happens-before any subsequent
 * execution on its new CPU [c1].
...
 * For blocking we (obviously) need to provide the same guarantee as for
 * migration. However the means are completely different as there is no lock
 * chain to provide order. Instead we do:
 *
 *   1) smp_store_release(X->on_cpu, 0)
 *   2) smp_cond_load_acquire(!X->on_cpu)
 *
 * Example:
 *
 *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
 *
 *   LOCK rq(0)->lock LOCK X->pi_lock
 *   dequeue X
 *   sched-out X
 *   smp_store_release(X->on_cpu, 0);
 *
 *smp_cond_load_acquire(>on_cpu, !VAL);
 * 

Re: Linux-kernel examples for LKMM recipes

2017-10-11 Thread Boqun Feng
On Wed, Oct 11, 2017 at 10:32:30PM +, Paul E. McKenney wrote:
> Hello!
> 
> At Linux Plumbers Conference, we got requests for a recipes document,
> and a further request to point to actual code in the Linux kernel.
> I have pulled together some examples for various litmus-test families,
> as shown below.  The decoder ring for the abbreviations (ISA2, LB, SB,
> MP, ...) is here:
> 
>   https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf
> 
> This document is also checked into the memory-models git archive:
> 
>   https://github.com/aparri/memory-model.git
> 
> I would be especially interested in simpler examples in general, and
> of course any example at all for the cases where I was unable to find
> any.  Thoughts?
> 
>   Thanx, Paul
> 
> 
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.Single-variable SC.
> 
>   a.  Within a single CPU, the use of the ->dynticks_nmi_nesting
>   counter by rcu_nmi_enter() and rcu_nmi_exit() qualifies
>   (see kernel/rcu/tree.c).  The counter is accessed by
>   interrupts and NMIs as well as by process-level code.
>   This counter can be accessed by other CPUs, but only
>   for debug output.
> 
>   b.  Between CPUs, I would put forward the ->dflags
>   updates, but this is anything but simple.  But maybe
>   OK for an illustration?
> 
> 1.MP (see test6.pdf for nickname translation)
> 
>   a.  smp_store_release() / smp_load_acquire()
> 
>   init_stack_slab() in lib/stackdepot.c uses release-acquire
>   to handle initialization of a slab of the stack.  Working
>   out the mutual-exclusion design is left as an exercise for
>   the reader.
> 
>   b.  rcu_assign_pointer() / rcu_dereference()
> 
>   expand_to_next_prime() does the rcu_assign_pointer(),
>   and next_prime_number() does the rcu_dereference().
>   This mediates access to a bit vector that is expanded
>   as additional primes are needed.  These two functions
>   are in lib/prime_numbers.c.
> 
>   c.  smp_wmb() / smp_rmb()
> 
>   xlog_state_switch_iclogs() contains the following:
> 
>   log->l_curr_block -= log->l_logBBsize;
>   ASSERT(log->l_curr_block >= 0);
>   smp_wmb();
>   log->l_curr_cycle++;
> 
>   And xlog_valid_lsn() contains the following:
> 
>   cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
>   smp_rmb();
>   cur_block = ACCESS_ONCE(log->l_curr_block);
> 
>   d.  Replacing either of the above with smp_mb()
> 
>   Holding off on this one for the moment...
> 
> 2.Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
>   Lots of variety here, can in some cases substitute:
>   
>   a.  READ_ONCE() for smp_load_acquire()
>   b.  WRITE_ONCE() for smp_store_release()
>   c.  Dependencies for both smp_load_acquire() and
>   smp_store_release().
>   d.  smp_wmb() for smp_store_release() in first thread
>   of ISA2 and Z6.2.
>   e.  smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
>   The canonical illustration of LB involves the various memory
>   allocators, where you don't want a load from about-to-be-freed
>   memory to see a store initializing a later incarnation of that
>   same memory area.  But the per-CPU caches make this a very
>   long and complicated example.
> 
>   I am not aware of any three-CPU release-acquire chains in the
>   Linux kernel.  There are three-CPU lock-based chains in RCU,
>   but these are not at all simple, either.
> 

The "Program-Order guarantees" case in scheduler? See the comments
written by Peter above try_to_wake_up():

 * The basic program-order guarantee on SMP systems is that when a task [t]
 * migrates, all its activity on its old CPU [c0] happens-before any subsequent
 * execution on its new CPU [c1].
...
 * For blocking we (obviously) need to provide the same guarantee as for
 * migration. However the means are completely different as there is no lock
 * chain to provide order. Instead we do:
 *
 *   1) smp_store_release(X->on_cpu, 0)
 *   2) smp_cond_load_acquire(!X->on_cpu)
 *
 * Example:
 *
 *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
 *
 *   LOCK rq(0)->lock LOCK X->pi_lock
 *   dequeue X
 *   sched-out X
 *   smp_store_release(X->on_cpu, 0);
 *
 *smp_cond_load_acquire(>on_cpu, !VAL);
 *